0% found this document useful (0 votes)

65 views

MIE1628 Big Data Analytics Lecture7

The document discusses Azure Synapse Analytics, a limitless analytics service that brings together data warehousing and big data analytics. It allows querying large amounts of data using either serverless on-demand or provisioned resources at scale. Key features include industry-leading SQL and Apache Spark support, data integration via pipelines, and unified management. Azure Synapse uses a massively parallel processing architecture with separate CPUs and storage to efficiently process large datasets in parallel.

Uploaded by

Viola Song

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views

MIE1628 Big Data Analytics Lecture7

Uploaded by

Viola Song

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 77

Big Data

Science
Lecture 7

1
Objectives • Big Data Architecture

• Data Warehousing

• Azure Synapse Analytics

• Azure Synapse Use-case

• Fraud Detection at Scale

Big Data Architectures
• Big data solutions typically involve one or more of the following types
of workload:
• Batch processing of big data sources at rest.
• Real-time processing of big data in motion.
• Interactive exploration of big data.
• Predictive analytics and machine learning.
• Consider big data architectures when you need to:
• Store and process data in volumes too large for a traditional database.
• Transform unstructured data for analysis and reporting.
• Capture, process, and analyze unbounded streams of data in real time, or
with low latency.
Components of a Big Data Architecture
Batch Processing
Real-time Processing
Combine Batch and Stream Processing
Lambda
Architecture

• All data coming into the system

goes through these two paths:
• A batch layer (cold path) stores
all of the incoming data in its raw
form and performs batch
processing on the data. The result
of this processing is stored as
a batch view.
• A speed layer (hot path)
analyzes data in real time. This
layer is designed for low latency,
at the expense of accuracy.
Kappa
Architecture

• The kappa architecture was

proposed by Jay Kreps as an
alternative to the lambda
architecture.
• It has the same basic goals
as the lambda architecture,
but with an important
distinction: All data flows
through a single path, using a
stream processing system.
Data warehouses have to handle big data.

Big data is the term used for large quantities of data

Data collected in escalating volumes, at higher velocities,
and in a greater variety of formats than ever before.

Warehousing
It can be historical (meaning stored) or real time
(meaning streamed from the source).

Businesses typically depend on their big data to help

make critical business decisions.
Modern Data Warehousing
Azure Data Services for Modern Data Warehousing
Azure Data
Factory
Azure Data Lake
Storage
• A data warehouse also
stores large quantities of
data, but the data in a
warehouse has been
processed to convert it
into a format for efficient
analysis.

• A data lake holds raw

data, but a data
warehouse
holds structured
information.
Azure
Databricks
Azure Synapse
Analytics
• Azure Synapse Analytics is
an analytics engine. It's
designed to process large
amounts of data very
quickly.
• Azure Synapse Analytics
leverages a massively
parallel processing (MPP)
architecture. This
architecture includes a
control node and a pool of
compute nodes.
Azure Analysis Services
Compare Analysis Services with Synapse Analytics

• Use Azure Synapse Analytics for:

• Very high volumes of data (multi-terabyte to petabyte sized datasets).
• Very complex queries and aggregations.
• Data mining, and data exploration.
• Complex ETL operations. ETL stands for Extract, Transform, and Load, and refers to the way in
which you can retrieve raw data from multiple sources, convert this data into a standard
format, and store it.
• Low to mid concurrency (128 users or fewer).
• Use Azure Analysis Services for:
• Smaller volumes of data (a few terabytes).
• Multiple sources that can be correlated.
• High read concurrency (thousands of users).
• Detailed analysis, and drilling into data, using functions in Power BI.
• Rapid dashboard development from tabular data.
Azure HDInsight
Azure Synapse
Analytics
Azure Synapse Analytics
• Azure Synapse Analytics is a limitless analytics service, that brings together
enterprise data warehousing and Big Data analytics. It gives you the
freedom to query data on your terms, using either serverless on-demand or
provisioned resources, at scale. Azure Synapse brings these two worlds
together with a unified experience to ingest, prepare, manage, and serve
data for immediate business intelligence and machine learning needs.
What is dedicated SQL pool (formerly SQL
DW) in Azure Synapse Analytics?
Key Features

INDUSTRY-LEADING SQL INDUSTRY-STANDARD INTEROP OF SQL AND BUILT-IN DATA UNIFIED MANAGEMENT,
APACHE SPARK APACHE SPARK ON YOUR INTEGRATION VIA MONITORING, AND
DATA LAKE PIPELINES SECURITY

SYNAPSE STUDIO
Azure Synapse Analytics MPP Intro
Parallelism

SMP - Symmetric
• Multiple CPUs used to complete individual processes simultaneously
• All CPUs share the same memory, disks, and network controllers (scale-up)

Multiprocessing •
•
All SQL Server implementations up until now have been SMP
Mostly, the solution is housed on a shared SAN

MPP - Massively • Uses many separate CPUs running in parallel to execute a single

Parallel
program
• Shared Nothing: Each CPU has its own memory and disk (scale-out)

Processing • Segments communicate using high-speed network between nodes

SQL DW Logical Architecture (overview)
DMS

“Compute” node Balanced storage

SQL

DMS

“Compute” node Balanced storage

SQL
DMS

“Control”
node SQL DMS

“Compute” node Balanced storage

SQL

DMS

“Compute” node Balanced storage

SQL
SQL DW Logical Architecture (overview)
DMS
User DMS
“Compute” node Balanced storage
“Control” node SQL
SQL

DMS

“Compute” node Balanced storage

SQL

DMS

“Compute” node Balanced storage

SQL

DMS

“Compute” node Balanced storage

SQL
Tables
Tables – Distributions
Round-robin distributed CREATE TABLE dbo.OrderTable
(
Distributes table rows evenly across all OrderId INT NOT NULL,

distributions at random. Date DATE NOT NULL,

Name VARCHAR(2),
Hash distributed Country VARCHAR(2)
)
Distributes table rows across the WITH
Compute nodes by using a (
CLUSTERED COLUMNSTORE INDEX,
deterministic hash function to assign DISTRIBUTION = HASH([OrderId]) |
ROUND ROBIN |
each row to one distribution. REPLICATED
);
Replicated
Full copy of table accessible on each
Compute node.
Tables – Partitions
Overview CREATE TABLE partitionedOrderTable
(
Table partitions divide data into smaller groups
OrderId INT NOT NULL,
In most cases, partitions are created on a date Date DATE NOT NULL,
column Name VARCHAR(2),

Supported on all table types Country VARCHAR(2)

)
RANGE RIGHT – Used for time partitions WITH
(
RANGE LEFT – Used for number partitions
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = HASH([OrderId]),
Benefits
PARTITION (
Improves efficiency and performance of loading and [Date] RANGE RIGHT FOR VALUES (
'2000-01-01', '2001-01-01', '2002-01-01’,
querying by limiting the scope to subset of data. '2003-01-01', '2004-01-01', '2005-01-01'
)
Offers significant query performance enhancements
)
where filtering on the partition key can eliminate );
unnecessary scans and eliminate IO.
Common table distribution methods
Table Category Recommended Distribution Option
Use hash-distribution with clustered columnstore index. Performance improves because hashing
enables the platform to localize certain operations within the node itself during query execution.
Operations that benefit:
COUNT(DISTINCT( <hashed_key> ))
Fact OVER PARTITION BY <hashed_key>
most JOIN <table_name> ON <hashed_key>
GROUP BY <hashed_key>

Use replicated for smaller tables. If tables are too large to store on each Compute node, use hash-
Dimension
distributed.

Use round-robin for the staging table. The load with CTAS is faster. Once the data is in the staging table,
Staging
use INSERT…SELECT to move the data to production tables.
Tables – Distributions & Partitions
Logical table structure Physical data distribution
( Hash distribution (OrderId), Date partitions )

OrderId Date Name Country Distribution1

(OrderId 80,000 – 100,000)
85016 11-2-2018 V UK
85018 11-2-2018 Q SP 11-2-2018 partition
85216 11-2-2018 Q DE OrderId Date Name Country
85016 11-2-2018 V UK
85395 11-2-2018 V NL
85018 11-2-2018 Q SP

…
82147 11-2-2018 Q FR
85216 11-2-2018 Q DE
86881 11-2-2018 D UK 85395 11-2-2018 V NL
93080 11-3-2018 R UK 82147 11-2-2018 Q FR

94156 11-3-2018 S FR
86881 11-2-2018 D UK x 60 distributions (shards)
… … … …
96250 11-3-2018 Q NL
98799 11-3-2018 R NL 11-3-2018 partition
98015 11-3-2018 T UK
OrderId Date Name Country
• Each shard is partitioned with the same
93080 11-3-2018 R UK
98310 11-3-2018 D DE 94156 11-3-2018 S FR date partitions
98979 11-3-2018 Z DE 96250 11-3-2018 Q NL

98137 11-3-2018 T FR 98799 11-3-2018 R NL • A minimum of 1 million rows per

98015 11-3-2018 T UK
… … … …
98310 11-3-2018 D DE distribution and partition is needed for
98979 11-3-2018 Z DE optimal compression and performance of
98137 T FR
clustered Columnstore tables
11-3-2018
… … … …
SQL DW Data Layout Options
DMS
T P

“Compute” node Balanced storage

D D
Time Dim Product Dim S C
D D
Date Dim ID Prod Dim ID SQL
Calendar Year
Star Schema Calendar Qtr
Prod Category
Prod Sub Cat
Calendar Mo Prod Desc
Calendar Day
DMS
T P

“Compute” node Balanced storage

D
S
D
C
D D
SQL
Sales Fact

Sales Fact
Store Dim
Date Dim ID
Store Dim ID Store Dim ID DMS
Store Name Prod Dim ID T P
Store Mgr
Store Size
Cust Dim ID “Compute” node Balanced storage
D
S
D
C
Qty Sold D D
Dollars Sold Customer Dim SQL

Cust Dim ID
Cust Name
Cust Addr
Cust Phone
Replicated Cust Email

DMS
Table copied to each compute node T P

“Compute” node
D D
Balanced storage
S
D
C
D
SQL
Distributed
Table spread across compute nodes based on “hash”
• https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/memory-
concurrency-limits
Architecture for DW100
Azure SQL Data Warehouse Azure Storage Blob(s)

D1 D2 D3 D4 D5 D6 D7 D8 D9 D10

D11 D12 D13 D14 D15 D16 D17 D18 D19 D20

D21 D22 D23 D24 D25 D26 D27 D28 D29 D30

Engine Worker1
D31 D32 D33 D34 D35 D36 D37 D38 D39 D40

D41 D42 D43 D44 D45 D46 D47 D48 D49 D50

D51 D52 D53 D54 D55 D56 D57 D58 D59 D60
Architecture for DW600
Azure SQL Data Warehouse Azure Storage Blob(s)

Worker1 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10

Worker2 D11 D12 D13 D14 D15 D16 D17 D18 D19 D20

Engine Worker3 D21 D22 D23 D24 D25 D26 D27 D28 D29 D30

Worker4 D31 D32 D33 D34 D35 D36 D37 D38 D39 D40

Worker5 D41 D42 D43 D44 D45 D46 D47 D48 D49 D50

Worker6 D51 D52 D53 D54 D55 D56 D57 D58 D59 D60
Azure Synapse Analytics
Data Warehouse Architecture

Compute Node Compute Node Compute Node

01101010101010101011 01101010101010101011 01101010101010101011

01010111010101010110 01010111010101010110 01010111010101010110

Compute Node Compute Node Compute Node

01101010101010101011 01101010101010101011 01101010101010101011

Control Node 01010111010101010110 01010111010101010110 01010111010101010110
Maximizing Query Performance

Round Robin Hash Distributed Replicated

Tables Tables Tables
Maximizing Query Performance
Is the default option for newly created tables

Evenly distributes the data across the available

Round-robin compute nodes in a random manner, giving an
Tables even distribution of data across all nodes

Loading into Round-robin tables is fast

Queries on Round-robin tables may require

more data movement as data is “reshuffled” to
organize the data for the query

Great to use for loading staging tables

Maximizing Query Performance
Distributes rows based on the value in the
distribution column, using a deterministic hash
function to assign each row to one distribution.
Hash Distributed
Tables Is designed to achieve high performance for
queries that run against large fact tables in a star
schema.

Choosing a good distribution column is

important to ensure the hash distribution
performs well

As a starting point, use on tables that are greater

than 2GB in size and has frequent inserts, updates
and deleted

But don’t choose a volatile column for the hash

distributed column
Maximizing Query Performance
A full copy of a table is placed on every single
compute node to minimize data movement

Replicated Tables Works well for dimension tables in a star

schema that are less than 2GB in size and are
used regularly in queries with simple
predicates

Should not be used on dimension tables that

are updated on a regular basis

You can convert existing round-robin tables to

replicated tables to take advantage of the
feature using a CTAS statement
Azure Synapse
Analytics
Studio
Azure Synapse Analytics Studio
Studio https://web.azuresynapse.net
Synapse Studio
Synapse Studio divided into Activity hubs.
These organize the tasks needed for building analytics solution.

Overview Data Develop Orchestrate

Quick-access to common Explore structured and Write code and the define Design pipelines that that
gestures, most-recently used unstructured data business logic of the pipeline move and transform data.
items, and links to tutorials via notebooks, SQL scripts,
and documentation. Data flows, etc.

Monitor Manage
Centralized view of all resource Configure the workspace, pool,
usage and activities in the access to artifacts
workspace.
Data Hub
Data Hub – Storage accounts
Data Hub – Storage accounts
Data Hub – Storage accounts
Data Hub – Storage accounts
Data Hub – Databases

SQL pool

SQL on-demand

Spark
Data Hub – Databases
Familiar gesture to generate T-SQL scripts from SQL metadata Starting from a table, auto-generate a single line of PySpark code
objects such as tables. that makes it easy to load a SQL table into a Spark dataframe
Data Hub – Datasets
Synapse Studio hub
Develop Hub

Overview
It provides development
experience to query,
analyze, model data
Develop Hub -
SQL scripts
SQL Script
• Authoring SQL Scripts
• Execute SQL script on
provisioned SQL Pool or
SQL On-demand
• Publish individual SQL
script or multiple SQL
scripts through Publish
all feature
• Language support and
intellisense
Develop Hub - SQL scripts
SQL Script
View results in Table or Chart form and export results in several
popular formats
Develop Hub - Notebooks

Notebooks
Allows to write multiple languages in
one notebook
%%<Name of language>

Offers use of temporary tables across

languages
Language support for Syntax highlight,
syntax error, syntax code completion,
smart indent, code folding
Export results
Develop Hub - Notebooks
Develop Hub – Power BI

Publish changes by simple save

report in workspace
Orchestrate Hub
It provides ability to create pipelines to ingest, transform and load data with 90+ inbuilt connectors.

Offers a wide range of activities that a pipeline can perform.

Monitor Hub
Manage Hub
Languages
Overview

Supports multiple languages to develop

notebook
• PySpark (Python)
• Spark (Scala)
• .NET Spark (C#)
• Spark SQL
• Java
• R (early 2020)

Benefits
Allows to write multiple languages in one
notebook
%%<Name of language>

Offers use of temporary tables across

languages
Synapse workspace
SQL pools
Apache Spark pools
Fraud Detection Use Case
Azure Synapse Fraud Detection Using Big Data
Analytics
• A leading fraud detection company in Brazil, is using Azure Synapse to
modernize their operational analytics data platform. Clearsale helps
customers verify an average of half a million transactions daily using
big data analytics to detect fraud across the world. Host Jeremy
Chapman, speaks with Jelther Goncalves, Data Engineer at Clearsale,
to discuss how Clearsale is using Azure Synapse to expand their
machine learning analytics for anomaly detection and to operate at
greater scale.
• Video
References
• https://docs.microsoft.com/en-us/azure/synapse-analytics/overview-
what-is
• https://docs.microsoft.com/en-us/azure/synapse-analytics/get-
started-create-workspace

09 - Azure Data Engineering Cheatsheet
No ratings yet
09 - Azure Data Engineering Cheatsheet
37 pages
Azure Synapse Analytics
100% (1)
Azure Synapse Analytics
7,794 pages
Ducat Java Full Stack
No ratings yet
Ducat Java Full Stack
8 pages
MIE1628 Big Data Analytics Lecture5
No ratings yet
MIE1628 Big Data Analytics Lecture5
72 pages
Azure Synapse
No ratings yet
Azure Synapse
609 pages
Start To Finish With Azure Data Factory
100% (2)
Start To Finish With Azure Data Factory
30 pages
Azure Advanced Analytics Overview
No ratings yet
Azure Advanced Analytics Overview
81 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Troy Stetina - Speed Mechanics For Lead Guitar
0% (1)
Troy Stetina - Speed Mechanics For Lead Guitar
1 page
Shear Capacity of Steel Plate Girders With Large Web Openings
No ratings yet
Shear Capacity of Steel Plate Girders With Large Web Openings
3 pages
SDC - Synapse Analytics
No ratings yet
SDC - Synapse Analytics
23 pages
Whiz-Cheat-Sheet-DP-203-v2
No ratings yet
Whiz-Cheat-Sheet-DP-203-v2
42 pages
Azure Synapse - Cloud Data Analytics
No ratings yet
Azure Synapse - Cloud Data Analytics
33 pages
Azure Synpse
No ratings yet
Azure Synpse
4 pages
Azure Synapse Analytics L300 Update
No ratings yet
Azure Synapse Analytics L300 Update
180 pages
Azure Data Platform End2End - 1day
No ratings yet
Azure Data Platform End2End - 1day
90 pages
2018 05 24 Kathryn Varralls Modern Data Warehouse Presentation
No ratings yet
2018 05 24 Kathryn Varralls Modern Data Warehouse Presentation
29 pages
AzureDataFactoryMicrosoftFabric.pptx
No ratings yet
AzureDataFactoryMicrosoftFabric.pptx
14 pages
Module 4
No ratings yet
Module 4
3 pages
Data All Delivering Them DW With Azure 202003224202063744
No ratings yet
Data All Delivering Them DW With Azure 202003224202063744
92 pages
Azure Data Platform End2End - 2day
100% (2)
Azure Data Platform End2End - 2day
108 pages
Modern Analytics Academy - Data Modeling
No ratings yet
Modern Analytics Academy - Data Modeling
12 pages
James Serra Azure Synapse Analytics Overview Big Data Conference Europe
No ratings yet
James Serra Azure Synapse Analytics Overview Big Data Conference Europe
72 pages
DP 203T00A ENU PowerPoint - 01
No ratings yet
DP 203T00A ENU PowerPoint - 01
20 pages
Lecture 2
No ratings yet
Lecture 2
25 pages
Synapse Project Deck
No ratings yet
Synapse Project Deck
196 pages
Azure Analytics Interview Answers Complete
No ratings yet
Azure Analytics Interview Answers Complete
5 pages
Azure Synapse
No ratings yet
Azure Synapse
229 pages
Week 4- Azure-AWSStorage
No ratings yet
Week 4- Azure-AWSStorage
97 pages
Choosing Technologies For A Big Data Solution in The Cloud: James Serra
No ratings yet
Choosing Technologies For A Big Data Solution in The Cloud: James Serra
58 pages
MIE1628 Big Data Analytics Lecture10
No ratings yet
MIE1628 Big Data Analytics Lecture10
41 pages
Data Warehouse
No ratings yet
Data Warehouse
14 pages
Azure Synapse Analytics Overview
No ratings yet
Azure Synapse Analytics Overview
251 pages
Internal and Architecture
No ratings yet
Internal and Architecture
34 pages
Aniruddha BigDataandAnalytics
No ratings yet
Aniruddha BigDataandAnalytics
33 pages
Azure SQL DWH Part1 1665371763
No ratings yet
Azure SQL DWH Part1 1665371763
200 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
30 pages
Azure Data Engineer Interview QA
No ratings yet
Azure Data Engineer Interview QA
2 pages
Azure Data Platform Overview
100% (2)
Azure Data Platform Overview
57 pages
Diagrams
No ratings yet
Diagrams
69 pages
Azure Synapse Analytics
No ratings yet
Azure Synapse Analytics
4 pages
DP 900T00A ENU PowerPoint - 01
No ratings yet
DP 900T00A ENU PowerPoint - 01
16 pages
Unit 1 Big Data Notes
No ratings yet
Unit 1 Big Data Notes
48 pages
Building Industry-Aware Analytics Solutions Using
No ratings yet
Building Industry-Aware Analytics Solutions Using
21 pages
Azure Analytics: Synapse
100% (4)
Azure Analytics: Synapse
251 pages
Azure Data Engineer
100% (4)
Azure Data Engineer
54 pages
SQL DW
No ratings yet
SQL DW
596 pages
40833 OR
No ratings yet
40833 OR
29 pages
DP 900t00a Enu Powerpoint 04
No ratings yet
DP 900t00a Enu Powerpoint 04
23 pages
Azure Data Factory
No ratings yet
Azure Data Factory
3,167 pages
Microsoft Azure DP 203 Cert Notes 1712494873
100% (1)
Microsoft Azure DP 203 Cert Notes 1712494873
151 pages
Designing A Modern Data Warehouse in Azure
100% (1)
Designing A Modern Data Warehouse in Azure
25 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
124 pages
Exam 70-445 Prep
No ratings yet
Exam 70-445 Prep
56 pages
Presentation Deck Part 21612531397089
No ratings yet
Presentation Deck Part 21612531397089
59 pages
Parcial Cono 1 21
No ratings yet
Parcial Cono 1 21
21 pages
Parcial Cono 1 14
No ratings yet
Parcial Cono 1 14
14 pages
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
No ratings yet
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
4 pages
Data Engineering 101 - Azure Synapse Analytics
No ratings yet
Data Engineering 101 - Azure Synapse Analytics
45 pages
DA DataArchitect PDF
No ratings yet
DA DataArchitect PDF
28 pages
Unit 4 LT
No ratings yet
Unit 4 LT
16 pages
Data Architectures in Azure For Analytics & Big Data: October 20, 2018
No ratings yet
Data Architectures in Azure For Analytics & Big Data: October 20, 2018
26 pages
DBMS MASTER: Become Pro in Database Management System
From Everand
DBMS MASTER: Become Pro in Database Management System
Ummed Singh
No ratings yet
Mastering DuckDB: High-Performance Analytics Made Easy
From Everand
Mastering DuckDB: High-Performance Analytics Made Easy
Robert Johnson
No ratings yet
Sok: Consensus in The Age of Blockchains
No ratings yet
Sok: Consensus in The Age of Blockchains
17 pages
PARETO
No ratings yet
PARETO
3 pages
Arthur Gervais
No ratings yet
Arthur Gervais
13 pages
Blockchain Doesn't Eliminate Intermediaries and Never Will - It's A Fact - Coinspeaker
No ratings yet
Blockchain Doesn't Eliminate Intermediaries and Never Will - It's A Fact - Coinspeaker
11 pages
MIE1628 Big Data Analytics Lecture8
No ratings yet
MIE1628 Big Data Analytics Lecture8
82 pages
Lecture 9
No ratings yet
Lecture 9
60 pages
MIE1628 Big Data Analytics Lecture6
No ratings yet
MIE1628 Big Data Analytics Lecture6
108 pages
Is .5529.1.1985-Permeability Tests
No ratings yet
Is .5529.1.1985-Permeability Tests
30 pages
Optimization & Planning: By: Mohammad Haris Khan RF Optimization Engineer Mobilink - South Nov 20 '2009
No ratings yet
Optimization & Planning: By: Mohammad Haris Khan RF Optimization Engineer Mobilink - South Nov 20 '2009
12 pages
History of Letters
No ratings yet
History of Letters
9 pages
Kinetics II 2 QP
No ratings yet
Kinetics II 2 QP
7 pages
C042178e PDF
No ratings yet
C042178e PDF
4 pages
Unit
No ratings yet
Unit
2 pages
Download Full Rahul Sardana Optics and Modern Physics JEE Rahul Sardana PDF All Chapters
100% (8)
Download Full Rahul Sardana Optics and Modern Physics JEE Rahul Sardana PDF All Chapters
71 pages
Chapter 5
No ratings yet
Chapter 5
12 pages
CS 3719 (Theory of Computation and Algorithms) - Lectures 2-4. Finite Automata and Regular Expressions
No ratings yet
CS 3719 (Theory of Computation and Algorithms) - Lectures 2-4. Finite Automata and Regular Expressions
12 pages
Basics in Mineral Processing-Size Control
100% (1)
Basics in Mineral Processing-Size Control
24 pages
2861 2013 05 22825114
No ratings yet
2861 2013 05 22825114
26 pages
Sample Paper 11 TH Med
No ratings yet
Sample Paper 11 TH Med
10 pages
12 Phy Assignment Ray Optics 1695915852
No ratings yet
12 Phy Assignment Ray Optics 1695915852
3 pages
Solving An Economic Dispatch Problem With Transmission System Representation by A Modified Hopfield Network
No ratings yet
Solving An Economic Dispatch Problem With Transmission System Representation by A Modified Hopfield Network
6 pages
VB RDTM MPC L7 Du GJ 27 AMS
No ratings yet
VB RDTM MPC L7 Du GJ 27 AMS
24 pages
Notes Slope, Distance, Midpoint
100% (1)
Notes Slope, Distance, Midpoint
6 pages
IP Sample Paper 3
No ratings yet
IP Sample Paper 3
8 pages
Applying OSS Note Using SNOTE
No ratings yet
Applying OSS Note Using SNOTE
10 pages
Matrix A4 UCAT Flyer 5.0
100% (1)
Matrix A4 UCAT Flyer 5.0
4 pages
Line Chart - Definition, Types, Examples
No ratings yet
Line Chart - Definition, Types, Examples
1 page
Wk5 HW
No ratings yet
Wk5 HW
5 pages
Vertx Evo V1000 Networked Controller
No ratings yet
Vertx Evo V1000 Networked Controller
2 pages
(Ebook) Engineering Drawing and Design, 6th edition by David A. Madsen, David P. Madsen ISBN 9781305659728, 1305659724 All Chapters Instant Download
100% (1)
(Ebook) Engineering Drawing and Design, 6th edition by David A. Madsen, David P. Madsen ISBN 9781305659728, 1305659724 All Chapters Instant Download
81 pages
Probability Exercises
No ratings yet
Probability Exercises
43 pages
SSM Unit-2 Vik
100% (1)
SSM Unit-2 Vik
13 pages
Inovance Sv820 Servo Manual English 20-4-20
No ratings yet
Inovance Sv820 Servo Manual English 20-4-20
335 pages
Audac - Com12mk2
No ratings yet
Audac - Com12mk2
4 pages

MIE1628 Big Data Analytics Lecture7

Uploaded by

MIE1628 Big Data Analytics Lecture7

Uploaded by

Big Data

• Azure Synapse Analytics

• Azure Synapse Use-case

• Fraud Detection at Scale

• All data coming into the system

• The kappa architecture was

Big data is the term used for large quantities of data

Businesses typically depend on their big data to help

• A data lake holds raw

• Use Azure Synapse Analytics for:

Processing • Segments communicate using high-speed network between nodes

“Compute” node Balanced storage

“Compute” node Balanced storage

“Compute” node Balanced storage

“Compute” node Balanced storage

“Compute” node Balanced storage

“Compute” node Balanced storage

“Compute” node Balanced storage

distributions at random. Date DATE NOT NULL,

Supported on all table types Country VARCHAR(2)

OrderId Date Name Country Distribution1

98137 11-3-2018 T FR 98799 11-3-2018 R NL • A minimum of 1 million rows per

“Compute” node Balanced storage

“Compute” node Balanced storage

Compute Node Compute Node Compute Node

01101010101010101011 01101010101010101011 01101010101010101011

Compute Node Compute Node Compute Node

01101010101010101011 01101010101010101011 01101010101010101011

Round Robin Hash Distributed Replicated

Evenly distributes the data across the available

Loading into Round-robin tables is fast

Queries on Round-robin tables may require

Great to use for loading staging tables

Choosing a good distribution column is

As a starting point, use on tables that are greater

But don’t choose a volatile column for the hash

Replicated Tables Works well for dimension tables in a star

Should not be used on dimension tables that

You can convert existing round-robin tables to

Overview Data Develop Orchestrate

Offers use of temporary tables across

Publish changes by simple save

Offers a wide range of activities that a pipeline can perform.

Supports multiple languages to develop

Offers use of temporary tables across

You might also like