MIE1628 Big Data Analytics Lecture7
MIE1628 Big Data Analytics Lecture7
Science
Lecture 7
1
Objectives • Big Data Architecture
• Data Warehousing
Warehousing
It can be historical (meaning stored) or real time
(meaning streamed from the source).
INDUSTRY-LEADING SQL INDUSTRY-STANDARD INTEROP OF SQL AND BUILT-IN DATA UNIFIED MANAGEMENT,
APACHE SPARK APACHE SPARK ON YOUR INTEGRATION VIA MONITORING, AND
DATA LAKE PIPELINES SECURITY
SYNAPSE STUDIO
Azure Synapse Analytics MPP Intro
Parallelism
SMP - Symmetric
• Multiple CPUs used to complete individual processes simultaneously
• All CPUs share the same memory, disks, and network controllers (scale-up)
Multiprocessing •
•
All SQL Server implementations up until now have been SMP
Mostly, the solution is housed on a shared SAN
MPP - Massively • Uses many separate CPUs running in parallel to execute a single
Parallel
program
• Shared Nothing: Each CPU has its own memory and disk (scale-out)
DMS
“Control”
node SQL DMS
DMS
DMS
DMS
DMS
Use replicated for smaller tables. If tables are too large to store on each Compute node, use hash-
Dimension
distributed.
Use round-robin for the staging table. The load with CTAS is faster. Once the data is in the staging table,
Staging
use INSERT…SELECT to move the data to production tables.
Tables – Distributions & Partitions
Logical table structure Physical data distribution
( Hash distribution (OrderId), Date partitions )
…
82147 11-2-2018 Q FR
85216 11-2-2018 Q DE
86881 11-2-2018 D UK 85395 11-2-2018 V NL
93080 11-3-2018 R UK 82147 11-2-2018 Q FR
94156 11-3-2018 S FR
86881 11-2-2018 D UK x 60 distributions (shards)
… … … …
96250 11-3-2018 Q NL
98799 11-3-2018 R NL 11-3-2018 partition
98015 11-3-2018 T UK
OrderId Date Name Country
• Each shard is partitioned with the same
93080 11-3-2018 R UK
98310 11-3-2018 D DE 94156 11-3-2018 S FR date partitions
98979 11-3-2018 Z DE 96250 11-3-2018 Q NL
Sales Fact
Store Dim
Date Dim ID
Store Dim ID Store Dim ID DMS
Store Name Prod Dim ID T P
Store Mgr
Store Size
Cust Dim ID “Compute” node Balanced storage
D
S
D
C
Qty Sold D D
Dollars Sold Customer Dim SQL
Cust Dim ID
Cust Name
Cust Addr
Cust Phone
Replicated Cust Email
DMS
Table copied to each compute node T P
“Compute” node
D D
Balanced storage
S
D
C
D
SQL
Distributed
Table spread across compute nodes based on “hash”
• https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/memory-
concurrency-limits
Architecture for DW100
Azure SQL Data Warehouse Azure Storage Blob(s)
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10
D11 D12 D13 D14 D15 D16 D17 D18 D19 D20
D21 D22 D23 D24 D25 D26 D27 D28 D29 D30
Engine Worker1
D31 D32 D33 D34 D35 D36 D37 D38 D39 D40
D41 D42 D43 D44 D45 D46 D47 D48 D49 D50
D51 D52 D53 D54 D55 D56 D57 D58 D59 D60
Architecture for DW600
Azure SQL Data Warehouse Azure Storage Blob(s)
Worker1 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10
Worker2 D11 D12 D13 D14 D15 D16 D17 D18 D19 D20
Engine Worker3 D21 D22 D23 D24 D25 D26 D27 D28 D29 D30
Worker4 D31 D32 D33 D34 D35 D36 D37 D38 D39 D40
Worker5 D41 D42 D43 D44 D45 D46 D47 D48 D49 D50
Worker6 D51 D52 D53 D54 D55 D56 D57 D58 D59 D60
Azure Synapse Analytics
Data Warehouse Architecture
Monitor Manage
Centralized view of all resource Configure the workspace, pool,
usage and activities in the access to artifacts
workspace.
Data Hub
Data Hub – Storage accounts
Data Hub – Storage accounts
Data Hub – Storage accounts
Data Hub – Storage accounts
Data Hub – Databases
SQL pool
SQL on-demand
Spark
Data Hub – Databases
Familiar gesture to generate T-SQL scripts from SQL metadata Starting from a table, auto-generate a single line of PySpark code
objects such as tables. that makes it easy to load a SQL table into a Spark dataframe
Data Hub – Datasets
Synapse Studio hub
Develop Hub
Overview
It provides development
experience to query,
analyze, model data
Develop Hub -
SQL scripts
SQL Script
• Authoring SQL Scripts
• Execute SQL script on
provisioned SQL Pool or
SQL On-demand
• Publish individual SQL
script or multiple SQL
scripts through Publish
all feature
• Language support and
intellisense
Develop Hub - SQL scripts
SQL Script
View results in Table or Chart form and export results in several
popular formats
Develop Hub - Notebooks
Notebooks
Allows to write multiple languages in
one notebook
%%<Name of language>
Benefits
Allows to write multiple languages in one
notebook
%%<Name of language>