Azure Synapse DW - Pool Best Practices & Field Guidance: Prepared by
Azure Synapse DW - Pool Best Practices & Field Guidance: Prepared by
Prepared by
Prepared by
Revision 1
5/8/2020
Page 1
The High-Level Architecture, Migration Dispositions and guidelines in this document is developed in
consultation and collaboration with Microsoft Corporation technical architects. Because Microsoft must
respond to changing market conditions, this document should not be interpreted as an invitation to
contract or a commitment on the part of Microsoft.
Microsoft has provided generic high-level guidance in this document with the understanding that
MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, WITH RESPECT TO THE INFORMATION
CONTAINED HEREIN.
This document is provided “as-is”. Information and views expressed in this document, including URL and
other Internet Web site references, may change without notice.
Some examples depicted herein are provided for illustration only and are fictitious. No real association or
connection is intended or should be inferred.
This document does not provide you with any legal rights to any intellectual property in any Microsoft
product. You may copy and use this document for your internal, reference purposes.
Note: The detail provided in this document has been harvested as part of a customer engagement
sponsored through the Data SQL Ninja Engineering.
4.2 Don't use SQL Data Warehouse for operational (OLTP) workloads that have ..................................... 13
7.3 Land data into Azure Blob or Azure Data Lake Store ..................................................................................... 17
7.3.1 Extract the source data into text files .................................................................................................. 17
7.6.2 What is the file splitting guidance for the COPY command loading CSV files? ......................... 26
7.6.3 What is the file splitting guidance for the COPY command loading Parquet or ORC files? .... 27
7.6.5 Are there any known issues with the COPY command? ................................................................... 28
Customer will have many questions on what Microsoft recommends from their other global
customer experience, how to configure and maintain Synapse Pool so we can get best
performance and concurrency at the same time.
This document help you pass our field experience to customer and make their environment
production ready.
We Discuss and Explain Data Loading, Transformations and Performance Strategies for Synapse
SQL Pool.
Azure Synapse is an analytics service that brings together enterprise data warehousing
and Big Data analytics. It gives you the freedom to query data on your terms, using
either serverless on-demand or provisioned resources—at scale. Azure Synapse brings
these two worlds together with a unified experience to ingest, prepare, manage, and
serve data for immediate BI and machine learning needs.
Synapse SQL leverages Azure Storage to keep your user data safe. Since your data is
stored and managed by Azure Storage, there is a separate charge for your storage
consumption. The data is sharded into distributions to optimize the performance of the
system. You can choose which sharding pattern to use to distribute the data when you
define the table. These sharding patterns are supported:
• Hash
• Round Robin
• Replicate
The Control node is the brain of the architecture. It is the front end that interacts with all
applications and connections. The MPP engine runs on the Control node to optimize
and coordinate parallel queries. When you submit a T-SQL query, the Control node
transforms it into queries that run against each distribution in parallel.
The Compute nodes provide the computational power. Distributions map to Compute
nodes for processing. As you pay for more compute resources, distributions are
remapped to available Compute nodes. The number of compute nodes ranges from 1 to
60 and is determined by the service level for Synapse SQL.
Each Compute node has a node ID that is visible in system views. You can see the
Compute node ID by looking for the node_id column in system views whose names
begin with sys.pdw_nodes. For a list of these system views, see MPP system views.
Data Movement Service (DMS) is the data transport technology that coordinates data
movement between the Compute nodes. Some queries require data movement to
ensure the parallel queries return accurate results. When data movement is required,
DMS ensures the right data gets to the right location.
3.2 Distributions
A distribution is the basic unit of storage and processing for parallel queries that run on
distributed data. When SQL Analytics runs a query, the work is divided into 60 smaller
queries that run in parallel.
Each of the 60 smaller queries runs on one of the data distributions. Each Compute
node manages one or more of the 60 distributions. A SQL pool with maximum compute
resources has one distribution per Compute node. A SQL pool with minimum compute
resources has all the distributions on one compute node.
A replicated table provides the fastest query performance for small tables.
A table that is replicated caches a full copy of the table on each compute node.
Consequently, replicating a table removes the need to transfer data among compute
nodes before a join or aggregation. Replicated tables are best utilized with small tables.
Extra storage is required and there is additional overhead that is incurred when writing
data, which make large tables impractical.
The diagram below shows a replicated table that is cached on the first distribution on
each compute node.
Before migrating, you want to be certain SQL Data Warehouse is the right solution for
your workload. SQL Data Warehouse is a distributed system, designed to perform
analytics on large volumes of data. Migrating to SQL Data Warehouse requires some
design changes that are not too hard to understand but might take some time to
implement. If your business requires an enterprise-class data warehouse (DW), the
A Synapse SQL pool represents a collection of analytic resources that are being
provisioned. Analytic resources are defined as a combination of CPU, memory, and IO.
These three resources are bundled into units of compute scale called Data Warehouse
Units (DWUs). A DWU represents an abstract, normalized measure of compute resources
and performance.
A change to your service level alters the number of DWUs that are available to the
system, which in turn adjusts the performance, and the cost, of your system.
For higher performance, you can increase the number of data warehouse units. For less
performance, reduce data warehouse units. Storage and compute costs are billed
separately, so changing data warehouse units does not affect storage costs.
Performance for data warehouse units is based on these data warehouse workload
metrics:
• How fast a standard SQL pool query can scan a large number of rows and then perform a
complex aggregation. This operation is I/O and CPU intensive.
• How fast the SQL pool can ingest data from Azure Storage Blobs or Azure Data Lake. This
operation is network and CPU intensive.
• How fast the CREATE TABLE AS SELECT T-SQL command can copy a table. This
operation involves reading data from storage, distributing it across the nodes of the
appliance and writing to storage again. This operation is CPU, IO, and network intensive.
Increasing DWUs:
• Linearly changes performance of the system for scans, aggregations, and CTAS statements
• Increases the number of readers and writers for Polybase load operations
• Increases the maximum number of concurrent queries and concurrency slots.
The ideal number of data warehouse units depends very much on your workload and
the amount of data you have loaded into the system.
SQL pool is a scale-out system that can provision vast amounts of compute and query
sizeable quantities of data.
To see its true capabilities for scaling, especially at larger DWUs, we recommend scaling
the data set as you scale to ensure that you have enough data to feed the CPUs. For
scale testing, we recommend using at least 1 TB.
This section provides helpful tips and best practices for building Azure Synapse
solutions.
7.3 Land data into Azure Blob or Azure Data Lake Store
7.3.1 Extract the source data into text files
Getting data out of your source system depends on the storage location. The goal is to
move the data into PolyBase, and the COPY supported delimited text or CSV files.
With PolyBase and the COPY statement, you can load data from UTF-8 and UTF-16
encoded delimited text or CSV files. In addition to delimited text or CSV files, it loads
from the Hadoop file formats such as ORC and Parquet. PolyBase and the COPY
statement can also load data from Gzip and Snappy compressed files.
Extended ASCII, fixed-width format, and nested formats such as WinZip or XML aren't
supported. If you're exporting from SQL Server, you can use the bcp command-line tool to
To land the data in Azure storage, you can move it to Azure Blob storage or Azure Data
Lake Store Gen2. In either location, the data should be stored in text files. PolyBase and
the COPY statement can load from either location.
Tools and services you can use to move data to Azure Storage:
• Azure ExpressRoute service enhances network throughput, performance, and
predictability. ExpressRoute is a service that routes your data through a dedicated private
connection to Azure. ExpressRoute connections do not route data through the public
internet. The connections offer more reliability, faster speeds, lower latencies, and higher
security than typical connections over the public internet.
• AZCopy utility moves data to Azure Storage over the public internet. This works if your
data sizes are less than 10 TB. To perform loads on a regular basis with AZCopy, test the
network speed to see if it is acceptable.
• Azure Data Factory (ADF) has a gateway that you can install on your local server. Then you
can create a pipeline to move data from your local server up to Azure Storage. To use Data
Factory with SQL Analytics, see Loading data for SQL Analytics.
Alternative is query/insert:
If you are using PolyBase, you need to define external tables in your SQL pool before
loading. External tables are not required by the COPY statement. PolyBase uses external
tables to define and access the data in Azure Storage.
An external table is similar to a database view. The external table contains the table
schema and points to data that is stored outside the SQL pool.
First, load your data into Azure Data Lake Storage or Azure Blob Storage. Next, use
PolyBase to load your data into staging tables. Use the following configuration:
Resource groups are used as a way to allocate memory to queries. If you need more
memory to improve query or loading speed, you should allocate higher resource classes.
On the flip side, using larger resource classes impacts concurrency. You want to take
that into consideration before moving all of your users to a large resource class.
If you notice that queries take too long, check that your users do not run in large
resource classes. Large resource classes consume many concurrency slots. They can
cause other queries to queue up.
Finally, by using Gen2 of SQL pool, each resource class gets 2.5 times more memory than
Gen1.
Indexing is helpful for reading tables quickly. There is a unique set of technologies that
you can use based on your needs:
Tips:
You might partition your table when you have a large fact table (greater than 1 billion
rows). In 99 percent of cases, the partition key should be based on date. Be careful to
not over partition, especially when you have a clustered columnstore index.
With staging tables that require ELT, you can benefit from partitioning. It facilitates data
lifecycle management. Be careful not to over partition your data, especially on a
clustered columnstore index.
• PolyBase is by far the fastest and most scalable SQL Data Warehouse loading
method to date, so we recommend it as your default loading mechanism. PolyBase
is a scalable, query processing framework compatible with Transact-SQL that can
be used to combine and bridge data across relational database management
systems, Azure Blob Storage, Azure Data Lake Store and Hadoop database
platform ecosystems (APS only)
• As a general rule, we recommend making PolyBase your first choice for loading
data into SQL Data Warehouse unless you can’t accommodate PolyBase-supported
file formats. Currently PolyBase can load data from UTF-8 and UTF-16 encoded
delimited text files as well as the popular Hadoop file formats RC File, ORC, and
Parquet (non-nested format). PolyBase can load data from gzip, zlib and Snappy
compressed files. PolyBase currently does not support extended ASCII, fixed-file
format, WinZip and semi-structured data such as Parquet (nested/hierarchical),
JSON, and XML. A popular pattern to load semi-structured data is to use Azure
Databricks or similarly HDI/Spark to load the data, flatten/transform to the
supported format, then load into SQL DW.
As the following architecture diagrams show, each HDFS bridge of the DMS service from
every Compute node can connect to an external resource such as Azure Blob Storage,
and then bidirectionally transfer data between SQL Data Warehouse and the external
resource.
As illustrated in Table 1 below, each DWU has a specific number of readers and
writers. As you scale out, each node gets additional number of readers and
writers. The static and dynamic resource classes also varies with the number of readers
and writers. Note that Parquet files typically has half the number of readers compared
to non-Parquet files. The number of readers and writers is an important factor in
determining your load performance.
Table 1. Number of readers and writers for Gen 2 SQL DW xlargerc resource class
To check for the number of readers/writers, use the following query (adjust the
appropriate request_id and step_index). For more information, see link Monitoring your
workload using DMVs
For source formats that don’t reflect the defaults, you must explicitly specify a custom
date format. However, if multiple non-default formats are used within one file, there is
currently no method for specifying multiple custom date formats within the PolyBase
command.
Fixed-length file format not supported
• Use lower privileged users to load without needing strict CONTROL permissions on
the data warehouse
• Execute a single T-SQL statement without having to create any additional database
objects
• Properly parse and load CSV files where delimiters (string, field, row) are escaped
within string delimited columns
• Specify a finer permission model without exposing storage account keys using
Share Access Signatures (SAS)
• Use a different storage account for the ERRORFILE location
(REJECTED_ROW_LOCATION)
• Customize default values for each target column and specify source data fields to
load into specific target columns
• Specify a custom row terminator for CSV files
• Leverage SQL Server Date formats for CSV files
• Specify wildcards and multiple files in the storage location path
The COPY command will have better performance depending on your workload. For
best loading performance during public preview, consider splitting your input into
7.6.2 What is the file splitting guidance for the COPY command loading
CSV files?
Guidance on the number of files is outlined in the table below. Once the recommended
number of files are reached, you will have better performance the larger the files. For a
simple file splitting experience, refer to the following documentation.
DWU #Files
100 60
200 60
300 60
400 60
500 60
1,000 120
1,500 180
2,000 240
2,500 300
3,000 360
5,000 600
6,000 720
7,500 900
10,000 1200
15,000 1800
30,000 3600
7.6.3 What is the file splitting guidance for the COPY command loading
Parquet or ORC files?
There is no need to split Parquet and ORC files because the COPY command will
automatically split files. Parquet and ORC files in the Azure storage account should be
256MB or larger for best performance.
If you have a single CSV file that is large, since COPY does not have file splits yet, there
will be a single reader that will read that file. If you partition file manually (or product
multiple files, more than 60), COPY will be able to load in parallel.
We are working on a solution to automatically partition a single big file automatically
(just like what Polybase does today) – this should be coming in for GA.
COPY comes with granular permission model, unlike Polybase. COPY will be preferred
way of loading data
The COPY command will be generally available by the end of this calendar year (2020).
• LOB support such as (n)varchar(max) is not available in the COPY statement. This will be
available early next year.
If you're going to incrementally load your data, first make sure that you allocate larger
resource classes to loading your data. This is particularly important when loading into
tables with clustered columnstore indexes. See resource classes for further details.
We recommend using PolyBase and ADF V2 for automating your ELT pipelines into your
data warehouse.
For a large batch of updates in your historical data, consider using a CTAS to write the
data you want to keep in a table rather than using INSERT, UPDATE, and DELETE.
While data is in the staging table, perform transformations that your workload requires.
Then move the data into a production table.
Data must be broadcast or shuffled across nodes before the query can execute. This
takes time.
Skew means that most of the work for a query takes place on only a few nodes, leading
to slow performance because balanced parallel processing cannot be achieved
Do not Join table types and on columns, which creates incompatible joins
Avoid Update, Delete Commands and use CTAS instead if updating more than 10% of
the records
For incremental loads performance check on the health of the CCI Index
The result-set cache persists even if a data warehouse is paused and resumed later.
Query cache is invalidated and refreshed when underlying table data or query code
changes.
Result cache is evicted regularly based on a time-aware least recently used algorithm
(TLRU).
Indexed views are automatically updated when data in underlying tables are changed.
This is a synchronous operation that occurs as soon as the data is changed.
The auto caching functionality allows SQL DW Query Optimizer to consider using
indexed view even if the view is not referenced in the query.
Supported aggregations: MAX, MIN, AVG, COUNT, COUNT_BIG, SUM, VAR, STDEV
Access the actual business requirement, type of workload and loading / reporting
window to design workload isolation and importance.
Partitioning CCIs is only useful when the row count is greater than 60million * #partitions
A non-clustered index may improve performance of joins when fact tables are joined to
very large (billion+) dimensions
If you find it is taking too long to update all of your statistics, you may want to try to be
more selective about which columns need frequent statistics updates. For example, you
might want to update date columns, where new values may be added, daily.
You will gain the most benefit by having updated statistics on columns involved in joins,
columns used in the WHERE clause and columns found in GROUP BY.
However, if you need to load thousands or millions of rows throughout the day, you
might find that singleton INSERTS just can't keep up. Instead, develop your processes so
that they write to a file and another process periodically comes along and loads this file.
When you are temporarily landing data, you may find that using a heap table will make
the overall process faster. If you are loading data only to stage it before running more
transformations, loading the table to heap table will be much faster than loading the
data to a clustered columnstore table.
In addition, loading data to a temp table will also load much faster than loading a table
to permanent storage. Temporary tables start with a "#" and are only accessible by the
session that created it, so they may only work in limited scenarios.
Heap tables are defined in the WITH clause of a CREATE TABLE. If you do use a
temporary table, remember to create statistics on that temporary table too.
When querying a columnstore table, queries will run faster if you select only the
columns you need.
SQL pool uses resource groups as a way to allocate memory to queries. Out of the box,
all users are assigned to the small resource class, which grants 100 MB of memory per
distribution. Since there are always 60 distributions and each distribution is given a
minimum of 100 MB, system wide the total memory allocation is 6,000 MB, or just under
6 GB.
Certain queries, like large joins or loads to clustered columnstore tables, will benefit
from larger memory allocations. Some queries, like pure scans, will yield no benefit.
However, utilizing larger resource classes reduces concurrency, so you will want to take
this impact into consideration before moving all of your users to a large resource class.
See also Resource classes for workload management.
SQL pool supports loading and exporting data through several tools including Azure
Data Factory, PolyBase, and BCP. For small amounts of data where performance isn't
critical, any tool may be sufficient for your needs. However, when you are loading or
exporting large volumes of data or fast performance is needed, PolyBase is the best
choice.
PolyBase is designed to leverage the MPP (Massively Parallel Processing) architecture
and will load and export data magnitudes faster than any other tool. PolyBase loads can
be run using CTAS or INSERT INTO.
Using CTAS will minimize transaction logging and the fastest way to load your data.
Azure Data Factory also supports PolyBase loads and can achieve similar performance as
CTAS. PolyBase supports a variety of file formats including Gzip files.
To maximize throughput when using gzip text files, break up files into 60 or more files
to maximize parallelism of your load. For faster total throughput, consider loading data
concurrently.
Load data,
Guide for using PolyBase
SQL pool loading patterns and strategies
To quickly find queries in these DMVs, using the LABEL option with your queries can
help.
• Load data into Azure SQL Data Warehouse by using Azure Data Factory
• Monitor your Azure Synapse Analytics SQL pool workload using DMVs
If you have feedback or suggestions for improving this data migration asset, please contact the
Data Migration Jumpstart Team ([email protected]). Thanks for your support!
Note: For additional information about migrating various source databases to Azure, see the
Azure Database Migration Guide.