Introduction to Azure Data Factory
Manuel Quintana
Agenda
• What is Azure Data Factory And Provisioning
• Integration Runtimes
• Linked Services
• Datasets
• Pipelines
• Data Flows
• Synapse Pipelines and Dataflows
Working with Azure Data Factory
Provisioning Azure Data Factory
Prerequisites
Azure Subscription
Must have an existing Azure Subscription
Azure Roles
Member of the contributor or owner role (or)
Administrator of the Azure subscription
Resource Groups
What is a resource group?
Container that holds related resources
Can hold all resources for solution or selective resources
Deploy, update, and delete them as a group
Stores metadata about the resources
Azure Storage account
What is an Azure Storage account?
General-purpose storage account
What services are available?
Tables
Queues
Files
Blobs
Azure VM Disks
Azure SQL DB
What is Azure SQL DB?
General-purpose relational database
What structures are supported?
Relational data
JSON
Spatial
XML
Provisioning Azure Data Factory
Data Factory
The name must be globally unique.
Subscription
Resource Group
Version (V1 vs V2)
Location
Version Control
Create Resources
Demo
Data Factory Navigation
Let’s get started – Home Hub
Actions
Ingest (Copy Data Activity Wizard)
Orchestrate (Create Pipeline)
Transform Data (Create Data flow)
Configure SSIS Runtime
Other Areas
Discover More
Recent Resources
Feature showcase
Resources
Author Hub
Design Area
Pipelines
Datasets
Data flows
Power Query
Monitoring Hub
Monitoring Options
Dashboards
Pipeline Runs
Trigger Runs
Integration Runtimes
Data flow debug
Manage Hub
Admin Options
Connections
Source Control
Author
Security
Data Factory Resources
Integration Runtimes
Linked Services
Datasets
Integration Runtimes
Integration Runtimes (Manage Hub)
The Integration Runtime is the compute infrastructure used by ADF to provide
the following data integration capabilities:
1. Data Movement (Azure IR, Self-Hosted IR)
2. SSIS package execution (Azure-SSIS IR)
Self-hosted integration runtime
Capable of running copy activities between cloud data stores and private data
stores
Linked Services and Datasets
Linked Services (Manage Hub)
Defines connection information so that Data Factory can connect to the data
source.
Can be reused among pipelines in a Data Factory
Datasets (Author Hub)
Named view of data that points or references the data
Data Stores: Tables, Files, Folders, and Documents
Resource Organization
Folders
Used to group pipeline resources together
Used to group dataset resources together
Used to group data flow’s together
Create Linked Service
Demo
Copy Activity Wizard
Copy Activity Wizard
Task cadence or schedule
Run once now
Run Regularly on schedule (Creates
Trigger)
Source Data Store
Choose existing data set
Create new data set
Destination data store
Choose existing data set
Create new data set
Settings
Data Integration Unit
Degree of copy parallelism
Copy Activity Wizard
Demo
Pipeline Basics
Demo
Get Metadata Activity
Get Metadata activity
Purpose
Retrieve metadata information of data
Metadata options
Item Name
Item Type
Size
Created
Last Modified
Child Items
Content MD5
Structure
Column Count
Exists
Output Parameters
Output Parameters
Outputs can be used in other activities
Output parameter names
Add dynamic content
Debug results (activity output)
Pipeline Design
Metadata Activity → Stored Procedure Activity
Get Metadata Activity
Demo
Stored Procedure Activity
Stored Procedure Activity
Purpose
Invoke a stored procedure
Utilize outputs from other activities
Supports
Azure SQL Database
Synapse Analytics (Azure SQL DW)
SQL Server Database
Limitations
No output parameters to ADF
Stored Procedure Activity
Demo
Lookup Activity
Pipeline Design
Design Pattern
Lookup Activity
Purpose
Retrieve a dataset
Supports
Any Azure Data Factory data source
Executing Stored Procedures
Executing SQL Scripts
Output parameters
Outputs
Single Value
Array / Object
Lookup Activity
Demo
If Condition Activity
Pipeline Design
If Condition Activity
Purpose
If statement functionality
Boolean expression (True/False)
Supports
ADF Expressions and Functions
If True Activities
If False Activities
If Condition Activity
Demo
Data Flows
Overview
What are Data Flows
Purpose
Allows for data transformations
Items
Source
Transformations
Sink
How to Execute
Debug
Data Flow Activity
ADF code converted to Scala
Data Flows are executed in Azure Databricks
Automatic scaling-out as needed
What is Parquet?
File Format
Column oriented data storage
format vs row oriented
Benefits
Storage
Performance
Source
Available Options
Azure SQL Data Warehouse
Azure SQL Database
Cosmos DB
Azure Blob
ADLS Gen1/2
Synapse Analytics
Items
Minimum of 1 Source
Transformations
Available Options
New Branch
Join
Conditional Split
Derived Column
Lookup
Select
Sort
Filter
Etc…
Expressions
Visual Expression Builder
Certain transformations require the usage of
the ADF expression language
Debug
Lets you see live in-progress preview of your
data results from the expression you are
building
Sink
Available Options
Azure SQL Data Warehouse
Azure SQL Database
Cosmos DB
Azure Blob
ADLS Gen1/2
Synapse Analytics
Items
Minimum of 1 Sink
Setup
Business Scenario
• My business has requested to get a file that lists all of the products our
company sells. (Source)
• They also want the model description of the product which comes from a
different table. (Lookup & Select)
• The shipping weight needs to be included but needs to be calculated by
padding the actual weight by 10% to account for packing (Derived Column)
• We also do not need products which have a list price of $0.00 (Filter)
• Finally we need to order the data in a file by the list price descending (Sort &
Sink)
Data Flow Overview
Demo
Scheduling a Pipeline
Triggers
Triggers
Schedule trigger
Invokes pipeline on a wall-clock schedule
Tumbling window trigger
Operates on a periodic interval, while also retaining
state
Event-based trigger
Responds to an event
Schedule Trigger
Schedule Recurrence:
Every Minute
Hourly
Daily
Weekly
Monthly
Pipeline Assignment
Multiple pipelines to single trigger
Assignment performed from pipeline
Schedule Triggers
Demo
Other ADF Features
Triggers
Lift and Shift
Executing SSIS packages stored in Azure
Using Azure resources, not on-prem resources
Power Query
Can leverage the Power Query Editor Online to
Transform data in a Pipeline
Flowlet
Store re-usable code
Data flow libraries (preview)
Custom functions using the expression builder for re-use