06.introduction To Data Factory
06.introduction To Data Factory
Learning Objectives
• Azure Data Factory
• Considerations
• Components
• Demos
What you can do in Azure Data Factory?
Definition
Azure Data Factory (ADF) is a hybrid data integration service
that enables you to quickly and efficiently create automated
data pipelines – without having to write any code!
Azure Data Factory
• Hybrid Data Integration Service
• Simplifies ETL at scale
• Enables modern data integration
• Drag and drop interface
• Over 80 connectors available
• Move, transform and save data
• Managed Service
• Create Data Driver workflows
• Orchestrate and automate data movement
• Transform and store data
• Operationalize the process
• ETL or ELT scenarios
Data Factory Considerations
Azure Data Factory Components
Data Factory Pipeline
• Data Factories can contain one ore more pipelines
• Logical group of Activities
• Manage Activities as a set
• One Pipeline can have one or more activitiesData
Azure Data Factory Activities
• Represents a processing step in the pipelines
• Actions to perform on data
• Ingest data
• Transform data
• Store data
• Can be linked
• Execute sequentially or
• Run in parallel
Activity Types
• Data movement activities
• Copy data amongst data stores located on-premises and in the cloud Data
stores – Blob storage, Cosmos DB, Amazon Redshift, Maria DB…etc
• Data transformation activities
• Transform and enrich data e.g. Hive, Pig, MapReduce, Spark or Databricks
• Control activities
• Control pipeline flow e.g. ForEach, Web
Data Flows
• Data Flow is a new feature of Azure Data Factory (ADF) that allows
you to develop graphical data transformation logic that can be
executed as activities within ADF pipelines.
• Two types:
• Mapping
• Wrangling
DataSet
• Simply point or reference the data
• Reference data used in an Activity
• Files
• Folders
• Documents
• Tables
Linked Service
• Similar to connection string
• Represent the connection information to connect to external
resources
• Datastores like Azure SQL Server
• Compute resource e.g. Spark Cluster
ADF Components
Integration Runtimes
• Provides fully managed, serverless compute infrastructure
• You don't have to worry about infrastructure provision, software
installation, patching, or capacity scaling.
• Pay only for duration of actual use
• Bridges between the activity and linked service
• Activity defines the action
• Linked service define the location
Integration Runtimes
• Data Integration Capabilities
• Data Flow
• Data Movement
• Format conversion, column mapping, serialization/ deserialization etc.
• Provides the native compute to move data between cloud data stores in a secure,
reliable, and high performance manner.
• Activity dispatch (e.g. Databricks Notebook, HDInsight Hive, pig,
spark activity, SP, ADL Analytics U-SQL activity)
• SSIS Package execution
Types of Integration Runtime
Specify the infrastructure to run activities
1. Azure Integration Runtime
• Work on public networks
• Responsible for data flows, data movements, and activity dispatches
2. Self-hosted Integration Runtime
• Work on public and private networks
• Provide data movement and activity dispatch capabilities
• Need to install on on-premises machine or a virtual machine inside private network
3. SSIS Integration Runtime
• Supports SSIS package execution
• Works on public and private networks
Integration Runtimes
IR Type Public Network Private Network
Azure Data Flow
Data movement
Activity Dispatch