Data Engineering
Data Engineering
The objective of a practical training is to learn something about industries practically and to be
familiar with a working style of a technical worker to adjust simply according to industrial
environment. This report deals with the equipments their relations and their general operating
principles. Data engineering is a multi-disciplinary field that comprises of learning, statistics,
database, visualisation, optimisation, and information theory. It is slightly younger than its
sibling, data science. Data engineering is a set of operations aimed at creating interfaces and
mechanisms for the flow and access of information. It takes dedicated specialists—data
engineers—to maintain data so that it remains available and usable by others. In short, data
engineers set up and operate the organisation’s data infrastructure, preparing it for further
analysis by data analysts and scientists. The first type of data engineering is SQL- focused. The
work and primary storage of the data is in relational databases. All of the data processing is
done with SQL or a SQL-based language. It is a broad field, where a data engineer transforms
data into a useful format for analysis. This paper provides a brief introduction to data
engineering.
Table of Contents
Chapter-1 ........................................................................................... 1
Introduction ..................................................................................................... 1
What is Data Engineering? .................................................................................................... 1
The data engineer role ........................................................................................................... 1
Data engineer responsibilities ................................................................................................ 1
Data engineer vs. data scientist.............................................................................................. 2
Chapter- 2 .......................................................................................... 2
SQL ................................................................................................................... 2
What is SQL? ........................................................................................................................ 2
Why SQL? ............................................................................................................................ 3
History of SQL...................................................................................................................... 3
Process of SQL ..................................................................................................................... 3
SQL vs NO SQL ................................................................................................................... 4
Advantages of SQL ............................................................................................................... 5
Disadvantages of SQL........................................................................................................... 5
SQL Commands .................................................................................................................... 6
What is SQL Server? ............................................................................................................. 7
SQL Server Basic .................................................................................................................. 9
SQL Server Views .............................................................................................................. 12
Advantages of views ........................................................................................................... 14
Managing views in SQL Server ........................................................................................... 14
SQL Server Indexes ............................................................................................................ 15
Store Procedure ................................................................................................................... 15
Using SQL constraints ........................................................................................................ 15
Chapter-3 ......................................................................................... 16
Azure Data Factory ....................................................................................... 16
How does it work? .............................................................................................................. 17
Create a data factory by using the Azure portal.................................................................... 20
Create a data factory............................................................................................................ 20
Advanced creation in the Azure portal................................................................................. 21
Pipelines and activities in Azure Data Factory and Azure Synapse Analytics....................... 22
Creating a pipeline with UI ................................................................................................. 24
Linked services in Azure Data Factory and Azure Synapse Analytics .................................. 26
Linked service with UI:Azure Data Factory......................................................................... 27
Create linked services.......................................................................................................... 27
Setting up ADF ................................................................................................................... 28
Integration Runtime ............................................................................................................ 28
Linked Service .................................................................................................................... 29
Data Set .............................................................................................................................. 29
Source and Sink .................................................................................................................. 30
Simple project on ADF ....................................................................................................... 31
References .......................................................................................................................... 56
Data Engineering
Chapter-1
Introduction
What is Data Engineering?
A data engineer is an IT worker whose primary job is to prepare data for analytical or
operational uses. These software engineers are typically responsible for building data
pipelines to bring together information from different source systems. They integrate,
consolidate and cleanse data and structure it for use in analytics applications. They aim to
make data easily accessible and to optimise their organisation's big data ecosystem.
The amount of data an engineer works with varies with the organisation, particularly with
respect to its size. The bigger the company, the more complex the analytics architecture,
and the more data the engineer will be responsible for. Certain industries are more data-
intensive, including healthcare, retail and financial services.
Data engineers work in conjunction with data science teams, improving data transparency
and enabling businesses to make more trustworthy business decisions.
1
Data Engineering
algorithms against the information for predictive analytics, machine learning and data
mining applications. Data engineers also deliver aggregated data to business executives and
analysts and other end users so they can analyze it and apply the results to improving
business operations.
Data engineers deal with both structured and unstructured data. Structured data is
information that can be organized into a formatted repository like a database. Unstructured
data -- such as text, images, audio and video files -- doesn't conform to conventional data
models. Data engineers must understand different approaches to data architecture and
applications to handle both data types. A variety of big data technologies, such as open
source data ingestion and processing frameworks, are also part of the data engineer's toolkit.
2
Data Engineering
Chapter- 2
SQL
What is SQL?
SQL is a short-form of the structured query language, and it is pronounced as S-Q-L or
sometimes as See-Quell. This database language is mainly designed for maintaining the
data in relational database management systems. It is a special tool used by data
professionals for handling structured data (data which is stored in the form of tables). It is
also designed for stream processing in RDSMS.
You can easily create and manipulate the database, access and modify the table rows and
columns, etc. This query language became the standard of ANSI in the year of 1986 and
ISO in the year of 1987.
If you want to get a job in the field of data science, then it is the most important query
language to learn. Big enterprises like Facebook, Instagram, and LinkedIn, use SQL for
storing the data in the back-end.
Why SQL?
Nowadays, SQL is widely used in data science and analytics. Following are the reasons
which explain why it is widely used:
• The basic use of SQL for data professionals and SQL users is to insert, update, and delete
the data from the relational database.
• SQL allows the data professionals and users to retrieve the data from the relational
database management systems.
• It also helps them to describe the structured data.
• It allows SQL users to create, drop, and manipulate the database and its tables.
• It also helps in creating the view, stored procedure, and functions in the relational
database.
• It allows you to define the data and modify that stored data in the relational database.
• It also allows SQL users to set the permissions or constraints on table columns, views,
and stored procedures.
History of SQL
"A Relational Model of Data for Large Shared Data Banks" was a paper which was
published by the great computer scientist "E.F. Codd" in 1970.
The IBM researchers Raymond Boyce and Donald Chamberlin originally developed the
SEQUEL (Structured English Query Language) after learning from the paper given by E.F.
Codd. They both developed the SQL at the San Jose Research laboratory of IBM
Corporation in 1970.
At the end of the 1970s, relational software Inc. developed their own first SQL using the
concepts of E.F. Codd, Raymond Boyce, and Donald Chamberlin. This SQL was totally
based on RDBMS. Relational Software Inc., which is now known as Oracle Corporation,
introduced the Oracle V2 in June 1979, which is the first implementation of SQL language.
This Oracle V2 version operates on VAX computers.
Process of SQL
When we are executing the command of SQL on any Relational database management
system, then the system automatically finds the best routine to carry out our request, and
the SQL engine determines how to interpret that particular command.
Structured Query Language contains the following four components in its process:
o Query Dispatcher
3
Data Engineering
o Optimization Engines
4
Data Engineering
SQL vs NO SQL
SQL No-SQL
2. The query language used in this database 2. The query language used in the No-SQL
system is a structured query language. database systems is a non-declarative query
language.
4. These databases are vertically scalable. 4. These databases are horizontally scalable.
5. The database type of SQL is in the form of 5. The database type of No-SQL is in the form
tables, i.e., in the form of rows and columns. of documents, key-value, and graphs.
7. Complex queries are easily managed in the 7. NoSQL databases cannot handle complex
SQL database. queries.
8. This database is not the best choice for 8. While No-SQL database is a perfect option
storing hierarchical data. for storing hierarchical data.
5
Data Engineering
9. All SQL databases require object-relational 9. Many No-SQL databases do not require
mapping. object-relational mapping.
10. Gauges, CircleCI, Hootsuite, etc., are the 10. Airbnb, Uber, and Kickstarter are the top
top enterprises that are using this query enterprises that are using this query language.
language.
11. SQLite, Ms-SQL, Oracle, PostgreSQL, and 11. Redis, MongoDB, Hbase, BigTable,
MySQL are examples of SQL database CouchDB, and Cassandra are examples of
systems. NoSQL database systems.
Advantages of SQL
SQL provides various advantages which make it more popular in the field of data science.
It is a perfect query language which allows data professionals and users to communicate
with the database. Following are the best advantages or benefits of Structured Query
Language:
1. No programming needed
SQL does not require a large number of coding lines for managing the database systems.
We can easily access and maintain the database by using simple SQL syntactical rules.
These simple rules make the SQL user-friendly.
2. High-Speed Query Processing
A large amount of data is accessed quickly and efficiently from the database by using
SQL queries. Insertion, deletion, and updation operations on data are also performed in
less time.
3. Standardized Language
SQL follows the long-established standards of ISO and ANSI, which offer a uniform
platform across the globe to all its users.
4. Portability
The structured query language can be easily used in desktop computers, laptops, tablets,
and even smartphones. It can also be used with other applications according to the user's
requirements.
5. Interactive language
We can easily learn and understand the SQL language. We can also use this language for
communicating with the database because it is a simple query language. This language is
also used for receiving the answers to complex queries in a few seconds.
6. More than one Data View
The SQL language also helps in making the multiple views of the database structure for
the different database users.
Disadvantages of SQL
With the advantages of SQL, it also has some disadvantages, which are as follows:
1. Cost
The operation cost of some SQL versions is high. That's why some programmers cannot
use the Structured Query Language.
6
Data Engineering
2. Interface is Complex
Another big disadvantage is that the interface of Structured query language is difficult,
which makes it difficult for SQL users to use and manage it.
3. Partial Database control
The business rules are hidden. So, the data professionals and users who are using this
query language cannot have full database control.
SQL Commands
• SQL commands are instructions. It is used to communicate with the database. It is also
used to perform specific tasks, functions, and queries of data.
• SQL can perform various tasks like create a table, add data to tables, drop the table,
modify the table, set permission for users.
• Types of SQL Commands
There are five types of SQL commands: DDL, DML, DCL, TCL, and DQL.
7
Data Engineering
8
Data Engineering
SQLOS provides many operating system services such as memory and I/O management.
Other services include exception handling and synchronization services.
6. SQL Server Services and Tools
Microsoft provides both data management and business intelligence (BI) tools and
services together with SQL Server.
For data management, SQL Server includes SQL Server Integration Services (SSIS), SQL
Server Data Quality Services, and SQL Server Master Data Services. To develop
databases, SQL Server provides SQL Server Data tools; and to manage, deploy, and
monitor databases SQL Server has SQL Server Management Studio (SSMS).
For data analysis, SQL Server offers SQL Server Analysis Services (SSAS). SQL Server
Reporting Services (SSRS) provides reports and visualization of data. The Machine
Learning Services technology appeared first in SQL Server 2016 which was renamed
from the R Services.
7. SQL Server Editions
SQL Server has four primary editions that have different bundled services and tools. Two
editions are available free of charge:
SQL Server Developer edition for use in database development and testing.
SQL Server Expression for small databases with the size of up to 10 GB of disk storage
capacity.
For larger and more critical applications, SQL Server offers the Enterprise edition that
includes all SQL Server’s features.
SQL Server Standard Edition has partial feature sets of the Enterprise Edition and limits
on the Server regarding the numbers of processor core and memory that can be
configured.
For detailed information on the SQL Editions, check out the available Server Server 2019
Editions.
In this tutorial, you have a brief overview of the SQL Server including its architecture,
services, tools, and editions.
10
Data Engineering
an existing table
• ALTER TABLE ALTER COLUMN – show you how to change the definition of
existing columns in a table.
• ALTER TABLE DROP COLUMN – learn how to drop one or more columns
from a table.
• COMPUTED COLUMNS – how to use the computed columns to resue the
calculation logic in multiple queries.
• DROP TABLE – show you how to delete tables from the database.
• TRUNCATE TABLE – delete all data from a table faster and more efficiently.
• SELECT INTO – learn how to create a table and insert data from a query into it.
• RENAME TABLE – walk you through the process of renaming a table to a new
one.
• TEMPORARY TABLE – introduce you to the temporary tables for storing
temporarily immediate data in stored procedures or database session.
• SYNONYM – explain you the synonym and show you how to create synonyms
for database objects.
13. SQL Server Data Types
• SQL SERVER DATA TYPES – give you an overview of the built-in SQL Server
data types.
• BIT – store bit data i.e., 0, 1, or NULL in the database with the BIT data type.
• INT – learn about various integer types in SQL server including BIGINT, INT,
SMALLINT, and TINYINT.
• DECIMAL – show you how to store exact numeric values in the database by
using DECIMAL or NUMERIC data type.
• CHAR – learn how to store fixed-length, non-Unicode character string in the
database.
• NCHAR – show you how to store fixed-length, Unicode character strings and
explain the differences between CHAR and NCHAR data types
• VARCHAR – store variable-length, non-Unicode string data in the database.
• NVARCHAR – learn how to store variable-length, Unicode string data in a table
and understand the main differences between VARCHAR and NVARCHAR.
• DATETIME2 – illustrate how to store both date and time data in a database.
• DATE – discuss the date data type and how to store the dates in the table.
• TIME – show you how to store time data in the database by using the TIME data
type.
• DATETIMEOFFSET – show you how to manipulate datetime with the time zone.
• GUID – learn about the GUID and how to use the NEWID() function to generate
GUID values.
14. Constraints
• PRIMARY KEY – explain you to the primary key concept and show you how to
use the primary key constraint to manage a primary key of a table.
• FOREIGN KEY – introduce you to the foreign key concept and show you use
the FOREIGN KEY constraint to enforce the link of data in two tables.
• NOT NULL CONSTRAINT – show you how to ensure a column not to accept
NULL.
• UNIQUE CONSTRAINT – ensure that data contained in a column, or a group of
columns, is unique among rows in a table.
• CHECK CONSTRAINT – walk you through the process of adding logic for
checking data before storing them in tables.
15. Expressions
• CASE – add if-else logic to SQL queries by using simple and searched CASE
expressions.
12
Data Engineering
Next time, if you want to get the same result set, you can save this query into a text file,
open it, and execute it again.
SQL Server provides a better way to save this query in the database catalog through a
view.
A view is a named query stored in the database catalog that allows you to refer to it later.
So the query above can be stored as a view using the CREATE VIEW statement as
follows:
CREATE VIEW sales.product_info
AS
SELECT
product_name,
brand_name,
list_price
FROM
production.products p
INNER JOIN production.brands b
ON b.brand_id = p.brand_id;
Code language: SQL (Structured Query Language) (sql)
Later, you can reference to the view in the SELECT statement like a table as follows:
SELECT * FROM sales.product_info;
Code language: SQL (Structured Query Language) (sql)
When receiving this query, SQL Server executes the following query:
SELECT
*
FROM
13
Data Engineering
SELECT
product_name,
brand_name,
list_price
FROM
production.products p
INNER JOIN production.brands b
ON b.brand_id = p.brand_id;
);
Code language: SQL (Structured Query Language) (sql)
Advantages of views
Generally speaking, views provide the following advantages:
Security
You can restrict users to access directly to a table and allow them to access a subset of
data via views.
For example, you can allow users to access customer name, phone, email via a view but
restrict them to access the bank account and other sensitive information.
Simplicity
A relational database may have many tables with complex relationships e.g., one-to-one
and one-to-many that make it difficult to navigate.
However, you can simplify the complex queries with joins and conditions using a set of
views.
Consistency
Sometimes, you need to write a complex formula or logic in every query.
To make it consistent, you can hide the complex queries logic and calculations in views.
Once views are defined, you can reference the logic from the views rather than rewriting
it in separate queries.
14
Data Engineering
Indexes are special data structures associated with tables or views that help speed up the
query. SQL Server provides two types of indexes: clustered index and non-clustered
index.
In this section, you will learn everything you need to know about indexes to come up with
a good index strategy and optimise your queries.
• Clustered Indexes – introduction to clustered indexes and learn how to create
clustered indexes for tables.
• Non Clustered Indexes – learn how to create non-clustered indexes using
the CREATE INDEX statement.
• Rename indexes – replace the current index name with the new name using
sp_rename stored procedure and SQL Server Management Studio.
• Disable indexes – show you how to disable indexes of a table to make the indexes
ineffective.
• Enable indexes – learn various statements to enable one or all indexes on a table.
• Unique indexes – enforce the uniqueness of values in one or more columns.
• Drop indexes – describe how to drop indexes from one or more tables.
• Indexes in included columns – describe how to add non-key columns to a non-
clustered index to improve the speed of queries.
• Filtered indexes – create an index on a portion of rows in a table.
• Indexes on computed columns – walk you through how to simulate function-based
indexes using the indexes on computed columns.
Store Procedure
It is a set of SQL statements with assigned names that can be shared and reused by
multiple programs
Syntax:
15
Data Engineering
16
Data Engineering
Chapter-3
Azure Data Factory
In the world of big data, raw, unorganised data is often stored in relational, non-relational,
and other storage systems. However, on its own, raw data doesn't have the proper context
or meaning to provide meaningful insights to analysts, data scientists, or business decision
makers.
Big data requires a service that can orchestrate and operationalise processes to refine these
enormous stores of raw data into actionable business insights. Azure Data Factory is a
managed cloud service that's built for these complex hybrid extract-transform-load (ETL),
extract-load-transform (ELT), and data integration projects.
Usage scenarios:
For example, imagine a gaming company that collects petabytes of game logs that are
produced by games in the cloud. The company wants to analyse these logs to gain insights
into customer preferences, demographics, and usage behaviour. It also wants to identify up-
sell and cross-sell opportunities, develop compelling new features, drive business growth,
and provide a better experience to its customers.
To analyze these logs, the company needs to use reference data such as customer
information, game information, and marketing campaign information that is in an on-
premises data store. The company wants to utilise this data from the on-premises data store,
combining it with additional log data that it has in a cloud data store.
To extract insights, it hopes to process the joined data by using a Spark cluster in the cloud
(Azure HDInsight), and publish the transformed data into a cloud data warehouse such as
Azure Synapse Analytics to easily build a report on top of it. They want to automate this
workflow, and monitor and manage it on a daily schedule. They also want to execute it
when files land in a blob store container.
Azure Data Factory is the platform that solves such data scenarios. It is the cloud-based
ETL and data integration service that allows you to create data-driven workflows for
orchestrating data movement and transforming data at scale. Using Azure Data Factory,
you can create and schedule data-driven workflows (called pipelines) that can ingest data
from disparate data stores. You can build complex ETL processes that transform data
visually with data flows or by using compute services such as Azure HDInsight Hadoop,
Azure Databricks, and Azure SQL Database.
Additionally, you can publish your transformed data to data stores such as Azure Synapse
Analytics for business intelligence (BI) applications to consume. Ultimately, through Azure
Data Factory, raw data can be organised into meaningful data stores and data lakes for
better business decisions.
17
Data Engineering
This visual guide provides a detailed overview of the complete Data Factory architecture:
and failure rates. Azure Data Factory has built-in support for pipeline monitoring via Azure
Monitor, API, PowerShell, Azure Monitor logs, and health panels on the Azure portal.
5. Top-level concepts
An Azure subscription might have one or more Azure Data Factory instances (or data
factories). Azure Data Factory is composed of below key components.
• Pipelines
• Activities
• Datasets
• Linked services
• Data Flows
• Integration Runtimes
These components work together to provide the platform on which you can compose data-
driven workflows with steps to move and transform data.
6. Pipeline
A data factory might have one or more pipelines. A pipeline is a logical grouping of
activities that performs a unit of work. Together, the activities in a pipeline perform a task.
For example, a pipeline can contain a group of activities that ingests data from an Azure
blob, and then runs a Hive query on an HDInsight cluster to partition the data.
The benefit of this is that the pipeline allows you to manage the activities as a set instead
of managing each one individually. The activities in a pipeline can be chained together to
operate sequentially, or they can operate independently in parallel.
7. Mapping data flows
Create and manage graphs of data transformation logic that you can use to transform any-
sized data. You can build-up a reusable library of data transformation routines and execute
those processes in a scaled-out manner from your ADF pipelines. Data Factory will execute
your logic on a Spark cluster that spins-up and spins-down when you need it. You won't
ever have to manage or maintain clusters.
8. Activity
Activities represent a processing step in a pipeline. For example, you might use a copy
activity to copy data from one data store to another data store. Similarly, you might use a
Hive activity, which runs a Hive query on an Azure HDInsight cluster, to transform or
analyze your data. Data Factory supports three types of activities: data movement activities,
data transformation activities, and control activities.
9. Datasets
Datasets represent data structures within the data stores, which simply point to or reference
the data you want to use in your activities as inputs or outputs.
10. Linked services
Linked services are much like connection strings, which define the connection information
that's needed for Data Factory to connect to external resources. Think of it this way: a linked
service defines the connection to the data source, and a dataset represents the structure of
the data. For example, an Azure Storage-linked service specifies a connection string to
connect to the Azure Storage account. Additionally, an Azure blob dataset specifies the
blob container and the folder that contains the data.
11. Linked services are used for two purposes in Data Factory:
• To represent a data store that includes, but isn't limited to, a SQL Server database,
Oracle database, file share, or Azure blob storage account. For a list of supported data
stores, see the copy activity article.
• To represent a compute resource that can host the execution of an activity. For
example, the HDInsightHive activity runs on an HDInsight Hadoop cluster. For a list of
transformation activities and supported compute environments, see the transform data
article.
19
Data Engineering
If you don't have an Azure subscription, create a free account before you begin.
Azure roles
To learn about the Azure role requirements to create a data factory, refer to Azure Roles
requirements.
1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory
UI is supported only in Microsoft Edge and Google Chrome web browsers.
2. Go to the Azure Data Factory Studio and choose the Create a new data factory
20
Data Engineering
radio button.
3. You can use the default values to create directly, or enter a unique name and choose
a preferred location and subscription to use when creating the new data factory.
4. After creation, you can directly enter the homepage of the Azure Data Factory
Studio.
21
Data Engineering
2. After landing on the data factories page of the Azure portal, click Create.
1. For Resource Group, take one of the following steps:
a. Select an existing resource group from the drop-down list.
b. Select Create new, and enter the name of a new resource group.
2. To learn about resource groups, see Use resource groups to manage your Azure
resources.
3. For Region, select the location for the data factory.
The list shows only locations that Data Factory supports, and where your Azure Data
Factory meta data will be stored. The associated data stores (like Azure Storage and
Azure SQL Database) and computes (like Azure HDInsight) that Data Factory uses can
run in other regions.
4. For Name, enter ADFTutorialDataFactory.
The name of the Azure data factory must be globally unique. If you see the following
error, change the name of the data factory (for example,
<yourname>ADFTutorialDataFactory) and try creating again. For naming rules for Data
Factory artifacts, see the Data Factory - naming rules article.
22
Data Engineering
An input dataset represents the input for an activity in the pipeline, and an output dataset
represents the output for the activity. Datasets identify data within different data stores,
such as tables, files, folders, and documents. After you create a dataset, you can use it
with activities in a pipeline. For example, a dataset can be an input/output dataset of a
Copy Activity or an HDInsightHive Activity. For more information about datasets, see
Datasets in Azure Data Factory article.
23
Data Engineering
Execute Execute Pipeline activity allows a Data Factory or Synapse pipeline to invoke
Pipeline another pipeline.
For Each ForEach Activity defines a repeating control flow in your pipeline. This
activity is used to iterate over a collection and executes specified activities in
a loop. The loop implementation of this activity is similar to the Foreach
looping structure in programming languages.
Get GetMetadata activity can be used to retrieve metadata of any data in a Data
Metadata Factory or Synapsepipeline.
Lookup Lookup Activity can be used to read or look up a record/ table name/ value
from any external
Activity
source. This output can further be referenced by succeeding activities.
24
Data Engineering
Wait When you use a Wait activity in a pipeline, the pipeline waits for the
Activity specified time beforecontinuing with execution of subsequent activities.
Web Web Activity can be used to call a custom REST endpoint from a
Activity pipeline. You can passdatasets and linked services to be consumed and
accessed by the activity.
Webhook Using the webhook activity, call an endpoint, and pass a callback URL. The
Activity pipeline run waitsfor the callback to be invoked before proceeding to the
next activity.
Data factory will display the pipeline editor where you can find:
1. All activities that can be used within the pipeline.
2. The pipeline editor canvas, where activities will appear when added to the
pipeline.
3. The pipeline configurations pane, including parameters, variables, general
settings, and output.
4. The pipeline properties pane, where the pipeline name, optional description, and
annotations can be configured. This pane will also show any related items to the
pipeline within the data factory.
25
Data Engineering
26
Data Engineering
activity to copy data from SQL Server to Azure Blob storage. Then, you might use a Hive
activity that runs a Hive script on an Azure HDInsight cluster to process data from Blob
storage to produce output data. Finally, you might use a second copy activity to copy the
output data to Azure Synapse Analytics, on top of which business intelligence (BI)
reporting solutions are built. For more information about pipelines and activities, see
Pipelines and activities.
Now, a dataset is a named view of data that simply points or references the data you want
to use in your activities as inputs and outputs.
Before you create a dataset, you must create a linked service to link your data store to the
Data Factory or Synapse Workspace. Linked services are much like connection strings,
which define the connection information needed for the service to connect to external
resources. Think of it this way; the dataset represents the structure of the data within the
linked data stores, and the linked service defines the connection to the data source. For
example, an Azure Storage linked service links a storage account to the service. An Azure
Blob dataset represents the blob container and the folder within that Azure Storage
account that contains the input blobs to be processed.
Here is a sample scenario. To copy data from Blob storage to a SQL Database, you create
two linked services: Azure Storage and Azure SQL Database. Then, create two datasets:
Azure Blob dataset (which refers to the Azure Storage linked service) and Azure SQL
Table dataset (which refers to the Azure SQL Database linked service). The Azure
Storage and Azure SQL Database linked services contain connection strings that the
service uses at runtime to connect to your Azure Storage and Azure SQL Database,
respectively. The Azure Blob dataset specifies the blob container and blob folder that
contains the input blobs in your Blob storage. The Azure SQL Table dataset specifies the
SQL table in your SQL Database to which the data is to be copied.
The following diagram shows the relationships among pipeline, activity, dataset, and
linked service in the service:
27
Data Engineering
After selecting New to create a new linked service you will be able to choose any of the
supported connectors and configure its details accordingly. Thereafter you can use the
linked service in any pipelines you create.
Linked service JSON { "name": "", "properties": { "type": "", "typeProperties": { "" },
"connectVia": { "referenceName": "", "type": "IntegrationRuntimeReference" } } }
The following table describes properties in the above JSON: Property Description
Required name Name of the linked service. See Naming rules. Yes type Type of the
linked service. For example: AzureBlobStorage (data store) or AzureBatch (compute).
See the description for typeProperties. Yes typeProperties The type properties are
different for each data store or compute. Yes
Setting up ADF
ADF is a data pipeline orchestrator and ETL tool that is part of the Microsoft Azure cloud
ecosystem. ADF can pull data from the outside world (FTP, Amazon S3, Oracle, and many
more), transform it, filter it, enhance it, and move it along to another destination. In my
work for a health-data project we are using ADF to drive our data flow from raw ingestion
to polished analysis that is ready to display.
There are many good resources for learning ADF, including an introduction anda
quickstart. When I was starting out with ADF however, I did not find a clear explanationfor
the basic underlying concepts it is built upon. This article is an attempt to fill that gap.
Getting ADF to do real work for you involves the following layers of technology, listed
from the highest level of abstraction that you interact with down to the software closest to
28
Data Engineering
the data.
Pipeline, the graphical user interface where you place widgets and draw data
paths
Activity, a graphical widget that does something to your data
Source and Sink, the parts of an activity that specify where data is coming from
and going to
Data Set, an explicitly defined set of data that ADF can operate on
Linked Service, the connection information that allows ADF to access a
specific outside data resource
Integration Runtime, a glue/gateway layer that lets ADF talk to software
outside of itself
Understanding the purpose of each layer and how it contributes to an overall ADF solution
is key to using the tool well. I find it easiest to understand ADF by considering the layers
in reverse order, starting at the bottom near the data.
Integration Runtime
An integration runtime provides the gateway between ADF and the actual data or compute
resources you need. If you are using ADF to marshal native Azure resources, such as an
Azure Data Lake or Databricks, then ADF knows how to talk to those resources. Just use
the built-in integration runtime and don’t think about it — no set up or configuration
required.
But suppose you want ADF to operate on data that is stored on an Oracle Database server
under your desk, or computers and data within your company’s private network. In these
cases you must set up the gateway with a self-hosted integration runtime.
Linked Service
A linked service tells ADF how to see the particular data or computers you want to operate
on. To access a specific Azure storage account, you create a linked service for it and include
access credentials. To read/write another storage account, you create another linked service.
To allow ADF to operate on an Azure SQL database, your linked service will state the
Azure subscription, server name, database name, and credentials.
29
Data Engineering
Data Set
A data set makes a linked service more specific; it describes the folder you are using within
a storage container, or the table within a database, etc.
The data set in this screenshot points to one directory in one container in one Azure storage
account. (The container and directory names are set in the Parameters tab.) Note how the
data set references a linked service. Note also that this data set specifies that the data is
zipped, which allows ADF to automatically unzip the data as you read it.
30
Data Engineering
31
Data Engineering
Activity
Activities are the GUI widgets within Data Factory that do specific kinds of data movement
or transformation. There is a CopyData activity to move data, a ForEach activity to loop
over a file list, a Filter activity that chooses a subset of files, etc. Most activities have a
source and a sink.
Pipeline
An ADF pipeline is the top-level concept that you work with most directly. Pipelines are
composed of activities and data flow arrows. You program ADF by creating pipelines. You
get work done by running pipelines, either manually or via automatic triggers. You look at
the results of your work by monitoring pipeline execution.
This pipeline takes inbound data from an initial Data Lake folder, moves it to cold archive
storage, gets a list of the files, loops over each file, copies those files to an unzipped working
folder, then applies an additional filter by file type.
32
Data Engineering
33
Data Engineering
34
Data Engineering
35
Data Engineering
36
Data Engineering
37
Data Engineering
38
Data Engineering
39
Data Engineering
40
Data Engineering
41
Data Engineering
42
Data Engineering
43
Data Engineering
44
Data Engineering
45
Data Engineering
46
Data Engineering
47
Data Engineering
48
Data Engineering
We should see another side panel that allows us to configure our table storage connection
information.
Let's name our service AzureTableStorage.
Next, we select our Storage account name for use. If you didn't setup one previously,
you'll need to do that first.
Finally, click the Test connection button to ensure it is working and then click
the Create button.
49
Data Engineering
50
Data Engineering
51
Data Engineering
52
Data Engineering
53
Data Engineering
54
Data Engineering
55
Data Engineering
Test
Finally, we test that our new pipeline works. For this we'll use the debug feature.
Click the Debug button.
This will show the output tab and allow us to see the status of our pipeline run.
56
Data Engineering
References
1. Microsoft learn
2. Wikipedia.com
3. Google.com
4. Databricks.com
57