0% found this document useful (0 votes)
72 views82 pages

ADF Pipeline Management and File Handling Guide

Uploaded by

VIGNESH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views82 pages

ADF Pipeline Management and File Handling Guide

Uploaded by

VIGNESH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

ADF Scenarios

If a pipeline fails, how do you resume from the failed activity?


If any pipeline got failed, then how you resume the pipeline from failed activity not from
starting.

Go to Monitor tab – pipeline runs – we have set of options like:


Rerun – the entire pipeline will run
Rerun from activity – by clicking on the activity, we can go with this option.
Rerun from failed activity – this option will allow us to rerun from the failed activity
only.

How do you separate bad records during a copy activity from ADLS to SQL? –
To separate bad records while performing a copy activity from Azure Data Lake Storage
(ADLS) to a SQL table in Azure Data Factory (ADF), follow these steps:
1️. Select the Copy activity.
2️. In the Settings tab, enable the Fault tolerance option and check "Skip incompatible rows."
This option will identify and skip rows that do not conform to the destination schema,
effectively separating the bad records from the rest.
3️. if you want to store the bad records, check the option Enable logging ,where you will
have to provide the storage name where logs will be stored.

Follow me on LinkedIn – Shivakiran kotur


How do you get the latest files using ADF?

To efficiently handle data workflows in Azure Data Factory, follow these steps:
1️.Use the GetMetadata activity and select 'child items' under the field list.
Implement a Foreach activity.
2️. Inside the Foreach activity, add another GetMetadata activity (GetMetadata2️) and select
'Last modified' in the field list.
3️. In the above activity, under the Variables section, create two variables: LatestFileName
and PreviousModifiedDate, and assign initial values.
4️. Add an If condition activity and write an expression to compare dates using
@greater(formatDateTime(activity('GetMetadata2️').[Link],
'yyyyMMddHHmmss'), formatDateTime(variables('PreviousModifiedDate'),
'yyyyMMddHHmmss')).
5️.Within the If condition's True branch, add a Set Variable activity to update the variable
with @activity('GetMetadata2️').[Link].
6️.Finally, add a Copy activity outside the Foreach to copy the latest files to the desired
output folder.
These steps help ensure that only the latest files are processed, optimizing your data
workflow.

Follow me on LinkedIn – Shivakiran kotur


Follow me on LinkedIn – Shivakiran kotur
How to delete files older than 3️0 days?
To efficiently delete files older than 3️0 days in Azure Data Factory, follow these steps:

1️. Use the GetMetadata activity, selecting 'child items' in the field list. Filter by 'last
modified' using the expression @adddays(utcnow(), -3️1️) under Endtime (UTC).
2️.Implement a Foreach loop activity with the expression
@activity('GetMetadata').[Link].
3️.Inside the Foreach loop activity, add a Delete activity and configure it to select and delete
the appropriate filenames.
These steps streamline the process of managing and deleting outdated files, ensuring your
data storage remains optimized.

How do you move files to an archive folder in ADLS?


When managing data in Azure, efficiently moving files to an archive is crucial. Here's a
streamlined approach:
1️. Get Metadata: Retrieve a list of objects, including files and subfolders, from your source
folder. Note, it does not recursively retrieve objects.
2️. Filter: Refine the list from the Get Metadata activity to select files only.
3️.ForEach: Take the filtered file list and iterate over it, passing each file to subsequent
activities.

Follow me on LinkedIn – Shivakiran kotur


4️.Copy: Transfer each file from the source to the destination store.
5️.Delete: Remove the copied file from the source store.
By following these steps, you ensure a smooth and efficient archival process. Happy data
management!

Scenario: Move multiple files from one container to another


Managing multiple files in Azure? Here’s an efficient way to move them to an archive:

1️.Fetch Metadata: Use the Metadata activity to retrieve file information, storing it in an
array.
2️.ForEach Activity: Take the output of the Metadata activity and input it into the ForEach
activity.
3️.Copy Activity: Within the ForEach loop, use the Copy activity to transfer each file from the
source (Blob) to the sink container (ADLS).
This approach ensures smooth and organized file transfers.

Follow me on LinkedIn – Shivakiran kotur


Inside foreach take copy activity
Here we need to take new dataset as file name is not fixed it dynamic → we need
to parameterized
Create the dataset and go to dataset and parameterize

Follow me on LinkedIn – Shivakiran kotur


Go to copy activity take the parameter dataset and in filepath parameter define dynamic
value

Follow me on LinkedIn – Shivakiran kotur


Create sink dataset for ADLS and link it to linked service(if not present create new)

Copy only files starting with "customer" and add a date in ADLS.

1️.Fetch Metadata: Use the Metadata activity to retrieve file information, storing it in an
array.
2️.Use Filter Activity :add the filter activity after the metadata to copy only files starting with
customer and link to metadata
5️.ForEach Activity: Take the output of the Metadata activity and input it into the ForEach
activity.
4️.Copy Activity: Within the ForEach loop, use the Copy activity to transfer each file from the
source (Blob) to the sink container (ADLS).

Follow me on LinkedIn – Shivakiran kotur


Filter gets out of metadata as input
@activity('Get Metadata1️').[Link]
Condition → @startsWith(item().name, 'cust')

@activity('Filter1️').[Link]
For copy activity→ source remains same , for sink create dynamic container name
with exp in dataset created
@concat('customer-', formatDateTime(utcnow(), 'yyyyMMddHHmmss'))

Follow me on LinkedIn – Shivakiran kotur


Copy files based on filenames from input to output container.

Step 1️: Create Datasets


Input Container Dataset: Define datasets for CSV, JSON, and Parquet files in the input
container.
Output Container Dataset: Define the dataset for the output container (DS_sink).

Step 2️: Parameterize the Source Dataset


Create a parameter in the source dataset for SrcContainer.
Dynamically pass the parameter in the file path of the dataset.

Follow me on LinkedIn – Shivakiran kotur


Step 3️: Set Up Get Metadata Activity
Add a new dataset for the GetMetadata activity.
In the GetMetadata activity, select Child items under item fields to get all filenames in the
container. This outputs:
Exists: Indicates if files are present.
Item type: Outputs the folder type.
Item name: Provides the container name.

Step 4️: Apply Filter Activity


Link the GetMetadata activity to the filter activity.
In the filter activity, set the items dynamically from Child items.
Set the filter condition to startswith([Link], 'cp') and endswith([Link], '.csv') using
JSON data parsing.

Step 5️: Implement ForEach and Copy Activities


Add a ForEach activity linked to the filter activity’s output.
Inside the ForEach loop, add a Copy activity with the source and sink datasets.
Dynamically pass the filename using items().name.
Additional Filter Condition
Add a second filter condition for JSON and Parquet files to ensure only these file types are
processed.

Follow me on LinkedIn – Shivakiran kotur


Pipeline→ FULL LOAD ACTIVITY
1. How to copy multiple tables form azure SQL database to blob storage.
2. How to avoid running copy activity if table not available in source side
3. How to create & maintain metadata table
4. Log pipeline execution details to azure SQL database on both failure and success
scenarios of pipeline.
5. Send notification using logic app on both failure and success scenarios of pipeline.

Scenario→ Say in Source—Azure SQL → 100 tables on 1st oct


On 11th → 5 new tables added
In destination→ Blob → 100 tables
On 1st oct→ full load →100 tables
On 11th→ incremental load or delta load
5 tables should only be copied

REQURIED ACTIVITY
1. Azure data factory
• 4 datasets
• 2 linked services
• Pipeline
• Look up activity, Foreach activity (Get metadata activity, if activity (True (copy activity-
Success SP, Failure SP) or False))-Web activity (Success or failure)
2. Azure SQL Server
3. Storage accounts
Lookup activity → create dataset to reference to metadata table (where all the info of tables
loaded)
Get Metadata activity → dynamically checks whether table exists or not.
For each loop → use inside IF activity to ensure true or false condition and then link to
storeprocdure activity for success or failure
If table not exists → false, if exists →True→ if true then copy activity succeeded or failed.
Logic apps→ using logic app send the notification through mail.

STEPS INVOLVED
Step1→ Script execution in Database
Involves table create insert data, metadata table create with active as 1 & inactive as 0 information
of all the table inserted, Store proc for logs
Create SQL database in azure, set the firewall and connect to SSMS using credentials, in Database
execute the following scripts.
→SQL SCRIPTS:
→TABLE CREATIONS
CREATE TABLE PRODUCT(PID INT, PNAME VARCHAR(50))
CREATE TABLE SELLS(SELLSID INT, STORENAME VARCHAR(50))
CREATE TABLE TRANSACTIONS(TID INT, TAMOUNT BIGINT)
CREATE TABLE CUST(CID INT, CLOCATION VARCHAR(50))
CREATE TABLE EMP(EMPID INT, EMPNAME VARCHAR(50))

--INSERT TABLES

INSERT INTO PRODUCT VALUES(111,'LAPTOP'),(222,'MOBILE')


INSERT INTO SELLS VALUES(1,'FLIPKART'),(222,'AMAZON')
INSERT INTO TRANSACTIONS VALUES(101,1000),(202,2000)
INSERT INTO CUST VALUES(1000,'BANGLORE'),(2000,'HYDERABAD')
INSERT INTO EMP VALUES(888,'RAMA'),(999,'KRISHNA')

--METADATA TABLE CREATION


CREATE TABLE METADATA
(SCHEMANAME VARCHAR(50),
TABLENAME VARCHAR(50),
BLOBCONTAINER VARCHAR(50),
ISDISABLE INT)

--INSERT DATA INTO METADATA TABLE (ALL THE TABLES INFO INSERTED)
INSERT INTO METADATA VALUES
('DBO','PRODUCT','PRODUCTOUTPUT',0),
('DBO','SELLS','SELLSOUTPUT',0),
('DBO','TRANSACTIONS','TRANSACTIONSOUTPUT',0),
('DBO','CUST','CUSTOUTPUT',0),
('DBO','EMP','EMPOUTPUT',0),
('DBO', 'ORDERS','ORDERSOUTPUT',1);

--CREATE A TABLE FOR AUDITING PURPOSE.


CREATE TABLE [DBO].[PIPELINE_LOG](
DATAFACTORY_NAME VARCHAR(100) NULL,
PIPELINENAME VARCHAR(100) NULL,
RUNID VARCHAR(100) NULL,
SOURCE VARCHAR(100) NULL,
DESTINATION VARCHAR(100) NULL,
TRIGGERID VARCHAR(100) NULL,
TRIGGERTYPE VARCHAR(100) NULL,
TRIGGERNAME VARCHAR(100) NULL,
TRIGGERTIME VARCHAR(100) NULL,
ROWSCOPIED INT NULL,
ROWSREAD INT,
NO_PARALLELCOPIES INT NULL,
COPYDURATION_IN_SECS INT NULL,
EFFECTIVEINTERGATIONRUNTIME VARCHAR(100) NULL,
SOURCE_TYPE VARCHAR(100) NULL,
SINK_TYPE VARCHAR(100) NULL,
COPYACTIVITY_START_TIME DATETIME NULL,
COPYACTIVITY_END_TIME DATETIME NULL,
EXECUTION_STATUS VARCHAR(100) NULL,
EXECUTION_STATUS_CODE VARCHAR(100) NULL,
ERROR_MESSAGE VARCHAR(100) NULL
);
--STORED PROCEDURE: FOR PIPLINE LOGS
CREATE PROCEDURE PIPELINE_LOG_USP
(
@DATAFACTORY_NAME VARCHAR(100) ,
@PIPELINENAME VARCHAR(100),
@RUNID VARCHAR(100) ,
@SOURCE VARCHAR(100),
@DESTINATION VARCHAR(100) ,
@TRIGGERID VARCHAR(100),
@TRIGGERTYPE VARCHAR(100) ,
@TRIGGERNAME VARCHAR(100) ,
@TRIGGERTIME VARCHAR(100) ,
@ROWSREAD INT,
@ROWSCOPIED INT ,
@NO_PARALLELCOPIES INT,
@COPYDURATION_IN_SECS INT,
@EFFECTIVEINTERGATIONRUNTIME VARCHAR(100),
@SOURCE_TYPE VARCHAR(100) ,
@SINK_TYPE VARCHAR(100) ,
@COPYACTIVITY_START_TIME DATETIME ,
@COPYACTIVITY_END_TIME DATETIME ,
@EXECUTION_STATUS VARCHAR(100) ,
@EXECUTION_STATUS_CODE VARCHAR(100),
@ERROR_MESSAGE VARCHAR(100) )
AS
BEGIN
INSERT INTO PIPELINE_LOG VALUES(
@DATAFACTORY_NAME ,
@PIPELINENAME,
@RUNID ,
@SOURCE,
@DESTINATION ,
@TRIGGERTYPE,
@TRIGGERID ,
@TRIGGERNAME,
@TRIGGERTIME,
@ROWSREAD,
@ROWSCOPIED ,
@NO_PARALLELCOPIES ,
@COPYDURATION_IN_SECS,
@EFFECTIVEINTERGATIONRUNTIME ,
@SOURCE_TYPE ,
@SINK_TYPE ,
@COPYACTIVITY_START_TIME ,
@COPYACTIVITY_END_TIME ,
@EXECUTION_STATUS ,
@EXECUTION_STATUS_CODE,
@ERROR_MESSAGE )
END

Step2→ create azure data factory for pipeline.


Step3→ Create a storage account
Under storage account→Security+network→Access key (to create key vaults copy key)
Step4→ Create key vault service→ go to secrets→ Generate/import→ Generate secret for blob
storage
For key paste the copied connection string from Access key of storage account
Similarly create for sql server→ go to sql database→ connection string→[Link]→ sql auth→ copy
and replace placed in the secret and use it for creating secret in key vault
Step 5→ create linked service for sql and blob in adf
Azure Key Vault is a cloud
service for securely storing
and accessing secrets. A
secret is anything that you
want to tightly control
access to, such as API keys,
passwords, certificates, or
cryptographic keys.

First create linked service for key vault, pass the sub and key vault details which created before ,
test connection and create.
2nd→ create the linked service for Azure Sql database, select Key vault instead of Connection String
Note→ if the secret key is not loading follow the following steps

Go to your key vault >> Access Policy→+create→ select Secret Permission→ Principal→ type your
data name and create it. For more reference refer below link.
[Link]
portal

Step 6→ Create Datasets for lookup


DS for lookup activity, to fetch the
list of tables, table name is left
empty as we fetch dynamically.
Step7→ go to ADF→ take lookupactivity

Here in output we can see the count value


and the record info of query in form of
JSON

Step 8→ take for each activity, and pass the output of lookupactivity to foreach dynamically
i.e @activity('Lookup1').[Link]
Step → Go inside the foreach activity

Take metadata activity and create dataset for it. Getmeatadata used to check whether table exists or not
For dataset of getmetadata create two parameter for Source schema and tablename

Here Schema_src value


dynamically passed from the
lookup activity JSON output
Give tablename and schema name
from Json (case senstitive)

In field list→ Arguments→ give EXISTS

Step→ Take next activity that is IF activity, here we pass the output of metadata activity.

i.e check true condition


activity('Get Metadata1').[Link]
Inside true of IF activity take Copy activity.

Create a Sql data set for copy src and Sink dataset

Go to copy activity and pass the variable


Similary go to Sink_ds and create a parameter for desti_container

In Copy activity pass dynamically the container column name seeing the JSON output

Step→ Next add the Store Procedure Activity following the Copy activity, one for success and one
for failure.
In store proc activity→ for for activity→ linked service→ test connect→ select SP→import to
import all the parameter of the procedure
Note→ check for the variable whether it is matching or not,
Need to pass all the parameter dynamically to the parameter of store proc, can refer Microsoft site
for major parameter.
IMP NOTE→ COPY activity fails if the container name is in UPPER case, so have to change it into
lower case, as container name should be in lower case.

@toLower(item().BLOBCONTAINER)
In above sink dataset properties.

(All the dynamic activity are case sensitive should follow the json structure).
Parameter passed
CopyActivity_End_Time: @activity('Copy data1').ExecutionEndTime
CopyActivity_Start_Time: @activity('Copy data1').ExecutionStartTime
copyDuration_in_secs: @activity('Copy data1').[Link]
Datafactory_Name: @pipeline().DataFactory
Destination: @item().blobcontainer
effectiveIntergationRuntime: @activity('Copy data1').[Link]
Error_Message: @activity('Copy data1').error
Execution_Status: @activity('Copy data1').status
Execution_Status_code: @activity('Copy data1').statuscode
No_ParallelCopies: @activity('Copy data1').[Link]
PipelineName: @pipeline().Pipeline
RowsCopied: @activity('Copy data1').[Link]
RowsRead: @activity('Copy data1').[Link]
Sink_Type: @activity('Copy data1').[Link][0].[Link]
Source: @item().tablename
Source_Type: @activity('Copy data1').[Link][0].[Link]
TriggerId: @pipeline().TriggerId
TriggerName: @pipeline().TriggerName
TriggerTime: @pipeline().TriggerTime
triggertype:@pipeline().TriggerType
Step→ Add the email notification to the pipeline, create the service call Logic app→ Go to app
designer→ search http → take email→ under email add parameter→ method→get→new step→get
authenticated→ add email address→body, subject→save→ url is grnerated→copy it→go to
pipeline take web activity→ connect to failure→ paste url.
Save the app→Copy url
Incremental Data Loading using Azure Data Factory
The process for the incremental load of data from an on-premises SQL Server to
Azure SQL database. Once the full data set is loaded from a source to a sink, there may be
some addition or modification of the source data. In that case, it is not always possible, or
recommended, to refresh all data again from source to sink. Incremental load methods help
to reflect the changes in the source to the sink every time a data modification is made on the
source.

There are different methods for incremental data loading. I will discuss the step-by-
step process for incremental loading, or delta loading, of data through a watermark.

Watermark
A watermark is a column in the source table that has the last updated time
stamp or an incrementing key. After every iteration of data loading, the maximum
value of the watermark column for the source data table is recorded. Once the next
iteration is started, only the records having the watermark value greater than the last
recorded watermark value are fetched from the data source and loaded in the data
sink. the latest maximum value of the watermark column is recorded at the end of
this iteration.

The workflow for this approach can be depicted with the following diagram (as given
in Microsoft documentation):

Here, I discuss the step-by-step implementation process for incremental loading of


data.

Step 1: Table creation and data population on premises

In on-premises SQL Server, I create a database first. Then, I create a table named db.
Student. I insert 3 records in the table and check the same. This table data will be
copied to the student table in an Azure SQL database. The update Date column of
the Student table will be used as the watermark column.

CREATE TABLE [dbo].[Student](


[studentId] [int] IDENTITY(1,1) NOT NULL,
[studentName] [varchar](100) NULL,
[stream] [varchar](50) NULL,
[marks] [int] NULL,
[createDate] [datetime2] NULL,
[updateDate] [datetime2] NULL
) ON [PRIMARY]
GO
INSERT INTO [Link]
(studentName,stream,marks,createDate,updateDate)
VALUES
('xxx', 'CSE',90,GETDATE(), GETDATE()),
('yyy', 'CSE',90,GETDATE(), GETDATE()),
('zzz', 'CSE',90,GETDATE(), GETDATE())
SELECT studentid, studentName,stream,marks,createDate,updateDate
FROM [Link]

Step 2: Table creation and data population in Azure

I create an Azure SQL Database through Azure portal. I connect to the database
through SSMS. Once connected, I create a table, named Student, which is having the
same structure as the Student table created in the on-premise SQL Server. The
studentId column in this table is not defined as IDENTITY, as it will be used to store
the studentId values from the source table.

I create another table named stgStudent with the same structure of Student. I will
use this table as a staging table before loading data into the Student table. I will
truncate this table before each load.

I create a table named WaterMark. Watermark values for multiple tables in the
source database can be maintained here. For now, I insert one record in this table. I
put the tablename column value as 'Student' and waterMarkVal value as an initial
default date value '1900-01-01 [Link]'.

CREATE TABLE [dbo].[stgStudent](


[studentId] [int] NOT NULL,
[studentName] [varchar](100) NULL,
[stream] [varchar](50) NULL,
[marks] [int] NULL,
[createDate] [datetime2] NULL,
[updateDate] [datetime2] NULL
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[Student](
[studentId] [int] NOT NULL,
[studentName] [varchar](100) NULL,
[stream] [varchar](50) NULL,
[marks] [int] NULL,
[createDate] [datetime2] NULL,
[updateDate] [datetime2] NULL
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[Watermark](
[tableName] [varchar](50) NULL,
[waterMarkVal] [datetime2] NULL
) ON [PRIMARY]
GO
INSERT INTO [dbo].[WaterMark]
([tableName],[waterMarkVal])
VALUES
('Student','1900-01-01 [Link]')
SELECT tableName,waterMarkVal
FROM [dbo].[WaterMark]

Step 3: Create a Self-Hosted Integration Runtime

Next, I create an ADF resource from the Azure Portal. I open the ADF resource and
go the Manage link of the ADF and create a new self-hosted integration runtime.
The Integration Runtime (IR) is the compute infrastructure used by ADF for data flow,
data movement and SSIS package execution. A self-hosted IR is required for
movement of data from on-premise SQL Server to Azure SQL.

I click the link under Option 1: Express setup and follow the steps to complete the
installation of the IR. The name for this runtime is selfhostedR1-sd.
Step 4: Create the Azure Integration Runtime

An Azure Integration Runtime (IR) is required to copy data between cloud data
stores. I choose the default options and set up the runtime with the name azureIR2.

Step 5: Create a Linked Service for SQL Server

The linked service helps to link the source data store to the Data Factory. A Linked
Service is similar to a connection string, as it defines the connection information
required for the Data Factory to connect to the external data source.
I provide details for the on-premise SQL Server and create the linked service, named
sourceSQL. There is an option to connect via Integration runtime. I select the self-
hosted IR as created in the previous step.

Step 6: Create a Linked Service for Azure SQL

I provide details for the Azure SQL database and create the linked service, named
AzureSQLDatabase1. In the connect via Integration runtime option, I select the the
Azure IR as created in the previous step.
Step 7: Create the Dataset for the SQL Server table

A dataset is a named view of data that simply points or references the data to be
used in the ADF activities as inputs and outputs. I create this dataset, named
SqlServerTable1, for the table, [Link], in on-premise SQL Server.

Step 8: Create a second Dataset for the Azure table

I create this dataset, named AzureSqlTable1, for the table, [Link], in the
Azure SQL database.
Step 9: Create the Watermark Dataset

I create this dataset, named AzureSqlTable2, for the table, [Link], in the
Azure SQL database.

Step 10: Create a Pipeline

I go to the Author tab of the ADF resource and create a new pipeline. I name it
pipeline_incrload.
Step 11: Add Parameters

I go to the Parameters tab of the pipeline and add the following parameters and set
their default values as detailed below.

• finalTableName (default value: Student)


• srcTableName (default value: Student)
• waterMarkCol (default value: updateDate)
• stgTableName (default value: stgStudent)
• storedProcUpsert (default value: usp_upsert_Student)
• storedProcWaterMark (default value: usp_update_WaterMark)

These parameter values can be modified to load data from different


usp_update_WaterMark source table to a different sink table.

Step 12: Create the Lookup Activity

A Lookup activity reads and returns the content of a configuration file or table. It also
returns the result of executing a query or stored procedure. The output from Lookup
activity can be used in a subsequent copy or transformation activity if it's a singleton
value.

I create the first lookup activity, named lookupOldWaterMark. The source dataset is
set to AzureSqlTable2 (pointing to [Link] table). I write the following query
to retrieve the waterMarkVal column value from the WaterMark table for the value,
Student. Here, tablename data is compared with finalTableName parameter of the
pipeline. Based, on the value selected for the parameter at runtime, I may retrieve
watermark data for different tables.
I click on the First Row Only checkbox, as only one record from the table is required.

SELECT waterMarkVal
FROM [dbo].[WaterMark]
WHERE tableName = '@{pipeline().[Link]}'

SELECT waterMarkVal
FROM [dbo].[WaterMark]
WHERE tableName = 'Student'

Step 13: Create a Second Lookup activity

I create the second lookup activity, named lookupNewWaterMark. The source


dataset is set to SqlServerTable1, pointing to [Link] table in on-premise SQL
Server.

I write the following query to retrieve the maximum value of updateDate column
value of Student table. I reference the pipeline parameters in the query. I may
change the parameter values at runtime to select a different watermark column from
a different table.

Here also I click on the First Row Only checkbox, as only one record from the table is
required.

SELECT MAX(@{pipeline().[Link]}) AS NewwaterMarkVal


FROM @{pipeline().[Link]}

SELECT MAX(updatedate) AS NewwaterMarkVal


FROM [Link]

Step 14: Create a Copy data activity

A Copy data activity is used to copy data between data stores located on-premises
and in the cloud. I create the Copy data activity, named CopytoStaging, and add the
output links from the two lookup activities as input to the Copy data activity.

In the source tab, source dataset is set as SqlServerTable1, pointing to [Link]


table in on-premise SQL Server. Then, I write the following query to retrieve all the
records from SQL Server Student table where the updateDate column value is
greater than the updateDate value stored in the WaterMark table, as retrieved from
lookupOldWaterMark activity output. I also check that the updateDate column value
is less than or equal to the maximum value of updateDate, as retrieved from
lookupNewWaterMark activity output.

I have used pipeline parameters for table name and column name values.

--query for source


select * from @{pipeline().[Link]}
where @{pipeline().[Link]} >
'@{activity('lookupOldWaterMark').[Link]}'
and @{pipeline().[Link]} <=
'@{activity('lookupNewWaterMark').[Link]}'

select * from [Link]


where updatedate >
'@{activity('GetOldWaterMarkVal-Cloud').[Link]}'
and updatedate <=
'@{activity('GetNewWaterMarkVal-Source').[Link]}'

In the sink tab, I select AzureSQLTable1 as the sink dataset. This points to the
staging tabke [Link]. I write the pre copy script to truncate the staging table
stgStudent every time before data loading.

I want to load data from the output of the source query to the stgStudent table.

--pre copy script for sink


TRUNCATE TABLE @{pipeline().[Link]}
Step 15: Create the Stored Procedure activity

I create a stored procedure activity next to the Copy Data activity. This will be
executed after the successful completion of Copy Data activity. I set the linked
service to AzureSqlDatabase1 and the stored procedure to usp_upsert_Student.

Here is the code for the stored procedure. The purpose of this stored procedure is to
update and insert records in Student table from the staging stgStudent. If the
student already exists, it will be updated. New students will be inserted.

CREATE PROCEDURE dbo.usp_upsert_Student


AS
BEGIN
MERGE [Link] AS t
USING (SELECT
studentId,studentName,stream,marks,createDate,updateDate FROM
[Link])
AS s (studentId,studentName,stream,marks,createDate,updateDate)
ON ([Link] = [Link])
WHEN MATCHED THEN
UPDATE SET studentName = [Link],
stream = [Link],
marks = [Link],
createDate = [Link],
updateDate = [Link]
WHEN NOT MATCHED THEN
INSERT (studentId,studentName,stream,marks,createDate,updateDate)
VALUES
([Link],[Link],[Link],[Link],[Link],[Link]);
END
GO

Step 16: Create the Stored Procedure to Update the Watermark

I create the second Stored Procedure activity, named uspUpdateWaterMark. It will be


executed after the successful completion of the first Stored Procedure activity
named uspUpsertStudent. I set the linked service as AzureSqlDatabase1 and the
stored procedure as usp_write_watermark.

The purpose of this stored procedure is to update the watermarkval column of the
WaterMark table with the latest value of updateDate column from the Student table
after the data is loaded. This procedure takes two parameters: LastModifiedtime and
TableName. The values of these parameters are set with the lookupNewWaterMark
activity output and pipeline parameters respectively.

The LastModifiedtime value is set as


@{activity('lookupNewWaterMark').[Link]} and
TableName value is set as @{pipeline().[Link]}.

CREATE PROCEDURE [dbo].[usp_write_watermark] @LastModifiedtime


datetime, @TableName varchar(100)
AS
BEGIN
UPDATE [dbo].[WaterMark]
SET waterMarkVal = @LastModifiedtime
WHERE tableName = @TableName
END
GO

Step 17: Debugging the Pipeline

Once all the five activities are completed, I publish all the changes. Then, I press the
Debug button for a test execution of the pipeline. The output tab of the pipeline
shows the status of the activities.

I follow the debug progress and see all activities are executed successfully.
Step 18: Check the data in Azure SQL Database

As I select data from the [Link] table, I can see the waterMakVal column
value has changed, and it is equal to the maximum value of the updateDate column
of the [Link] table in SQL Server.

As I select data from [Link] table, I can see all the records inserted in the
[Link] table in SQL Server are now available in the Azure SQL Student table.

SELECT tableName,waterMarkVal
FROM [Link]
SELECT studentid, studentName,stream,marks,createDate,updateDate
FROM [Link]

Step 19: Update and Insert Data in SQL Server

Now, I update the stream value in one record of the [Link] table in SQL Server.
The updateDate column value is also modified with the GETDATE() function output. I
also add a new student record. The inserted and updated records have the latest
values in the updateDate column.

In the next load, only the update and insert in the source table needs to be reflected
in the sink table. The other records should remain the same.

UPDATE [Link]
SET stream = 'ECE',
updateDate = GETDATE()
WHERE studentId = 3
INSERT INTO [Link]
(studentName,stream,marks,createDate,updateDate)
VALUES
('aaa', 'CSE',100,GETDATE(), GETDATE())

Step 20: Debug the Pipeline

I execute the pipeline again by pressing the Debug button. I follow the progress and
all the activities execute successfully.

Step 21: Check Data in Azure SQL Database

As I select data from [Link] table, I can see the waterMarkVal column value
is changed. It is now equal to the maximum value of the updateDate column of
[Link] table in SQL Server. As I select data from [Link] table, I can see
one existing student record is updated and a new record is inserted.

So, I have successfully completed incremental load of data from on-premise SQL
Server to Azure SQL database table.

SELECT tableName,waterMarkVal
FROM [Link]
SELECT studentid, studentName,stream,marks,createDate,updateDate
FROM [Link]

Conclusion
The step-by-step process above can be referred for incrementally loading data from
SQL Server on-premise database source table to Azure SQL database sink table.
Pipeline parameter values can be supplied to load data from any source to any sink
table. The source table column to be used as a watermark column can also be
configured. Once the pipeline is completed and debugging is done, a trigger can be
created to schedule the ADF pipeline execution.
Implementing Slowly Changing Dimensions (Type 2) in Azure Data Flow 🌟

Managing data history is crucial for any organization. Type 2 Slowly Changing Dimensions
(SCD2) allows us to retain full historical records of data changes, ensuring we capture the
complete evolution of our data over time.

Implementing Slowly Changing Dimensions (Type 2) in Azure Data Flow allows us to


maintain a complete history of data changes. In this approach, when a chosen attribute's
value changes, the current record is closed, and a new record is created to reflect the
updated information. Each record includes effective and expiration dates to indicate the
period during which the record was active.
For example, you can create a table like this:

CREATE TABLE scd2 (


surrkey INT IDENTITY(1,1),
id INT,
name NVARCHAR(100),
address NVARCHAR(100),
isactive NVARCHAR(100)
);

INSERT INTO scd2 VALUES (1, 'john', 'chennai', 1);


INSERT INTO scd2 VALUES (2, 'antony', 'chennai', 1);
INSERT INTO scd2 VALUES (3, 'antony', 'chennai', 1);

Step 1: Add the source dataset (dataset should point to file which is located in your source
layer).
Step 2: Add derived column resource and add column name as isactive and provide the value
as 1.

Step 3: Configure your sink mappings as shown below

Step 4: Add SQL dataset as another source.


Step 5: Use select resource to rename columns from SQL table.

Step 6: Add lockup activity (It requires two sources so, first source will be your select activity

and second source will be your source file).


Step 7: Output of lookup activity should be like shown below (For non-matching rows there
should be null’s).

Step 8: Now let’s filter out the rows which has non-nulls in the source file columns.
Step 9: Select only the required columns that you are going to insert or update in SQL table.

Step 10: Add derived column and add isactive column to the table and the value should be 0.
Step 11: Add alter row resource and configure as shown below.

Step 12: Add sink and configure the sink as shown below.

Step 13: Finally, Under settings section of your dataflow select sink2 as first and sink1 as
second.
Step 14: After successfully running, your pipeline verify the data in your SQL table.
Enhancing Data Workflows with Logic Apps for Real-Time Notifications

In today’s fast-paced data environment, ensuring timely notifications for pipeline events is
crucial. Here’s how I integrated Logic Apps to automate notifications for my Azure Data
Factory (ADF) pipeline, sending alerts directly to my email.

Step-by-Step Implementation:

1. Set Up the Logic App for Email Notification:

1. Create a New Logic App: Navigate to the Azure portal and create a new Logic App.
2. Design the Workflow:
o Open the Logic App designer.
o Search for the HTTP trigger and add it to the workflow.
o Configure the trigger method as GET.

2. Add Email Action:

1. Add New Step: Search for the Send an email (V2) action.
2. Configure Email Parameters:
o Add email address parameters.
o Set up the email body and subject dynamically.
o Save the Logic App to generate a unique URL.
3. Integrate with ADF Pipeline:

1. Copy the URL: Copy the generated Logic App URL.


2. Add Web Activity to ADF:
o Go to your ADF pipeline.
o Add a Web activity.
o Configure it to trigger on pipeline failure.
o Paste the Logic App URL into the URL field of the Web activity.
3. Connect to Failure Path: Ensure the Web activity is connected to the failure path in
your pipeline to trigger the notification.
Result:

Now, whenever there is a failure in the ADF pipeline, an automated email notification is sent
out, providing real-time alerts and ensuring prompt action can be taken to address the issue.
Building a Secure Azure Data Factory Pipeline with Azure Key Vault Integration

In today's data-driven world, securing access to your data sources is paramount. Here’s a
step-by-step guide on how I created a secure Azure Data Factory (ADF) pipeline by
integrating Azure Key Vault for managing secrets like storage and database connection
strings.

Step 1: Create a Storage Account

1. Set Up the Storage Account: Navigate to the Azure portal and create a new storage
account.
2. Access Keys: Under the storage account, go to Security + Networking and select
Access keys. Copy the connection string.

Step 2: Create a Key Vault

1. Set Up Key Vault: Create a new Key Vault service.


2. Generate Secrets:
o Go to Secrets and select Generate/Import.
o For the storage account, paste the copied connection string from the storage
account's access keys.
o Similarly, generate a secret for your SQL server connection string. Go to the
SQL database, copy the [Link] connection string (SQL Authentication),
and replace it in the Key Vault.
Step 3: Create Linked Services in ADF

1. Linked Service for Key Vault:


o In ADF, create a linked service for Azure Key Vault.
o Pass the subscription and Key Vault details created earlier.
o Test the connection and create the linked service.
2. Linked Service for Azure SQL Database:
o Create a linked service for the Azure SQL database in ADF.
o Instead of a direct connection string, select Key Vault to fetch the connection
string securely.
Troubleshooting Secret Access

If the secret key is not loading, follow these steps:

1. Access Policies: Go to your Key Vault and select Access Policies.


2. Create Access Policy:
o Click + Add Access Policy.
o Select Secret Permissions and add your Data Factory's managed identity as
the principal.
o Save and create the access policy.

For more details, you can refer to the official Microsoft document .

Benefits:

Enhanced Security: By using Azure Key Vault, sensitive connection strings and secrets
are securely managed.

Simplified Management: Centralized secret management simplifies configuration and


reduces the risk of credential leakage.

Scalability: Easily scale your data operations with secure and efficient credential
management.
Scenario Series 1: Fetch Files Not Present in Destination

Follow me on LinkedIn – Shivakiran kotur


To create an Azure Data Factory (ADF) pipeline that compares the source files
with the destination files and copies only those files not present in the
destination, you can follow these steps:

Step 1: Create Datasets


1. Source Dataset: Define a dataset for your source files (e.g.,
SourceDataset).
o Linked Service: Configure a linked service for your source ADL.
o Dataset Configuration: Set up the dataset to point to the folder
containing [Link], [Link], etc.
2. Destination Dataset: Define a dataset for your destination files (e.g.,
DestinationDataset).
o Linked Service: Configure a linked service for your destination
ADL.
o Dataset Configuration: Set up the dataset to point to the folder
containing [Link] and [Link].
Step 2: Create a Pipeline
1. Pipeline Variables:
o Define a pipeline variable (e.g.,
@pipeline().[Link]) to store the list of source
files.
o Define another variable (e.g.,
@pipeline().[Link]) to store the list of
destination files.
2. Get Source Files:
o Use a Get Metadata activity to list all the files in the source folder.
o Output should be saved to the SourceFiles variable.
3. Get Destination Files:
o Use another Get Metadata activity to list all the files in the
destination folder.
o Output should be saved to the DestinationFiles variable.

Follow me on LinkedIn – Shivakiran kotur


4. Filter Files:
o Use a Filter activity to compare the source files against the
destination files.
o Set the condition to include files that are in the source but not in
the destination.

Follow me on LinkedIn – Shivakiran kotur


5. Copy Activity:
o For each file identified by the filter, use a ForEach activity to
iterate over the list.
o Inside the ForEach activity, use a Copy Data activity to copy each
file from the source to the destination.

Follow me on LinkedIn – Shivakiran kotur


Follow me on LinkedIn – Shivakiran kotur
Scenario Series 2 -> ADF Pipeline Setup: Fetching and Storing
Employee Names

To pass the final array of employee names to another variable (`ename`) after
appending them using the Append Variable activity in Azure Data Factory
(ADF), follow these steps:

Follow me on LinkedIn – Shivakiran kotur


Steps:

1. Add a Lookup Activity:


- Configure the Lookup activity to fetch the `ename` column from the
`employee` table:

sql
SELECT ename FROM employee;

- Ensure First Row Only is set to False to retrieve all names.

2. Create an Array Variable:


- Create a new array variable (e.g., `namesArray`) in the pipeline.

3. Add a ForEach Activity:


- Drag a ForEach activity onto the canvas.
- Set the Items property to:

@activity('LookupActivityName').[Link]

Follow me on LinkedIn – Shivakiran kotur


- This allows the ForEach loop to iterate over each `ename` value.

4. Inside the ForEach Loop:


- Add an Append Variable activity inside the ForEach loop.
- Set the Variable Name to `namesArray`.

- Set the Value to:

@item().ename

5. Pass the Final Array to Another Variable:


- After the ForEach loop, add a Set Variable activity.
- Variable Name: Select a new variable (e.g., `ename`) of type Array.

- Value: Set the value to the `namesArray` variable:

Follow me on LinkedIn – Shivakiran kotur


@variables('namesArray')

Final Setup:

- Lookup Activity: Fetch `ename` values.


- ForEach Activity: Iterate through each `ename`.
- Append Variable Activity: Append each `ename` to `namesArray`.
- Set Variable Activity: Assign the final `namesArray` to the `ename` variable.

This setup will allow you to gather all employee names into an array and store
them in the `ename` variable, ready for use in subsequent pipeline activities.

Follow me on LinkedIn – Shivakiran kotur


Scenario Series 3 - Incrementing a Variable Until It Reaches 5

The Until activity in Azure Data Factory (ADF) is designed to repeatedly execute
a series of activities in a loop until a specified condition evaluates to true. This
is particularly useful for scenarios where you need to wait for a certain state or
perform iterative operations.

Follow me on LinkedIn – Shivakiran kotur


Steps:

1. Declare the Variable:


- Create a variable `X` to hold the initial value (e.g., 0).
- Use a Set Variable activity to assign the initial value to `X`.

2. Configure the Until Activity:


- Add an Until activity to your pipeline.
- Set the Expression for the Until activity to check if `X > 5`. The loop will
continue until this condition is met:

@greater(variables('X'), 5)

Follow me on LinkedIn – Shivakiran kotur


3. Inside the Until Activity:
- Since you can't directly reference the variable in its own expression, create
a temporary variable, say `VarB`.
- Set Variable activity: Assign `VarB` the current value of `X`.
- Set Variable activity: Increment `X` by 1 using the temporary variable:

@add(variables('VarB'), 1)

- This step increments `X` by 1 in each iteration of the loop.

Follow me on LinkedIn – Shivakiran kotur


How the Until Activity Works:

The Until activity runs a loop, executing all the activities defined inside it, until
the condition associated with it evaluates to true. In this example, the loop
increments the value of `X` by 1 on each iteration. The loop continues until `X`
becomes greater than 5.

By using a temporary variable (`VarB`), we avoid self-referencing issues and


ensure that the increment operation is performed correctly.

This approach allows the loop to progress smoothly, ensuring that the
condition (`X > 5`) is eventually met, thus stopping the loop.

Follow me on LinkedIn – Shivakiran kotur


Scenario series 4→ Conditional File Copy in Azure Data Factory
Based on Row Count Using Lookup and If Condition Activities

To check the row count for a single file in Azure Data Factory (ADF) using the
Lookup activity and then copy the file if it has rows you can follow these steps:

Step-by-Step Process:
1. Lookup Activity:
o Use the Lookup activity to check if the CSV file has rows.
o You can configure this to run a query that counts the rows in the
file.

Follow me on LinkedIn – Shivakiran kotur


➔ Here we are taking lookup to fetch the file, in dataset we wont pass any
filename lets make it dynamic in dataset and passed during our pipeline
execution

Follow me on LinkedIn – Shivakiran kotur


Once you debug it will ask to pass the filename, give and csv file name present
in input container

2. If Condition Activity:
o Add an If Condition activity to evaluate whether the row count is greater
than 0.
o If the condition is true, the file will be copied; otherwise, it will be
skipped.

Follow me on LinkedIn – Shivakiran kotur


3. Copy Data Activity:
o Add a Copy Data activity inside the If Condition (true case) activity to
copy the file if the row count is greater than 0.
o The file name will be passed as a parameter to ensure it is used
correctly.

Follow me on LinkedIn – Shivakiran kotur


➔ When you debug it will ask the filename, if count is not > 0 it will not
copy any file it just succeed the pipeline, if it has count it will copy to
destination

Follow me on LinkedIn – Shivakiran kotur


Summary:
o Lookup Activity checks the row count for the specified file.
o If Condition Activity determines if the file should be copied based on
the row count.
o Copy Data Activity copies the file to the destination if the row count is
greater than 0.

This setup ensures that the file is only copied if it contains rows, and you can
dynamically handle the file name using parameters.

Feel free to comment below on how you would approach handling


this scenario when working with multiple files at a time in Azure
Data Factory!

Follow me on LinkedIn – Shivakiran kotur


Hint→

Follow me on LinkedIn – Shivakiran kotur


Scenario series 5 → Pass the response from a REST API call to a
variable in Azure Data Factory (ADF)

Step-by-Step Process:
1. Create a Web Activity to Call the REST API:
o This activity will fetch the JSON response from your REST API.
2. Create a Set Variable Activity:
o Use this activity to assign the response from the Web Activity to a
pipeline variable.
3. Configure the Web Activity and Set Variable Activity:
o Web Activity: Configure it to call the REST API and get the JSON
response.
o Set Variable Activity: Set up to assign the response to a variable.

Follow me on LinkedIn – Shivakiran kotur


Lets take the REST API

Follow me on LinkedIn – Shivakiran kotur


Output response is not in the form of array its string (due to limitation of adf)

Take a variable and set it as array

Follow me on LinkedIn – Shivakiran kotur


➔ Output of web activity is in string, and variable we are passing is array,
we in expression we have to convert the web activity response to json

Follow me on LinkedIn – Shivakiran kotur


Handling JSON Response:

If the response is a JSON array or nested, you might need to use functions to
parse or access specific parts of the JSON. Here’s how you can handle different
cases:
• Access Specific Elements:

@activity('CallRESTAPI').[Link][0].name

• Extract Specific Values:


@json(activity('CallRESTAPI').[Link]).propertyName

Summary:
1. Web Activity retrieves the JSON response from the REST API.
2. Set Variable Activity assigns the JSON response to a pipeline variable.
3. Use expressions to access or manipulate the JSON data as needed.

This setup allows you to dynamically pass the response from a REST API to a
variable for further processing within your ADF pipeline.

Follow me on LinkedIn – Shivakiran kotur

You might also like