ADF Pipeline Management and File Handling Guide
ADF Pipeline Management and File Handling Guide
How do you separate bad records during a copy activity from ADLS to SQL? –
To separate bad records while performing a copy activity from Azure Data Lake Storage
(ADLS) to a SQL table in Azure Data Factory (ADF), follow these steps:
1️. Select the Copy activity.
2️. In the Settings tab, enable the Fault tolerance option and check "Skip incompatible rows."
This option will identify and skip rows that do not conform to the destination schema,
effectively separating the bad records from the rest.
3️. if you want to store the bad records, check the option Enable logging ,where you will
have to provide the storage name where logs will be stored.
To efficiently handle data workflows in Azure Data Factory, follow these steps:
1️.Use the GetMetadata activity and select 'child items' under the field list.
Implement a Foreach activity.
2️. Inside the Foreach activity, add another GetMetadata activity (GetMetadata2️) and select
'Last modified' in the field list.
3️. In the above activity, under the Variables section, create two variables: LatestFileName
and PreviousModifiedDate, and assign initial values.
4️. Add an If condition activity and write an expression to compare dates using
@greater(formatDateTime(activity('GetMetadata2️').[Link],
'yyyyMMddHHmmss'), formatDateTime(variables('PreviousModifiedDate'),
'yyyyMMddHHmmss')).
5️.Within the If condition's True branch, add a Set Variable activity to update the variable
with @activity('GetMetadata2️').[Link].
6️.Finally, add a Copy activity outside the Foreach to copy the latest files to the desired
output folder.
These steps help ensure that only the latest files are processed, optimizing your data
workflow.
1️. Use the GetMetadata activity, selecting 'child items' in the field list. Filter by 'last
modified' using the expression @adddays(utcnow(), -3️1️) under Endtime (UTC).
2️.Implement a Foreach loop activity with the expression
@activity('GetMetadata').[Link].
3️.Inside the Foreach loop activity, add a Delete activity and configure it to select and delete
the appropriate filenames.
These steps streamline the process of managing and deleting outdated files, ensuring your
data storage remains optimized.
1️.Fetch Metadata: Use the Metadata activity to retrieve file information, storing it in an
array.
2️.ForEach Activity: Take the output of the Metadata activity and input it into the ForEach
activity.
3️.Copy Activity: Within the ForEach loop, use the Copy activity to transfer each file from the
source (Blob) to the sink container (ADLS).
This approach ensures smooth and organized file transfers.
Copy only files starting with "customer" and add a date in ADLS.
1️.Fetch Metadata: Use the Metadata activity to retrieve file information, storing it in an
array.
2️.Use Filter Activity :add the filter activity after the metadata to copy only files starting with
customer and link to metadata
5️.ForEach Activity: Take the output of the Metadata activity and input it into the ForEach
activity.
4️.Copy Activity: Within the ForEach loop, use the Copy activity to transfer each file from the
source (Blob) to the sink container (ADLS).
@activity('Filter1️').[Link]
For copy activity→ source remains same , for sink create dynamic container name
with exp in dataset created
@concat('customer-', formatDateTime(utcnow(), 'yyyyMMddHHmmss'))
REQURIED ACTIVITY
1. Azure data factory
• 4 datasets
• 2 linked services
• Pipeline
• Look up activity, Foreach activity (Get metadata activity, if activity (True (copy activity-
Success SP, Failure SP) or False))-Web activity (Success or failure)
2. Azure SQL Server
3. Storage accounts
Lookup activity → create dataset to reference to metadata table (where all the info of tables
loaded)
Get Metadata activity → dynamically checks whether table exists or not.
For each loop → use inside IF activity to ensure true or false condition and then link to
storeprocdure activity for success or failure
If table not exists → false, if exists →True→ if true then copy activity succeeded or failed.
Logic apps→ using logic app send the notification through mail.
STEPS INVOLVED
Step1→ Script execution in Database
Involves table create insert data, metadata table create with active as 1 & inactive as 0 information
of all the table inserted, Store proc for logs
Create SQL database in azure, set the firewall and connect to SSMS using credentials, in Database
execute the following scripts.
→SQL SCRIPTS:
→TABLE CREATIONS
CREATE TABLE PRODUCT(PID INT, PNAME VARCHAR(50))
CREATE TABLE SELLS(SELLSID INT, STORENAME VARCHAR(50))
CREATE TABLE TRANSACTIONS(TID INT, TAMOUNT BIGINT)
CREATE TABLE CUST(CID INT, CLOCATION VARCHAR(50))
CREATE TABLE EMP(EMPID INT, EMPNAME VARCHAR(50))
--INSERT TABLES
--INSERT DATA INTO METADATA TABLE (ALL THE TABLES INFO INSERTED)
INSERT INTO METADATA VALUES
('DBO','PRODUCT','PRODUCTOUTPUT',0),
('DBO','SELLS','SELLSOUTPUT',0),
('DBO','TRANSACTIONS','TRANSACTIONSOUTPUT',0),
('DBO','CUST','CUSTOUTPUT',0),
('DBO','EMP','EMPOUTPUT',0),
('DBO', 'ORDERS','ORDERSOUTPUT',1);
First create linked service for key vault, pass the sub and key vault details which created before ,
test connection and create.
2nd→ create the linked service for Azure Sql database, select Key vault instead of Connection String
Note→ if the secret key is not loading follow the following steps
Go to your key vault >> Access Policy→+create→ select Secret Permission→ Principal→ type your
data name and create it. For more reference refer below link.
[Link]
portal
Step 8→ take for each activity, and pass the output of lookupactivity to foreach dynamically
i.e @activity('Lookup1').[Link]
Step → Go inside the foreach activity
Take metadata activity and create dataset for it. Getmeatadata used to check whether table exists or not
For dataset of getmetadata create two parameter for Source schema and tablename
Step→ Take next activity that is IF activity, here we pass the output of metadata activity.
Create a Sql data set for copy src and Sink dataset
In Copy activity pass dynamically the container column name seeing the JSON output
Step→ Next add the Store Procedure Activity following the Copy activity, one for success and one
for failure.
In store proc activity→ for for activity→ linked service→ test connect→ select SP→import to
import all the parameter of the procedure
Note→ check for the variable whether it is matching or not,
Need to pass all the parameter dynamically to the parameter of store proc, can refer Microsoft site
for major parameter.
IMP NOTE→ COPY activity fails if the container name is in UPPER case, so have to change it into
lower case, as container name should be in lower case.
@toLower(item().BLOBCONTAINER)
In above sink dataset properties.
(All the dynamic activity are case sensitive should follow the json structure).
Parameter passed
CopyActivity_End_Time: @activity('Copy data1').ExecutionEndTime
CopyActivity_Start_Time: @activity('Copy data1').ExecutionStartTime
copyDuration_in_secs: @activity('Copy data1').[Link]
Datafactory_Name: @pipeline().DataFactory
Destination: @item().blobcontainer
effectiveIntergationRuntime: @activity('Copy data1').[Link]
Error_Message: @activity('Copy data1').error
Execution_Status: @activity('Copy data1').status
Execution_Status_code: @activity('Copy data1').statuscode
No_ParallelCopies: @activity('Copy data1').[Link]
PipelineName: @pipeline().Pipeline
RowsCopied: @activity('Copy data1').[Link]
RowsRead: @activity('Copy data1').[Link]
Sink_Type: @activity('Copy data1').[Link][0].[Link]
Source: @item().tablename
Source_Type: @activity('Copy data1').[Link][0].[Link]
TriggerId: @pipeline().TriggerId
TriggerName: @pipeline().TriggerName
TriggerTime: @pipeline().TriggerTime
triggertype:@pipeline().TriggerType
Step→ Add the email notification to the pipeline, create the service call Logic app→ Go to app
designer→ search http → take email→ under email add parameter→ method→get→new step→get
authenticated→ add email address→body, subject→save→ url is grnerated→copy it→go to
pipeline take web activity→ connect to failure→ paste url.
Save the app→Copy url
Incremental Data Loading using Azure Data Factory
The process for the incremental load of data from an on-premises SQL Server to
Azure SQL database. Once the full data set is loaded from a source to a sink, there may be
some addition or modification of the source data. In that case, it is not always possible, or
recommended, to refresh all data again from source to sink. Incremental load methods help
to reflect the changes in the source to the sink every time a data modification is made on the
source.
There are different methods for incremental data loading. I will discuss the step-by-
step process for incremental loading, or delta loading, of data through a watermark.
Watermark
A watermark is a column in the source table that has the last updated time
stamp or an incrementing key. After every iteration of data loading, the maximum
value of the watermark column for the source data table is recorded. Once the next
iteration is started, only the records having the watermark value greater than the last
recorded watermark value are fetched from the data source and loaded in the data
sink. the latest maximum value of the watermark column is recorded at the end of
this iteration.
The workflow for this approach can be depicted with the following diagram (as given
in Microsoft documentation):
In on-premises SQL Server, I create a database first. Then, I create a table named db.
Student. I insert 3 records in the table and check the same. This table data will be
copied to the student table in an Azure SQL database. The update Date column of
the Student table will be used as the watermark column.
I create an Azure SQL Database through Azure portal. I connect to the database
through SSMS. Once connected, I create a table, named Student, which is having the
same structure as the Student table created in the on-premise SQL Server. The
studentId column in this table is not defined as IDENTITY, as it will be used to store
the studentId values from the source table.
I create another table named stgStudent with the same structure of Student. I will
use this table as a staging table before loading data into the Student table. I will
truncate this table before each load.
I create a table named WaterMark. Watermark values for multiple tables in the
source database can be maintained here. For now, I insert one record in this table. I
put the tablename column value as 'Student' and waterMarkVal value as an initial
default date value '1900-01-01 [Link]'.
Next, I create an ADF resource from the Azure Portal. I open the ADF resource and
go the Manage link of the ADF and create a new self-hosted integration runtime.
The Integration Runtime (IR) is the compute infrastructure used by ADF for data flow,
data movement and SSIS package execution. A self-hosted IR is required for
movement of data from on-premise SQL Server to Azure SQL.
I click the link under Option 1: Express setup and follow the steps to complete the
installation of the IR. The name for this runtime is selfhostedR1-sd.
Step 4: Create the Azure Integration Runtime
An Azure Integration Runtime (IR) is required to copy data between cloud data
stores. I choose the default options and set up the runtime with the name azureIR2.
The linked service helps to link the source data store to the Data Factory. A Linked
Service is similar to a connection string, as it defines the connection information
required for the Data Factory to connect to the external data source.
I provide details for the on-premise SQL Server and create the linked service, named
sourceSQL. There is an option to connect via Integration runtime. I select the self-
hosted IR as created in the previous step.
I provide details for the Azure SQL database and create the linked service, named
AzureSQLDatabase1. In the connect via Integration runtime option, I select the the
Azure IR as created in the previous step.
Step 7: Create the Dataset for the SQL Server table
A dataset is a named view of data that simply points or references the data to be
used in the ADF activities as inputs and outputs. I create this dataset, named
SqlServerTable1, for the table, [Link], in on-premise SQL Server.
I create this dataset, named AzureSqlTable1, for the table, [Link], in the
Azure SQL database.
Step 9: Create the Watermark Dataset
I create this dataset, named AzureSqlTable2, for the table, [Link], in the
Azure SQL database.
I go to the Author tab of the ADF resource and create a new pipeline. I name it
pipeline_incrload.
Step 11: Add Parameters
I go to the Parameters tab of the pipeline and add the following parameters and set
their default values as detailed below.
A Lookup activity reads and returns the content of a configuration file or table. It also
returns the result of executing a query or stored procedure. The output from Lookup
activity can be used in a subsequent copy or transformation activity if it's a singleton
value.
I create the first lookup activity, named lookupOldWaterMark. The source dataset is
set to AzureSqlTable2 (pointing to [Link] table). I write the following query
to retrieve the waterMarkVal column value from the WaterMark table for the value,
Student. Here, tablename data is compared with finalTableName parameter of the
pipeline. Based, on the value selected for the parameter at runtime, I may retrieve
watermark data for different tables.
I click on the First Row Only checkbox, as only one record from the table is required.
SELECT waterMarkVal
FROM [dbo].[WaterMark]
WHERE tableName = '@{pipeline().[Link]}'
SELECT waterMarkVal
FROM [dbo].[WaterMark]
WHERE tableName = 'Student'
I write the following query to retrieve the maximum value of updateDate column
value of Student table. I reference the pipeline parameters in the query. I may
change the parameter values at runtime to select a different watermark column from
a different table.
Here also I click on the First Row Only checkbox, as only one record from the table is
required.
A Copy data activity is used to copy data between data stores located on-premises
and in the cloud. I create the Copy data activity, named CopytoStaging, and add the
output links from the two lookup activities as input to the Copy data activity.
I have used pipeline parameters for table name and column name values.
In the sink tab, I select AzureSQLTable1 as the sink dataset. This points to the
staging tabke [Link]. I write the pre copy script to truncate the staging table
stgStudent every time before data loading.
I want to load data from the output of the source query to the stgStudent table.
I create a stored procedure activity next to the Copy Data activity. This will be
executed after the successful completion of Copy Data activity. I set the linked
service to AzureSqlDatabase1 and the stored procedure to usp_upsert_Student.
Here is the code for the stored procedure. The purpose of this stored procedure is to
update and insert records in Student table from the staging stgStudent. If the
student already exists, it will be updated. New students will be inserted.
The purpose of this stored procedure is to update the watermarkval column of the
WaterMark table with the latest value of updateDate column from the Student table
after the data is loaded. This procedure takes two parameters: LastModifiedtime and
TableName. The values of these parameters are set with the lookupNewWaterMark
activity output and pipeline parameters respectively.
Once all the five activities are completed, I publish all the changes. Then, I press the
Debug button for a test execution of the pipeline. The output tab of the pipeline
shows the status of the activities.
I follow the debug progress and see all activities are executed successfully.
Step 18: Check the data in Azure SQL Database
As I select data from the [Link] table, I can see the waterMakVal column
value has changed, and it is equal to the maximum value of the updateDate column
of the [Link] table in SQL Server.
As I select data from [Link] table, I can see all the records inserted in the
[Link] table in SQL Server are now available in the Azure SQL Student table.
SELECT tableName,waterMarkVal
FROM [Link]
SELECT studentid, studentName,stream,marks,createDate,updateDate
FROM [Link]
Now, I update the stream value in one record of the [Link] table in SQL Server.
The updateDate column value is also modified with the GETDATE() function output. I
also add a new student record. The inserted and updated records have the latest
values in the updateDate column.
In the next load, only the update and insert in the source table needs to be reflected
in the sink table. The other records should remain the same.
UPDATE [Link]
SET stream = 'ECE',
updateDate = GETDATE()
WHERE studentId = 3
INSERT INTO [Link]
(studentName,stream,marks,createDate,updateDate)
VALUES
('aaa', 'CSE',100,GETDATE(), GETDATE())
I execute the pipeline again by pressing the Debug button. I follow the progress and
all the activities execute successfully.
As I select data from [Link] table, I can see the waterMarkVal column value
is changed. It is now equal to the maximum value of the updateDate column of
[Link] table in SQL Server. As I select data from [Link] table, I can see
one existing student record is updated and a new record is inserted.
So, I have successfully completed incremental load of data from on-premise SQL
Server to Azure SQL database table.
SELECT tableName,waterMarkVal
FROM [Link]
SELECT studentid, studentName,stream,marks,createDate,updateDate
FROM [Link]
Conclusion
The step-by-step process above can be referred for incrementally loading data from
SQL Server on-premise database source table to Azure SQL database sink table.
Pipeline parameter values can be supplied to load data from any source to any sink
table. The source table column to be used as a watermark column can also be
configured. Once the pipeline is completed and debugging is done, a trigger can be
created to schedule the ADF pipeline execution.
Implementing Slowly Changing Dimensions (Type 2) in Azure Data Flow 🌟
Managing data history is crucial for any organization. Type 2 Slowly Changing Dimensions
(SCD2) allows us to retain full historical records of data changes, ensuring we capture the
complete evolution of our data over time.
Step 1: Add the source dataset (dataset should point to file which is located in your source
layer).
Step 2: Add derived column resource and add column name as isactive and provide the value
as 1.
Step 6: Add lockup activity (It requires two sources so, first source will be your select activity
Step 8: Now let’s filter out the rows which has non-nulls in the source file columns.
Step 9: Select only the required columns that you are going to insert or update in SQL table.
Step 10: Add derived column and add isactive column to the table and the value should be 0.
Step 11: Add alter row resource and configure as shown below.
Step 12: Add sink and configure the sink as shown below.
Step 13: Finally, Under settings section of your dataflow select sink2 as first and sink1 as
second.
Step 14: After successfully running, your pipeline verify the data in your SQL table.
Enhancing Data Workflows with Logic Apps for Real-Time Notifications
In today’s fast-paced data environment, ensuring timely notifications for pipeline events is
crucial. Here’s how I integrated Logic Apps to automate notifications for my Azure Data
Factory (ADF) pipeline, sending alerts directly to my email.
Step-by-Step Implementation:
1. Create a New Logic App: Navigate to the Azure portal and create a new Logic App.
2. Design the Workflow:
o Open the Logic App designer.
o Search for the HTTP trigger and add it to the workflow.
o Configure the trigger method as GET.
1. Add New Step: Search for the Send an email (V2) action.
2. Configure Email Parameters:
o Add email address parameters.
o Set up the email body and subject dynamically.
o Save the Logic App to generate a unique URL.
3. Integrate with ADF Pipeline:
Now, whenever there is a failure in the ADF pipeline, an automated email notification is sent
out, providing real-time alerts and ensuring prompt action can be taken to address the issue.
Building a Secure Azure Data Factory Pipeline with Azure Key Vault Integration
In today's data-driven world, securing access to your data sources is paramount. Here’s a
step-by-step guide on how I created a secure Azure Data Factory (ADF) pipeline by
integrating Azure Key Vault for managing secrets like storage and database connection
strings.
1. Set Up the Storage Account: Navigate to the Azure portal and create a new storage
account.
2. Access Keys: Under the storage account, go to Security + Networking and select
Access keys. Copy the connection string.
For more details, you can refer to the official Microsoft document .
Benefits:
Enhanced Security: By using Azure Key Vault, sensitive connection strings and secrets
are securely managed.
Scalability: Easily scale your data operations with secure and efficient credential
management.
Scenario Series 1: Fetch Files Not Present in Destination
To pass the final array of employee names to another variable (`ename`) after
appending them using the Append Variable activity in Azure Data Factory
(ADF), follow these steps:
sql
SELECT ename FROM employee;
@activity('LookupActivityName').[Link]
@item().ename
Final Setup:
This setup will allow you to gather all employee names into an array and store
them in the `ename` variable, ready for use in subsequent pipeline activities.
The Until activity in Azure Data Factory (ADF) is designed to repeatedly execute
a series of activities in a loop until a specified condition evaluates to true. This
is particularly useful for scenarios where you need to wait for a certain state or
perform iterative operations.
@greater(variables('X'), 5)
@add(variables('VarB'), 1)
The Until activity runs a loop, executing all the activities defined inside it, until
the condition associated with it evaluates to true. In this example, the loop
increments the value of `X` by 1 on each iteration. The loop continues until `X`
becomes greater than 5.
This approach allows the loop to progress smoothly, ensuring that the
condition (`X > 5`) is eventually met, thus stopping the loop.
To check the row count for a single file in Azure Data Factory (ADF) using the
Lookup activity and then copy the file if it has rows you can follow these steps:
Step-by-Step Process:
1. Lookup Activity:
o Use the Lookup activity to check if the CSV file has rows.
o You can configure this to run a query that counts the rows in the
file.
2. If Condition Activity:
o Add an If Condition activity to evaluate whether the row count is greater
than 0.
o If the condition is true, the file will be copied; otherwise, it will be
skipped.
This setup ensures that the file is only copied if it contains rows, and you can
dynamically handle the file name using parameters.
Step-by-Step Process:
1. Create a Web Activity to Call the REST API:
o This activity will fetch the JSON response from your REST API.
2. Create a Set Variable Activity:
o Use this activity to assign the response from the Web Activity to a
pipeline variable.
3. Configure the Web Activity and Set Variable Activity:
o Web Activity: Configure it to call the REST API and get the JSON
response.
o Set Variable Activity: Set up to assign the response to a variable.
If the response is a JSON array or nested, you might need to use functions to
parse or access specific parts of the JSON. Here’s how you can handle different
cases:
• Access Specific Elements:
@activity('CallRESTAPI').[Link][0].name
Summary:
1. Web Activity retrieves the JSON response from the REST API.
2. Set Variable Activity assigns the JSON response to a pipeline variable.
3. Use expressions to access or manipulate the JSON data as needed.
This setup allows you to dynamically pass the response from a REST API to a
variable for further processing within your ADF pipeline.