0% found this document useful (0 votes)
233 views13 pages

Azure Project

This document provides instructions for developing a data pipeline to process order files from a landing folder to staging and discarded folders. Key steps include: 1. Creating Azure resources like storage, Databricks, and Data Factory. 2. Developing logic in a Databricks notebook to check for duplicate order IDs and valid order statuses, and move files accordingly. 3. Creating a pipeline in Data Factory to automate execution when files arrive, triggering the Databricks notebook. The pipeline is then enhanced to make it more generalized and handle any file, rather than a hardcoded name, by dynamically retrieving filenames from the trigger. Mounting logic is also improved to check if storage is already mounted before mounting.

Uploaded by

Diva R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
233 views13 pages

Azure Project

This document provides instructions for developing a data pipeline to process order files from a landing folder to staging and discarded folders. Key steps include: 1. Creating Azure resources like storage, Databricks, and Data Factory. 2. Developing logic in a Databricks notebook to check for duplicate order IDs and valid order statuses, and move files accordingly. 3. Creating a pipeline in Data Factory to automate execution when files arrive, triggering the Databricks notebook. The pipeline is then enhanced to make it more generalized and handle any file, rather than a hardcoded name, by dynamically retrieving filenames from the trigger. Mounting logic is also improved to check if storage is already mounted before mounting.

Uploaded by

Diva R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Disclaimer: These slides are copyrighted

and strictly for personal use only

• This document is reserved for people enrolled


into the
Ultimate Big Data Masters Program (Cloud Focused) by
Sumit Sir

• Please do not share this document, it is intended


for personal use and exam preparation only, thank you.

• If you’ve obtained these slides for free on a website


that is not the course’s website, please reach out to
[email protected] . Thanks!

• All the Best and Happy Learning!


Developing a Data Pipeline
Project Use-case

A third party service drops a file named orders.csv in the landing folder.

Requirement:

As soon as file arrives in landing folder, perform the following checks

1. Check for duplicate order_id in order_status


2. Check for valid orders_status

If both the conditions are true then move the file to the staging folder else
move it to the discarded folder.

Note: In future, if the list of valid order_status changes, then we should


dynamically be able to incorporate the changes.

Services required to Implement the Solution :

As soon as the data is added to the storage


|
Run the Pipeline (Data Factory - trigger is a Storage Event)
|
Databricks Notebook executes the spark code to perform the necessary
checks.
Building the Data Pipeline for the Project
Creating the necessary Resources :

1. Storage Account

-> Create a storage account with enabled hierarchical namespace

-> Create a container named “sales”. Inside the sales container


create directories as: landing, staging, discarded
2. Databricks

-> Create a Databricks Workspace

3. Data Factory

-> Create a Datafactory Service

-> Two components that has to be connected to Azure Data


Factory

1. ADLS Gen2 (Storage)


2. Databricks (Compute)

-> During this process, we will use a Key Vault to store any of the
passwords/ secret keys.

-> Linked Services need to be created between the datafactory


and the 2 components - ADLS Gen2 & Databricks.

-During the creation of linked service for databricks, we cannot


directly add the access token, instead store the access token in
key vault and use that name by moving the option to “Azure key
vault”
-While trying to connect with key vault in linked service of
databricks, a key vault linked service should also be created and
to create a linked service to key vault we need to allow
datafactory to access the key vault by enabling it in access
policies.

Totally, 3 linked services will be created ->

1. ADLS GEN2 | 2. Databricks | 3. Key Vault

Creating Azure SQL DB : Lookup Table to maintain the List of valid


order_status

ON_HOLD

PAYMENT_REVIEW

PROCESSING

CLOSED

SUSPECTED_FRAUD

COMPLETE

PENDING

CANCELLED

PENDING_PAYMENT
To keep the above list in a lookup table, Azure SQL database is required
along with the following database details

Database-Name

Server-Name

Username-Name

Password (Sql-password is stored in key vault)

-> Once the database is ready, allow the Azure services(like Databricks) to
access this database by enabling the client access through firewall settings.

-> Create a lookup table - valid_order_status

create table valid_order_status(status_name varchar(50))

-> Inserting values into the table

insert into valid_order_status


values('ON_HOLD'),('PAYMENT_REVIEW'),('PROCESSING'),('CLOSED'),('S
USPECTED_FRAUD'),('COMPLETE'),('PENDING'),('CANCELED'),('PENDIN
G_PAYMENT')select * from valid_order_status

Developing Logic using Interactive Databricks Cluster

-> Create an interactive cluster in databricks that will execute the databricks
notebook.
dbutils.fs.mount(source='wasbs://[email protected]',
mount_point = '/mnt/sales',
extra_configs={'fs.azure.account.key.trendtechsa.blob.core.windows.net':'MQs
Edge/QEAbx95+lIlFcujt5AnU7Q9ErfjTiDEEXkBv4jHFNfsTEozFawfr8KUUrkd3
qf9/LSDZ+AStDaWbXw=='})

sales@trendtechsa : “sales” - Container name


“trendtechsa” - Storage account name

fs.azure.account.key.trendtechsa.blob.core.windows.net :
“trendtechsa” - Storage Account Name
Perform the following in the Databricks Notebook :
1. Create a Mount Point.
2. Create a Dataframe.
3. Load the orders.csv file into a Dataframe and perform validation(checking
for duplicates order_id in order_status)
4. Establish a connection with Azure SQL Server from databricks. During this
process, a secret scope has to be created so that databricks can access the
sql-password which is stored in the key vault.
Note:
In the case of Datafactory, it can directly access the key vault by creating a
linked service but for databricks, a secret scope is needed to access the key
vault.
While creating the secret scope, edit the url till .net and add
“#secrets/createScope”
https://<databricks-instance>#secrets/createScope
Eg:
https://adb-8497524354556085.5.azuredatabricks.net/#secrets/createScope

5. First Validation Check - If there are any duplicate order_ids. Once the
validation is successful, create a Spark Table from Dataframe
errorFlg = False

ordersCount = ordersDf.count()
distinctOrdersCount = ordersDf.select(‘order_id’).distinct().count()

if ordersCount != distinctOrdersCount
errorFlg = True
if errorFlg :
dbutils.fs.mv(‘/mnt/sales/landing/orders.csv’,’/mnt/sales/discarded’)
dbutil.notebook.exit(‘{“errorFlg”: “true”, “errorMsg”: “Orderid is
repeated”}’)

ordersDf.createOrReplaceTempView(“orders”)
6. Then the second validation(check for valid orders_status) is evaluated.
For this validation, we need to connect to Azure SQL DB from the databricks
notebook. The following details are required for connecting to the SQL DB
dbServer
dbPort
dbName
dbUser
dbPassword
databricksScope

connectionUrl =
‘jdbc:sqlserver://{}.database.windows.net:{};database={};user={};’.
format(dbServer, dbPort, dbName, dbUser)

dbPassword = dbutils.secrets.get(scope = databricksScope,


key=’sql-password’)

connectionProperties = {
‘password’ : dbPassword,
‘driver’: ‘com.microsoft.sqlserver.jdbc.SQLServerDriver’
}

Note: Databricks cannot directly access the key vault (where the database
password is stored) directly. A secret scope needs to be created in Databricks
in order to access the database password present in the key vault.

In the URL of the Databricks workspace, add the following to Create a secret
scope (databricksScope) :
add #secrets/createScope

Ex:
https://adb-5015943971662126.6.azuredatabricks.net/onboarding?o=5015943
971662126#secret/createScope

Connecting to the SQL Database for order_status


validStatusDf = spark.read.jdbc(url = connectionUrl, table =
‘dbo.valid_order_status’, properties = connectionProperties

7. Once the second validation is also successful, notebook will exit by giving
the following message and the file is moved to the staging folder.
Notebook exited: {"errorflag:"false","errorMsg:"All Good"}
Creating a Pipeline in Azure Data Factory (Automating the pipeline
execution)
-> Add the Databricks Notebook to the pipeline.
-> In the Settings tab, Select the Notebook to execute
-> Add Trigger -> Type (Storage Event Trigger) -> Storage Account Name ->
Container Name (Ex: Sales) -> Path Begins with (Ex: Landing) -> Path Ends
with(Ex: orders.csv) -> Continue.
-> Monitor tab gives an overview of all the triggered pipelines
-> Process
Once the data is uploaded in the specified container in the storage
account, Storage event triggers the pipeline execution by launching the
cluster and the databricks notebook will be executed. Once the
execution is complete, the cluster deployed for the pipeline execution
will be terminated.

Scenario 2
Problem Statement : Automatically trigger a pipeline execution when any file
(not a specific hardcoded filename - orders.csv as in the previous example) is
added to the Storage Account Container.
Solution :
Adopting a parameterized approach to dynamically read files, moving
away from hardcoding specific filenames.
TriggerBody —> Pipeline —> databricks Notebook
[Code]
filename = dbutils.widgets.get('filename') # The filename will be taken from the
pipeline
fnamewithoutExt = filename.split('.')[0]
print(filename)

OrdersDf = spark.read.csv('/mnt/sales/landing/{}'.format(filename),
inferSchema = True , header = True)
Note:
-> Trigger(First point of contact) should have the capability to dynamically
capture the filenames of the newly added files in the Storage Account and
pass it to the Pipeline.
In this context, the file names are no longer hardcoded; instead, they are
dynamically retrieved from the pipeline and stored in the "filename" variable.
The DataFrame is then created using the “filename” variable.
-> We are retrieving the filename from the triggerbody
(@triggerBody().filename)
-> Changes to be made in the Notebook
filename = dbutils.widgets.get(‘filename’)
Fnamewithoutext = filename.split(‘.’)[0]

Scenario 3
Problem Statement : Enhance the Pipeline by creating Generalized Pipelines
with Generic Mount Code.
Solution : Logic of Mount Code should be developed in a way that if the
mount point is already set, then no mounting should take place else the
storage should be mounted.
alreadyMounted = False
for x in dbutils.fs.mounts() :
if x.mountPoint == “/mnt/sales” :
alreadyMounted = True
break
else
alreadyMounted = False
print(alreadyMounted)
if not alreadyMounted :
dbutils.fs.mount(
source='wasbs://[email protected]',
mount_point = '/mnt/sales',
extra_configs={'fs.azure.account.key.trendtechsa.blob.core.windows.net
':'MQsEdge/QEAbx95+lIlFcujt5AnU7Q9ErfjTiDEEXkBv4jHFNfsTEozFawfr8KU
Urkd3qf9/LSDZ+AStDaWbXw=='})
alreadyMounted = True
print(“Mounting done successfully”)
else
print(“Already mounted”)

Key Points : Enhancements to the Pipeline


1. Dynamically picking the Filenames using the Parameterization
approach
2. Generic code for mounting the storage
3. A secure way of accessing the Storage Account Key with Key
Vault.

Background Activity : Mimicking the scenario - data provided by Third Party

1. order_items data (Present in Amazon S3 in Json Format)


A third party service is adding the order_items file in Json format in a
bucket in Amazon S3 (Need to mimic this scenario)
Requirement - Getting the data from Amazon S3 to ADLS Gen2
using Azure Data Factory
[ You would need an AWS account for this activity. Create an Amazon S3
bucket and a folder to upload the order_items data file in Json format. An IAM
role and secret key needs to be generated for external services to access this
Json file in the S3 bucket.
If you need to access the file from Azure, The Secret Key and Id values
should be added as secrets in the Key Vault. ]
Accessing file in Amazon S3 Bucket from Azure
Create a Linked Service for AWS S3 - provide the details of S3 (Secret
Key and ID present in the Azure Key Vault)
Tasks to be performed as soon as file (orders.csv) arrives in the
Landing folder of ADLS Gen2
a. Get the data file- order_items.json from Amazon S3 and load it to
ADLS Gen2 :
Create a Data Pipeline the ADF with source as Amazon S3(file in Json
format) and the Sink as ADLS Gen2(file to be loaded in CSVformat)
b. Execute the Databricks Notebook

2. customers data
A third party agency will be publishing this data in Azure SQL DB(Need
to mimic this scenario)
Requirement - Getting the data from Azure SQL DB to ADLS Gen2
using Azure Data Factory
-Upload customers.csv file to a folder in ADLS Gen2
-Create an Azure SQL DB and a table with the schema of customers
data file.
-Create two Linked Services ->
a. Linked Service pointing to Azure SQL DB
b. Linked Service pointing to Storage ADLS Gen2
-Create a Pipeline in Azure Data Factory with Copy Data Activity
Source : customers.csv in ADLS Gen2
Sink : customers table in Azure SQL DB
(Schema Mapping - Ensure that the schema of Source and Sink
are matching)
Project Requirement
1. No.of orders placed by each customer
2. Amount spent by each customer
Solution : Join the tables (orders, order_item, customers) to calculate the
no.of orders placed and the amount spent by each customer.

You might also like