0% found this document useful (0 votes)

233 views13 pages

Azure Project

This document provides instructions for developing a data pipeline to process order files from a landing folder to staging and discarded folders. Key steps include: 1. Creating Azure resources like storage, Databricks, and Data Factory. 2. Developing logic in a Databricks notebook to check for duplicate order IDs and valid order statuses, and move files accordingly. 3. Creating a pipeline in Data Factory to automate execution when files arrive, triggering the Databricks notebook. The pipeline is then enhanced to make it more generalized and handle any file, rather than a hardcoded name, by dynamically retrieving filenames from the trigger. Mounting logic is also improved to check if storage is already mounted before mounting.

Uploaded by

Diva R

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

233 views13 pages

Azure Project

Uploaded by

Diva R

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Disclaimer: These slides are copyrighted

and strictly for personal use only

• This document is reserved for people enrolled

into the
Ultimate Big Data Masters Program (Cloud Focused) by
Sumit Sir

• Please do not share this document, it is intended

for personal use and exam preparation only, thank you.

• If you’ve obtained these slides for free on a website

that is not the course’s website, please reach out to
[email protected] . Thanks!

• All the Best and Happy Learning!

Developing a Data Pipeline
Project Use-case

A third party service drops a file named orders.csv in the landing folder.

Requirement:

As soon as file arrives in landing folder, perform the following checks

1. Check for duplicate order_id in order_status

2. Check for valid orders_status

If both the conditions are true then move the file to the staging folder else
move it to the discarded folder.

Note: In future, if the list of valid order_status changes, then we should

dynamically be able to incorporate the changes.

Services required to Implement the Solution :

As soon as the data is added to the storage

|
Run the Pipeline (Data Factory - trigger is a Storage Event)
|
Databricks Notebook executes the spark code to perform the necessary
checks.
Building the Data Pipeline for the Project
Creating the necessary Resources :

1. Storage Account

-> Create a storage account with enabled hierarchical namespace

-> Create a container named “sales”. Inside the sales container

create directories as: landing, staging, discarded
2. Databricks

-> Create a Databricks Workspace

3. Data Factory

-> Create a Datafactory Service

-> Two components that has to be connected to Azure Data

Factory

1. ADLS Gen2 (Storage)

2. Databricks (Compute)

-> During this process, we will use a Key Vault to store any of the
passwords/ secret keys.

-> Linked Services need to be created between the datafactory

and the 2 components - ADLS Gen2 & Databricks.

-During the creation of linked service for databricks, we cannot

directly add the access token, instead store the access token in
key vault and use that name by moving the option to “Azure key
vault”
-While trying to connect with key vault in linked service of
databricks, a key vault linked service should also be created and
to create a linked service to key vault we need to allow
datafactory to access the key vault by enabling it in access
policies.

Totally, 3 linked services will be created ->

1. ADLS GEN2 | 2. Databricks | 3. Key Vault

Creating Azure SQL DB : Lookup Table to maintain the List of valid

order_status

ON_HOLD

PAYMENT_REVIEW

PROCESSING

CLOSED

SUSPECTED_FRAUD

COMPLETE

PENDING

CANCELLED

PENDING_PAYMENT
To keep the above list in a lookup table, Azure SQL database is required
along with the following database details

Database-Name

Server-Name

Username-Name

Password (Sql-password is stored in key vault)

-> Once the database is ready, allow the Azure services(like Databricks) to
access this database by enabling the client access through firewall settings.

-> Create a lookup table - valid_order_status

create table valid_order_status(status_name varchar(50))

-> Inserting values into the table

insert into valid_order_status

values('ON_HOLD'),('PAYMENT_REVIEW'),('PROCESSING'),('CLOSED'),('S
USPECTED_FRAUD'),('COMPLETE'),('PENDING'),('CANCELED'),('PENDIN
G_PAYMENT')select * from valid_order_status

Developing Logic using Interactive Databricks Cluster

-> Create an interactive cluster in databricks that will execute the databricks
notebook.
dbutils.fs.mount(source='wasbs://[email protected]',
mount_point = '/mnt/sales',
extra_configs={'fs.azure.account.key.trendtechsa.blob.core.windows.net':'MQs
Edge/QEAbx95+lIlFcujt5AnU7Q9ErfjTiDEEXkBv4jHFNfsTEozFawfr8KUUrkd3
qf9/LSDZ+AStDaWbXw=='})

sales@trendtechsa : “sales” - Container name

“trendtechsa” - Storage account name

fs.azure.account.key.trendtechsa.blob.core.windows.net :
“trendtechsa” - Storage Account Name
Perform the following in the Databricks Notebook :
1. Create a Mount Point.
2. Create a Dataframe.
3. Load the orders.csv file into a Dataframe and perform validation(checking
for duplicates order_id in order_status)
4. Establish a connection with Azure SQL Server from databricks. During this
process, a secret scope has to be created so that databricks can access the
sql-password which is stored in the key vault.
Note:
In the case of Datafactory, it can directly access the key vault by creating a
linked service but for databricks, a secret scope is needed to access the key
vault.
While creating the secret scope, edit the url till .net and add
“#secrets/createScope”
https://<databricks-instance>#secrets/createScope
Eg:
https://adb-8497524354556085.5.azuredatabricks.net/#secrets/createScope

5. First Validation Check - If there are any duplicate order_ids. Once the
validation is successful, create a Spark Table from Dataframe
errorFlg = False

ordersCount = ordersDf.count()
distinctOrdersCount = ordersDf.select(‘order_id’).distinct().count()

if ordersCount != distinctOrdersCount
errorFlg = True
if errorFlg :
dbutils.fs.mv(‘/mnt/sales/landing/orders.csv’,’/mnt/sales/discarded’)
dbutil.notebook.exit(‘{“errorFlg”: “true”, “errorMsg”: “Orderid is
repeated”}’)

ordersDf.createOrReplaceTempView(“orders”)
6. Then the second validation(check for valid orders_status) is evaluated.
For this validation, we need to connect to Azure SQL DB from the databricks
notebook. The following details are required for connecting to the SQL DB
dbServer
dbPort
dbName
dbUser
dbPassword
databricksScope

connectionUrl =
‘jdbc:sqlserver://{}.database.windows.net:{};database={};user={};’.
format(dbServer, dbPort, dbName, dbUser)

dbPassword = dbutils.secrets.get(scope = databricksScope,

key=’sql-password’)

connectionProperties = {
‘password’ : dbPassword,
‘driver’: ‘com.microsoft.sqlserver.jdbc.SQLServerDriver’
}

Note: Databricks cannot directly access the key vault (where the database
password is stored) directly. A secret scope needs to be created in Databricks
in order to access the database password present in the key vault.

In the URL of the Databricks workspace, add the following to Create a secret
scope (databricksScope) :
add #secrets/createScope

Ex:
https://adb-5015943971662126.6.azuredatabricks.net/onboarding?o=5015943
971662126#secret/createScope

Connecting to the SQL Database for order_status

validStatusDf = spark.read.jdbc(url = connectionUrl, table =
‘dbo.valid_order_status’, properties = connectionProperties

7. Once the second validation is also successful, notebook will exit by giving
the following message and the file is moved to the staging folder.
Notebook exited: {"errorflag:"false","errorMsg:"All Good"}
Creating a Pipeline in Azure Data Factory (Automating the pipeline
execution)
-> Add the Databricks Notebook to the pipeline.
-> In the Settings tab, Select the Notebook to execute
-> Add Trigger -> Type (Storage Event Trigger) -> Storage Account Name ->
Container Name (Ex: Sales) -> Path Begins with (Ex: Landing) -> Path Ends
with(Ex: orders.csv) -> Continue.
-> Monitor tab gives an overview of all the triggered pipelines
-> Process
Once the data is uploaded in the specified container in the storage
account, Storage event triggers the pipeline execution by launching the
cluster and the databricks notebook will be executed. Once the
execution is complete, the cluster deployed for the pipeline execution
will be terminated.

Scenario 2
Problem Statement : Automatically trigger a pipeline execution when any file
(not a specific hardcoded filename - orders.csv as in the previous example) is
added to the Storage Account Container.
Solution :
Adopting a parameterized approach to dynamically read files, moving
away from hardcoding specific filenames.
TriggerBody —> Pipeline —> databricks Notebook
[Code]
filename = dbutils.widgets.get('filename') # The filename will be taken from the
pipeline
fnamewithoutExt = filename.split('.')[0]
print(filename)

OrdersDf = spark.read.csv('/mnt/sales/landing/{}'.format(filename),
inferSchema = True , header = True)
Note:
-> Trigger(First point of contact) should have the capability to dynamically
capture the filenames of the newly added files in the Storage Account and
pass it to the Pipeline.
In this context, the file names are no longer hardcoded; instead, they are
dynamically retrieved from the pipeline and stored in the "filename" variable.
The DataFrame is then created using the “filename” variable.
-> We are retrieving the filename from the triggerbody
(@triggerBody().filename)
-> Changes to be made in the Notebook
filename = dbutils.widgets.get(‘filename’)
Fnamewithoutext = filename.split(‘.’)[0]

Scenario 3
Problem Statement : Enhance the Pipeline by creating Generalized Pipelines
with Generic Mount Code.
Solution : Logic of Mount Code should be developed in a way that if the
mount point is already set, then no mounting should take place else the
storage should be mounted.
alreadyMounted = False
for x in dbutils.fs.mounts() :
if x.mountPoint == “/mnt/sales” :
alreadyMounted = True
break
else
alreadyMounted = False
print(alreadyMounted)
if not alreadyMounted :
dbutils.fs.mount(
source='wasbs://[email protected]',
mount_point = '/mnt/sales',
extra_configs={'fs.azure.account.key.trendtechsa.blob.core.windows.net
':'MQsEdge/QEAbx95+lIlFcujt5AnU7Q9ErfjTiDEEXkBv4jHFNfsTEozFawfr8KU
Urkd3qf9/LSDZ+AStDaWbXw=='})
alreadyMounted = True
print(“Mounting done successfully”)
else
print(“Already mounted”)

Key Points : Enhancements to the Pipeline

1. Dynamically picking the Filenames using the Parameterization
approach
2. Generic code for mounting the storage
3. A secure way of accessing the Storage Account Key with Key
Vault.

Background Activity : Mimicking the scenario - data provided by Third Party

1. order_items data (Present in Amazon S3 in Json Format)

A third party service is adding the order_items file in Json format in a
bucket in Amazon S3 (Need to mimic this scenario)
Requirement - Getting the data from Amazon S3 to ADLS Gen2
using Azure Data Factory
[ You would need an AWS account for this activity. Create an Amazon S3
bucket and a folder to upload the order_items data file in Json format. An IAM
role and secret key needs to be generated for external services to access this
Json file in the S3 bucket.
If you need to access the file from Azure, The Secret Key and Id values
should be added as secrets in the Key Vault. ]
Accessing file in Amazon S3 Bucket from Azure
Create a Linked Service for AWS S3 - provide the details of S3 (Secret
Key and ID present in the Azure Key Vault)
Tasks to be performed as soon as file (orders.csv) arrives in the
Landing folder of ADLS Gen2
a. Get the data file- order_items.json from Amazon S3 and load it to
ADLS Gen2 :
Create a Data Pipeline the ADF with source as Amazon S3(file in Json
format) and the Sink as ADLS Gen2(file to be loaded in CSVformat)
b. Execute the Databricks Notebook

2. customers data
A third party agency will be publishing this data in Azure SQL DB(Need
to mimic this scenario)
Requirement - Getting the data from Azure SQL DB to ADLS Gen2
using Azure Data Factory
-Upload customers.csv file to a folder in ADLS Gen2
-Create an Azure SQL DB and a table with the schema of customers
data file.
-Create two Linked Services ->
a. Linked Service pointing to Azure SQL DB
b. Linked Service pointing to Storage ADLS Gen2
-Create a Pipeline in Azure Data Factory with Copy Data Activity
Source : customers.csv in ADLS Gen2
Sink : customers table in Azure SQL DB
(Schema Mapping - Ensure that the schema of Source and Sink
are matching)
Project Requirement
1. No.of orders placed by each customer
2. Amount spent by each customer
Solution : Join the tables (orders, order_item, customers) to calculate the
no.of orders placed and the amount spent by each customer.

Guidewire Billing Center
No ratings yet
Guidewire Billing Center
14 pages
About This
No ratings yet
About This
89 pages
Guidewire Policy Center
No ratings yet
Guidewire Policy Center
23 pages
03-TA80 CONF030 ExtendingEntities
No ratings yet
03-TA80 CONF030 ExtendingEntities
37 pages
ClaimCenter Business Analyst Learning Path
No ratings yet
ClaimCenter Business Analyst Learning Path
6 pages
Guidewire Jutro for Insurance UI Design
No ratings yet
Guidewire Jutro for Insurance UI Design
2 pages
Exam Guide Professional PolicyCenter Business Analyst
No ratings yet
Exam Guide Professional PolicyCenter Business Analyst
4 pages
GT Framework Test Automation
0% (1)
GT Framework Test Automation
8 pages
Cloud Platform
No ratings yet
Cloud Platform
10 pages
GW Data Plataform
No ratings yet
GW Data Plataform
3 pages
Policy Q-A
No ratings yet
Policy Q-A
8 pages
Guidewire Training & Placement Course
No ratings yet
Guidewire Training & Placement Course
4 pages
IS10 FUND K StudentWorkbook
No ratings yet
IS10 FUND K StudentWorkbook
162 pages
LTI Case Study Guidewire Policy Conversion
No ratings yet
LTI Case Study Guidewire Policy Conversion
3 pages
Guidewire Functional Testing Questions
No ratings yet
Guidewire Functional Testing Questions
29 pages
Implementation Details
0% (1)
Implementation Details
4 pages
TCS BaNCS AR Forrester Wave™ Global Banking Platforms 12 2010
100% (1)
TCS BaNCS AR Forrester Wave™ Global Banking Platforms 12 2010
21 pages
Guidewre PC CloudAPI Consumer
No ratings yet
Guidewre PC CloudAPI Consumer
530 pages
Designer
No ratings yet
Designer
34 pages
01-TA80 CONF010 FundIntro
No ratings yet
01-TA80 CONF010 FundIntro
67 pages
AI & IoT Integration for Insurance
100% (1)
AI & IoT Integration for Insurance
2 pages
PolicyCenter Data Sheet Rating Management
No ratings yet
PolicyCenter Data Sheet Rating Management
4 pages
Webmethods Designer Service Development Help 7 2
No ratings yet
Webmethods Designer Service Development Help 7 2
406 pages
Cloud Testing for Guidewire Apps
No ratings yet
Cloud Testing for Guidewire Apps
7 pages
Untitled
No ratings yet
Untitled
13 pages
CloudAPI Consumer
No ratings yet
CloudAPI Consumer
368 pages
Config PDF
No ratings yet
Config PDF
674 pages
04-TA80 CONF040 CreatingEntities
No ratings yet
04-TA80 CONF040 CreatingEntities
29 pages
05 TA80 INTG060 PluginsPredefined
No ratings yet
05 TA80 INTG060 PluginsPredefined
39 pages
ISO Electronic Rating Content (ERC) User Guide Parser 3.2.3 - Final - v102615
100% (2)
ISO Electronic Rating Content (ERC) User Guide Parser 3.2.3 - Final - v102615
73 pages
Ej Notes (Ut)
No ratings yet
Ej Notes (Ut)
15 pages
Guidewire PolicyCenter Interview QA
No ratings yet
Guidewire PolicyCenter Interview QA
2 pages
Guide Wire Documentation
No ratings yet
Guide Wire Documentation
47 pages
Jutro Developer Learning Path
No ratings yet
Jutro Developer Learning Path
6 pages
Marine Insurance: Adithya V Arka B Shyam K Mahesh P
0% (1)
Marine Insurance: Adithya V Arka B Shyam K Mahesh P
40 pages
Questions
No ratings yet
Questions
8 pages
Qaconficpc
No ratings yet
Qaconficpc
2 pages
SurePath Overview
No ratings yet
SurePath Overview
84 pages
Scala Cheatsheet
No ratings yet
Scala Cheatsheet
2 pages
Guidewire Datahub
No ratings yet
Guidewire Datahub
3 pages
OnBase Whitepaper DuckCreek 1654 PDF
No ratings yet
OnBase Whitepaper DuckCreek 1654 PDF
6 pages
Admin Guide
No ratings yet
Admin Guide
436 pages
Datasheet Guidewire ClaimCenter LondonMarketMessaging PDF
No ratings yet
Datasheet Guidewire ClaimCenter LondonMarketMessaging PDF
2 pages
Top 100 Guidewire Interview Questions - Beginner To Advanced
No ratings yet
Top 100 Guidewire Interview Questions - Beginner To Advanced
3 pages
Product Model
No ratings yet
Product Model
208 pages
Career Path Guidewire Product Development PRINT
No ratings yet
Career Path Guidewire Product Development PRINT
1 page
PolicyCenter 10 Data Sheet Whats New en
0% (1)
PolicyCenter 10 Data Sheet Whats New en
2 pages
04 TA80 INTG040 GosuBundles
No ratings yet
04 TA80 INTG040 GosuBundles
35 pages
Guidewire BA (PC, CC, BC) Content
No ratings yet
Guidewire BA (PC, CC, BC) Content
4 pages
Intro To Platform
No ratings yet
Intro To Platform
12 pages
02 TA80 INTG020 GosuForIntegration
No ratings yet
02 TA80 INTG020 GosuForIntegration
30 pages
Policy Center Getting Started Guide v8.5
100% (1)
Policy Center Getting Started Guide v8.5
80 pages
Rules
No ratings yet
Rules
84 pages
Duck Creek Questions, Issues, Concerns v4
No ratings yet
Duck Creek Questions, Issues, Concerns v4
6 pages
Guidewire Data Conversion Approach Paper
No ratings yet
Guidewire Data Conversion Approach Paper
5 pages
BA Project Insurance Domain
No ratings yet
BA Project Insurance Domain
84 pages
Course Content
No ratings yet
Course Content
13 pages
?stuck in A Loop of Rejections - Let's Break The Cycle!?
No ratings yet
?stuck in A Loop of Rejections - Let's Break The Cycle!?
7 pages
Azure Data Engineering Course Interview Questions 1751484980
No ratings yet
Azure Data Engineering Course Interview Questions 1751484980
20 pages
Indian Institute of Technology Kharagpur Mid-Spring Semester Examination 2023-24
No ratings yet
Indian Institute of Technology Kharagpur Mid-Spring Semester Examination 2023-24
3 pages
Connect Oracle to MS Access on Windows
No ratings yet
Connect Oracle to MS Access on Windows
4 pages
MAH MCA CET Computer Concepts
No ratings yet
MAH MCA CET Computer Concepts
79 pages
The Internet: ITS323: Introduction To Data Communications CSS331: Fundamentals of Data Communications
No ratings yet
The Internet: ITS323: Introduction To Data Communications CSS331: Fundamentals of Data Communications
43 pages
Assessment System: Take Assessment - Enetwork Final Exam - Ccna Exploration: Network Fundamentals (Version 4.0)
No ratings yet
Assessment System: Take Assessment - Enetwork Final Exam - Ccna Exploration: Network Fundamentals (Version 4.0)
29 pages
Siebel CRM Installation Guide
0% (1)
Siebel CRM Installation Guide
112 pages
Teradata RDBMS Features & MultiLoad
0% (1)
Teradata RDBMS Features & MultiLoad
7 pages
Connecting To SQL Server Using SSMS
No ratings yet
Connecting To SQL Server Using SSMS
18 pages
Report Blood Bank Management System DBMS
100% (1)
Report Blood Bank Management System DBMS
33 pages
Operating Systems Class Notes
No ratings yet
Operating Systems Class Notes
2 pages
TSM Implementation Guide
No ratings yet
TSM Implementation Guide
9 pages
RegEx for Data Quality in Informatica
No ratings yet
RegEx for Data Quality in Informatica
2 pages
Validation Based Protocol
No ratings yet
Validation Based Protocol
7 pages
Digital Electronics 1
No ratings yet
Digital Electronics 1
26 pages
Bgates LT Ita
No ratings yet
Bgates LT Ita
26 pages
Walmart EDI 997 Acknowledgment Guide
No ratings yet
Walmart EDI 997 Acknowledgment Guide
25 pages
Gzip 114
No ratings yet
Gzip 114
7 pages
Computer Fundamentals MCQs PDF
75% (16)
Computer Fundamentals MCQs PDF
112 pages
Regex Cheat Sheet for Developers
No ratings yet
Regex Cheat Sheet for Developers
1 page
C File Management Basics
No ratings yet
C File Management Basics
24 pages
6600 Switch Series - EoS Announcement
No ratings yet
6600 Switch Series - EoS Announcement
2 pages
Bubble Sort Guide & Analysis
No ratings yet
Bubble Sort Guide & Analysis
18 pages
Non Linear Data Structures
No ratings yet
Non Linear Data Structures
50 pages
Cisco SG300
No ratings yet
Cisco SG300
586 pages
Data Entry Instructions Terms and Conditions
No ratings yet
Data Entry Instructions Terms and Conditions
2 pages
Recoverpoint Fundamentals SRG
No ratings yet
Recoverpoint Fundamentals SRG
55 pages
Decentralized Election Voting System Using Blockchain
100% (1)
Decentralized Election Voting System Using Blockchain
1 page
CCW CST308
No ratings yet
CCW CST308
6 pages
Java String & Array Quiz
No ratings yet
Java String & Array Quiz
5 pages
Interview Questions
No ratings yet
Interview Questions
6 pages

Azure Project

Uploaded by

Azure Project

Uploaded by

Disclaimer: These slides are copyrighted

and strictly for personal use only

• This document is reserved for people enrolled

• Please do not share this document, it is intended

• If you’ve obtained these slides for free on a website

• All the Best and Happy Learning!

As soon as file arrives in landing folder, perform the following checks

1. Check for duplicate order_id in order_status

Note: In future, if the list of valid order_status changes, then we should

Services required to Implement the Solution :

As soon as the data is added to the storage

-> Create a storage account with enabled hierarchical namespace

-> Create a container named “sales”. Inside the sales container

-> Create a Databricks Workspace

-> Create a Datafactory Service

-> Two components that has to be connected to Azure Data

1. ADLS Gen2 (Storage)

-> Linked Services need to be created between the datafactory

-During the creation of linked service for databricks, we cannot

Totally, 3 linked services will be created ->

1. ADLS GEN2 | 2. Databricks | 3. Key Vault

Creating Azure SQL DB : Lookup Table to maintain the List of valid

Password (Sql-password is stored in key vault)

-> Create a lookup table - valid_order_status

create table valid_order_status(status_name varchar(50))

-> Inserting values into the table

insert into valid_order_status

Developing Logic using Interactive Databricks Cluster

sales@trendtechsa : “sales” - Container name

dbPassword = dbutils.secrets.get(scope = databricksScope,

Connecting to the SQL Database for order_status

Key Points : Enhancements to the Pipeline

Background Activity : Mimicking the scenario - data provided by Third Party

1. order_items data (Present in Amazon S3 in Json Format)

You might also like