0% found this document useful (1 vote)

362 views

SCD Typ2 in Databricks Azure

This Python code implements SCD Type 2 logic using Pandas and Spark. It retrieves active records from a history table and delta table, identifies new and changed records, populates fields for insertion and update, and overwrites the results to the history table in HDFS. Key steps include splitting records into active/inactive frames, comparing the frames to find new and updated records, generating IDs, populating required fields, and writing the results out to the target table.

Uploaded by

sayhi2sudarshan

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (1 vote)

362 views

SCD Typ2 in Databricks Azure

Uploaded by

sayhi2sudarshan

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

SCD TYPE-2 USING PANDAS

IN SPARK FRAMEWORK

Sasi Kumar Thanigai Mani

SCD TYPE2 USING PANDAS IN SPARK FRAMEWORK
Overview
Apache Spark is a very popular platform for in‐memory data processing and analysis, enabling real‐
time big data analytics. Since we started using Spark framework in all new Big data migration projects. We
have requirement to migrate ETL Type2 tables into HDFS file system using Spark framework. Spark
framework supports four different programming languages (Scala, Python, Java & R). The below logic built
in python using the Pandas libraries.

Pandas
 Pandas is an open source library used in python.
 Pandas Dataframes will look similar to table and used to perform all SQL related data manipulations
and data handling.
 Since Spark supports various programming languages it’s possible to use native python code in Spark
Framework.
 This below code logic requires changes in Spark configuration if the Data volume is huge.

Code Highlights
 Split inactive records from history table in to separate Dataframe for overwrite option in the end.
 Identify the business key columns for new insert records using column defined as set functions.
 Identify the Type2 columns for update records by merging Dataframes.
 Populates required fields for insert records.
 Generates unique id similar to surrogate key generator.
 Merge all the Dataframes into single to overwrite in the end with updated values as per delta table.

PYSpark code to implement SCDType2

###########################################################################
This code retrieves active codes from two tables and perform SCD Type2 operation and write into HDFS file
system using Spark.
###########################################################################
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, SQLContext,

from pyspark.sql.types import *

import pandas as pd
import numpy as np

1
import datetime
from decimal import *
##########################################

#### Create spark context

#########################################
spark = SparkSession.builder.appName('SCD-Type2-Pandas').getOrCreate()
sqlContext = SQLContext(spark)
#############################

#### GET DATA ###############

#############################
VAR_DB_NM = 'Hist_DB'
VAR_TB_NM = 'Hist_Table'

# 1) Hist_Table sql table

HIST_FULL_DF = sqlContext.sql('select * from '+VAR_DB_NM+'.'+VAR_TB_NM).toPandas()
HIST_FULL_DF.columns = map(str.lower, HIST_FULL_DF.columns)

MASK = HIST_FULL_DF.is_curr_row_ind == 'Y'

#Use this DF containing active records for processing, during overwriting merge Hist_Table with
Hist_Table_INACTV_DF
HIST_DF = HIST_FULL_DF.loc[MASK,:]
HIST_INACTV_DF = HIST_FULL_DF.loc[~MASK,:]

# 2) Load latest file from HDFS

DELTA_DF = sqlContext.sql('select * from '+VAR_DB_NM+'.'+VAR_TB_NM).toPandas()
#column formatting
DELTA_DF.columns = map(str.lower, BUS_UNT_DEL_DF.columns)

###########################################################################
###### Functions ######
2
###########################################################################
def POPULATE_FLDS_INSERT(DF,ID):
###########################################################################

#THIS function populates all the fields for inserting new records into Hist_Table hive table
#INPUT-- 1 Dataframe and BU_ID. 1) DF with new BU NAME AND BU DESC 2) Value of last BU_ID
generated
#RETURNS-- 1 Dataframe. DF with all fields populated for insert
###################################################################################
BU_ID = [(ID+i+1) for i in range(len(DF))]

last_ID = BU_ID[-1]
DF = DF.assign(bus_unt_id = BU_ID, row_efctv_to_tmsp = '2200-04-11 18:47:16',crtd_by_nm = \
'Type2Job.py',updtd_by_nm = 'Type2Job.py',is_curr_row_ind = 'Y')
x = '{:%Y-%m-%d %H:%M:%S}'.format(datetime.datetime.now())
DF = DF.assign(crtd_tmsp = x, updtd_tmsp = x, row_efctv_from_tmsp = x)

#rearrange columns matching Final DF

DF = DF[['bus_unt_id','bus_unt_nm','bus_unt_desc','crtd_by_nm','crtd_tmsp','updtd_by_nm',\
'updtd_tmsp','row_efctv_from_tmsp','row_efctv_to_tmsp','is_curr_row_ind']]
return DF

def EXPIRE_UPDATE_RECS(HIS_DF, DEL_DF):

#######################################################################################
#THIS function identifies and expires records that needs update, subsequently creates new records with
updates
#INPUT -- 2 Dataframes. 1) HIS_DF of History table 2) DEL_DF of Delta table
#RETURNS -- 1 Dataframe. DF with old Records expired that needs update, and new rows inserted with
updates
#######################################################################################
#Get required columns to DF for validation

HIS_DF_TRIM = HIS_DF[['bus_unt_nm','bus_unt_desc']]

#Type check use merge fuction to identify updated records (Type2 Column - bus_unt_desc)

3
global CHNGD_BU_DF
CHNGD_BU_DF = pd.DataFrame()
#identifies records with mismatch in descr for given BU name

CHNGD_BU_DF = DEL_DF[~(pd.merge(DEL_DF, HIS_DF_TRIM, on=['bus_unt_nm', 'bus_unt_desc'],\

how='left', indicator=True)['_merge'] == 'both')]

#When no updates records found return the HIS_DF that has latest BUS_UNT data
if len(CHNGD_BU_DF) ==0:

return HIS_DF
else:
CHNGD_BU = CHNGD_BU_DF.bus_unt_nm.tolist()
EXPRD_BU_ID = HIS_DF['bus_unt_id'].loc[HIS_DF.bus_unt_nm.isin(CHNGD_BU)].tolist()

#Expiring current record before inserting updated BU desc record

HIS_DF['is_curr_row_ind'].loc[HIS_DF.bus_unt_nm.isin(CHNGD_BU)] = 'N'
HIS_DF['updtd_by_nm'].loc[HIS_DF.bus_unt_nm.isin(CHNGD_BU)] = 'Type2Job.py'
cur_tmsp = '{:%Y-%m-%d %H:%M:%S}'.format(datetime.datetime.now())
HIS_DF['updtd_tmsp'].loc[HIS_DF.bus_unt_nm.isin(CHNGD_BU)] = cur_tmsp

HIS_DF['row_efctv_to_tmsp'].loc[HIS_DF.bus_unt_nm.isin(CHNGD_BU)] = cur_tmsp

#populating fields for inserting records as part of update operation

INSRT_UPDTBU_DF = POPULATE_FLDS_INSERT(CHNGD_BU_DF,last_ID)

#After above call INSRT_UPDTBU_DF DF will have new ID's generated for type2 update rows
#create a DF with old and new BU_ID
global OLD_NEW_ID
OLD_NEW_ID = pd.DataFrame()
OLD_NEW_ID = INSRT_UPDTBU_DF[['bus_unt_id']]

OLD_NEW_ID['EXPRD_BU_ID'] = EXPRD_BU_ID
OLD_NEW_ID = OLD_NEW_ID.rename(columns={'bus_unt_id': 'new_bus_unt_id', 'EXPRD_BU_ID' :\
'bus_unt_id' })
4
HIS_DF = pd.concat([HIS_DF,INSRT_UPDTBU_DF],ignore_index=True)
return HIS_DF

def WRITE_TO_HIVE(DF,TB_NM):
###################################################################################
#THIS function converts PANDAS DF to SPARK DF and overwrites it to HIVE table
#INPUT -- 1) DF with data to be loaded to HIVE table 2) HDFS path name as string
#RETURNS -- Not Applicable

###################################################################################
DF = DF.astype('str')
DF.columns = map(str.upper, DF.columns)
#covert DF to SDF
DQ_SDF = spark.createDataFrame(DF)

DQ_SDF.write.format("parquet").mode("overwrite").save('hdfs:/data/hadoop/Hist_DB/Hist_Table/')
return
###########################################################################
###### Validation for new insert, update SCD records ######
###########################################################################

#Get the last ID from HIST Table for surrogate key generation
DQ_BUS_ID = HIST_FULL_DF.bus_unt_id.tolist()
DQ_BUS_ID.sort()
global last_ID
last_ID = DQ_BUS_ID[-1]

#BUS_UNT name validation

BU_NAMES = set(HIST_DF.bus_unt_nm)
BU_CRNT = set(DELTA_DF.bus_unt_nm)
#New records for insert (Business Key - bus_unt_nm)
if not BU_CRNT.issubset(BU_NAMES):

#Identify the new BU names

new_BU = BU_CRNT - BU_NAMES

5
#Create a DF with new BU NAME and DESCR from CSV
BU_INSRT = BUS_UNT_DEL_DF.loc[BUS_UNT_DEL_DF.bus_unt_nm.isin(new_BU)]

#Populate all fields required for hive table insert

DF_HIVE_INSRT = POPULATE_FLDS_INSERT(BU_INSRT,last_ID)

#Append the new rows to HIST_DF dataframe for hive table insert and reset index
HIST_DF = pd.concat([HIST_DF,DF_HIVE_INSRT],ignore_index=True)

#Identify and Expire records that needs desc to be updated

HIVE_LOAD_DF = EXPIRE_UPDATE_RECS(HIST_DF,DELTA_DF)

#Merge inactive records before overwrite operation

HIVE_LOAD_DF = pd.concat([HIVE_LOAD_DF,HIST_INACTV_DF],ignore_index=True)

#WRITE call is MUST as this loop has new records to be inserted into HIVE table
print('************New Bus Unt Identified for this Run**************')
WRITE_TO_HIVE(HIVE_LOAD_DF,VAR_TB_NM)

#Check for update records

else:
#This part will be executed when no new BU name in CSV
print('*********NO new BU Names identified for this run************')

#Identify and Expire records that needs BU desc to be updated

HIVE_LOAD_DF = EXPIRE_UPDATE_RECS(HIST_DF,BUS_UNT_DEL_DF)

if len(CHNGD_BU_DF) ==0:

print('No Insert or Update to Hist table for run {:%d-%m-%y}'.format(datetime.datetime.now()))

else:
#Merge inactive records before overwrite operation
6
HIVE_LOAD_DF = pd.concat([HIVE_LOAD_DF,HIST_INACTV_DF],ignore_index=True)
print('*********Type2 Update identified for this Run*********')
WRITE_TO_HIVE(HIVE_LOAD_DF,VAR_TB_NM)

AHM 250 Course Material
73% (15)
AHM 250 Course Material
285 pages
Lab - Qlik Replicate Azure Databricks
No ratings yet
Lab - Qlik Replicate Azure Databricks
16 pages
Ultimate Data Engineering with Databricks: Develop Scalable Data Pipelines Using Data Engineering's Core Tenets Such as Delta Tables, Ingestion, Transformation, Security, and Scalability
From Everand
Ultimate Data Engineering with Databricks: Develop Scalable Data Pipelines Using Data Engineering's Core Tenets Such as Delta Tables, Ingestion, Transformation, Security, and Scalability
Mayank Malhotra
No ratings yet
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
From Everand
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
Debananda Ghosh
No ratings yet
ADF Course Content
No ratings yet
ADF Course Content
11 pages
Databricks Delta Guide
No ratings yet
Databricks Delta Guide
11 pages
Data Migration Deloitte Solution-Siemens
No ratings yet
Data Migration Deloitte Solution-Siemens
2 pages
Databricks How To Data Import PDF
No ratings yet
Databricks How To Data Import PDF
16 pages
WS-BPEL 2.0 Beginner's Guide
From Everand
WS-BPEL 2.0 Beginner's Guide
Matjaz B. Juric
No ratings yet
Whitepaper NEC SAPHANA Hadoop
No ratings yet
Whitepaper NEC SAPHANA Hadoop
24 pages
Book Power BI From Rookie To Rock Star Book04 Power BI Modeling and DAX Reza Rad RADACAD
No ratings yet
Book Power BI From Rookie To Rock Star Book04 Power BI Modeling and DAX Reza Rad RADACAD
257 pages
Speed Your Data Lake ROI
No ratings yet
Speed Your Data Lake ROI
16 pages
Mongodb Spark
No ratings yet
Mongodb Spark
13 pages
IDQ - 1WMP Data Migration Use Cases
No ratings yet
IDQ - 1WMP Data Migration Use Cases
11 pages
Mandapriyanka (7 0)
No ratings yet
Mandapriyanka (7 0)
3 pages
HOL Informatica DataQuality 9.1
No ratings yet
HOL Informatica DataQuality 9.1
48 pages
Data Vault and HQDM Principles PDF
No ratings yet
Data Vault and HQDM Principles PDF
8 pages
Access Control Snowflake
No ratings yet
Access Control Snowflake
6 pages
Teradata and ETL Testing
No ratings yet
Teradata and ETL Testing
17 pages
FSLDM Data Modeller
No ratings yet
FSLDM Data Modeller
1 page
AWS 05 DataLake
No ratings yet
AWS 05 DataLake
78 pages
Informatica MDM (Overview)
No ratings yet
Informatica MDM (Overview)
8 pages
Ebook Solving Business Needs With Delta Lakev2
No ratings yet
Ebook Solving Business Needs With Delta Lakev2
43 pages
Govindarajan Data Vault PDF
100% (1)
Govindarajan Data Vault PDF
29 pages
Cloudera Hive
No ratings yet
Cloudera Hive
132 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
2024 DQOps Ebook A Step-By-step Guide To Improve Data Quality
No ratings yet
2024 DQOps Ebook A Step-By-step Guide To Improve Data Quality
120 pages
Documenting ETL Rules in CA ERwin
No ratings yet
Documenting ETL Rules in CA ERwin
25 pages
2 Hadoop (Uploaded)
No ratings yet
2 Hadoop (Uploaded)
82 pages
Databricks Secure Deployments and Security Baselines
No ratings yet
Databricks Secure Deployments and Security Baselines
25 pages
Snowflake To Oracle
No ratings yet
Snowflake To Oracle
16 pages
Serverless Architecture For Product Defect Detection Using Computer Vision Ra
No ratings yet
Serverless Architecture For Product Defect Detection Using Computer Vision Ra
1 page
Eb Cloud Data Warehouse Comparison Ebook en
No ratings yet
Eb Cloud Data Warehouse Comparison Ebook en
10 pages
Apache Druid: Sudhindra Tirupati Nagaraj
No ratings yet
Apache Druid: Sudhindra Tirupati Nagaraj
12 pages
What Is Bigquery: Enterprise Data Warehouse
No ratings yet
What Is Bigquery: Enterprise Data Warehouse
2 pages
Facebook Hive POC
No ratings yet
Facebook Hive POC
18 pages
Introduction To Databricks SQL Answer Guide
No ratings yet
Introduction To Databricks SQL Answer Guide
6 pages
DataFabric On Azure
100% (1)
DataFabric On Azure
2 pages
Big Data Engineer Interview Questions
No ratings yet
Big Data Engineer Interview Questions
1 page
Lead Data Engineer Resume Example
No ratings yet
Lead Data Engineer Resume Example
1 page
06.introduction To Data Factory
No ratings yet
06.introduction To Data Factory
26 pages
Sqoop Export and Import Commands
No ratings yet
Sqoop Export and Import Commands
5 pages
Building Data Pipelines - 1
No ratings yet
Building Data Pipelines - 1
25 pages
Designing Data Integration The ETL Pattern Approac
No ratings yet
Designing Data Integration The ETL Pattern Approac
9 pages
Fundamentals of Big Data Engineering: A Guide To The
No ratings yet
Fundamentals of Big Data Engineering: A Guide To The
14 pages
Snowflake - Data Ingestion - Loading
No ratings yet
Snowflake - Data Ingestion - Loading
12 pages
Choosing Technologies For A Big Data Solution in The Cloud: James Serra
No ratings yet
Choosing Technologies For A Big Data Solution in The Cloud: James Serra
58 pages
Idq9 0 1
100% (3)
Idq9 0 1
199 pages
Manual ERwin Data Modeler PDF
100% (1)
Manual ERwin Data Modeler PDF
87 pages
Mining Your Data Lake For Analytics Insights v3 101420
No ratings yet
Mining Your Data Lake For Analytics Insights v3 101420
16 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Kafka Sparkstreaming
No ratings yet
Kafka Sparkstreaming
75 pages
Insurance DataWare House Design Vechiles
No ratings yet
Insurance DataWare House Design Vechiles
2 pages
02 - Apache Spark On Amazon EMR
No ratings yet
02 - Apache Spark On Amazon EMR
31 pages
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
From Everand
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
Saba Shah
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Oracle SOA BPEL Process Manager 11gR1 A Hands-on Tutorial
From Everand
Oracle SOA BPEL Process Manager 11gR1 A Hands-on Tutorial
Ravi Saraswathi
5/5 (1)
Mastering Data Engineering and Analytics with Databricks: A Hands-on Guide to Build Scalable Pipelines Using Databricks, Delta Lake, and MLflow (English Edition)
From Everand
Mastering Data Engineering and Analytics with Databricks: A Hands-on Guide to Build Scalable Pipelines Using Databricks, Delta Lake, and MLflow (English Edition)
Manoj Kumar
No ratings yet
Beginning Microsoft SQL Server 2012 Programming
From Everand
Beginning Microsoft SQL Server 2012 Programming
Paul Atkinson
1/5 (1)
Monitoring Hadoop
From Everand
Monitoring Hadoop
Gurmukh Singh
No ratings yet
Apache Spark 2.x Cookbook
From Everand
Apache Spark 2.x Cookbook
Rishi Yadav
No ratings yet
The Snowflake Handbook: Optimizing Data Warehousing and Analytics
From Everand
The Snowflake Handbook: Optimizing Data Warehousing and Analytics
Robert Johnson
No ratings yet
Instant Pentaho Data Integration Kitchen
From Everand
Instant Pentaho Data Integration Kitchen
Sergio Ramazzina
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
The Need For Big Data Governance Collibra Mapr
No ratings yet
The Need For Big Data Governance Collibra Mapr
8 pages
Apache Kafka Key Concepts
100% (1)
Apache Kafka Key Concepts
8 pages
Azure Security
100% (1)
Azure Security
22 pages
Facebook's Petabyte Scale Data Warehouse Using Hive and Hadoop
No ratings yet
Facebook's Petabyte Scale Data Warehouse Using Hive and Hadoop
40 pages
Flink
No ratings yet
Flink
31 pages
BPM - Project Implementation Methodology
No ratings yet
BPM - Project Implementation Methodology
10 pages
ch10
No ratings yet
ch10
31 pages
Music Database: Salary:money)
No ratings yet
Music Database: Salary:money)
2 pages
Lab Program
No ratings yet
Lab Program
17 pages
(UnderDefense MAXI) - Encryption Policy
No ratings yet
(UnderDefense MAXI) - Encryption Policy
7 pages
Seth Surabhi PDF
No ratings yet
Seth Surabhi PDF
1 page
File Stream Classes:-: Steps of File Operations
No ratings yet
File Stream Classes:-: Steps of File Operations
28 pages
HPE ProLiant Gen9 Servers - Part Number Memory Matrix
No ratings yet
HPE ProLiant Gen9 Servers - Part Number Memory Matrix
23 pages
Object Oriented Programming & Data Structure0 PDF
No ratings yet
Object Oriented Programming & Data Structure0 PDF
2 pages
StorageTek Tape Libraries Entry and Midrange Help Desk Support Consultant Assessment
No ratings yet
StorageTek Tape Libraries Entry and Midrange Help Desk Support Consultant Assessment
20 pages
Bitlocker To Go
No ratings yet
Bitlocker To Go
41 pages
Personal Area Network
100% (1)
Personal Area Network
5 pages
Section III - SELECT: 3.1: Selecting All Columns
No ratings yet
Section III - SELECT: 3.1: Selecting All Columns
24 pages
Python CSV Files
No ratings yet
Python CSV Files
9 pages
Firewalls
No ratings yet
Firewalls
22 pages
Ab Initio Session 4 Introduction To Ab Initio
No ratings yet
Ab Initio Session 4 Introduction To Ab Initio
37 pages
Computerscience 2018
No ratings yet
Computerscience 2018
4 pages
Free NAS
100% (2)
Free NAS
9 pages
DOS Interrupts
No ratings yet
DOS Interrupts
12 pages
Mca R16
No ratings yet
Mca R16
86 pages
RD55 Document-Requirement Gathering
No ratings yet
RD55 Document-Requirement Gathering
138 pages
Fsharp Cheatsheet
No ratings yet
Fsharp Cheatsheet
3 pages
Seat HOW TO UPDATE YOUR NAVIGATION SYSTEM SEAT
No ratings yet
Seat HOW TO UPDATE YOUR NAVIGATION SYSTEM SEAT
8 pages
Sqlmap Help File
No ratings yet
Sqlmap Help File
5 pages
Modern NLP in Python
No ratings yet
Modern NLP in Python
46 pages
Department of EJ/EN/EQ/ET/EX 22426 MAA MCQ (Microcontroller and Application)
No ratings yet
Department of EJ/EN/EQ/ET/EX 22426 MAA MCQ (Microcontroller and Application)
15 pages
Monitor Your Switches - SNMP Setting
No ratings yet
Monitor Your Switches - SNMP Setting
2 pages
C# Interview Questions
0% (1)
C# Interview Questions
5 pages
ZySCAN Manual
No ratings yet
ZySCAN Manual
168 pages
Operating System MCQ CS 2nd Year
No ratings yet
Operating System MCQ CS 2nd Year
131 pages

SCD Typ2 in Databricks Azure

Uploaded by

SCD Typ2 in Databricks Azure

Uploaded by

SCD TYPE-2 USING PANDAS

Sasi Kumar Thanigai Mani

PYSpark code to implement SCDType2

from pyspark.sql.types import *

#### Create spark context

#### GET DATA ###############

# 1) Hist_Table sql table

MASK = HIST_FULL_DF.is_curr_row_ind == 'Y'

# 2) Load latest file from HDFS

#rearrange columns matching Final DF

def EXPIRE_UPDATE_RECS(HIS_DF, DEL_DF):

CHNGD_BU_DF = DEL_DF[~(pd.merge(DEL_DF, HIS_DF_TRIM, on=['bus_unt_nm', 'bus_unt_desc'],\

#Expiring current record before inserting updated BU desc record

#populating fields for inserting records as part of update operation

#BUS_UNT name validation

#Identify the new BU names

#Populate all fields required for hive table insert

#Identify and Expire records that needs desc to be updated

#Merge inactive records before overwrite operation

#Check for update records

#Identify and Expire records that needs BU desc to be updated

print('No Insert or Update to Hist table for run {:%d-%m-%y}'.format(datetime.datetime.now()))

You might also like