AZURE DATA ENGINEERING
PROJECT
By-Amit Singh
MBA AI&DS
1404307
System Architecture:
The pipeline is structured into five main components:
1. Data Source: The starting point where the data resides (e.g., HTTP, GitHub, APIs, or
other external sources).
2. Data Ingestion: Azure Data Factory (ADF) is used to extract raw data and load it into
a storage solution.
3. Raw Data Store: Data Lake Gen2 stores unprocessed data in its raw format for
scalability and secure storage.
4. Transformation: Azure Databricks processes the raw data, cleaning, transforming,
and enriching it for analysis.
5. Serving and Reporting:
o Processed data is loaded into Azure Synapse Analytics for advanced analysis.
o Power BI connects to Synapse to create interactive reports and dashboards.
Step 1: Data Ingestion Using Azure Data Factory (ADF)
Objective: Automate the extraction of raw data from a GitHub repository.
Process:
1. Creating ADF Instance:
- Logged into the Azure portal and created an Azure Data Factory instance.
- Configured essential settings such as the resource group and region.
2. Configuring Pipelines:
- Built a pipeline in ADF to extract data from the GitHub repository.
- Used the GitHub connector in ADF to establish a secure connection
with the repository.
- Scheduled the pipeline to automate the extraction process at defined intervals.
3. Storing Extracted Data:
- Verified the pipeline to ensure data was successfully extracted.
- Stored the data temporarily in ADF’s staging area for further processing.
Step 2: Setting up Azure Storage Account and Data Lake Gen2
Objective: Store raw data securely and scalable for further processing.
Process:
1. Creating Azure Storage Account:
- Set up an Azure Storage Account to provide a foundation for the data lake.
- Enabled redundancy options (e.g., Geo-redundant storage)
to ensure high availability.
2. Configuring Data Lake Gen2:
- Activated hierarchical namespace for efficient data organization.
- Structured the Data Lake into directories for better management
(e.g., /raw, /processed).
3. Uploading Data:
- Automated the transfer of data from ADF to the raw data directory in Data Lake Gen2.
- Ensured data integrity by validating uploads.
Step 3: Data Transformation Using Azure Databricks
Objective: Clean and transform raw data to make it analysis-ready.
Process:
1. Setting up Azure Databricks:
- Created a Databricks workspace and linked it to the Azure environment.
- Configured a cluster with appropriate compute resources for efficient processing.
2. Developing Notebooks:
- Built Python and SQL-based notebooks within Databricks to perform data cleaning and
transformation.
- Applied techniques like:
- Removing duplicates and null values.
- Standardizing data formats.
- Aggregating data for summary statistics.
3. Storing Transformed Data:
- Saved the cleaned and processed data back into Data Lake Gen2
under the /processed directory.
Step 4: Loading Transformed Data into Azure Synapse Analytics
Objective: Enable scalable analysis and query optimization.
Process:
1. Configuring Azure Synapse:
- Created a Synapse Analytics workspace.
- Configured Synapse SQL pools to store transformed data in an optimized
tabular format.
2. Data Loading:
- Transferred the processed data from Data Lake Gen2 to Synapse Analytics
using integration tools like Azure Data Factory or Databricks.
3. Optimization:
- Partitioned and indexed the data for faster querying.
- Validated data integrity post-migration.
Step 5: Data Visualization and Reporting in Power BI
Objective: Deliver interactive and user-friendly analytics.
Process:
1. Connecting Power BI to Synapse:
- Established a live connection between Power BI and Synapse Analytics for real-time
data access.
- Imported data sets required for visualizations.
2. Designing Reports and Dashboards:
- Created multiple reports focusing on business requirements.
- Used visuals like bar charts, line graphs, maps, and KPIs for comprehensive insights.
3. Interactivity:
- Added filters, slicers, and drill-through capabilities for better data exploration
by end-users.