0% found this document useful (0 votes)
71 views3 pages

End-to-End Data Engineering Projects

Arslan Ali is a skilled Data Engineer with extensive experience in data analysis, engineering, and cleaning, particularly in the banking sector, proficient in Python and Power BI. He has worked on various projects involving end-to-end data pipelines, data lake implementations, and ETL processes using Azure technologies, Apache Spark, and machine learning models. Arslan holds a Bachelor of Science in Computer Engineering and relevant certifications, including Databricks Data Engineer Associate Certification.

Uploaded by

Uzair Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views3 pages

End-to-End Data Engineering Projects

Arslan Ali is a skilled Data Engineer with extensive experience in data analysis, engineering, and cleaning, particularly in the banking sector, proficient in Python and Power BI. He has worked on various projects involving end-to-end data pipelines, data lake implementations, and ETL processes using Azure technologies, Apache Spark, and machine learning models. Arslan holds a Bachelor of Science in Computer Engineering and relevant certifications, including Databricks Data Engineer Associate Certification.

Uploaded by

Uzair Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Arslan Ali

arslanmushtaq4343@[Link] | +923106119450 | [Link]/in/arslanali434343 |


[Link]
Career Objective
Skilled Data Engineer with extensive experience in data analysis, data engineering, and data
cleaning within the banking sector. Proficient in Python and Power BI fordata validation and
transformation. Adept at leveraging cloud-based technologies to enhance data-driven
decision-making.
Skills
• Databricks Certified • Apache Spark • ETL • SSIS • Hadoop • Python • SQL • Data Lake
• Data Warehouse • Lakehouse • Airbyte • Airflow • Jenkins • Git • ML Forecasting Model
• Image Recognition Models • Text Recognition Models • SQL Server • Oracle • Power BI
• Microsoft Azure • AWS •Azure Data Lake Storage • Azure Data Factory •Azure Databricks •Azure Data
Lake Analytics •Azure Active Directory •Azure Key Vault

Experience

Techlogix, Lahore
Software Engineer (Data Engineer)

Project: End-to-End Data Pipeline for Financial Data Processing May 2024 – Present

 Designed an end-to-end data pipeline for processing financial data, orchestrated using Azure Data Factory for
task automation.
 Implemented data ingestion and processing using Azure Databricks with Apache Spark, efficiently handling
large-scale financial datasets in a Docker-containerized environment.
 Enforced security measures for data encryption and access control using Azure Key Vault to ensure compliance
with data governance standards.
 Processed and ingested data using Spark and delivered it to Azure Data Lake Storage for distributed storage and
parallel processing.
 Utilized Azure Data Lake Analytics to analyze the processed data, enabling further analysis and reporting.
 Integrated the database with external systems through Azure API Management, allowing front-end platform
services to fetch data in real-time for business intelligence and customer interaction.
 Leveraged Azure Kubernetes Service (AKS) for managing Docker containers throughout the pipeline, enhancing
portability, scalability, and efficient resource management.
 Tools and Technologies: Azure Data Factory, Azure Databricks, Azure Key Vault, Azure Data Lake Storage, Azure
Data Lake Analytics, Azure API Management, Azure Kubernetes Service (AKS), Docker.

Project: Data Lake Implementation for Scalable Data Processing Jan 2024 – Apr 2024

 Developed a comprehensive Azure Data Lake architecture to support scalable data ingestion and processing
workflows for various data sources, including structured and unstructured data.
 Implemented data ingestion pipelines using Azure Data Factory, enabling seamless movement of data from on-
premises and cloud sources into Azure Data Lake Storage.
 Employed Azure Databricks for data processing, leveraging Apache Spark to transform and analyze large
datasets efficiently.
 Utilized Azure Data Lake Analytics to perform analytics directly on data stored in the Data Lake, allowing for
flexible querying and data manipulation.
 Established data governance and security measures by integrating Azure Active Directory and Azure Key Vault
to control access and manage sensitive information.
 Created a series of data visualizations using Power BI, connecting directly to Azure Data Lake Storage to enable
real-time insights and reporting.
 Tools and Technologies: Azure Data Lake Storage, Azure Data Factory, Azure Databricks, Azure Data Lake
Analytics, Azure Active Directory, Azure Key Vault, Power BI.

Project: Spark Optimization and SQL to Spark Migration with Polar POC Oct 2023 – Dec 2023

 Optimized local Spark jobs by tuning executor memory, shuffle partitions, and parallelism settings for
efficient resource utilization.
 Improved Spark memory management using RDD persistence, broadcast variables, and in-memory
caching strategies.
 Configured key Spark parameters like spark. executor. Memory, [Link],
and [Link]. shuffle. partitions to enhance performance and reduce costs.
 Developed PySpark unit tests using pytest for validation of data transformations and logic integrity.
 Conducted a POC comparing Apache Spark with PolarDB, achieving 50x faster query execution on
Polar for specific workloads.
 Led the migration from SQL to PySpark using DataFrames API and Spark SQL, ensuring optimized
query performance.
 Tools and Technologies: Spark, PolarDB, PySpark, SQL, Unit Testing (pytest), Spark Configuration, RDD
Caching, Memory Management.

Project: Data Management and ETL for Blue Cross Blue Shield (BCBS) Jun 2023 – Sep 2023

 Managed petabyte-scale data for Blue Cross Blue Shield (BCBS), utilizing Databricks for advanced data
processing and analytics.
 Used Stonebranch for workflow orchestration and AWS S3 for secure data storage.
 Designed and implemented a robust ETL pipeline following the Medallion architecture.
 Employed optimization techniques, including Delta operations, to reduce DBU costs.
 Created comprehensive documentation to ensure clarity and ease of understanding for stakeholders
and team members.
 Tools and Technologies: Databricks, Stonebranch, AWS S3, Medallion architecture, Delta operations.

Project: Data Ingestion and Integration for End-to-End ELT Pipeline Jan 2023 – May 2023

 Led integration of heterogeneous data sources, including Oracle DB, CSV, and Excel (~58GB, 77 million
rows) using SQL Server Integration Services (SSIS) to perform ETL operations.
 Utilized Python (Pandas, NumPy) for data cleaning and validation, rectifying discrepancies in banking
data. Jupyter Notebooks were used for validation workflows and exploratory data analysis, along with
SQL-based checks for regulatory compliance.
 Implemented data segmentation through Python scripts, integrated into the ETL process for real-time
processing.
 Established validation checkpoints within the ETL pipeline to ensure data quality pre- and post-
transformation. Performed complex data transformations using aggregations, CTEs, and window
functions for balance calculations.
 Automated validation scripts in Python to ensure ETL execution integrity, performing before-and-after
comparisons for data accuracy. Streamlined the validation process with automated quality checks.
 Tools and Technologies: SSIS, Python, Pandas, NumPy, SQL, Jupyter Notebooks, Oracle DB, CSV, Excel.
Project: Data Modeling for BOP CC and BOP RF for Customer Analysis Sep 2022 – Dec 2022

 Created multiple dashboards for customer analysis using DAX and SQL.
 Modeled measures, applied binnings, and sorting using DAX.
 Worked with Excel and CSV as data sources.
 Used PowerPoint for presentations and analysis summary.
 Tools and Technologies: DAX, SQL, Excel, CSV, PowerPoint.

Project: Feature Risk May 2022 – Aug 2022

 Applied machine learning techniques to develop and enhance forecasting models


 Developed forecasting models including: XGBoost, ARIMA, Facebook Prophet
 Focused on optimization for balanced predictions
 Tools and Technologies: Machine Learning, Python, SQL, Jupiter Notebook

Broadstone (Python Dev), Lahore June 2021 – Dec 2021


 Developed and maintained databases using Oracle Database.
 Created and managed APIs for seamless data integration and extraction.

 Automated data validation and cleaning processes using Python, enhancing data quality for financial
reporting.

Education
Bachelor of Science in Computer Engineering Sep 2018 – June 2022
University of Engineering and Technology, Lahore

Major Courses: Database Systems, Data Mining

Certifications
Databricks Data Engineer Associate Certification Sep 2018 – June 2022

Apache Airflow Fundamentals Jul 2024- Jul 2026


Certification for Apache Airflow Fundamentals can demonstrate the
fundamental skills needed to create, manage and monitor DAGs on
Apache Airflow effectively.

You might also like