Skip to content

DeepakReddy02/Databricks-Data-engineering-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 

Repository files navigation

Data-Engineering-Project using Indian Rainfall Data API

An end-to-end data engineering pipeline on Databricks leveraging the publicly available RainFall API. This project covers:

  • Data Ingestion (Bronze)
  • Data Processing & Cleaning (Silver)
  • Data Quality & Delivery (Gold)

Medallion Layers:

Layer Purpose
Bronze Ingest raw data from API into Parquet
Silver Clean, dedupe, enrich; enforce schemas with PySpark
Gold Spliting the date as Fact & Dimension Table

PROJECT ARCHITECTURE

image

Phase 1: Bronze Layer (Raw Ingestion)

image
  • Storage
    • All raw ingestions stored as Parquet in the rainfall_data/bronze_layer container

Phase 2: Silver (Cleansing & Enrichment)

  • Transformations

    • Split multi-valued columns (e.g., Daily Actual)
    • Remove duplicates
    • Cast of data types for analytics readiness
  • Storage

    • Cleaned Parquet files in the rainfall_data/silver_layer container

Phase 3: Gold (Quality & Aggregation)

  • Transformations

    • Removed Unncessary columns
    • De-normalizing dataset to tables(Fact & Dimension)
    • Re-naming columns for ease of use
  • Output

    • Aggregated the tables as rainfall_data.rain_fact_table and workspace.rainfall_data.state_table

Technology Stack

Component Purpose
Databricks Spark-based ETL & Delta Live Tables
Unity Catlog ACID-compliant, performant data format
Python / PySpark Data transformation logic

Future Improvements

  • Implementing Concurrency to reduce the time taken for the API calls.
  • Moving to a External storage instead of Unity Catlog.
  • Automate the whole process using Databricks job pipeline.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published