An end-to-end data engineering pipeline on Databricks leveraging the publicly available RainFall API. This project covers:
- Data Ingestion (Bronze)
- Data Processing & Cleaning (Silver)
- Data Quality & Delivery (Gold)
Medallion Layers:
| Layer | Purpose |
|---|---|
| Bronze | Ingest raw data from API into Parquet |
| Silver | Clean, dedupe, enrich; enforce schemas with PySpark |
| Gold | Spliting the date as Fact & Dimension Table |
- Sources
- "District-wise Rainfall Distribution" as API Request
- Source from "https://ndap.niti.gov.in/dataset/7319"
- Storage
- All raw ingestions stored as Parquet in the
rainfall_data/bronze_layercontainer
- All raw ingestions stored as Parquet in the
-
Transformations
- Split multi-valued columns (e.g., Daily Actual)
- Remove duplicates
- Cast of data types for analytics readiness
-
Storage
- Cleaned Parquet files in the
rainfall_data/silver_layercontainer
- Cleaned Parquet files in the
-
Transformations
- Removed Unncessary columns
- De-normalizing dataset to tables(Fact & Dimension)
- Re-naming columns for ease of use
-
Output
- Aggregated the tables as
rainfall_data.rain_fact_tableandworkspace.rainfall_data.state_table
- Aggregated the tables as
| Component | Purpose |
|---|---|
| Databricks | Spark-based ETL & Delta Live Tables |
| Unity Catlog | ACID-compliant, performant data format |
| Python / PySpark | Data transformation logic |
- Implementing Concurrency to reduce the time taken for the API calls.
- Moving to a External storage instead of Unity Catlog.
- Automate the whole process using Databricks job pipeline.