🌤️ Delhi Weather Predictor — ML Pipeline

End-to-end machine learning pipeline for forecasting Delhi's daily mean temperature using XGBoost, LightGBM, Bidirectional LSTM, ARIMA, SARIMA, and a tuned Ensemble — served via a Streamlit web app.

📁 Project Structure

ai_ml_wheather_prediction-main/
│
├── app/
│   └── main.py                          ← Streamlit web app (all models)
│
├── data/
│   ├── raw/
│   │   ├── Train.csv                    ← Raw training data (2013–2016)
│   │   └── Test.csv                     ← Raw test data (2017)
│   ├── processed/                       ← Auto-generated cleaned CSVs
│   └── predictions/                     ← Per-model prediction CSVs
│
├── models/                              ← Saved model files (generated by notebooks)
│   ├── xgboost_model.pkl
│   ├── lightgbm_model.pkl
│   ├── lstm_model.keras
│   ├── lstm_scaler.pkl                  ← MinMaxScaler fitted on LSTM training data
│   ├── lstm_features.pkl                ← Feature column list for LSTM inference
│   ├── arima_model.pkl
│   ├── sarima_model.pkl
│   ├── ensemble_weights.pkl
│   └── feature_meta.pkl
│
├── notebooks/
│   ├── 02_eda_cleaning.py               ← EDA, outlier removal, clean CSV export
│   ├── 03_feature_engineering.py        ← 40+ lag, rolling, time, interaction features
│   ├── 03_feature_engineering_baseline.py
│   ├── 04_model_train_evaluate.py       ← XGBoost + LightGBM training + SHAP
│   ├── 04_model_train_baseline.py
│   ├── 05_lstm_model.py                 ← BiLSTM + MultiHeadAttention training
│   ├── 05_arima_model.py                ← ARIMA + SARIMA training
│   └── 06_ensemble.py                   ← Combine all predictions, evaluate ensemble
│
├── reports/                             ← Auto-generated plots and figures
│   ├── figure/
│   └── shap_plots/
│
├── src/                                 ← Reserved for utility modules
├── RUN_IN_COLAB.ipynb                   ← ⭐ Open this in Google Colab to train models
├── requirements.txt
├── .gitignore
└── README.md

⚙️ Pipeline Overview

Run the notebooks in order in Google Colab before launching the app.

Step	File	What it does
1	`02_eda_cleaning.py`	Load raw CSVs, fix outliers, impute missing values, save cleaned data
2	`03_feature_engineering.py`	Create 40+ lag, rolling, EMA, seasonal, and cross-features
3	`04_model_train_evaluate.py`	Train XGBoost + LightGBM, generate SHAP plots, save models
4	`05_lstm_model.py`	Train Bidirectional LSTM + MultiHeadAttention, save model + scaler + feature list
5	`05_arima_model.py`	Fit ARIMA and SARIMA on temperature series, save models
6	`06_ensemble.py`	Merge all predictions, compute weighted ensemble, evaluate all models
7	`app/main.py`	Run the Streamlit app

🚀 Running the App

pip install -r requirements.txt
streamlit run app/main.py

Make sure all model files are present in models/ before launching.

🧠 Model Architecture

XGBoost & LightGBM

Trained on 40+ engineered features: lag temperatures, rolling means/std, EMA, heat index, pressure delta, seasonal cyclical encodings
Feature importance analysed with SHAP

LSTM — Bidirectional + Multi-Head Attention

Architecture: BiLSTM(256) → BiLSTM(128) + Residual → MultiHeadAttention(4 heads) + LSTM(64) → Dense(128→64→32→1)
Input: 30-day sliding window of 40+ features (velocity, momentum, z-score, interaction terms)
Trained with Huber loss, Adam optimizer, EarlyStopping (patience=35), ReduceLROnPlateau
Important: lstm_scaler.pkl and lstm_features.pkl are saved alongside the model and must be present for correct inference

ARIMA & SARIMA

Statistical time-series models fitted on the training temperature series
Used for one-step-ahead forecasting appended to the observed series

Ensemble

Tuned manual weights: XGBoost 35% · LightGBM 35% · LSTM 15% · ARIMA 8% · SARIMA 7%
Weights auto-renormalise if any model file is missing

🖥️ App Features

Live sliders for today's temperature, humidity, wind speed, and pressure
Lag inputs for yesterday's and 2-days-ago temperature
Individual prediction cards for all 5 models + ensemble
Bar chart comparison across all models
Summary table with RMSE and R² reference values
Graceful degradation — ensemble still runs if optional models (LSTM/ARIMA/SARIMA) are not loaded

📦 Requirements

See requirements.txt. Key dependencies:

streamlit
pandas
numpy
scikit-learn
xgboost
lightgbm
tensorflow
statsmodels
joblib
matplotlib
shap

📝 Dataset

Source: Delhi Climate Data 2013–2017
Target: meantemp — daily mean temperature in °C
Features: humidity, wind_speed, meanpressure + 40+ engineered features
Train: 2013–2016 | Test: 2017 (114 rows)

👤 Author

Divyansh Prakash | GitHub: @DivyanshPrakashIIT

Siddharth Shukla | GitHub: @Siddharth

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌤️ Delhi Weather Predictor — ML Pipeline

📁 Project Structure

⚙️ Pipeline Overview

🚀 Running the App

🧠 Model Architecture

XGBoost & LightGBM

LSTM — Bidirectional + Multi-Head Attention

ARIMA & SARIMA

Ensemble

🖥️ App Features

📦 Requirements

📝 Dataset

👤 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
app		app
data		data
models		models
notebooks		notebooks
reports		reports
src		src
.gitignore		.gitignore
README.md		README.md
RUN_IN_COLAB.ipynb		RUN_IN_COLAB.ipynb
Team15_Report.pdf		Team15_Report.pdf
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🌤️ Delhi Weather Predictor — ML Pipeline

📁 Project Structure

⚙️ Pipeline Overview

🚀 Running the App

🧠 Model Architecture

XGBoost & LightGBM

LSTM — Bidirectional + Multi-Head Attention

ARIMA & SARIMA

Ensemble

🖥️ App Features

📦 Requirements

📝 Dataset

👤 Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages