End-to-end machine learning pipeline for forecasting Delhi's daily mean temperature using XGBoost, LightGBM, Bidirectional LSTM, ARIMA, SARIMA, and a tuned Ensemble — served via a Streamlit web app.
ai_ml_wheather_prediction-main/
│
├── app/
│ └── main.py ← Streamlit web app (all models)
│
├── data/
│ ├── raw/
│ │ ├── Train.csv ← Raw training data (2013–2016)
│ │ └── Test.csv ← Raw test data (2017)
│ ├── processed/ ← Auto-generated cleaned CSVs
│ └── predictions/ ← Per-model prediction CSVs
│
├── models/ ← Saved model files (generated by notebooks)
│ ├── xgboost_model.pkl
│ ├── lightgbm_model.pkl
│ ├── lstm_model.keras
│ ├── lstm_scaler.pkl ← MinMaxScaler fitted on LSTM training data
│ ├── lstm_features.pkl ← Feature column list for LSTM inference
│ ├── arima_model.pkl
│ ├── sarima_model.pkl
│ ├── ensemble_weights.pkl
│ └── feature_meta.pkl
│
├── notebooks/
│ ├── 02_eda_cleaning.py ← EDA, outlier removal, clean CSV export
│ ├── 03_feature_engineering.py ← 40+ lag, rolling, time, interaction features
│ ├── 03_feature_engineering_baseline.py
│ ├── 04_model_train_evaluate.py ← XGBoost + LightGBM training + SHAP
│ ├── 04_model_train_baseline.py
│ ├── 05_lstm_model.py ← BiLSTM + MultiHeadAttention training
│ ├── 05_arima_model.py ← ARIMA + SARIMA training
│ └── 06_ensemble.py ← Combine all predictions, evaluate ensemble
│
├── reports/ ← Auto-generated plots and figures
│ ├── figure/
│ └── shap_plots/
│
├── src/ ← Reserved for utility modules
├── RUN_IN_COLAB.ipynb ← ⭐ Open this in Google Colab to train models
├── requirements.txt
├── .gitignore
└── README.md
Run the notebooks in order in Google Colab before launching the app.
| Step | File | What it does |
|---|---|---|
| 1 | 02_eda_cleaning.py |
Load raw CSVs, fix outliers, impute missing values, save cleaned data |
| 2 | 03_feature_engineering.py |
Create 40+ lag, rolling, EMA, seasonal, and cross-features |
| 3 | 04_model_train_evaluate.py |
Train XGBoost + LightGBM, generate SHAP plots, save models |
| 4 | 05_lstm_model.py |
Train Bidirectional LSTM + MultiHeadAttention, save model + scaler + feature list |
| 5 | 05_arima_model.py |
Fit ARIMA and SARIMA on temperature series, save models |
| 6 | 06_ensemble.py |
Merge all predictions, compute weighted ensemble, evaluate all models |
| 7 | app/main.py |
Run the Streamlit app |
pip install -r requirements.txt
streamlit run app/main.pyMake sure all model files are present in models/ before launching.
- Trained on 40+ engineered features: lag temperatures, rolling means/std, EMA, heat index, pressure delta, seasonal cyclical encodings
- Feature importance analysed with SHAP
- Architecture:
BiLSTM(256) → BiLSTM(128) + Residual → MultiHeadAttention(4 heads) + LSTM(64) → Dense(128→64→32→1) - Input: 30-day sliding window of 40+ features (velocity, momentum, z-score, interaction terms)
- Trained with Huber loss, Adam optimizer, EarlyStopping (patience=35), ReduceLROnPlateau
- Important:
lstm_scaler.pklandlstm_features.pklare saved alongside the model and must be present for correct inference
- Statistical time-series models fitted on the training temperature series
- Used for one-step-ahead forecasting appended to the observed series
- Tuned manual weights: XGBoost 35% · LightGBM 35% · LSTM 15% · ARIMA 8% · SARIMA 7%
- Weights auto-renormalise if any model file is missing
- Live sliders for today's temperature, humidity, wind speed, and pressure
- Lag inputs for yesterday's and 2-days-ago temperature
- Individual prediction cards for all 5 models + ensemble
- Bar chart comparison across all models
- Summary table with RMSE and R² reference values
- Graceful degradation — ensemble still runs if optional models (LSTM/ARIMA/SARIMA) are not loaded
See requirements.txt. Key dependencies:
streamlit
pandas
numpy
scikit-learn
xgboost
lightgbm
tensorflow
statsmodels
joblib
matplotlib
shap
- Source: Delhi Climate Data 2013–2017
- Target:
meantemp— daily mean temperature in °C - Features:
humidity,wind_speed,meanpressure+ 40+ engineered features - Train: 2013–2016 | Test: 2017 (114 rows)
Divyansh Prakash | GitHub: @DivyanshPrakashIIT
Siddharth Shukla | GitHub: @Siddharth