0% found this document useful (0 votes)
30 views46 pages

Yangon 30-Day Weather Forecast in °C

This document presents a study on weather forecasting, specifically temperature prediction in the Yangon Region using machine learning techniques. It discusses the integration of time-series analysis with machine learning models like ARIMA and SARIMA, emphasizing the importance of user-friendly interfaces for accessibility. The research methodology includes data collection from Open Weather API, exploratory data analysis, and the implementation of various machine learning models to enhance prediction accuracy and reliability.

Uploaded by

Thet Mg Mg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views46 pages

Yangon 30-Day Weather Forecast in °C

This document presents a study on weather forecasting, specifically temperature prediction in the Yangon Region using machine learning techniques. It discusses the integration of time-series analysis with machine learning models like ARIMA and SARIMA, emphasizing the importance of user-friendly interfaces for accessibility. The research methodology includes data collection from Open Weather API, exploratory data analysis, and the implementation of various machine learning models to enhance prediction accuracy and reliability.

Uploaded by

Thet Mg Mg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

1

CHAPTER I

INTRODUCTION

1.1 Introduction

The last few years have seen an increased interest in Machine Learning with the
advancement of natural language processing model like GPT. In this study, we focus
on weather forecasting with a specific emphasis on temperature prediction in the
Yangon Region using machine learning. Knowing weather condition ahead will help
us with our day-to-day decision making. Weather plays a huge impact on our day-to-
day lives and it’s a critical aspect of decision making. They play a pivotal role in
various sectors, including agriculture, energy, and public safety.

1.2 Machine Learning

There has been an increasing interest in Machine learning with the advancement
of natural language processing model like GPT and the release of image generation
models such as Mid Journey, Dall-E and stable diffusion. Exponentially increasing
amount of data along with increasing computational power and storages having bigger
capacities and getting cheaper contributed to better machine learning models.
Additionally, the rise of cloud computing has made it easier for researchers and
developers to access powerful computing resources for training complex machine
learning models. As a result, the field of machine learning continues to evolve rapidly
with new breakthroughs and applications being developed.

1.3 User-interface and accessibility

Design a clear and user-friendly interface that highlights essential weather


information, making it easily accessible and understandable for all users. Users should
easily navigate and access essential information. Utilize clear and concise graphical
representations. Visuals aids enhance comprehension and enable users to quickly
grasp trends and patterns. Ensure users can see the weathers on different platforms
and make sure user can get responsive and adaptive design principles, accommodating
users with various devices and screen sizes.
2

1.4 Expected result

Expected outcome is the development and implementation of an advanced


weather forecasting system for the Yangon Region that significantly enhances the
accuracy, reliability, and accessibility of weather predictions. Through the integration
of meteorological models, comprehensive data sources and a user-friendly interface.
Ensure to provide users with an extended lead time for weather prediction, allowing
for better preparedness and decision-making for their daily life activities.

1.5 Organization of Thesis

This paper is comprised of five chapters. Chapter 1 of the paper will serve as
an introduction to the topic, providing an overview of the subject matter and outlining
the expected results of the research. Chapter 2 offers a comprehensive understanding
of time series analysis, covering foundational components, transformation techniques,
and forecasting methods along with visualization to help us interpret results. Chapter
3 deals with the methodology of machine learning techniques. Chapter 4 consist
implementation of machine learning methodology and statistical analysis along with
implementation of our model into web application. Chapter 5 is dedicated to results of
the research. Chapter 6 consists of discussion of our findings and possible future
works.
3

CHAPTER II

LITERATURE REVIEW

2.1 Integration of time-series with Machine Learning Techniques in


Recent Developments

Recent advancements in data collection methodologies have significantly


improved the quality and scope of time-series datasets. Liu (2020) and Talavera
(2022) both highlight the importance of data augmentation techniques, such as Add
Noise, Permutation, Scaling, and Warping, in enhancing the performance of deep
learning models in time series classification tasks. These techniques are particularly
useful in scenarios with limited data resources. Struckov (2019) emphasizes the need
for modern tools and techniques for storing time-series data, which is crucial for
maintaining the integrity and accessibility of these datasets. Wen (2020) further
underscores the significance of data augmentation in deep learning, providing a
comprehensive review of different methods and their applications in time series
analysis tasks.

2.2 ARIMA and SARIMA models

The Autoregressive Integrated Moving Average (ARIMA) model and its


seasonal counterpart (SARIMA) stand as prominent statistical techniques widely
applied in meteorology for time-series forecasting. The Autoregressive Integrated
Moving Average (ARIMA) model, developed by Box and Jenkins in 1970 and also
known as the Box–Jenkins method, stands as a prominent statistical technique widely
applied in meteorology for time-series forecasting. Recognized for its efficacy in
capturing short-term variations and long-term trends, ARIMA remains a valuable tool
in the dynamic field of meteorological predictions.

The Seasonal Autoregressive Integrated Moving Average (SARIMA) model,


an extension of the well-established ARIMA methodology, plays a pivotal role in
unraveling the complexities of meteorological time-series forecasting, with a
particular emphasis on capturing seasonal patterns (Box & Jenkins, 1976). Developed
as an evolution of ARIMA, SARIMA integrates seasonal components into its three
fundamental elements: Autoregressive (AR), Integrated (differencing), and Moving
4

Average (MA), denoted as p, d, q. Additionally, SARIMA introduces seasonal


parameters P, D, Q and m, where m represents the seasonal period. This integration
allows SARIMA to offer a comprehensive and flexible framework tailored to the
intricacies of seasonal variations present in meteorological data.

2.3 Integration with Machine Learning

Machine learning techniques, particularly ARIMA and SARIMA, have been


successfully applied in weather forecasting. (Shivhare (2021)) developed an ARIMA-
based daily weather forecasting tool, achieving accurate results for rainfall and
temperature data. (Li (2012)) proposed a neuro-fuzzy system with ARIMA models,
demonstrating superior performance in time series forecasting. (Khashei (2012))
introduced a hybrid model combining SARIMA and computational intelligence
techniques, overcoming the limitations of SARIMA models and improving
forecasting accuracy. (Ashwini (2021)) used SARIMA to predict Tamilnadu monsoon
rainfall, achieving accurate results with low error. These studies collectively highlight
the effectiveness of ARIMA and SARIMA in weather forecasting, particularly when
combined with other machine learning techniques.

2.4 Categorization of Weather Condition

A range of studies have explored the use of various models and techniques for
weather prediction and categorization. Mantri (2021) developed a hybrid model using
neural networks and k-nearest neighbors to predict temperature, humidity, and
weather conditions, achieving high accuracy. However, Thompson (1952) highlighted
the limitations of conventional categorical weather forecasts, suggesting the need for
more tailored and probabilistic approaches. Chauhan (2014) provided a
comprehensive review of data mining techniques for weather prediction, emphasizing
the potential for further advancements in this field. Lastly, Sapronova (2014)
demonstrated the effectiveness of categorization in improving the accuracy of an
artificial neural network-based model for wind speed and direction forecasting.

2.5 Limitation of Weather Datasets

The use of real data in weather and climate models can be limited by its
representativeness, cost, and licensing restrictions (Meyer, 2020). This is particularly
relevant in developing countries, where the lack of accessible and reliable
5

meteorological datasets can introduce uncertainties in crop yield response estimates


(Parkes, 2019). The scarcity of datasets with pasture yield and nutritional parameters,
remote sensing, and weather information further limits the design of prediction
models for forage conditions (Defalque, 2024).
6

Chapter III

Research Methodology

3.1 Artificial Intelligence

Artificial intelligence (AI) refers to computer systems designed to perform


tasks that typically require human intelligence. These systems can learn from data,
recognize patterns, and make decisions with minimal human intervention. AI is used
in various applications such as speech recognition (like Siri or Alexa), image
recognition (identifying objects in photos), and natural language processing
(understanding and generating human language). Machine learning enables computers
to improve their performance on tasks over time through experience, without explicit
programming. As technology advances, AI's capabilities are expected to expand,
influencing society in profound ways.

3.1.1Machine Learning

Machine learning is a branch of artificial intelligence (AI) where computers


learn to perform tasks without being explicitly programmed. There are different types
of machine learning. Supervised learning involves training the computer on labeled
data, where it learns to map inputs (like images) to outputs (like "cat" or "dog").
Machine learning is used in many practical applications today, from predicting
weather patterns to recommending movies on streaming platforms. As computers
process more data and algorithms improve, machine learning continues to advance,
making our technology smarter and more capable.

3.2 Development Environment

Our study is primarily conducted in Jupyter Notebook and Visual studio code
using python as main language and we will be utilizing PHP for building web-
application. Also, we will be implementing Git to perform version control activities.
7

Table 3.1 Development Tools

No Name Description

1 Jupyter Notebook Development Environment for model


development and training

2 PHP (Hypertext Preprocessor) Intended programming language For


Model Deployment

3 Python Main Programming Language for core


functions and models

4 Visual Studio Code Development Environment for model


deployment

5 Git Version Control System

Table 3.2 shows all library and packages we used in our research including
tools. Datetime module is used for manipulation of date and time and calculations of
them. JSON is a lightweight data interchange format which is also a JavaScript object
notation. To import JSON files, we will be using JSON module in our research.
Pandas is a library commonly used for data manipulation and statistical analysis. Pytz
is a python library time zone calculation and conversions. Request is used for making
network requests in Python. We utilized Time module to sleep our web requests as not
to hit rate limit.

Table 3.2 Libraries Used in Analysis

No Name Description

1 Datetime For manipulating dates and times

2 JSON To Handle JSON files

3 Pandas Data Manipulation and Analysis

4 Pytz Converting Time Zone

5 Request For HTML request

6 Time Accessing and converting times


8

Following table 3.3, Scikit-learn is a widely-used Python library for machine


learning tasks like classification, regression, clustering, and dimensionality reduction.
Statsmodels is a specialized Python library for statistical modeling, hypothesis testing,
and time series analysis, offering a comprehensive suite of statistical models and
methods.

Table 3.3 Libraries used for Machine Learning and Statistical learning

No Name Description

1 Scikit-learn For machine learning and model


ensembles and analysis

2 Statsmodels For statistical model and analysis

Table 3.4 shows tools that are utilized for visualization in our research.
Matplotlib, Seaborn, and Plotly are all popular libraries used for data visualization in
Python. Matplotlib is used for Basic line, bar, and scatter plots. Plotly excels in
creating interactive visuals and generating 3D plots. Seaborn is utilized for heatmap
and other advanced statistical plots (such as box plot and pair plot).

Table 3.4 Libraries used for Data Visualization

No Name Description

1 Matplotlib Basic Simple Visualization

2 Plotly Customizable Interactive


Visualization

3 Seaborn Advanced Statistical Visualization

3.3 Flowchart of Weather Forecasting of Yangon Region using Machine


Learning
9

Fig 3.1 Flow Chart

The flow chart visually illustrates the process involved in our research, starting from
data collection till the model deployment.

3.4 Expected Timeline

Expected timeline represents the time that our research will take for each step.
This will show the end-users how much our research has completed.

Fig 3.2 Expected Timeline

3.5 Limitations of Our Research

Following limitations are those that can have impact our research,

Handling Missing Values and Outliers:

Imputation techniques such as mean imputation or interpolation, while


commonly used for fixing missing values, can introduce bias and distort the original
data distribution, potentially affecting model accuracy. Likewise, outlier detection
methods based on statistical thresholds or visualization may fail to capture subtle
outliers, leading to incomplete data cleansing and potential model instability.
10

Handling High and Low Cardinality Variables:

Grouping infrequent categories into an "other" category or dropping high


cardinality variables may oversimplify the data representation, potentially
overlooking important distinctions and reducing the model's ability to capture
nuanced relationships. Additionally, one-hot encoding of low cardinality variables can
lead to increased dimensionality and computational complexity, impacting model
training and performance.

Handling Multicollinearity:

While Pearson correlation analysis and heatmap visualization effectively


identify multicollinearity, they may not capture nonlinear relationships or complex
interactions among predictors. Failing to address such nuances could lead to inflated
standard errors, misleading coefficient estimates, and reduced model interpretability,
particularly in models reliant on precise variable relationships.

3.6 Data Collection

Weather data for the Yangon region was sourced from Open
Weather([Link] a comprehensive weather data provider
offering real-time and historical weather information through an accessible API. Open
Weather provides a wide range of meteorological parameters including temperature,
atmospheric pressure, humidity, minimum temperature, maximum temperature, wind
speed, wind degree, wind gust, cloudiness and rain.

The data collection process involved the following steps:

i. Accessing Open Weather API: Utilizing the Open Weather API, which
provides developers access to weather data for various geographic locations
worldwide, including the Yangon region.
ii. API Requests: Requests were made to the Open Weather API specifying the
geographical coordinates (latitude and longitude) of the Yangon region and the
desired time range for historical weather data retrieval.
iii. Data Retrieval: Historical weather data for Yangon was retrieved from Open
Weather in JSON format, containing detailed information for each specified
parameter at regular intervals (e.g., hourly or daily measurements).
11

Using Open Weather as the primary data source facilitated access to


comprehensive and up-to-date weather data for the Yangon region, enabling robust
analysis and modeling for weather prediction purposes.

3.7 Exploratory Data Analysis and Feature Engineering

During Exploratory Data Analysis (EDA) and Feature Engineering, we


conduct a comprehensive examination of the dataset using descriptive analysis and
visualizations to uncover patterns, trends, and relationships. Techniques such as
descriptive statistics, data visualization, correlation analysis, missing data analysis,
and outlier detection are employed to deepen our understanding of the data's structure.
Furthermore, we systematically engineer features to optimize model performance and
interpretability through transformations, creations, selections, and encoding of
variables. This includes normalization, scaling, log transformations, temporal, and
spatial aggregation to enrich the dataset with informative variables, ensuring
adherence to model assumptions and consistent scales for improved predictive
accuracy and insights. Combining these steps allows for a holistic approach to dataset
exploration and refinement, enabling us to uncover valuable insights and prepare the
data effectively for model training and evaluation.

3.8 Models and Methods Used in Research

We plan to utilize following models and methods in the study to choose the
best option for our research:

ARIMA: ARIMA is a model that is used to forecast time series data by combining
past values, differencing and moving averages for prediction. ARIMA got 3
components namely Auto Regressive (AR), Integrated (I), Moving Average (MA)
namely (p,d,q) respectively. The AR(p) component captures the linear relationship
between the current observation and its past values, while the MA(q) component
captures the linear relationship between the current observation and its past errors.
The I(d) parameter represents the number of differences needed to make the series
stationary.

Equation for the ARIMA(p,d,q) is as follow

Yt=c+ϕ1Yt−1+⋯+ϕpYt−p+θ1et−1+⋯+θqet−q+et
12

Where:

 Yt is the value of the time series at time tt.


 c is a constant term (also called the intercept).
 ϕ1,ϕ2,…,ϕp are the parameters of the autoregressive (AR) terms.
 θ1,θ2,…,θq are the parameters of the moving average (MA) terms.
 et is the error term at time t, which is assumed to be normally distributed with
mean 0 and constant variance.

SARIMA: SARIMA model is an extension of ARIMA model that incorporate


seasonality into analysis. It’s useful for timeseries data that exhibit seasonality.
SARIMA is denoted as SARIMA (p,d,q)(P,D,Q)m where the additional (P, D, Q) terms
represent the seasonal components, and m represents the seasonal period.

The SARIMA model can be expressed as follows:

Yt=c+ϕ1Yt−1+ϕ2Yt−2+…+ϕpYt−p−θ1et−1−θ2et−2−…−θqet−q+et+Φ1Yt−m+Φ2Yt−2m+…
+ΦPYt−Pm−Θ1et−m−Θ2et−2m−…−ΘQet−Qm

Where:

 Yt is the value of the time series at time t.


 c is a constant term (intercept).
 ϕ1,ϕ2,…,ϕp are the parameters of the autoregressive (AR) terms.
 θ1,θ2,…,θq are the parameters of the moving average (MA) terms.
 Φ1,Φ2,…,ΦP are the parameters of the seasonal autoregressive (SAR) terms.
 Θ1,Θ2,…,ΘQ are the parameters of the seasonal moving average (SMA) terms.
 et is the error term at time t, assumed to be normally distributed with mean 0
and constant variance.
 m represents the seasonal period.

Linear Regression: Linear Regression, one of the most commonly used algorithms,
is used for modeling the relationship between a dependent variable Y and one or more
independent variables X. In the case of simple linear regression, there is only one
independent variable, while multiple linear regression involves two or more
independent variable

For simple linear regression:

Y=β0+β1X+e
13

Where:

 Y is the dependent variable (the variable we are trying to predict).


 X is the independent variable (the variable used to make predictions).
 β0 is the y-intercept, the value of Y when X is 0.
 β1 is the slope of the line, representing the change in Y for a one-unit change in
X.
 e is the error term, representing the difference between the observed values of
Y and the values predicted by the model.

For multiple linear regression with n independent variables:

Y=β0+β1X1+β2X2+…+βnXn+e

Where:

 Y is the dependent variable.


 X1,X2,…,Xn are the independent variables.
 β0 is the y-intercept.
 β1,β2,…,βn are the coefficients (slopes) associated with each independent
variable.
 e is the error term, representing the difference between the observed values of
Y and the values predicted by the model.

The goal of linear regression is to estimate the coefficients β0,β1,…,βn that


minimize the difference between the observed values of Y and the values predicted by
the model.

Ensemble models: Ensemble models combine multiple individual models to improve


predictive performance and robustness. Ensemble methods include Bagging,
Boosting, and Stacking, among others, each with its own approach to combining
predictions.

i. Bagging (Bootstrap Aggregating): Bagging involves training multiple base


models independently on different bootstrap samples of the training data. By
averaging their predictions (for regression) or using a majority vote (for
classification), bagging reduces overfitting and enhances predictive accuracy.

ii. Boosting: Boosting is an ensemble method that improves predictive accuracy


by training a series of models sequentially, emphasizing instances where
14

previous models struggled, leading to enhanced overall performance and


adaptability to complex patterns in the data. Popular algorithms include
AdaBoost and Gradient Boosting (e.g., XGBoost, LightGBM).

iii. Stacking (Stacked Generalization): Stacking combines predictions from


diverse base models by training a meta-learner on their outputs, exploiting the
complementary strengths of different algorithms for improved overall
accuracy and robustness.

3.8.1 AdaBoost (Adaptive Boosting)

AdaBoost, short for Adaptive Boosting, is an ensemble learning algorithm that


combines multiple weak learners to form a strong classifier. It is particularly effective
for binary classification tasks and works by sequentially training weak learners,
focusing on the instances that previous classifiers misclassified. The goal of AdaBoost
is to improve the accuracy of the model by emphasizing harder-to-classify instances.

For each iteration t:

(1) Initialize Weights: Assign equal weights to all training samples. Initially,

1
ω (i1)= ∀ i=1 , 2 , … , N
N

where N is the number of training samples.

(2) Train Weak Learner: Train a weak learner ht using the weighted training samples.

(3) Calculate Weighted Error: Compute the error of the weak learner,
N
ϵ ❑t =∑ ω(ti ) I ( y i ≠ h t (x i))
i=1

where I(⋅) is the indicator function, y i is the actual label, and ht (x i) is the predicted
label.

(4) Compute Weak Learner Weight: Calculate the weight of the weak learner,

αt =
1
2
ln
( )
1−ϵ t
ϵt

(5)Update Sample Weights: Adjust the weights of the training samples,


15

(t +1) (t )
ωi = ω i exp (−α t y i ht (x i ))

Normalize the weights so they sum to 1,


(t +1)
ωi
(t +1) N
ωi =
∑ ω(tj +1)
j =1

(6) Final Hypothesis: Combine the weak learners to form the final strong classifier,

(∑ )
T
H(x) = sign α t h t (x i)
t =1

where T is the total number of iterations.

Explanation of Terms

 Weak Learner: A simple model that performs slightly better than random
guessing. Common examples include decision stumps (one-level decision
trees).
 Weighted Error ϵ t : The proportion of incorrectly classified samples, weighted
by their importance.
 Learner Weightα t : The contribution of each weak learner to the final model,
calculated based on its accuracy.
 Sample Weights ω i: Weights assigned to each training sample, adjusted to
emphasize harder-to-classify instances in subsequent iterations.

3.9 Validation of Machine Learning Models

Model Validation is an important part of machine learning process, it ensure


that model perform well on new unseen data and meet the intended goal of problem
being solved. Not only does it help with validating that our model perform well on
new unseen data, it can be used to select the best model among multiple models.
Different ways to validate a machine learning models includes techniques such as,

(1) Cross Validation: is a technique used to assess how the results of a statistical
analysis will generalize to an independent data set.
(1) i K-Fold Cross-Validation: The data is divided into k subsets (folds).
The model is trained kk times, each time using k−1 folds for training
16

and the remaining fold for validation. The performance metric is


averaged over all k folds.
(1) ii Leave-One-Out Cross-Validation (LOOCV): A special case of k-fold
cross-validation where kk equals the number of data points. Each data
point is used as the validation set once, and the remaining data points
are used for training.
(2) Stratified Cross-Validation: A variation of k-fold cross-validation where the
folds are created to preserve the proportion of classes in each fold, which is
especially useful for imbalanced datasets.
(3) Train-Test Split: involves dividing the dataset into two separate parts: one for
training the model and one for testing its performance.
(3) i Simple Train-Test Split: The data is divided into a training set
(typically 70-80% of the data) and a testing set (20-30% of the data).
The model is trained on the training set and evaluated on the testing
set.
(3) ii Validation Set: An additional split where a portion of the training data
is held out to tune hyperparameters.
(4) Holdout Validation: is similar to the train-test split but may involve additional
steps. The dataset is divided into a training set and a holdout set (also called a
test set) that is used only for final evaluation.
(5) Bootstrap Aggregating (Bagging): involves repeatedly sampling from the
dataset with replacement and training a model on each sample to evaluate its
performance. Multiple models are trained on different bootstrapped samples,
and their predictions are averaged (for regression) or voted on (for
classification).
(6) Performance Metrics: Performance metrics are used to evaluate the model’s
effectiveness. Common metrics include:

(6) i Classification Metrics:


17

Accuracy: The proportion of correctly predicted instances

Number of Correct Predictions


Accuracy =
Total Number of Predictions

Precision: The proportion of true positive predictions among all positive predictions

(True Positives+ False Positives)


Precision =
True Positives

Recall (Sensitivity): The proportion of true positive predictions among all


actual positives

True Positives
Recall =
(True Positives+ False Negatives)

F1 Score: The harmonic means of precision and recall

Precision+ Recall
F1 Score = 2×
Precision × Recall

ROC-AUC Score: The area under the Receiver Operating Characteristic curve,
indicating the model’s ability to distinguish between classes.

(6) ii Regression Metrics:

Mean Absolute Error (MAE): The average of absolute differences


between predicted and actual values.

1
MAE = ∑∣Predicted Value−Actual Value|
n

Mean Squared Error (MSE): The average squared differences between


predicted and actual values

1
MSE = ∑(Predicted Value−Actual Value)2
n

Root Mean Squared Error (RMSE): The square root of the average
squared differences between predicted and actual values

RMSE = √ MSE

R-Squared: The proportion of variance in the dependent variable that is


predictable from the independent variables.
18

(7) Model Selection Criteria: Model selection criteria like AIC and BIC are used
to choose the best model from a set of candidate models. They incorporate
both the goodness of fit (how well the model explains the data) and the
complexity of the model (to penalize overfitting).
(7) i AIC (Akaike Information Criterion): Balances model fit with the
number of parameters. Lower AIC values indicate a better model.
Formula
AIC=−2ln( ^L)+2k
 ^L : Maximum likelihood estimates of the model.
 k: Number of parameters in the model.
(7) ii BIC (Bayesian Information Criterion): Similar to AIC but imposes a
stronger penalty for models with more parameters, especially as the
sample size increases. Lower BIC values indicate a better model.

Formula

BIC=−2ln( ^L)+kln(n)

 ^L: Maximum likelihood estimates of the model.


 k: Number of parameters in the model.
 n: Number of observations.
(8) Hyperparameter Tuning: Hyperparameter Tuning involves adjusting the
model’s hyperparameters to improve performance.
(8) i Grid Search: Exhaustively searches a predefined set of
hyperparameters.
(8) ii Random Search: Samples random combinations of hyperparameters
and evaluates them.
(8) iii Bayesian Optimization: Uses probabilistic models to select
hyperparameters based on past evaluation results.
(9) Model Comparison: Model comparison involves evaluating several models to
choose the best one. This can be done by assessing different models based on
the same validation criteria or by comparing the model against baseline
models or existing solutions.
19

3.10 Expected Result & What to do with it

We expect our model to achieve our research objectives to forecast weather


with a good accuracy. Leveraging the Python programming language and essential
libraries such as scikit-learn and statsmodels, we will develop and evaluate various
machine learning algorithms and statistical models. Firstly, data set will be split into
training and testing subsets to assess model performance effectively. Subsequently, we
will explore a range of algorithms including ARIMA, SARIMA, regression models,
and ensemble methods to identify the most suitable approach for predicting weather
patterns in the Yangon region. Furthermore, we will conduct visual analysis through
time series plots, residual plots, and prediction intervals to scrutinize model
performance and identify potential areas for improvement. Through comprehensive
evaluation and comparison of multiple models, we aim to identify the most suitable
approach for accurate and reliable weather forecasting in the Yangon region, ensuring
the practical applicability and effectiveness of our predictive models in real-world
scenarios.

After achieving our expected results, we will implement python and


development framework, we optimize model performance and scalability for real-time
performances. We will also utilize PHP to make a user-friendly web-application for
end-users to interact with our system more fluently and more conveniently to help
them make their right decisions before going out with timely and accurate forecasts.
We intend to create a comfortable environment for end-user to gain knowledge about
the current weather and future weather. We also aim for people that don’t usually look
at weather prediction programs on television to alarm them before they make their
daily activities.
20

Chapter IV

Implementation

4.1 Data Collection

For our weather prediction project, we received our data from


OpenWeatherMap which is an online service owned by OpenWeather Ltd. They gave
us access to Developer plan for current weather and forecasts and medium plan for
historical weather collection.

The data we collected is hourly data ranging from January 22, 2023 to January
27,2024. We collected our data using python script to request data from
OpenWeatherAPI and save them as a JSON file since they come in JSON format. As
the history API have a limit of 169 per call, we have to iterate our calls by 604800s
which is equal to 169 hours till we receive desired data. The data we received
includes, time with 1 hour interval, temperature, feels like temperature, atmospheric
pressure, humidity, minimum temperature, maximum temperature, wind speed, wind
degree, wind gust, cloudiness, weather description, and rain.

Fig 4.1 Data Collection Process


21

4.2 System Flow Chart

Fig 4.2 System Flow Chart

4.3 Data Preprocessing

The collected data was initially in UNIX timestamp format and in the UTC
time zone. We adjusted it back to the Date Time format with the format of (YYYY-
MM-DD hh:mm:ss+06:30), where +06:30 indicates the time zone. Since rain is
measured each hour whenever data is available and is in JSON format, we extracted
these values and created a new column called 1h_rain. Upon analyzing the data, we
checked for duplicate time data and found that 51 of our data points were duplicates.
We removed these duplicated temporal data from our dataset. We also inspected our
data for any null values and found that wind_gust and rain had null values as they are
only recorded whenever data is available. We will remove weather_id, and
weather_icon as they will not be used in our model and also removed wind_gust, and
main_feels_like as we won’t be considering those in our model. After that, we set dt
as our index to start modelling. By setting dt as the index, we can easily perform
22

operations like resampling and time-based indexing, which are crucial in time series
analysis. Table 4.1 is the table of data before removal of duplicate with Range Index
of 8955 entries.

Table 4.1 Info of data before removal

# Column Non-Null Count Dtype

1 Dt 8955 non-null Datetime64[ns,


Asia/Yangon]

2 Main_temp 8955 non-null Float64

3 Main_feels_like 8955 non-null Float64

4 Main_pressure 8955 non-null Int64

5 Main_humidity 8955 non-null Int64

6 Main_temp_min 8955 non-null Float64

7 Main_temp_max 8955 non-null Float64

8 Wind_speed 8955 non-null Float64

9 Wind-deg 8955 non-null Int64

10 Wind_gust 578 non-null Float64

11 Clouds_all 8955 non-null Int64

12 Weather_id 8955 non-null Int64

13 Weather_main 8955 non-null Object

14 Weather_description 8955 non-null Object

15 Weather_icon 8955 non-null Object

16 Rain 912 non-null Object

After preprocessing and setting date time as index, Dt become index with
8904 entries, ranging from 2023-01-22 [Link]+06:30 to 2024-01-27
[Link]+06:30. Table 4.2 show the result of the process.
23

Table 4.2 Info of data after preprocessing and setting datetime as index

# Column Non-Null Count Dtype

0 Main_temp 8904 non-null Float64

1 Main_pressure 8904 non-null Int64

2 Main_humidity 8904 non-null Int64

3 Main_temp_min 8904 non-null Float64

4 Main_temp_max 8904 non-null Float64

5 Wind_speed 8904 non-null Float64

6 Wind_deg 8904 non-null Int64

7 Clouds_all 8904 non-null Int64

8 Weather_main 8904 non-null Object

9 Weather_description 8904 non-null Object

10 Rain_1h 906 non-null Float64

4.4 Exploratory Data Analysis (EDA) & Feature Engineering

After preprocessing phase, EDA is started with descriptive statistics of data.


Table 4.3 shows the descriptive statistics of our data after changing temperature unit
from Kevin to Celsius so we can comprehend better.
24

Table 4.3 Descriptive Statistics of Data

Main_temp Main_pressu Main_humidity Main_temp_min


re

count 8904.000000 8904.000000 8904.000000 8904.000000

Mean 27.940309 1009.011343 79.859164 27.940115

Std 3.419749 3.309927 16.220746 3.419805

Min 17.570000 1000.000000 14.000000 17.570000

25% 25.980000 1007.000000 70.000000 25.980000

50% 26.980000 1009.000000 83.000000 26.980000

75% 29.980000 1011.000000 94.000000 29.980000

max 40.980000 1019.000000 100.000000 40.980000

Table 4.3 Descriptive Statistics of Data Continued

Main_temp_max Wind speed Wind degree Cloud

count 8904.000000 8904.00000 8904.000000 8904.000000


0

Mean 27.940478 2.541006 205.840746 52.654313

Std 3.419748 1.445915 100.331652 33.008408

Min 17.570000 0.000000 0.000000 0.000000

25% 25.980000 1.540000 150.000000 20.000000

50% 26.980000 2.060000 230.000000 40.000000

75% 29.980000 3.600000 280.000000 75.000000

max 40.980000 7.720000 360.000000 100.000000

Temperature, which is primary focus of our research, have mean value of


27.9403°C and it have median point of 26.98°C. It’s minimum value and maximum
25

values are at 17.57°C and 40.98°C respectively. It also has standard deviation of
3.4197.

Fig 4.3 Boxplot of Yangon Temperature (January 22, 2023 to January 27,2024)

In Fig 4.3 box plot, some of the values are exceeding upper whisker and lower
whisker. To check if those are truly outlier or not, we checked the time where those
temperature exist and found that lowest temperature and highest temperatures are
existed at 2024-01-22 and 2023-04-26 respectively. As those times are Winter and
Summer, we concluded that they are not outlier and should be keep without
modifying them as they are reasonable values.

Fig 4.4 Line Chart of Yangon Temperature in Daily Average (January 22, 2023
to January 27,2024)

In Fig 4.4, we can see how our temperature is moving from day to day and we
can assume that it’s stationary. Just to make it sure we will check it in Moving
Average as to remove short-term fluctuations and longer-term trends or cycles.
26

Fig 4.5 Line Chart of Yangon Temperature in Moving Average of 24 (January


22, 2023 to January 27,2024)

As observed in Figures 4.4 and 4.5, our data does not exhibit a significant
trend. However, to confirm its stationarity, we will employ the Augmented Dickey-
Fuller (ADF) test. This test examines the null hypothesis that a unit root of 1 is
present, implying non-stationarity. The alternative hypothesis posits the absence of a
unit root, indicating that our data is stationary.

Null Hypothesis H0:

Unit Root = 1

Alternative Hypothesis H1:

Unit Root < 1

Results of ADF test can be seen in Table 4.4. as shown in table, the p-value of
0.000028 allows us to reject the null hypothesis of a unit root of 1, thereby confirming
the stationarity of our data.

Table 4.4 ADF Test statistics of Temperature Data

ADF Statistic -4.951574

p-value: 0.000028

Critical Value 1% -3.431

Critical Value 5% -2.862

Critical Value 10% -2.567


27

Fig 4.6 Seasonal Decomposition of Yangon Temperature (January 22, 2023 to


January 27,2024)

Fig 4.6 presents the seasonal decomposition of our temperature data, breaking
it down into Trend, Seasonal, and Residual components. The 'Observed' plot
represents our original temperature data. The 'Trend' plot confirms the absence of a
significant trend, while the 'Seasonal' plot clearly illustrates the presence of
seasonality in our data. The 'Residual' plot, which represents the remaining data after
the removal of trend and seasonal components, does not exhibit any discernible
pattern. This lack of pattern in the residuals assures us that our assumptions have not
overlooked any significant aspects of the data.

Fig 4.7 ACF and PACF of Yangon Temperature (January 22, 2023 to January
27,2024)

Fig 4.7 displays the Autocorrelation Function (ACF) and Partial


Autocorrelation Function (PACF) plots of our data. Both plots exhibit a gradual
decay, indicative of a possible autoregressive. The ACF plot shows significant spikes
28

at regular intervals, suggesting seasonality in our data. This pattern in the ACF plot
further confirms the presence of a seasonal component in our dataset.

4.5 Model Building and Evaluation for Forecasting Temperature

In this step, we aim to identify a suitable model for our data. We begin by
splitting the dataset into a 75% training set (spanning from 2023-01-22 to 2023-10-
26) and a 25% testing set (from 2023-10-27 to 2024-1-27), using scikit-learn's
train_test_split function with shuffle set to off. We then establish a baseline for our
model using the mean, which results in a Mean Absolute Error (MAE) of 2.6651. Our
goal is to develop a model that outperforms this baseline.

We attempted to forecast the temperature using Auto Regression. Which is a


part of ARIMA model and it use it’s lagged values as input. We attempted to find best
model using grid search.

(1) Algorithm used for grid search is as follow:


(2) Split the dataset into training and testing sets.
(3) Define the range of hyperparameters for tuning: lag.
(4) Initialize arrays to store BIC for both training and testing data.
(5) Loop over for each lag:
a. Create a pipeline with AutoRegression.
b. Set the hyperparameters of the AutoRegression within the pipeline.
c. Fit the pipeline to the training data.
d. Predict temperature for both training and testing data.
e. Append a tuple of the BIC value and the corresponding order
parameters to the BIC list.
f. After iterating through all combinations, sort the list of BIC tuples in
ascending order based on BIC values, as lower BIC values indicate a
better model fit.
29

Top 5 AR order can be seen in the following Table 4.5.

Table 4.5 Top 5 AR Lag from Grid Search

Lag Value BIC

5 19331.2755

6 19332.7548

7 19335.7229

9 19340.5138

10 19341.3911

After fitting the training data and predicting the test data, the AR(5) model
resulted in an MAE of 2.889 which is higher than baseline.

Fig 4.8 AR(5) Temperature Forecast Vs Actual

To validate the model better, WFV is applied. MAE improved slightly from
2.889 to 2.3232.

Improvement can be seen in the Fig 4.9 which shows AR(5) model with WFV
applied.

Fig 4.9 AR(5) with WFV Temperature Forecast Vs Actual


30

To identify the order of the of the model, a grid search is used again and this
time we use the AIC to select the best order for ARIMA and SARIMA models to see
if they perform better, the results of the grid search can be seen at Table 4.6.

General Grid Search Algorithm for ARIMA and SARIMA is attempted as


followed to find the best parameters for them,

(1) Define the range of values for the ARIMA model parameters: p
(autoregressive terms), d (differencing order), and q (moving average terms)
for SARIMAX , we also add P,D,Q of seasonal parts.
(2) Initialize two empty lists to store tuples of AIC (Akaike Information Criterion)
values and their corresponding order parameters (p, d, q) and (P,D,Q) if they
exist, in the case of AR model, we calculate MAE and store in the empty
loop .
(3) Iterate over each combination of values within their defined ranges:
a. Construct the order tuple (p, d, q) for the current iteration.
b. Attempt to fit an ARIMA model with the current order parameters to
the training data.
c. If the model fitting is successful, retrieve the AIC and BIC values from
the fitted model.
d. Append a tuple of the AIC value and the corresponding order
parameters to the AIC list.
e. If an error occurs during model fitting (e.g., the model is not
identifiable with the given parameters), skip the current combination of
parameters and continue with the next iteration.
(4) After iterating through all combinations, sort the list of AIC tuples in
ascending order based on AIC values, as lower AIC values indicate a better
model fit.
31

Top 5 ARIMA order can be seen in the following Table 4.6.

Table 4.6 Top 5 ARIMA Order from Grid Search

No ARIMA Order AIC

1 (3, 0, 3) 18332.2367

2 (4, 0, 2) 18332.7918

3 (5, 0, 5) 18339.0224

4 (5, 0, 4) 19296.2740

5 (3, 0, 2) 19297.0853

The result of fitting training data to ARIMA(3,0,3) can be seen in the Table 4.7.

Table 4.7 ARIMA (3,0,3) Results for Temperature training data

SARIMAX Results

Dep. Variable Main_temp

Model ARIMA (3, 0, 3)

AIC 18332.237

BIC 18386.689

[Link] 6678

Sample From 01-21-2023 To 10-27-2023

Ljung-Box (L1) (Q) 0.02

Prob(Q) 0.90

Prob (JB) 0.00

Heteroskedasticity (H) 0.60


32

We then attempted to forecast our test data using this fitted model and
compared the forecast to the actual data and illustrated that using a line graph (Fig
4.10).

Fig 4.10 ARIMA (3,0,3) Forecast Vs Actual

This resulted in a Mean Absolute Error (MAE) of 2.8099. We noticed that our
model became stationary after a few forecasts because it moved far from our initial
data point. To improve this, we used walk-forward validation with a window size of
24 steps. This method, used to assess the performance of our time series model,
involves forecasting 24 points ahead (one day's worth of hourly data) at a time and
including the actual data for those 24 hours in the training set for the next iteration.
This approach, which closely mimics a real-world scenario of continually updating
the model as new data becomes available, improved our ARIMA model's MAE from
2.8099 to 1.8834 and resulting plot can be seen in Fig 4.11.

Fig 4.11 ARIMA (3,0,3) Forecast Vs Actual after Walk-Forward Validation


33

Given that our data is hourly and exhibits seasonality according to the seasonal
plot, we decided to use a Seasonal ARIMA (SARIMA) model. We used a grid search
to find the best SARIMA (p, d, q) (P, D, Q) m model, similar to what we did for the
ARIMA model. It's important to note that this grid search for SARIMA is
computationally intensive and took over 72000 seconds to identify the best model
even with parallel processing. After that, we selected the best model using the AIC
score. The top 5 SARIMA orders are shown in Table 4.8.

Table 4.8 Top 5 SARIMA Order from Grid Search

No. ARIMA Order SARIMA Order AIC

1 (1, 0, 1), (2, 1, 2)24 16986.6623

2 (2, 0, 0) (2, 1, 2) 24 16986.9364

3 (2, 0, 1) (2, 1, 2) 24 16988.5437

4 (1, 0, 2) (2, 1, 2) 24 16988.5895

5 (2, 0, 2) (2, 1, 2) 24 16990.1672

The result of fitting training data to SARIMA (1,0,1) (2,1,2) 24 can be seen in the Table
4.9.

Table 4.9 show the result of SARIMA (1,0,1) (2,1,2)24 after fitting the training data.

SARIMAX RESULTS

Dep. Variable Main_temp

Model SARIMAX(1,0,1)x(2,1,[1,2],24)

AIC 16986.662

BIC 17034.283

[Link] 6678

Sample From 01-21-2023 To 10-26-2023

Ljung-Box(L1)(Q) 0.00

Prob(Q) 0.97

Prob(JB) 0.00
34

Heteroskedasticity(H) 0.75

We then forecasted our test data using our fitted SARIMA model and
compared the forecast to the actual data using a line graph (Figure 4.12). This resulted
in an MAE of 1.7735, showing an improvement compared to ARIMA.

Fig 4.12 SARIMA (1,0,1) (2,1,2)24 Temperature Forecast Vs Actual

After testing all three models with the test data for temperature, the mean
absolute error of each model is as shown in Table 4.10:

Table 4.10 Model Comparison

No. Model MAE Time Taken To Grid Search Time


Train

1 SARIMA 1.7735 ≈872s ≈72000s (with


Parallel
Processing)

2 ARIMA (After WFV) 1.8834 ≈6s ≈188s

3 AR(5)(After WFV) 2.3232 ≈>1s ≈20s

4 Baseline 2.6651 ≈>1s -

5 ARIMA (Before WFV) 2.8099 ≈6s ≈188s

6 AR(5)(Before WFV) 2.8885 ≈>1s ≈20s


35

Only the ARIMA model (after WFV), AR model (After WFV) and the
SARIMA model surpassed the baseline MAE. As the SARIMA model has the best
score, we have chosen it as the model to deploy.

4.6 Model Building and Evaluation for Main Weather Feature


Classification

After forecasting the temperature, Adaboost algorithm is used to predict the


main weather features such as rain, clear, cloud, drizzle and so on. Temperature,
Humidity and Cloudiness are used to predict main weather features. Following figures
show the distribution of the classes.

Fig 4.13 Distribution of Weather Main Features


36

Fig 4.14 Proportional Distribution of Weather Main Features

As shown in Fig 4.13 and Fig 4.14, Class distribution is imbalanced with
Clouds being majority class with 74.53%. Adaboost perform well in scenarios with
imbalanced data as it focus more on difficult to classify instances, improving the
model's ability to generalize from the minority class.

Grid Search is attempted to find the most optimal hyper parameter for our
adaboost algorithm. Hyper parameters include, number of estimators and learning
rate. Algorithm used for grid search of adaboost is as follow:

(1) Split the dataset into training and testing sets.


(2) Define the range of hyperparameters for tuning: n_estimators and
learning_rate.
(3) Initialize arrays to store accuracy scores for both training and testing data.
(4) Loop over each combination of n_estimators and learning_rate:
a. Create a pipeline with AdaBoostClassifier.
b. Set the hyperparameters of the AdaBoost classifier within the pipeline.
c. Fit the pipeline to the training data.
d. Predict labels for both training and testing data.
e. Calculate and store the accuracy for both training and testing
predictions.
37

(5) Plot the accuracy scores along with the corresponding hyperparameters for
both training and testing data.

Resulting Plot can be seen in Fig 4.15.

Fig 4.15 Hyperparameter Tuning Results of Adaboost Classifier

We can see that learning rate affect our accuracy most. Learning rate 1.0, and
number of estimators 10 is chosen as hyper parameters for Adaboost classifer as
they have high accuracy of 0.8492. The classification report after fitting the
training data to model and predicting the test data for validation purpose can be
seen in Table 4.11.

Table 4.11 Classification report

Precision Recall F1-Score Support

Clear 0.97 1.00 0.99 200

Clouds 0.83 1.00 0.91 1317

Drizzle 0.00 0.00 0.00 42

Fog 0 0 0 1

Haze 0 0 0 14

Mist 1.00 0.11 0.20 36

Rain 0 0 0 173

Thunderstorm 0 0 0 8

Weighted avg 0.74 0.85 0.78 1791


38

Weighted average refer to the average of instances of classes with the number
of occurrences in test data set which is support as weight.

Confusion Matrix of the test data can be seen in Fig 4.16.

Fig 4.16 Confusion Matrix of Test Data

We can see that our model is performing bad at classifying certain instances
such as thunderstorm and fog.
39

Fig 4.17 is the precision-recall curve of the model weighted by support instances.

Fig 4.17 Weighted Precision-Recall Curve

The weighted Precision-Recall (PR) curve is a graphical representation that


shows the trade-off between precision and recall for different thresholds, Area Under
Curve(AUC) of 0.80 suggests that the model has a good balance of precision and
recall across all classes. With weighted F1 score of 0.78 which is the harmonic mean
of precision and recall weighted by number of instances, we can consider that our
model is performing well.

4.7 Users Interface for Yangon Region’s weather forecast

Fig 4.18 shows the home page of the weather forecast for Yangon Region. In
this page, weather conditions for today and tomorrow are displayed. From this page,
users can go to “Overview”, “Hourly”, “Daily”, “Report an issue” and “Contact”
40

pages. And, Line graph for next 12 hours of temperatures can also be seen.

Fig 4.18 Home page of The Yangon Region’s Weather Forecast

In Fig 4.19, overview of today’s weather such as “Feels like”, “Humidity”,


“Wind speed” and “UV Index” are displayed . Users can go to their respective pages
which show the details weather information by clicking dropdown icon beside the
overview.

Fig 4.19 Overview Page


41

Users can see detailed information of feels like, humidity, wind speed and UV
index as shown in Fig 4.20, 4.21, 4.22 and 4.23. These pages show detailed
information with graphs, daily summary, daily comparison and unit changes.

Fig 4.20 Feels Like Page

Fig 4.21 Humidity Page


42

Fig 4.22 Wind Speed Page

Fig 4.23 UV Index Page

Fig 4.24 shows the page of hourly weather forecast for Yangon Region. It
shows hourly temperature, times and weather conditions.
43

Fig 4.24 Hourly Weather Forecast Page

Fig 4.25 shows the weather forecast information for next three days and it also
allows users to report an issue.

Fig 4.25 Daily Weather Forecast Page

As shown in Fig 4.26, users can report actual weather conditions in their areas.
Users can select weather conditions that actually happen in their cities and they can
also select where they live. By doing this, we can know the precise weather conditions
and continue further development and maintenance.
44

Fig 4.26 Report An Issue Page

Users can contact our admin team from contact page and can also send an
email. It can be seen in Fig 4.27.

Fig 4.27 Contact Page


45

To accept users’ reports and emails, we create admin dashboard. Admin Login
Form can be seen in Fig 4.28.

Fig 4.28 Admin Login Form

In Fig 4.29 and 4.30, it can be seen that reports from users and contacts from
users are systematically stored.

Fig 4.29 User’s Report Page


46

Fig 4.30 Customer Contact Page

You might also like