Yangon 30-Day Weather Forecast in °C
Yangon 30-Day Weather Forecast in °C
CHAPTER I
INTRODUCTION
1.1 Introduction
The last few years have seen an increased interest in Machine Learning with the
advancement of natural language processing model like GPT. In this study, we focus
on weather forecasting with a specific emphasis on temperature prediction in the
Yangon Region using machine learning. Knowing weather condition ahead will help
us with our day-to-day decision making. Weather plays a huge impact on our day-to-
day lives and it’s a critical aspect of decision making. They play a pivotal role in
various sectors, including agriculture, energy, and public safety.
There has been an increasing interest in Machine learning with the advancement
of natural language processing model like GPT and the release of image generation
models such as Mid Journey, Dall-E and stable diffusion. Exponentially increasing
amount of data along with increasing computational power and storages having bigger
capacities and getting cheaper contributed to better machine learning models.
Additionally, the rise of cloud computing has made it easier for researchers and
developers to access powerful computing resources for training complex machine
learning models. As a result, the field of machine learning continues to evolve rapidly
with new breakthroughs and applications being developed.
This paper is comprised of five chapters. Chapter 1 of the paper will serve as
an introduction to the topic, providing an overview of the subject matter and outlining
the expected results of the research. Chapter 2 offers a comprehensive understanding
of time series analysis, covering foundational components, transformation techniques,
and forecasting methods along with visualization to help us interpret results. Chapter
3 deals with the methodology of machine learning techniques. Chapter 4 consist
implementation of machine learning methodology and statistical analysis along with
implementation of our model into web application. Chapter 5 is dedicated to results of
the research. Chapter 6 consists of discussion of our findings and possible future
works.
3
CHAPTER II
LITERATURE REVIEW
A range of studies have explored the use of various models and techniques for
weather prediction and categorization. Mantri (2021) developed a hybrid model using
neural networks and k-nearest neighbors to predict temperature, humidity, and
weather conditions, achieving high accuracy. However, Thompson (1952) highlighted
the limitations of conventional categorical weather forecasts, suggesting the need for
more tailored and probabilistic approaches. Chauhan (2014) provided a
comprehensive review of data mining techniques for weather prediction, emphasizing
the potential for further advancements in this field. Lastly, Sapronova (2014)
demonstrated the effectiveness of categorization in improving the accuracy of an
artificial neural network-based model for wind speed and direction forecasting.
The use of real data in weather and climate models can be limited by its
representativeness, cost, and licensing restrictions (Meyer, 2020). This is particularly
relevant in developing countries, where the lack of accessible and reliable
5
Chapter III
Research Methodology
3.1.1Machine Learning
Our study is primarily conducted in Jupyter Notebook and Visual studio code
using python as main language and we will be utilizing PHP for building web-
application. Also, we will be implementing Git to perform version control activities.
7
No Name Description
Table 3.2 shows all library and packages we used in our research including
tools. Datetime module is used for manipulation of date and time and calculations of
them. JSON is a lightweight data interchange format which is also a JavaScript object
notation. To import JSON files, we will be using JSON module in our research.
Pandas is a library commonly used for data manipulation and statistical analysis. Pytz
is a python library time zone calculation and conversions. Request is used for making
network requests in Python. We utilized Time module to sleep our web requests as not
to hit rate limit.
No Name Description
Table 3.3 Libraries used for Machine Learning and Statistical learning
No Name Description
Table 3.4 shows tools that are utilized for visualization in our research.
Matplotlib, Seaborn, and Plotly are all popular libraries used for data visualization in
Python. Matplotlib is used for Basic line, bar, and scatter plots. Plotly excels in
creating interactive visuals and generating 3D plots. Seaborn is utilized for heatmap
and other advanced statistical plots (such as box plot and pair plot).
No Name Description
The flow chart visually illustrates the process involved in our research, starting from
data collection till the model deployment.
Expected timeline represents the time that our research will take for each step.
This will show the end-users how much our research has completed.
Following limitations are those that can have impact our research,
Handling Multicollinearity:
Weather data for the Yangon region was sourced from Open
Weather([Link] a comprehensive weather data provider
offering real-time and historical weather information through an accessible API. Open
Weather provides a wide range of meteorological parameters including temperature,
atmospheric pressure, humidity, minimum temperature, maximum temperature, wind
speed, wind degree, wind gust, cloudiness and rain.
i. Accessing Open Weather API: Utilizing the Open Weather API, which
provides developers access to weather data for various geographic locations
worldwide, including the Yangon region.
ii. API Requests: Requests were made to the Open Weather API specifying the
geographical coordinates (latitude and longitude) of the Yangon region and the
desired time range for historical weather data retrieval.
iii. Data Retrieval: Historical weather data for Yangon was retrieved from Open
Weather in JSON format, containing detailed information for each specified
parameter at regular intervals (e.g., hourly or daily measurements).
11
We plan to utilize following models and methods in the study to choose the
best option for our research:
ARIMA: ARIMA is a model that is used to forecast time series data by combining
past values, differencing and moving averages for prediction. ARIMA got 3
components namely Auto Regressive (AR), Integrated (I), Moving Average (MA)
namely (p,d,q) respectively. The AR(p) component captures the linear relationship
between the current observation and its past values, while the MA(q) component
captures the linear relationship between the current observation and its past errors.
The I(d) parameter represents the number of differences needed to make the series
stationary.
Yt=c+ϕ1Yt−1+⋯+ϕpYt−p+θ1et−1+⋯+θqet−q+et
12
Where:
Yt=c+ϕ1Yt−1+ϕ2Yt−2+…+ϕpYt−p−θ1et−1−θ2et−2−…−θqet−q+et+Φ1Yt−m+Φ2Yt−2m+…
+ΦPYt−Pm−Θ1et−m−Θ2et−2m−…−ΘQet−Qm
Where:
Linear Regression: Linear Regression, one of the most commonly used algorithms,
is used for modeling the relationship between a dependent variable Y and one or more
independent variables X. In the case of simple linear regression, there is only one
independent variable, while multiple linear regression involves two or more
independent variable
Y=β0+β1X+e
13
Where:
Y=β0+β1X1+β2X2+…+βnXn+e
Where:
(1) Initialize Weights: Assign equal weights to all training samples. Initially,
1
ω (i1)= ∀ i=1 , 2 , … , N
N
(2) Train Weak Learner: Train a weak learner ht using the weighted training samples.
(3) Calculate Weighted Error: Compute the error of the weak learner,
N
ϵ ❑t =∑ ω(ti ) I ( y i ≠ h t (x i))
i=1
where I(⋅) is the indicator function, y i is the actual label, and ht (x i) is the predicted
label.
(4) Compute Weak Learner Weight: Calculate the weight of the weak learner,
αt =
1
2
ln
( )
1−ϵ t
ϵt
(t +1) (t )
ωi = ω i exp (−α t y i ht (x i ))
(6) Final Hypothesis: Combine the weak learners to form the final strong classifier,
(∑ )
T
H(x) = sign α t h t (x i)
t =1
Explanation of Terms
Weak Learner: A simple model that performs slightly better than random
guessing. Common examples include decision stumps (one-level decision
trees).
Weighted Error ϵ t : The proportion of incorrectly classified samples, weighted
by their importance.
Learner Weightα t : The contribution of each weak learner to the final model,
calculated based on its accuracy.
Sample Weights ω i: Weights assigned to each training sample, adjusted to
emphasize harder-to-classify instances in subsequent iterations.
(1) Cross Validation: is a technique used to assess how the results of a statistical
analysis will generalize to an independent data set.
(1) i K-Fold Cross-Validation: The data is divided into k subsets (folds).
The model is trained kk times, each time using k−1 folds for training
16
Precision: The proportion of true positive predictions among all positive predictions
True Positives
Recall =
(True Positives+ False Negatives)
Precision+ Recall
F1 Score = 2×
Precision × Recall
ROC-AUC Score: The area under the Receiver Operating Characteristic curve,
indicating the model’s ability to distinguish between classes.
1
MAE = ∑∣Predicted Value−Actual Value|
n
1
MSE = ∑(Predicted Value−Actual Value)2
n
Root Mean Squared Error (RMSE): The square root of the average
squared differences between predicted and actual values
RMSE = √ MSE
(7) Model Selection Criteria: Model selection criteria like AIC and BIC are used
to choose the best model from a set of candidate models. They incorporate
both the goodness of fit (how well the model explains the data) and the
complexity of the model (to penalize overfitting).
(7) i AIC (Akaike Information Criterion): Balances model fit with the
number of parameters. Lower AIC values indicate a better model.
Formula
AIC=−2ln( ^L)+2k
^L : Maximum likelihood estimates of the model.
k: Number of parameters in the model.
(7) ii BIC (Bayesian Information Criterion): Similar to AIC but imposes a
stronger penalty for models with more parameters, especially as the
sample size increases. Lower BIC values indicate a better model.
Formula
BIC=−2ln( ^L)+kln(n)
Chapter IV
Implementation
The data we collected is hourly data ranging from January 22, 2023 to January
27,2024. We collected our data using python script to request data from
OpenWeatherAPI and save them as a JSON file since they come in JSON format. As
the history API have a limit of 169 per call, we have to iterate our calls by 604800s
which is equal to 169 hours till we receive desired data. The data we received
includes, time with 1 hour interval, temperature, feels like temperature, atmospheric
pressure, humidity, minimum temperature, maximum temperature, wind speed, wind
degree, wind gust, cloudiness, weather description, and rain.
The collected data was initially in UNIX timestamp format and in the UTC
time zone. We adjusted it back to the Date Time format with the format of (YYYY-
MM-DD hh:mm:ss+06:30), where +06:30 indicates the time zone. Since rain is
measured each hour whenever data is available and is in JSON format, we extracted
these values and created a new column called 1h_rain. Upon analyzing the data, we
checked for duplicate time data and found that 51 of our data points were duplicates.
We removed these duplicated temporal data from our dataset. We also inspected our
data for any null values and found that wind_gust and rain had null values as they are
only recorded whenever data is available. We will remove weather_id, and
weather_icon as they will not be used in our model and also removed wind_gust, and
main_feels_like as we won’t be considering those in our model. After that, we set dt
as our index to start modelling. By setting dt as the index, we can easily perform
22
operations like resampling and time-based indexing, which are crucial in time series
analysis. Table 4.1 is the table of data before removal of duplicate with Range Index
of 8955 entries.
After preprocessing and setting date time as index, Dt become index with
8904 entries, ranging from 2023-01-22 [Link]+06:30 to 2024-01-27
[Link]+06:30. Table 4.2 show the result of the process.
23
Table 4.2 Info of data after preprocessing and setting datetime as index
values are at 17.57°C and 40.98°C respectively. It also has standard deviation of
3.4197.
Fig 4.3 Boxplot of Yangon Temperature (January 22, 2023 to January 27,2024)
In Fig 4.3 box plot, some of the values are exceeding upper whisker and lower
whisker. To check if those are truly outlier or not, we checked the time where those
temperature exist and found that lowest temperature and highest temperatures are
existed at 2024-01-22 and 2023-04-26 respectively. As those times are Winter and
Summer, we concluded that they are not outlier and should be keep without
modifying them as they are reasonable values.
Fig 4.4 Line Chart of Yangon Temperature in Daily Average (January 22, 2023
to January 27,2024)
In Fig 4.4, we can see how our temperature is moving from day to day and we
can assume that it’s stationary. Just to make it sure we will check it in Moving
Average as to remove short-term fluctuations and longer-term trends or cycles.
26
As observed in Figures 4.4 and 4.5, our data does not exhibit a significant
trend. However, to confirm its stationarity, we will employ the Augmented Dickey-
Fuller (ADF) test. This test examines the null hypothesis that a unit root of 1 is
present, implying non-stationarity. The alternative hypothesis posits the absence of a
unit root, indicating that our data is stationary.
Unit Root = 1
Results of ADF test can be seen in Table 4.4. as shown in table, the p-value of
0.000028 allows us to reject the null hypothesis of a unit root of 1, thereby confirming
the stationarity of our data.
p-value: 0.000028
Fig 4.6 presents the seasonal decomposition of our temperature data, breaking
it down into Trend, Seasonal, and Residual components. The 'Observed' plot
represents our original temperature data. The 'Trend' plot confirms the absence of a
significant trend, while the 'Seasonal' plot clearly illustrates the presence of
seasonality in our data. The 'Residual' plot, which represents the remaining data after
the removal of trend and seasonal components, does not exhibit any discernible
pattern. This lack of pattern in the residuals assures us that our assumptions have not
overlooked any significant aspects of the data.
Fig 4.7 ACF and PACF of Yangon Temperature (January 22, 2023 to January
27,2024)
at regular intervals, suggesting seasonality in our data. This pattern in the ACF plot
further confirms the presence of a seasonal component in our dataset.
In this step, we aim to identify a suitable model for our data. We begin by
splitting the dataset into a 75% training set (spanning from 2023-01-22 to 2023-10-
26) and a 25% testing set (from 2023-10-27 to 2024-1-27), using scikit-learn's
train_test_split function with shuffle set to off. We then establish a baseline for our
model using the mean, which results in a Mean Absolute Error (MAE) of 2.6651. Our
goal is to develop a model that outperforms this baseline.
5 19331.2755
6 19332.7548
7 19335.7229
9 19340.5138
10 19341.3911
After fitting the training data and predicting the test data, the AR(5) model
resulted in an MAE of 2.889 which is higher than baseline.
To validate the model better, WFV is applied. MAE improved slightly from
2.889 to 2.3232.
Improvement can be seen in the Fig 4.9 which shows AR(5) model with WFV
applied.
To identify the order of the of the model, a grid search is used again and this
time we use the AIC to select the best order for ARIMA and SARIMA models to see
if they perform better, the results of the grid search can be seen at Table 4.6.
(1) Define the range of values for the ARIMA model parameters: p
(autoregressive terms), d (differencing order), and q (moving average terms)
for SARIMAX , we also add P,D,Q of seasonal parts.
(2) Initialize two empty lists to store tuples of AIC (Akaike Information Criterion)
values and their corresponding order parameters (p, d, q) and (P,D,Q) if they
exist, in the case of AR model, we calculate MAE and store in the empty
loop .
(3) Iterate over each combination of values within their defined ranges:
a. Construct the order tuple (p, d, q) for the current iteration.
b. Attempt to fit an ARIMA model with the current order parameters to
the training data.
c. If the model fitting is successful, retrieve the AIC and BIC values from
the fitted model.
d. Append a tuple of the AIC value and the corresponding order
parameters to the AIC list.
e. If an error occurs during model fitting (e.g., the model is not
identifiable with the given parameters), skip the current combination of
parameters and continue with the next iteration.
(4) After iterating through all combinations, sort the list of AIC tuples in
ascending order based on AIC values, as lower AIC values indicate a better
model fit.
31
1 (3, 0, 3) 18332.2367
2 (4, 0, 2) 18332.7918
3 (5, 0, 5) 18339.0224
4 (5, 0, 4) 19296.2740
5 (3, 0, 2) 19297.0853
The result of fitting training data to ARIMA(3,0,3) can be seen in the Table 4.7.
SARIMAX Results
AIC 18332.237
BIC 18386.689
[Link] 6678
Prob(Q) 0.90
We then attempted to forecast our test data using this fitted model and
compared the forecast to the actual data and illustrated that using a line graph (Fig
4.10).
This resulted in a Mean Absolute Error (MAE) of 2.8099. We noticed that our
model became stationary after a few forecasts because it moved far from our initial
data point. To improve this, we used walk-forward validation with a window size of
24 steps. This method, used to assess the performance of our time series model,
involves forecasting 24 points ahead (one day's worth of hourly data) at a time and
including the actual data for those 24 hours in the training set for the next iteration.
This approach, which closely mimics a real-world scenario of continually updating
the model as new data becomes available, improved our ARIMA model's MAE from
2.8099 to 1.8834 and resulting plot can be seen in Fig 4.11.
Given that our data is hourly and exhibits seasonality according to the seasonal
plot, we decided to use a Seasonal ARIMA (SARIMA) model. We used a grid search
to find the best SARIMA (p, d, q) (P, D, Q) m model, similar to what we did for the
ARIMA model. It's important to note that this grid search for SARIMA is
computationally intensive and took over 72000 seconds to identify the best model
even with parallel processing. After that, we selected the best model using the AIC
score. The top 5 SARIMA orders are shown in Table 4.8.
The result of fitting training data to SARIMA (1,0,1) (2,1,2) 24 can be seen in the Table
4.9.
Table 4.9 show the result of SARIMA (1,0,1) (2,1,2)24 after fitting the training data.
SARIMAX RESULTS
Model SARIMAX(1,0,1)x(2,1,[1,2],24)
AIC 16986.662
BIC 17034.283
[Link] 6678
Ljung-Box(L1)(Q) 0.00
Prob(Q) 0.97
Prob(JB) 0.00
34
Heteroskedasticity(H) 0.75
We then forecasted our test data using our fitted SARIMA model and
compared the forecast to the actual data using a line graph (Figure 4.12). This resulted
in an MAE of 1.7735, showing an improvement compared to ARIMA.
After testing all three models with the test data for temperature, the mean
absolute error of each model is as shown in Table 4.10:
Only the ARIMA model (after WFV), AR model (After WFV) and the
SARIMA model surpassed the baseline MAE. As the SARIMA model has the best
score, we have chosen it as the model to deploy.
As shown in Fig 4.13 and Fig 4.14, Class distribution is imbalanced with
Clouds being majority class with 74.53%. Adaboost perform well in scenarios with
imbalanced data as it focus more on difficult to classify instances, improving the
model's ability to generalize from the minority class.
Grid Search is attempted to find the most optimal hyper parameter for our
adaboost algorithm. Hyper parameters include, number of estimators and learning
rate. Algorithm used for grid search of adaboost is as follow:
(5) Plot the accuracy scores along with the corresponding hyperparameters for
both training and testing data.
We can see that learning rate affect our accuracy most. Learning rate 1.0, and
number of estimators 10 is chosen as hyper parameters for Adaboost classifer as
they have high accuracy of 0.8492. The classification report after fitting the
training data to model and predicting the test data for validation purpose can be
seen in Table 4.11.
Fog 0 0 0 1
Haze 0 0 0 14
Rain 0 0 0 173
Thunderstorm 0 0 0 8
Weighted average refer to the average of instances of classes with the number
of occurrences in test data set which is support as weight.
We can see that our model is performing bad at classifying certain instances
such as thunderstorm and fog.
39
Fig 4.17 is the precision-recall curve of the model weighted by support instances.
Fig 4.18 shows the home page of the weather forecast for Yangon Region. In
this page, weather conditions for today and tomorrow are displayed. From this page,
users can go to “Overview”, “Hourly”, “Daily”, “Report an issue” and “Contact”
40
pages. And, Line graph for next 12 hours of temperatures can also be seen.
Users can see detailed information of feels like, humidity, wind speed and UV
index as shown in Fig 4.20, 4.21, 4.22 and 4.23. These pages show detailed
information with graphs, daily summary, daily comparison and unit changes.
Fig 4.24 shows the page of hourly weather forecast for Yangon Region. It
shows hourly temperature, times and weather conditions.
43
Fig 4.25 shows the weather forecast information for next three days and it also
allows users to report an issue.
As shown in Fig 4.26, users can report actual weather conditions in their areas.
Users can select weather conditions that actually happen in their cities and they can
also select where they live. By doing this, we can know the precise weather conditions
and continue further development and maintenance.
44
Users can contact our admin team from contact page and can also send an
email. It can be seen in Fig 4.27.
To accept users’ reports and emails, we create admin dashboard. Admin Login
Form can be seen in Fig 4.28.
In Fig 4.29 and 4.30, it can be seen that reports from users and contacts from
users are systematically stored.