Rain Prediction Based On Machine Learning: Ye Zhao, Hanqi Shi, Yifei Ma, Mengyan He, Haotian Deng, Zhou Tong
Rain Prediction Based On Machine Learning: Ye Zhao, Hanqi Shi, Yifei Ma, Mengyan He, Haotian Deng, Zhou Tong
Proceedings of the 2022 8th International Conference on Humanities and Social Science Research (ICHSSR 2022)
ABSTRACT
Our purpose is to try to use machine learning algorithms to predict the weather of the next day, since whether it will
rain tomorrow is a very important indicator of the weather. In order to find the most predictable attributes of rain, The
researcher use line charts, matrix graphs, and scatterplot graphs for visualization and analysis. The researcher find that
several pairs of attributes have a high degree of similarity and correlation. In the fitting stage, the researcher used simple
models such as KNN, decision tree, and ridge regression to evaluate its basic prediction quality and found that the
accuracy rate is around 0.78. Since in the visualization stage, the researcher found that the samples that rained today
have a slightly higher probability of raining the next day, the researcher tried to use LSTM to analyze the impact of
historical weather and found that the relationship is not strong. Finally, logistic regression turns out to have the highest
accuracy of 0.85, followed by adaboost with an accuracy of 0.82. Whether it will rain remains unpredictable to some
extent.
to predict weather changes, a practical problem that bad attributes 2. To a certain extent, the role of the
people care in daily life. characteristics of the universe
The researcher also wants to figure out which
attribute has closer relationship with whether it will rain 2.3. Balance the data
the next day, using some basic viualization skills. Category imbalance refers to the situation of the
spherical bottle mouth of the training sample tasks of
2. BACKGROUND different categories in the classification. There are three
commonly used situations, namely 1. sampling, 2. over
In the late 1930s, during World War II, the British
sampling, and 3. threshold shift. This time the prediction
innovative discovery radar could not only be used to
is mainly due to the following reasons, that is, to
monitor enemy aircraft, but also could receive echoes
understand some counter-example explanations, counter-
from raindrops of certain specific wavelengths (5-10 cm)
example problems, and then learn. Following is
[2]. This technique can be used to track and study
visualization, which is processing data: it is observable
individual showers and observe the precipitation
and easy to observe when a clean data set after analysis
structure of larger storms. This is the early use of
is obtained, the next step is Exploratory Data Analysis
scientific and technological means by modern scientists
(EDA). EDA can help discover data, and can also be used
to predict the possibility of future rainfall, but it is clear
to find patterns, relationships, or anomalies to guide
that this can only predict rainfall in a short period of time.
students' analysis. One of the effective start-up tools is
With the rapid development of technical scientific data
the scatter plot matrix. The scatter plot matrix allows two
and technology, scientists can begin to use geophysical
separate distributions and the relationship between them.
knowledge and use a large amount of research data to
The histogram of the diagonal position allows us to see
make judgments. The first choice is to use machine
the distribution of each variable, while the scatter plot on
learning in artificial intelligence. Senior January, Google
the diagonal shows the relationship between the two.
engineer Jason Hickey introduced an application of
Another important thing is the line chart. The
machine learning to weather radar charts. The idea is to
visualization form of many visualization tools is the line
convert the radar chart into a "computer vision" problem
chart. By drawing the data changed by the tool into a line,
that machines are good at. He tried to use a large amount
the size and trend of the line chart can be intuitively
of data to drive machines to learn physics principles from
changed. Set, especially those occasions where the trend
algorithms, and later used U-Net in neural neural
is more important.
networks (CNNs). In comparison, his machine learning
algorithm is better than the three traditions-HRRR
numerical forecasting, optical flow method (optical flow 2.4. Models Applied
method) and persistent modeling (persistent model), in Logistic Regression is a machine learning method
ultra-short-term changes [3]. When the experiment uses used to solve classification problems and is used to
decision tree to predict rain in Australia, some data estimate the possibility of something. Logistic
processing methods and models are needed. Below, the Regression and Linear Regression are both a generalized
actual skills will be described. linear model. The simple process is to first write the
maximum likelihood function and perform logarithmic
2.1. Initial Stage processing. Then use the gradient descent method to find
the minimum value of the cost function. The KNN
The Z-socre method standardizes the dataset. Before
classifier, as a relatively easy-to-understand
data analysis, the researcher usually needs to analyze
classification algorithm in supervised learning, is often
(standardize) the data first and use the data analysis after
used in various classification tasks. The core idea of the
the data. The z-score standard is based on the mean
KNN model is very simple. It calculates the Euclidean
(mean) and standard value (standard deviation) of the
distance between each sample point of the test set and
original data for each indicator of the data. Each column
each sample in the training set, and then takes the K
says that all data is around 0, which is 1.
nearest Euclidean distance. Point (k is the number of
neighbors that can be delineated artificially, and the
2.2. Encoding determination of K will affect the results of the
In many machines learning tasks, features are not algorithm) and count the category frequencies of the K
always continuous values, but may be personal values. In training set sample points and convert the category with
this way, for each feature, if it has m possible values, it the highest frequency to the test sample point Forecast
becomes m binary features after individual hot coding, category [4]. The method of LSTM and Ensemble
and these features do not affect each other, and there is Algorithm—boosting bagging are also used.
only one activation at a time. So, the benefits of these data
will become different categories. Therefore, the main
benefits of solving data are: 1. Solving the problem of
2958
Advances in Social Science, Education and Humanities Research, volume 664
Group A includes MinTemp, MaxTemp, Temp9am, Lastly, was the problem on unbalanced data. The
Temp3pm researcher have many more cases of no rain tomorrow
than cases of rain tomorrow. To make the data more
Group B includes Sunshine, WindGustDir,
balanced, the researcher draw 50 random samples
WindDir9am, WindDir3pm
without replacement from each class. This may have
Group C includes Rainfall, Evaporation, caused inaccuracy because the data the researcher use to
Humidity9am, Humidity3pm graph is just part of the original data. However, since they
are random samples, the researcher believe they still can
Group D includes WindGustSpeed, WindSpeed9am,
represent the whole dataset in some degree. After
WindSpeed3pm
graphing each group of features, the resulted graphs are
Group E includes Pressure9am, Pressure3pm as follow (Fig 2).
Group F includes Cloud9am, Cloud3pm, RainToday
2959
Advances in Social Science, Education and Humanities Research, volume 664
Fig. 2 This group of line charts shows the distributions in features by weather of tomorrow.
2960
Advances in Social Science, Education and Humanities Research, volume 664
2961
Advances in Social Science, Education and Humanities Research, volume 664
Fig 4: Next comes the splitter, which is the strategy used to choose the split at each node. 'Best' is to choose the best
split, and 'Random' is to choose one from the best splits randomly. The researcher can see that using best is more
stable and overall, its performance is better than random. So, the researcher decided to use the 'Best' as the splitter.
Assuming that maybe it is due to the random_state of the computer.
Fig. 5: Then the researcher decided the max depth of the tree. Because if a tree is too deep, it is easy for it to overfit
the train set. So, the researcher need to find a suitable depth to find the compromise of overfitting and underfitting.
And eight is the best depth.
2962
Advances in Social Science, Education and Humanities Research, volume 664
Fig. 6: About min_samples_leaf and min_samples_split. Min_samples_leaf is the minimum number of samples required
to be at a leaf node. This is also used to prevent overfitting, as a node contains too few points means that it can even be
an outlier and using it to predict is useless. But too many points means that it may still have the potential to split more
precisely. Min_samples_split is the minimum number of samples required to split an internal node. It is a little familiar
with the min_sample_leaf, if one node contains less than the number of you required, it will be a leaf.
Fig. 7: At last, is the max_features which is the number of features to consider when looking for the best split. It limits
the number of features considered when branching, and features that exceed the limit will be discarded. This is kind of
a way to reduce dimensionality. But the method is too violent, without knowing the importance of each feature in the
decision tree, forcibly setting this parameter may lead to Insufficient model learning. And the chart shows that if the
features are less than 20, its overall performance is worse. But the researcher can see that a lower max_features can
even lead to a lower error_rate, such as 16. Therefore, the researcher can find that there are many attributes that are not
useful and even distort the prediction, so our inference: only by using a few attributes can predict whether it will rain
tomorrow is correct.
2963
Advances in Social Science, Education and Humanities Research, volume 664
Fig. 8: Here is the overall performance of the decision tree. Its accuracy rate approximately is 0.78757. The confusion
matrix is as follows. 0 means no rain tomorrow, and 1 means that it will rain tomorrow.
Fig. 9: In the next experiment, the researcher compares the parameter combinations between: l1 penalty and c = 0.1; l2
penalty and c = 0.01; no penalty. The result is presented in figure and again shows the intriguing feature of higher
accuracy on validation dataset. The graph indicates that l2 penalty with c = 0.01 is the most accurate and no penalty is
the least accurate in prediction. It can further be inferred that because ridge regression works better than Lasso regression
and regression with no regularization, all features are used to determine the weather of tomorrow, yet some are less
important than others. Because c is relatively small, there is a huge discrepancy between the contributions of the
important attributes and of the unimportant attributes in this classification.
2964
Advances in Social Science, Education and Humanities Research, volume 664
Fig.10: The confusion matrix for l2 penalty and c = 0.01, our optimal parameter, shows that the model much more
accurate in predicting not raining circumstances than predicting raining circumstances. One possible explanation can
be that because the model consists of an imbalance amount of raining and not raining classes, it learns how to classify
not raining circumstances better.
5.2. LSTM results. Then, the researcher design three models to test
the hypothesis. Model 1.1 uses data from the last seven
The dataset is not only organized by location in days without data about whether it rains on each day to
Australia but also by dates. This inspires us to postulate predict the rain status for the next day. Model 1.2 uses
that weather predictions may be dependent on the data from the last fourteen days without data about
previous weather of the week - there can exist a subtle whether it rains on each day to predict the rain status for
pattern of rain in a period of time. To test this hypothesis, the next day. Interestingly, in both models, the prediction
the researcher decides to use Long Short Term Memory, is not as accurate as the researcher have imagined. In fact,
a type of recurrent neural network introduced by Sepp the two LSTM models perform similarly to the ridge
Hochreiter in 1997[6]. regression model. It is nonetheless pleasing to see that in
model 1.2 the prediction for no rain tomorrow is slightly
The unprocessed dataset, though containing much more accurate than both the logistic regression and
more data, is not perfect. There are void numerical and LSTM Model 1.1. Although a 2% increase may be
categorical values. For void numerical values, the ignorable, it still implies that perhaps rain displays a
researcher fills them with the mean of all their column. pattern in longer periods of time, but past data does not
For categorical variables, the researcher uses logistic drastically improve the model.
regression to make predictions and fill the voids with the
2965
Advances in Social Science, Education and Humanities Research, volume 664
Fig 11: In the line chart and confusion matrix, the researcher has noticed that rain today is very different between if it
is going to rain tomorrow or if it is not. To test if the prediction is directly related to this feature, the researcher excludes
this feature in LSTM model 1.1 and model 1.2. LSTM Model 2 however includes this feature. It is a model that predicts
the rain status of the next day based on environmental data from the last seven days, including if it rained on each day.
Again, the result is not so much as the researcher have pictured it, and the accuracy is almost the same as model 1.1.
Whether it rained is not as important as the researcher have assumed. It is just as likely to predict whether it will rain
correctly with environmental data. Perhaps, even though the changes of rainy days and clear days are visible, they are
just a visual presentation of rain, whereas other subtle environmental factors can also indicate the likeliness of rain.
Fig 12: The lstm model, no matter using seven days of pat data or fourteen days of past data, is not much different
from the logistic regression model. The implication here is that whether it will rain is not dependent on whether it rained
on the past days, but rather can be predicted from just data on the previous day. However, because the initial data is
incomplete, the researcher have filled void data with results based on statistics, not what is recorded. This can
compromise the accuracy of the whole dataset, and thus lower the accuracy. Moreover, knowing if it rains on each day
does not help the model improve, suggesting that what the researcher see as rain is just a direct representation of what
has happened in the environment, for instance, the change in humidity and temperature.
2966
Advances in Social Science, Education and Humanities Research, volume 664
[Link] Algorithm
Bagging algorithms are methods that do not
Fig. 13: In this Figure 2.4.1, axis_x represents the adaptively change the weight of the weak classifiers,
iteration times, axis_y represents the accuracy of the Bagging, means bootstrap aggregating, is one of the
formed models. The score raised along with a few simplest and earliest ensemble methods, with wonderful
floating, then reach peak at 750 times iteration, then the performance. The deference between bagging and
model seems to get start to be overfitting. As a boosting algorithms mainly reflects in sampling scheme.
conclusion, 750 might be the suitable number. Bagging methods split the training datasets by sampling,
then apply each to base model, second, synthesizing the
Learning rate is also an indispensable role in predicted results of all base models to get the final predict
AdaBoost methods. The influences from learning rate result [7]. Bagging methods simply merge the decisions
are, as shown in the following line charts. of simple learner by votes, and these trees are expected
to be similar and to implement the same classification for
each test instance (Subasi and Qaisar). The main
advantage of Bagging is balancing the instability of base
models. In each stage, the method alters the training
datasets to decimating certain instances and interpolating
others.
The researcher use bagging algorithms to promote the
accuracy scores of Decision Tree.
2967
Advances in Social Science, Education and Humanities Research, volume 664
Figure 16: 75 decision trees are the best choice for [Link] adjusting
the bagging in range one to a hundred. Surprisingly, the
value of predication result is not changing intensely. To Since every simple is represented by its k nearest
found out the possible reason for this phenomenon, a neighbors in KNN algorithm, to determine k value is
line chart has been created. particularly important. If a smaller value of K is selected,
it is equivalent to smaller training simples in the field of
forecast, so approximation error will decrease, only with
the input instance is close or similar training simples will
only work on forecast results, at the same time the
problem is that the estimation error will increase. The
decrease of the K value means the whole model is
complicated and overfitting is easy to occur; If a larger
value of K is selected, it is equivalent to using training
examples in a larger field to make predictions. Its
advantage is that it can reduce the estimation error of
learning, but its disadvantage is that the approximation
error of learning will increase. At this time, training
simples far from the input instance will also act on the
predictor, making the prediction error, and the increase
Fig 17: The accuracy reached a dropped after 75, at of K value means that the overall model becomes simple
the same time, the accuracy keep floating in a small (Li). Here the researcher use a looping statement to test
range, which indicate that this line will approach to the prediction accuracy of different K values on the test
straight line, since the bagging is work as an mean of all base one by one. The KNN algorithm is achieved by
the weak leaner, which means the accuracy will reach a using Kneighborsclassifier function in sklearn library.
stable number as the increasing of the weak learner in All the parameters in Kneighborsclassifier function
same type. except K value are the default value. K values range from
1 to 20.
[Link] Classifier It can be seen that when k value is small, the accuracy
of the model is low. However, when k value is greater
The KNN classifier is based on a KNN algorithm,
than 3, the accuracy of the model is significantly
which is the simplest algorithm of the area of machine
improved, exceeding 78 percent. When k value continues
learning. The main idea of a KNN classifier is that the
to increase, the accuracy of the model can be maintained
category of every simples can be delegated by its k-
at a good level, about 80 percent, although it does not
nearest neighbors. The KNN classifier mainly depends
continue to increase.
on the surrounding limited adjacent samples, rather than
on the method of discriminating the class domain to
determine the category. Therefore, KNN method is more
suitable than other methods for the sample sets to be
divided with more overlapping or overlapping class
domains. According to the scatter plots metrics, most
plots are very close to each other, and the plots of
different category have large area overlapping, indicating
that a KNN method is a suitable algorithm for our topic.
In addition, many classification problems are
successfully solved by KNN algorithm. Using four
different manners to train KNN method to detect DTMF
2968
Advances in Social Science, Education and Humanities Research, volume 664
Logistic bagging
Decision Tree LSTM AdaBoost KNN
Regression method
Accuarcy (out
0.78 0.85 0.85 0.82 0.798 0.80
of 1.00)
2969
Advances in Social Science, Education and Humanities Research, volume 664
whether it will rain tomorrow just by some simple fo_tab_contents. Accessed 22 Sept. 2021.
attributes. Or maybe the attributes the researcher choose
[6] Hochreiter, Sepp, and Jürgen Schmidhuber. "Long
is just cannot represent whether it will rain tomorrow.
Short-Term Memory." Neural Computation, vol. 9,
no. 8, 15 Nov. 1997, pp. 1735-80. MIT Press Direct,
AACKNOWLEDGMENT
[Link]
Hanqi Shi and Yifei Ma are both the second authors. Accessed 22 Sept. 2021.
[7] Lemmens, Aurélie, and Christophe Croux.
REFERENCES "Bagging and Boosting Classification Trees to
[1] Young, Joe. "Rain in Australia." Kaggle, 2018, Predict Churn." Journal of Marketing Research, vol.
[Link]/jsphyg/weather-dataset-rattle- 43, no. 2, 1 May 2006, pp. 276-86. SAGE Journals
package/version/2. Accessed 20 Sept. 2021. Online, [Link]
Accessed 22 Sept. 2021.
[2] Rafferty, John P., editor. "Numerical Weather
Prediction (NWP) Models." Encyclopaedia [8] Soui, Makram, et al. "NSGA-II as feature selection
Britannica Online, Encyclopaedia Britannica, technique and AdaBoost classifier for COVID-19
[Link]/science/weather- prediction using patient's symptoms." National
forecasting/Numerical-weather-prediction-NWP- Library of Medicine. Pub Med,
models. Accessed 19 Sept. 2021 [Link]
Accessed 22 Sept. 2021.
[3] Hickey, Jason. "Using Machine Learning to
'Nowcast' Precipitation in High Resolution." Google [9] Liu, Wenfei, et al. "Comparisions on KNN, SVM,
AI Blog, Google, 13 Jan. 2020, BP and the CNN for Handwritten Digit
[Link]/2020/01/using-machine- Recognition." IEEE Xplore, Aug. 2020, pp. 25-27.
[Link]. Accessed 18 Sept. 2021. IEEE Xplore,
[Link]
[4] Zhou, Zhi-Hua. Ensemble Methods Foundations 82. Accessed 22 Sept. 2021.
and Algorithms. E-book ed., Taylor and Francis
Group, 2012. [10] Mirceva, G., et al. "Classifying Protein Structures
by Using Protein Ray Based Descriptor, KNN and
[5] Tibshirani, Robert. "Regression Shrinkage and FuzzyKNN Classification Methods." IEEE, Nov.
Selection via the Lasso." Royal Statistical Society, 2020. IEEE XPLORE,
vol. 58, no. 1, 1996, pp. 267-88. JSTOR, [Link]
[Link]/stable/2346178?seq=1#metadata_in 42. Accessed 22 Sept. 2021.
2970