Journal Pone 0262008
Journal Pone 0262008
RESEARCH ARTICLE
* [email protected]
a1111111111
a1111111111
a1111111111 Abstract
a1111111111
a1111111111
Background
Climate change is expected to exacerbate diarrhoea outbreaks across the developing
world, most notably in Sub-Saharan countries such as South Africa. In South Africa, dis-
OPEN ACCESS eases related to diarrhoea outbreak is a leading cause of morbidity and mortality. In this
Citation: Abdullahi T, Nitschke G, Sweijd N (2022) study, we modelled the impacts of climate change on diarrhoea with various machine learn-
Predicting diarrhoea outbreaks with climate ing (ML) methods to predict daily outbreak of diarrhoea cases in nine South African
change. PLoS ONE 17(4): e0262008. https://doi.
org/10.1371/journal.pone.0262008
provinces.
Africa (https://www.clicksgroup.co.za/) or the considered when predicting outbreak of diarrhoea in South Africa were precipitation, humid-
Council for Scientific and Industrial Research ity, evaporation and temperature conditions.
(CSIR) (https://www.csir.co.za/). Furthermore, I
confirm that others would be able to access these
data by request and permission from the Clicks Conclusions
Group Limited, South Africa. I also confirm that no Overall, experiments indicated that the prediction capacity of our DL methods (Convolu-
special privileges were involved when the data was
secured. All other datasets (real-world climate data
tional Neural Networks) was found to be superior (with statistical significance) in terms of
and synthetic datasets for all variables including prediction accuracy across most provinces. This study’s results have important implications
diarrhoea) used in this study can be accessed from for the development of automated early warning systems for diarrhoea (and related disease)
our GitHub repository https://github.com/
outbreaks across the globe.
aminalawal/Predicting-Diarrhoea-Outbreak-with-
Climate-Change.
methods. ML methods are known for their ability to handle high-dimensional data and model
complex predictive problems.
Several supervised learning-based ML techniques such as Support Vector Machines (SVMs)
[16] and Deep learning techniques such as Convolutional Neural Networks (CNNs) [17], Long
Short-Term Memory Networks (LSTMs) [18] have been applied in medical research for devel-
oping predictive and diagnostic models for various diseases [14, 15]. For example, CNNs have
been used for the detection of Malaria parasite [19] and Tuberculosis diseases [20] in individu-
als. LSTMs have also been used to predict the outbreak of diseases like Typhoid, Chicken Pox
and Scarlet Fever [14]. SVMs were also used for Hepatitis disease detection [21]. These ML
methods are widely used for modelling infectious diseases because of the numerous advantages
they possess. For instance, CNNs are popular for their powerful feature extraction capabilities
[17]. LSTMs are commonly used to handle sequential tasks such as time series forecasting
because of their ability to capture long term dependencies [14]. SVMs are widely accepted for
their ability to solve nonlinear regression estimation problems, their non-parametric nature
enables them to represent complex and nonlinear functions easily [16].
Despite advances in a range of health-care applications using such predictive-based ML [14,
15, 21], there is a lack of research and data on the efficacy of such predictive ML methods for
diarrhoea outbreak prediction in Sub-Saharan Africa. Additionally, the overall task perfor-
mance of ML algorithms, applied to many health-care applications and more broadly to any
predictive classification task, largely depends on the manual tuning and calibration by algo-
rithm designers and experimenters of methodological parameters over the course of several
experimental trials [22, 23]. Such manual tuning is often ineffective and significantly limits the
full potential of task performance achieved by the ML method, especially for high-dimen-
sional, partially observable, noisy and complex task domains [22], as are typified by the nature
of data-sets in many health-care applications including diarrhoea outbreak prediction. Task
performance also largely depends on the amount of available training data [24], which is a sig-
nificant challenge for most predictive ML in health-care applications due to the sensitive and
controlled nature of health-care data-sets [25]. The inaccessibility of data adds to the difficulty
of method comparison, accuracy, and the advancement of ML as a whole [24, 26].
The overall aim of this study is to ascertain the suitability of various ML methods given vari-
ous climate factors and synthetic (generative) training data for accurately predicting diarrhoea
outbreaks. Specifically, the study aims to elucidate what type of ML method is most appropri-
ate when coupled with specific training and test data-sets (that is, specific climate variables,
data-sparseness, data-noise and synthetic data compliment), in order to optimise prediction
efficacy. Thus, we compared task-performance of three ML methods (CNNs, LSTMs and
SVMs) to ascertain the most suitable method for predicting future number of daily diarrhoea
cases in nine South African provinces. The average predictive accuracy of each method was
compared across multiple datasets and experiment replications. Given the sparse and noisy
nature of the data-sets used for method training and testing, we necessarily augmented the
available data (real-world data) with synthetic data generated using Generative Adversarial
Networks (GANs). GANs were selected as they have been previously demonstrated as effective
for generating different types of realistic data [24, 25]. Also, since there was a lack of previous
research to guide parameter tuning and calibration for optimising such ML methods applied
to diarrhoea outbreak prediction, we used the Relevance Estimation and Value Calibration
(REVAC) method [27]. REVAC is an evolutionary algorithm design for meta-heuristic param-
eter tuning, and as such was applied to optimise methodological parameters of the ML meth-
ods used in this study. Previous work has demonstrated the effectiveness of REVAC for
parameter tuning and attaining optimal algorithm performance across a range of complex,
noisy and high-dimensional search spaces [28, 29].
Methods
Study population
This study focused on the nine South African Provinces which are: Western Cape, Eastern
Cape, Northern Cape, North West, Free State, Limpopo, KwaZulu Natal, Gauteng, and Mpu-
malanga. Most provinces in South Africa experience rainfall in the summer with the excep-
tion of Western Cape. Western Cape has a Mediterranean climate that receives rainfall
during winter with an average annual rainfall of 515mm. Provinces such as KwaZulu Natal,
Free State and Mpumalanga experience the highest annual rainfall rate which is between
800–1054mm while Eastern Cape, Limpopo, Gauteng, Northern Cape, and North West prov-
ince receive an annual rainfall that is between 400–600mm. In terms of temperature condi-
tions, Limpopo, Northern Cape, Mpumalanga and North West provinces usually record the
highest temperature with annual averages between 27.1–30˚C while the least annual average
temperatures which are between 22.1–23.3˚C are usually recorded for Western and Eastern
Cape provinces.
Datasets
The datasets used for all experiments consists of nine features categorized into two data sub-
sets: Diarrhoea and a set of eight climate features.
For each province, daily sales records of Loperamide, an anti-diarrhoea compound that has
been evaluated in the treatment of patients with chronic non-specific diarrhoea in South Africa
and other parts of the world was obtained from Clicks Group Limited, South Africa (https://
www.clicksgroup.co.za/). The data contains a 10-year period of total number of loperamide
purchased between November 2008 and March 2018. This data was used as a proxy for diar-
rhoea cases in the region. In this study, the number of diarrhoea cases per day for a specific
province was computed as the number of loperamide sales per day associated with the prov-
ince. Six-hourly data on Maximum temperature, Minimum temperature, Air temperature, Spe-
cific humidity, Potential evaporation rate, Precipitation rate, Surface pressure, and Wind
velocity climate factors for each South African province between the period of November 2008
and October 2019 were obtained from the National centres for Atmospheric Research and
Atmospheric Prediction. Please see (https://psl.noaa.gov/).
Generative Adversarial Networks (GANs) [25] were used to generate 20, 000 synthetic time-
series samples with 24 time-steps each for the diarrhoea and eight climate data in each prov-
ince. Data augmentation was performed to have sufficient data for making predictions, where
synthetic data was augmented with the real-world data-sets in two ways: upward augmentation
and downward augmentation. When the data-sets were augmented upwards, the training set
included a combination of the real-world and synthetic samples, but the test set included only
the synthetic data-sets and when the data-sets were augmented downwards, the training set
included mainly the synthetic data-sets and the test set included the real-world data-set. Tech-
nical details on GAN implementation can be seen in S1 Appendix.
The violin plots in Fig 1 show the distribution of the augmented dataset used in the study
for each province. The distribution of the diarrhoea case variable (loperamide) is similar across
Western Cape, KwaZulu Natal and Gauteng with Western Cape having the highest spread of
cases among all provinces. The distribution of the pressure variable is shown to be symmetric
across all provinces, meaning that its values occur at regular frequencies while the precipitation
variable is positively skewed thus, the mean value for each province is greater than the median.
The distribution of the other climate variables is shown to be approximately identical across
provinces.
Fig 1. Violin plots showing the distribution of loperamide (diarrhoea) and climate variables across the provinces.
EC = Eastern Cape, FS = Free State, GA = Gauteng, KZ = KwaZulu Natal, LP = Limpopo, MP = Mpumalanga,
NC = Northern Cape, NW = North West, WC = Western Cape. The distribution of the real-world and synthetic data
(augmented data) are shown in S1 and S2 Figs respectively.
https://doi.org/10.1371/journal.pone.0262008.g001
Data preprocessing
The real world climate and diarrhoea cases data-sets for each province collected for the study
were numerical and was ordered in the form of time series. To predict daily diarrhoea cases,
the six-hourly climate features data-sets for each province was converted into daily average
format. For all experiments, the normalization technique we adopted for our CNNs and
LSTMs is the Min-Max Normalization because it largely adopted for most neural network
regression models [30]. For our SVM methods, we adopted the Standard Scaling technique
since SVMs assume that the data given as input is within a standard range [31]. We used the
python Scikit-Learn (https://scikit-learn.org/) library to implement all our normalizations. For
all experiments, we divided our data-sets into a ratio of 70: 30 for training and testing our
methods. The data-sets with the earlier dates were used for training while the data-sets with
later dates were used to test and verify the accuracy of the methods.
In Eq (1), xi is the actual value while yi is the predicted value and n is the total number of
observations to be analysed. The ML method with the smallest RMSE error is considered to be
the best performing method in terms of prediction accuracy.
Configuration of ML methods
This study adopted two popular deep learning methods namely CNNs, LSTMs and a tradi-
tional ML method SVM for all experiments. These methods were chosen because of their suc-
cess in time series predictive tasks such as [14, 33]. Asides the powerful feature representation
capabilities of deep learning models, the LSTM network is a powerful technique for analyzing
temporal data. While the existence of other traditional ML methods such as decision trees [34]
and ARIMA [12] are known, SVM was chosen because it is a widely used nonlinear regression
estimation technique [16]. In addition, our preliminary analysis showed that the chosen ML
methods outperforms the decision trees (see the S1 Appendix section). The rest of this section
provides details on how the chosen methods were implemented.
CNN method. CNNs are a class of feed forward, deep neural network that consist of mul-
tiple convolutional and activation layers, pooling layers, and a fully connected layer as shown
in Fig 2. These layers are designed to perform specific tasks in order to extract important fea-
tures from the input data. After several iterations of convolutions, node activations and pool-
ing the final output is computed in the fully connected layer of the network. Our CNN method
was designed with 1D convolutions to match the sequential nature of our input data.
LSTM method. LSTMs as shown in Fig 3 are examples of Neural Networks under the cat-
egory of Recurrent Neural Networks (RNNs) that address the issue of exploding and vanishing
gradients. They contain memory cells that maintain their state overtime. The memory cells are
managed by gating units that control how it memorize, erase, and expose information. These
gating units are called the input gate, forget gate and output gate respectively.
SVM method. SVMs are mathematical models whose main function is to find hyper-
planes capable of creating margins that separates data points in a high dimensional feature
Fig 2. Basic architecture of the Convolutional Neural Network (CNN) with two convolution and pooling layers.
https://doi.org/10.1371/journal.pone.0262008.g002
Fig 3. Basic structure of the Long-term Short Term (LSTM) method with two LSTM layers.
https://doi.org/10.1371/journal.pone.0262008.g003
space with the smallest structural risk using kernel functions. We used the Scikit-Learn package
(https://scikit-learn.org/) to develop all our SVM method with a Radial Basis Function Kernel
(RBF) for predictions as shown in Fig 4.
For all the methods used in this study, we prepared our input data in a lag format (described
in the experiment section), no manual feature extraction step was conducted. Both deep learn-
ing methods (CNN and LSTM) were implemented with the Keras and TensorFlow (https://
keras.io/) deep learning library. The methods were configured to make reproducible results
Fig 4. Structure of the Support Vector Machine (SVM) regression method. The mappings of the input vectors and
the final output is discerned with the RBF kernel function.
https://doi.org/10.1371/journal.pone.0262008.g004
thus, a fixed random seed (https://www.tensorflow.org/) was set for all experiments. For all
ML methods, we kept some parameters fixed (based on parameter values established in previ-
ous related work [14, 15]), while others were tuned. See Table 1 for the list of tuned
parameters.
Experiments setup
Table 2 gives an overview of the experiments conducted for this study and Fig 5 presents the
overall pipeline used to predict daily diarrhoea cases in our experiments. Since this is a regres-
sion task, the input data were all in numerical format. Previous studies such as [12, 15] have
shown that the basic form of feature engineering applied to a time series prediction task is tak-
ing past observations into consideration. Although, deep learning methods are known for
automatic feature engineering [17], we applied this feature engineering step across all models
for consistency. This approach is also consistent with previous works such as [14, 15, 33]. To
make forecasts on the possible number of daily diarrhoea cases, we considered past observa-
tions (lags) in all our methods because patterns of the past are likely to be repeated in the
future. We tested the predictions of the three ML methods with respect to four different lag
periods from all input features. The lag periods we considered include a lag of one (1) day, lag
of five (5) days, lag of two (2) weeks and a lag of three (3) weeks. For example, a lag of one day
means that the predictions made by a method for the 6th of January 2018 was made with input
variables (for all features) of the 5th of January 2018 while a lag of five days means predictions
for the 1st of January 2018 was made with input variables (for all features) of the 1st to the 5th
of January 2018. These specific lag periods were chosen since our preliminary analyses show
that they produce more accurate predictions.
Thereafter, optimal parameters were selected and we determined the best performing ML
method by comparing the RMSE from the predictions made by the three ML algorithms (with
respect to the four lag periods) in three different experiments in which for each ML method,
predictions were repeated three times for each lag, across each province and the average RMSE
result was computed.
The first experiment (Experiment I) was implemented with the real-world case data which
contained the diarrhoea cases and eight climate features. The objective was to determine
which ML method performs best given the amount of data instances contained in the real-
world data-set. In order to obtain optimal training parameter values for each ML method
across each province, the grid-search method was used in this experiment. For most deep neu-
ral networks, such as the CNNs, the computational complexity, can be computed as O(n2) for
both training and inference time, where n is the input dataset size [35]. However for networks
that deal with sequential learning such as the LSTM, their learning complexity per time step is
O(W), where W is the number of parameters in a standard network [36]. The computational
complexity of SVM on the other hand is On3 [37]. This shows that computational complexity
of each model is different hence training and test time will also differ. In other to address this,
the average run-time of each model was computed for this experiment. (see the S1 Appendix
section more details).
After concluding the first experiment, we measured the degree of importance of each cli-
mate variable to the best performing diarrhoea prediction method in a specific province by
conducting a sensitivity analysis [38]. We adopted the Backward stepwise method [38] in
which we measured the effect of one variable at a time while keeping the other variables fixed.
Sensitivity is then measured by observing changes in the RMSE error of the given method
based on the omission of a certain variable. The larger the increase in RMSE, the higher the
importance of the omitted variable. The second experiment (Experiment II) was conducted to
determine the effect of augmented training and testing data as well as the effect of a larger
training data size on the prediction performance of the three ML methods. The data-sets used
in this experiment were combinations of the synthetic and real-world data-set, that is, the
upward and downward augmented data in each province. Predictions by each ML method
were made with each input data-set separately for each province. The data preprocessing steps
and the parameters selected by the grid-search tuning in the first experiment were maintained
for each ML method with regards to a specific province.
The third experiment (Experiment III) was performed to determine the effect of REVAC
parameter tuning on the prediction performance of the three ML methods with the upward
and downward augmented data. The major difference between the second and third experi-
ment is the method used for tuning the parameters of each ML method. For all the prediction
tasks carried out in the third experiment, data preprocessing steps taken for the three ML
methods were the same as the previous experiments. However, the parameter values of each
ML method were tuned with REVAC tuning method. Once the REVAC parameter tuning
tasks were completed for each ML method, the fittest set of parameter values for each province
were used to carry out final predictions.
Results
Table 3 represents the average RMSE for predictions made with real-world data in all prov-
inces. We observed that the high performance of the CNN method was closely followed by the
LSTM method. SVM on the other hand showed the poorest performance. Table 3 also showed
that the CNN method had the least overall RMSE average of 31.55% while LSTM and SVM
averages were 32.91% and 33.89% respectively. We can infer from these results that the RMSE
errors are lower for the deep learning methods (CNN & LSTM).
Fig 6 shows that the use of augmented data greatly improved the performance of the three
ML methods in each province. Predictions for Limpopo province show the highest improve-
ment with over 50% increase for each ML methods when both upward and downward aug-
mented data were used for predictions. However, over most provinces, the percentage increase
Table 3. Root Mean Square Error (RMSE) averages for predictions using real-world data.
ML method RMSE
Convolutional Neural Network (CNN) 31.55%
Long-term Short Term Memory (LSTM) 32.91%
Support Vector Machine (SVM) 33.89%
Standard Deviation 0.008
https://doi.org/10.1371/journal.pone.0262008.t003
Fig 6. Percentage change in performance of each ML method for predictions in Experiment II. (a) & (b):
Percentage change in performance of each ML method over each province when predictions were made with the (a)
upward augmented data-set and (b) downward augmented data-set instead of the real-world data. High percentage
RMSE indicates an improvement in performance and vice-versa.
https://doi.org/10.1371/journal.pone.0262008.g006
in performance for predictions with the LSTM and SVM methods was more than the CNN
method.
Table 4 compares the performance of the overall predictions made by each ML method
when the augmented data-sets were used based on the parameter tuning technique selected.
By comparing the RMSE of the augmented data made with grid-search parameters in Table 3
against the RMSE of predictions made with the real-world data in Table 2. For each ML
method, the predictions made with the augmented datasets yielded better and lower RMSE
than their predictions with the real-world data-sets. Thus, we can infer that the amount of
training data used for training, significantly affects the prediction performance of all the three
ML methods. By comparing the average RMSE percentages across data-sets, it also shows that
CNN outperformed the other methods when the real-world dataset was used alone while
LSTM outperformed the other methods when either of the augmented datasets were used.
Fig 7 shows the results when the parameters of the three ML methods were tuned with
REVAC instead of grid-search. We found that the CNN method’s prediction results improved
across all provinces. The highest percentage increase recorded for CNN was over 12% and the
least increase was about 2.5%. The LSTM method’s performance also increased across most
province, however, its predictive task performance declined in Limpopo, KwaZulu Natal and
Free state provinces. Among the three methods, the SVM recorded the highest number of
provinces that saw a decline in task performance. The average increase of SVM task perfor-
mance across all provinces was also the least.
Fig 8 shows the provincial prediction results of the ML methods when augmented data was
used for training. In Fig 6a, when grid-search parameters were used, the LSTM method out-
performed all the other methods in most provinces with both augmented data-sets and was
Table 4. RMSE averages for REVAC and grid-search method parameter tuning.
ML Method REVAC tuning Grid-search tuning
Upward augmented data Downward augmented data Upward augmented data Downward augmented data
CNN 22.07% 23.86% 23.11% 25.80%
LSTM 21.60% 23.61% 21.93% 23.78%
SVM 22.17% 27.30% 22.17% 27.97%
Standard Deviation 0.003 0.021 0.006 0.134
https://doi.org/10.1371/journal.pone.0262008.t004
Fig 7. Percentage change in performance of each ML method for predictions in Experiment III. (a) & (b):
Percentage change in performance of each ML method over each province when predictions were made with the
parameters from REVAC tuning instead of the grid-search parameters for (a) upward augmented data and (b)
downward augmented data-set. High percentage RMSE indicates an improvement in task performance and vice-versa.
https://doi.org/10.1371/journal.pone.0262008.g007
closely followed by the SVM except in Western Cape and KwaZulu Natal province where the
CNN outperformed the SVM. When REVAC tuning parameters were used as shown in Fig 6b,
the LSTM method still outperformed the other methods for most provinces and was closely
followed by the CNN for most of the data-sets. However, in Gauteng province, the SVM out-
performed the CNN.
The results from the sensitivity study we conducted in Fig 9 shows that the relative impor-
tance of each climate variable differs across provinces. For instance, over provinces such as
Western Cape, Eastern Cape and Free State, the Pressure climate variable was the most sensi-
tive when training any given diarrhoea outbreak prediction method. Whereas, in North West
and Mpumalanga, Evaporation was the most sensitive climate variable. In Gauteng, Maximum
Temperature was most important while in and KwaZulu Natal, Minimum Temperature was
more sensitive. In Limpopo, Humidity was most sensitive variable while Wind speed was more
important in the Northern Cape.
Fig 8. Provincial results of the ML methods with the augmented data-sets in Experiments II & III. (a) & (b): Results
of the predictions with the augmented data-sets for each province (a) represents the results with grid-search tuned
parameters and (b) represents the results with REVAC tuning parameters. Low RMSE averages indicate better task
performance and vice-versa.
https://doi.org/10.1371/journal.pone.0262008.g008
Fig 9. Variable importance plot. Result of the sensitivity analysis carried out for the CNN prediction method for each province. The x-axis indicates
the prediction accuracy of the method once the variable on the y-axis is omitted from the method. The longer the bar, the larger the loss in accuracy and
the higher the importance of that variable.
https://doi.org/10.1371/journal.pone.0262008.g009
Discussion
The results of our experiments revealed that although the Deep Learning (DL) methods (Con-
figuration of ML methods section) outperformed the SVM (SVM method section). In most
tasks, there was no clear best ML method overall. The ML methods showed different levels of
skill based on the availability of training data and the type of parameter tuning method used
during training.
The prediction performance of all ML methods improved when the augmented data-sets
were used for training, with the LSTM (LSTM method section) giving the overall best perfor-
mance. This implies that a large training set size boosts the performance of most ML algo-
rithms. We also surmise that the LSTM method performs better when the size of training data
is large, perhaps the reason for its relatively poor performance in the first experiment where
only real-world data with limited training set was used. A study conducted by [39] have shown
that LSTM benefits from a large training set size. In addition, Another study by [14] reported
that LSTMs are a state of the art for capturing the long-term dependencies specific to a given
data-set thus their ability to learn patterns in sequential data with sufficient training size
regardless of its noisy nature.
Table 5. Root Mean Square Error (RMSE) performance comparison with the existing diarrhoea prediction studies.
Study CNN LSTM SVM RF ARIMA
our study 0.22 0.21 0.22 - -
[14] - 1.43 - - 1.38
[15] - - 49.91 48.14 -
[41] - - - 0.45 0.31
https://doi.org/10.1371/journal.pone.0262008.t005
Sensitivity analysis
Our parameter sensitivity analysis (Experiments setup section) demonstrated that the predic-
tion of diarrhoea outbreak by the given ML methods is influenced by specific climate factors.
The most prominent (influential) factors are precipitation, humidity, evaporation and tempera-
ture, although their levels of influence differ across South African provinces. Our findings are
in agreement with studies such as [7, 8] that have shown that diarrhoea cases increase for every
1˚C increase in temperature. In addition, related work by [42] reported that evaporation rate
is strongly linked to high temperature. Since increase in diarrhoea cases have been associated
with high temperature, perhaps diarrhoea can also be linked to evaporation rate. Other studies
[9, 15] have also demonstrated that precipitation rate and humidity are strongly related to
reported increases in diarrhoea-related hospitalizations.
Study contributions
A key contribution of this research is the first comprehensive study and application of perti-
nent ML methods to real-world health-care data sourced from various South African medical
institutions in order to formalise an effective predictive machine learning methodology for
Sub-Saharan Africa (currently, one of the most adversely affected areas, globally, by diarrhoea
outbreaks [1, 3]). A second key contribution of this research is the use of evolutionary optimi-
sation for automating parameter tuning for a given ML method and associated training data-
set, as well as demonstration of data augmentation techniques, such as use of generative mod-
els to generate artificial data [24, 25] to complement training data deficiencies.
While our study has demonstrated that ML can be used for diarrhoea outbreak prediction
with climate factors. The results can be improved in some ways. For example, taking other
human and environmental factors that cause the spread of infectious diseases into consider-
ation may improve the accuracy of future diarrhoea prediction models. Given the different
strength of each ML algorithm, developing a hybrid method that combines the advantage and
benefits of at least two ML algorithms may result in a methodology that yields consistently
high predictive task performance regardless of the conditions set in an experiment.
Conclusion
The global burden of diarrhoea is a major public health problem that causes both personal and
widespread harm. This study ascertained the applicability of various Machine Learning (ML)
methods in the development of automated early warning system for predicting the outbreak of
diarrhoea in South Africa given specific climate variables. We compared the predictive task
performance of various ML methods, including Support Vector Machines, Long-Short Term
Memory Neural Networks (LSTM) and Convolutional Neural Networks (CNNs), for predicting
daily diarrhoea cases over nine South African provinces. Prediction comparisons were with
respect to a specific set of climate variables and varying proportional combinations of real-
world and synthetic (data augmentation) training and testing data. Results indicated that over-
all (for all real-world data-sets), our CNN yielded the highest accuracy predictions supporting
the well established predictive capacity and efficacy of deep-learning systems. However, given
synthetic training and testing data-augmentation, our LSTM yielded the most accuracy predic-
tions overall. This also study elucidated that the climate variables: precipitation, humidity,
evaporation, and temperature, yielded the greatest impact on daily diarrhoea cases across
South Africa, and were thus the data-set variables integral to the predictive success of our
tested methods. Thus, a key contribution of this study is the guidance it provides researchers
in selecting a suitable ML method for disease outbreak prediction (diarrhoea case prediction
in this study), given real-world and augmented training and testing data-sets containing
specific types of climate variables. Current research is applying further predictive machine
learning methods in an ongoing effort to develop automated early-warning systems for broad-
spectrum disease outbreak prediction across various developing nations with deficient public
health systems.
Supporting information
S1 Appendix.
(PDF)
S1 Fig. Violin plots showing the distribution of the upward augmented data for lopera-
mide (diarrhoea) and climate variables across theprovinces. EC = Eastern Cape, FS = Free
State, GA = Gauteng, KZ = KwaZulu Natal, LP = Limpopo, MP = Mpumalanga,
NC = Northern Cape, NW = North West, WC = Western Cape.
(TIF)
S2 Fig. Violin plots showing the distribution of the downward augmented data for lopera-
mide (diarrhoea) and climate variables across theprovinces. EC = Eastern Cape, FS = Free
State, GA = Gauteng, KZ = KwaZulu Natal, LP = Limpopo, MP = Mpumalanga,
NC = Northern Cape, NW = North West, WC = Western Cape.
(TIF)
Acknowledgments
The authors would like to extend their gratitude to the Applied Center for Climate and Earth
Systems Research (ACCESS) under the Council for Scientific Research (CSIR), South Africa
and Clicks Pharmaceuticals, South Africa for providing data that was relevant to this study.
Author Contributions
Conceptualization: Tassallah Abdullahi, Geoff Nitschke, Neville Sweijd.
Data curation: Tassallah Abdullahi, Neville Sweijd.
Methodology: Tassallah Abdullahi, Geoff Nitschke.
Resources: Neville Sweijd.
Supervision: Geoff Nitschke.
Writing – original draft: Tassallah Abdullahi.
Writing – review & editing: Tassallah Abdullahi, Geoff Nitschke, Neville Sweijd.
References
1. Troeger C, et al. Estimates of the Global, Regional, and National Morbidity, Mortality, and Aetiologies of
Diarrhoea in 195 Countries: A Systematic Analysis for the Global Burden of Disease Study 2016. The
Lancet Infectious Diseases. 2018; 18(11):1211–1228. https://doi.org/10.1016/S1473-3099(18)30362-1
2. WHO, et al. The World Health Report: 1996: Fighting Disease, Fostering Development. WHO, Geneva,
Switzerland: WHO; 1996.
3. Kosek M, Bern C, Guerrant R. The Global Burden of Diarrhoeal Disease, as Estimated from Studies
Published between 1992 and 2000. Bulletin of the World Health Organization. 2003; 81:197–204.
PMID: 12764516
4. Bradshaw D, Groenewald P, Laubscher R, Nannan N, Nojilana B, Norman R, et al. Initial Burden of Dis-
ease Estimates for South Africa, 2000. South African Medical Journal. 2003; 93(9):682–688. PMID:
14635557
27. Nannen V, Eiben A. Efficient Relevance Estimation and Value Calibration of Evolutionary Algorithm
Parameters. In: 2007 IEEE congress on evolutionary computation. IEEE; 2007. p. 103–110.
28. Smit S, Eiben A. Comparing Parameter Tuning Methods for Evolutionary Algorithms. In: 2009 IEEE
Congress on Evolutionary Computation. IEEE; 2009. p. 399–406.
29. Smit S, Eiben A. Beating the ‘World Champion’Evolutionary Algorithm via REVAC Tuning. In: IEEE
Congress on Evolutionary Computation. IEEE; 2010. p. 1–8.
30. Jorge S, Sevilla J. Importance of Input Data Normalization for the Application of Neural Networks to
Complex Industrial Problems. IEEE Transactions on Nuclear Science. 1997; 44(3):1464–1468. https://
doi.org/10.1109/23.589532
31. Borer S, Graf A. Normalization in Support Vector Machines. In: Joint Pattern Recognition Symposium.
Munich, Germany: Springer; 2001. p. 277–282.
32. Pelanek R. Metrics for Evaluation of Student Models. Journal of Educational Data Mining. 2015; 7(2):1–
19.
33. Lara-Benı́tez P, Carranza-Garcı́a M, Luna-Romera JM, Riquelme JC. Temporal convolutional networks
applied to energy-related time series forecasting. applied sciences. 2020; 10(7):2322. https://doi.org/
10.3390/app10072322
34. Safavian SR, Landgrebe D. A survey of decision tree classifier methodology. IEEE transactions on sys-
tems, man, and cybernetics. 1991; 21(3):660–674. https://doi.org/10.1109/21.97458
35. Ding C, Liao S, Wang Y, Li Z, Liu N, Zhuo Y, et al. Circnn: accelerating and compressing deep neural
networks using block-circulant weight matrices. In: Proceedings of the 50th Annual IEEE/ACM Interna-
tional Symposium on Microarchitecture; 2017. p. 395–408.
36. Sak H, Senior A, Beaufays F. Long short-term memory based recurrent neural network architectures for
large vocabulary speech recognition. arXiv preprint arXiv:14021128. 2014;.
37. Abdiansah A, Wardoyo R. Time complexity analysis of support vector machines (SVM) in LibSVM.
International journal computer and application. 2015; 128(3):28–34. https://doi.org/10.5120/
ijca2015906480
38. Gevrey M, Dimopoulos I, Lek S. Review and Comparison of Methods to Study the Contribution of Vari-
ables in Artificial Neural Network Models. Ecological Modelling. 2003; 160(3):249–264. https://doi.org/
10.1016/S0304-3800(02)00257-0
39. Yang S, Yu X, Zhou Y. LSTM and GRU Neural Network Performance Comparison Study: Taking Yelp
Review Dataset as an Example. In: 2020 International Workshop on Electronic Communication and
Artificial Intelligence (IWECAI). IEEE; 2020. p. 98–101.
40. Nguyen V, Gupta S, Rane S, Li C, Venkatesh S. Bayesian Optimization in Weakly Specified Search
Space. In: 2017 IEEE International Conference on Data Mining (ICDM). IEEE; 2017. p. 347–356.
41. Fang X, Liu W, Ai J, He M, Wu Y, Shi Y, et al. Forecasting incidence of infectious diarrhea using random
forest in Jiangsu Province, China. BMC infectious diseases. 2020; 20(1):1–8. https://doi.org/10.1186/
s12879-020-4930-2 PMID: 32171261
42. Kamai T, Weisbrod N, Dragila M. Impact of Ambient Temperature on Evaporation from Surface-
exposed Fractures. Water Resources Research. 2009; 45(2). https://doi.org/10.1029/2008WR007354