05-SIJ Sinta2-Haikal S2-AHW KS
05-SIJ Sinta2-Haikal S2-AHW KS
Abstract.
Purpose: Numerous factors can affect the duration of COVID-19 recovery. One method involves utilizing natural
herbal medication. This study seeks to determine the variables influencing the duration of COVID-19 recovery and to
compare discriminant analysis and support vector machine models using COVID-19 patient data from West Sumatra.
Methods: Two data mining methods, Discriminant Analysis and Support Vector Machine with different types of
kernels (linear, polynomial, and radial basis function), were employed to categorize the time of COVID-19 recovery in
this work. The study utilized 428 data points, with 75% allocated for training data and 25% for testing data. The
independent factors were evaluated by determining the selection variables' information value (IV) to gauge their
influence on the dependent variable. Data resampling techniques were employed to tackle the problem of data
imbalance. This study employs data resampling techniques, including undersampling, oversampling, and SMOTE. The
balancing accuracy of Discriminant Analysis and Support Vector Machine was examined.
Result: The Discriminant Analysis with SMOTE achieved a balanced accuracy of 66.50%, outperforming the linear
kernel Support Vector Machine with SMOTE, which had a balanced accuracy of 63.20% in this dataset.
Novelty: This study assessed the novelty, originality, and value by comparing Discriminant Analysis and SVM
algorithms with categorical and continuous independent variables. This research explores techniques for managing
imbalanced data using undersampling, oversampling, and SMOTE, with variable selection based on information value
assessment.
Keywords: Discriminant analysis, Support vector machine, Mixed independent variable, Resampling, COVID-
19
Received November 2023 / Revised February 2024 / Accepted February 2024
This work is licensed under a Creative Commons Attribution 4.0 International License.
INTRODUCTION
SARS-CoV-2, also known as COVID-19, can cause a variety of effects ranging from no symptoms to multi-
organ failure and death. [1] During the initial stages of the pandemic in early 2020, approximately 80% of
people who contracted SARS-CoV-2 showed no symptoms, while around 13% experienced severe illness
necessitating respiratory assistance, and about 7% needed intensive care due to clinical manifestations such
as acute respiratory infection (ARI), sepsis, and multi-organ failure [2].
Natural herbal treatments have been used globally for treating COVID-19. [3] Several COVID-19
individuals in different countries, such as China, have been treated using traditional herbal medicine
prescriptions. [4] As a tropical country, Indonesia has abundant medicinal plants, with the West Sumatra
region particularly abundant in natural medicinal flora. The inhabitants of West Sumatra have traditionally
used indigenous botanicals to treat various illnesses, including COVID-19. The leaves of the sungkai tree
(Peronema canescens) in West Sumatra are thought to provide medicinal potential for treating COVID-19.
[5] The leaves of the sungkai tree are traditionally used to cure fever, colds, diarrhoea, hypertension, and
malaria. They are also being explored as an alternative therapy for COVID-19. Yani [6] conducted research
indicating that extracts from young sungkai leaves can enhance immunity by raising the white blood cell
count in the blood, therefore strengthening the immune system against many infectious diseases.
COVID-19 has an incubation period, which is the duration between viral infection and the appearance of
illness symptoms [7]. COVID-19's incubation time is reportedly 14 days [8], [9]. This study will focus on
*
Corresponding author.
Email addresses: [email protected] (Haikal)
DOI: 10.15294/sji.v11i1.48565
The categorisation approach was selected to categorise the COVID-19 recovery time. Classification
algorithms predict data groups based on existing class categories utilising independent factors. [11]
Imbalanced class data can create classification issues, resulting in misclassification. [12] Imbalanced class
data, with unequal distribution of data points among distinct classes, can impact the model's performance
[13]. The study's response variable, the duration of COVID-19 recovery, exhibits uneven class
characteristics and needs to be addressed. Resampling techniques can assist in addressing imbalanced data
[14]. This study employed undersampling, oversampling, and the Synthetic Minority Oversampling
Technique (SMOTE).
Undersampling is a resampling technique that randomly decreases the data in the majority class to match
or come close to the number in the minority class [15]. Qian [16] discovered that undersampling enhanced
classification accuracy in Support Vector Machine (SVM) and discriminant Analysis. Oversampling
involves randomly adding data to the minority class to balance or approximate its number with the dominant
class, addressing the issue of class imbalance. [17], [18], [19] SMOTE is a skilful resampling technique
that has emerged as a suitable alternative for addressing issues associated with imbalanced data. [20] It is
an oversampling technique that equalises the class distribution of a dataset by introducing artificial samples
to the minority class. [21] Wang [22] observed that SMOTE is an excellent technique for addressing
unbalanced data and enhancing accuracy metrics.
Discriminant Analysis is a statistical technique for categorising and assigning new objects to predetermined
groups. [23] Ronald A. Fisher established it in 1936, and it is regarded as a classic data mining method.
Discriminant Analysis initially had limitations as it exclusively operated with continuous independent
variables. [24] Mbina [25] expanded Discriminant Analysis to accommodate mixed categorical-continuous
independent variables, providing an alternative for discriminant models with categorical variables.
Categorical independent variables are managed by constructing cells from a multinomial table of
categorical values in each group rather than converting them into dummy variables [26]. This research
employs the Support Vector Machine (SVM) approach alongside Discriminant Analysis.
In his study, Guhathakurata [27] evaluated the performance of a Support Vector Machine (SVM) against
various classification algorithms like K-Nearest Neighbour (kNN), Classification Tree (CART), Random
Forest, Naïve Bayes, and AdaBoost in categorising COVID-19 patient symptoms. The findings indicated
that SVM outperformed the other methods regarding predictive accuracy—James [28], emphasised SVM's
excellent performance in object classification. The SVM approach aims to identify the best hyperplane that
maximally separates the classes. A hyperplane is a mathematical function that can distinguish between
different classes.
Scholars have studied mixed independent variables in Discriminant Analysis and Support Vector Machines
(SVM). Mahat [29] studied the process of selecting continuous variables in discriminant Analysis with
mixed independent variables. Mbina [25] investigated variable selection in discriminant Analysis, including
mixed categorical-continuous independent variables. Their research needs to address imbalanced class data
management and the utilisation of Information Value for variable selection. Guhathakurata [27] said that
SVM is the most effective method for categorising COVID-19 symptoms, but it has yet to address concerns
about imbalanced classes and variable selection. Anggrawan [30], in his research, explains the use of
SMOTE to overcome the problem of imbalanced data in SVM but needs to clarify the variable selection
approach, notably the usage of Information Value. This study intends to investigate the classification
outcomes of two techniques utilising data on the duration of COVID-19 recovery in West Sumatra, which
includes a combination of independent factors and unbalanced class data.
The variables used in the study include mixed categorical and continuous independent variables, as shown
in Table 1.
Table 1. Description of dataset
Variables Features Type Description
1 : ≤ 14 days,
Y the duration of recovery for patients from COVID-19 Categorical
0 : >14 days
X1 Duration of COVID-19 Symptoms Disappearing Continuous Years
X2 Age Continuous Days
X3 Duration of Consumption Continuous Days
X4 Amount of Sungkai Leaves Consumed in the Potion Continuous Leaves
X5 Symptoms Experienced during COVID-19 Infection Categorical Mild, moderate, severe
X6 Number of Glasses Sungkai Leaf Potion Continuous Glass per day
X7 Daily Intensity of Drinking the Sungkai Leaf Potion Continuous Intensity per day
X8 Gender Categorical Male, female
Information Value
Information value (IV) is a commonly used techniques for selecting independent variables in classification
algorithms with binary answer variables. [31] The Information Value (IV) is computed by analysing data
for each independent variable, which is segmented into certain intervals referred to as bins
𝐵1 , 𝐵3 , 𝐵3 , … , 𝐵𝐵 }. Next, calculate the information value using equation [32].
|{(𝑥𝑖 , 𝑦𝑖 ): 𝑥𝑖 ∈ 𝐵𝑏 𝑎𝑛𝑑 𝑦𝑖 = 𝑔2 }|
𝑝𝑜𝑠𝑏 =
|{(𝑥𝑖 , 𝑦𝑖 ): 𝑦𝑖 = 𝑔2 }|
|{(𝑥𝑖 , 𝑦𝑖 ): 𝑥𝑖 ∈ 𝐵𝑏 𝑎𝑛𝑑 𝑦𝑖 = 𝑔1 }|
𝑛𝑒𝑔𝑏 =
|{(𝑥𝑖 , 𝑦𝑖 ): 𝑦𝑖 = 𝑔1 }|
𝑝𝑜𝑠
IV = ∑(𝑝𝑜𝑠𝑏 − 𝑛𝑒𝑔𝑏 ) × log ( 𝑏 )
𝑛𝑒𝑔𝑏
𝑥𝑖 represents data on the independent variable, 𝑦𝑖 represents data on the dependent variable, 𝐵𝑏 represents
the bth bin, and 𝑔1,2 represents categories on the dependent variable. The IV value can be a practical or
poor predictor of the independent variable's relationship before constructing the classification model.
Stojanovic [33] classifies IV values according to many parameters, as displayed in Table 2.
Resampling Data
Resampling is a technique utilised to address the issue of imbalanced data. Imbalanced data refers to a
situation where the answer variable contains a majority class and a minority class [34]. The majority class
contains more data than the minority class, resulting in an imbalance in the distribution of data points
between the two classes [35]. Imbalanced data might result in models that primarily categorise observations
into the most common class and show minor sensitivity to the less common class [36]. The study utilised
resampling techniques such as undersampling, oversampling, and Synthetic Minority Oversampling
Technique (SMOTE) to address data imbalance.
𝑇𝑁
𝑆𝑝𝑒𝑐𝑖𝑣𝑖𝑐𝑖𝑡𝑦 =
𝑇𝑁+𝐹𝑃
The confusion matrix displays four possible combinations of predicted and actual values. The symbols 𝜋𝑖
𝑖 = 1,2 denote individual categories of the answer variable. TP (true positive) is the count of observations
correctly predicted to be in the first category. [41] False positive (FP) refers to the number of observations
anticipated to be in one category but belong to a different category. [42] False negative (FN) refers to the
number of observations anticipated to be in the second category but belong to the first category [43]. True
negative (TN) is the count of observations correctly predicted to be in the second category. [44] The
confusion matrix calculates different parameters to evaluate the model's performance [45]. Sensitivity and
Specificity are utilised to compute balanced accuracy, which is especially beneficial for addressing
imbalanced response variables [46].
Analysis Flowchart
Data Exploration
This study uses mixed independent variables, including continuous and categorical variables. Figure 2
displays a summary of the continuous independent variables.
Figure 2 displays a boxplot for each continuous independent variable. Outliers were observed regarding
age, duration of COVID-19 symptoms fading, number of sungkai leaves consumed in the concoction, and
number of glasses of sungkai leaf concoction. Data exploration of categorical independent variables may
be found in Figure 3.
Preprocessing Data
The information value of each independent variable is utilised to select independent variables that impact
the dependent variable. The information value indicates the impact of the independent variable on the
dependent variable. The data values are displayed in Table 4.
Table 4 indicates that out of the eight independent variables, there are two variables with predictive solid
power, three with moderate predictive power, one with weak predictive power, and two unpredictable
factors. This study utilises independent variables categorised as strong and moderate predictors based on
their information value. The study utilised the independent factors of age, duration of COVID-19 symptom
resolution, duration of sungkai leaf intake, symptoms during COVID-19 infection, and the quantity of
sungkai leaves consumed in the herbal remedy. The covariance homogeneity test was conducted on the
continuous independent variables. Prior to performing discriminant Analysis, a covariance homogeneity
test must be executed. Assessing covariance homogeneity with Box's M technique [48]. The Box's M test
yielded a p-value of 0.333. If the p-value is more significant than α (0.05), it indicates that the data meets
the condition of covariance homogeneity.
The study data necessitates a method to address unbalanced data due to the disproportionate distribution of
the response variable data. Uneven data distribution on response variables can lead the model to exhibit
bias towards categorising objects into the predominant class, diminishing prediction accuracy [49]. The
dataset was divided into 75% for training data and 25% for testing data before modelling. Training data is
utilised to construct the model, whereas testing data is employed to assess the model. The study utilised
resampling to address the issue of data imbalance. The study involved resampling techniques such as
undersampling, oversampling, and SMOTE. Table 5 displays the quantity of data following resampling.
Based on Table 5, it can be seen that by using the undersampling, oversampling, and SMOTE technique,
the data on the response variable has been balanced.
Table 6 displays the percentage of each training data set using unbalanced data handling. A discriminant
analysis model was created for each training data set. The discriminant analysis models were compared
using their balanced accuracy scores. The optimal model is the one with the maximum balanced accuracy.
Table 7 displays the balanced accuracy values for each model.
Table 7 displays the overall accuracy value of four models in predicting the recovery duration of COVID-
19 patients in West Sumatra using the testing data. The undersampling strategy in mixed independent
variable discriminant Analysis reduces accuracy, which differs from earlier studies that found this method
can improve accuracy [16]. Previous research has shown that oversampling and SMOTE are valuable
methods for addressing unbalanced data and improving accuracy. [17] [22] The discriminant analysis
model, utilising the SMOTE approach for unbalanced data handling, is considered the best due to its
excellent balanced accuracy value of 66.54%.
Table 8 displays the balanced accuracy values of 12 models used to predict test data. The utilisation of
undersampling, oversampling, and SMOTE in SVM aligns with prior studies that have demonstrated the
effectiveness of these methods in addressing data imbalance issues and enhancing accuracy. [16] [17] [22]
The linear kernel SVM with SMOTE stands out as the top-performing model among the 12, boasting a
balanced accuracy value of 63.20%.
>14 days 16 31
≤ 14 days 7 54
Sensitivity, specificity, and balanced accuracy values from the confusion matrix are as follows,
16
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = = 0.6957
16+7
54
𝑆𝑝𝑒𝑐𝑖𝑣𝑖𝑐𝑖𝑡𝑦 = = 0.6353
54+31
𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦+𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 0.6957+0.6353
𝐵𝑎𝑙𝑎𝑛𝑐𝑒 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = = = 0.6654
2 2
The optimal SVM model, utilizing the SMOTE resampling method, yields the confusion matrix results
presented in Table 10.
Sensitivity, specificity, and balanced accuracy values from the confusion matrix are as follows,
15
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = = 0.6522
15+8
52
𝑆𝑝𝑒𝑐𝑖𝑣𝑖𝑐𝑖𝑡𝑦 = = 0.6118
52+33
𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦+𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 0.6522+0.6118
𝐵𝑎𝑙𝑎𝑛𝑐𝑒 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = = = 0.6320
2 2
Based on the evaluation results of the best model from both methods, you can see the comparison of these
two models in Table 11.
Table 11. Comparative evaluation of model fit
Goodness of fit value
No Metode Balance Sensitivity Specificity
Accuracy
1 Discriminant Analysis with SMOTE 0.6654 0.6957 0.6353
2 SVM Linear Kernel with SMOTE 0.6320 0.6522 0.6118
Table 11 displays the adequacy of fit for each Analysis. The Discriminant Analysis applied to imbalanced
data with the SMOTE method yielded a sensitivity of 69.57%, Specificity of 63.53%, and balanced
accuracy of 66.54%. The linear kernel Support Vector Machine with SMOTE achieved a sensitivity of
CONCLUSION
Analysis results indicate that addressing data imbalance using SMOTE yields the highest balanced accuracy
for both approaches in this dataset. Discriminant Analysis with data balancing using SMOTE achieves a
balanced accuracy of 66.54%. However, employing the support vector machine technique with a linear
kernel and data balancing by SMOTE yielded a balancing accuracy of 63.20%. The results indicate that the
discriminant analysis model outperforms the support vector machine on this dataset.
Recommendations for future research based on the study findings. Future research should investigate the
impact of underlying disorders or comorbidities on the duration of COVID-19 recovery using COVID-19
data. Another recommendation is to perform research utilising discriminant analysis and support vector
machine (SVM) approaches on a spatial level, incorporating mixed independent variables.
REFERENCES
[1] Z. Wu and J. M. McGoogan, “Characteristics of and Important Lessons From the Coronavirus
Disease 2019 (COVID-19) Outbreak in China,” Jama, vol. 323, no. 13, p. 1239, 2020, doi:
10.1001/jama.2020.2648.
[2] P. K. Perera and A. C. B. Meedeniya, “Curcumin as a Potential Treatment for COVID-19,” Front.
Pharmacol., vol. 12, no. September 2021, pp. 1–10, 2021, doi: 10.3389/fphar.2021.675287.
[3] Ö. Güngör and H. Baykal, “Attitudes toward herbal medicine for COVID-19 in healthcare workers:
A cross-sectional observational study,” Med. (United States), vol. 102, no. 38, p. E35176, 2023,
doi: 10.1097/MD.0000000000035176.
[4] J. Ren, A. Zhang, and X. Wang, “Traditional Chineses Medicine for Covid-19 Treatment,”
Pharmacol. Res., p. 104743, 2020, doi: 10.1016/j.phrs.2020.104743.
[5] R. F. Noor’An, Karmilasanti, and C. B. Wiati, “Potential and distribution of Vitex sp and Peronema
canescens jack as anti -COVID 19 plants in East Kalimantan Province, Indonesia,” IOP Conf. Ser.
Earth Environ. Sci., vol. 886, no. 1, 2021, doi: 10.1088/1755-1315/886/1/012030.
[6] A. P. Yani, A. Ruyani, I. Ansyori, and R. Irwanto, “UJI POTENSI DAUN MUDA SUNGKAI
(Peronema canescens) UNTUK KESEHATAN (IMUNITAS) PADA MENCIT (Mus.muculus)
The Potential Test of Sungkai Young Leaves (Peronema canescens) to Maintain Goodhelth
(Immunity)in Mice (Mus musculus),” Semin. Nas. XI Pendidik. Biol. FKIP UNS 245, pp. 245–250,
2014.
[7] M. Kakehashi and S. Kawano, Fundamentals of Mathematical Models of Infectious Diseases and
Their Application to Data Analyses, 1st ed., vol. 36. Elsevier B.V., 2017. doi:
10.1016/bs.host.2017.06.002.
[8] S. A. Lauer et al., “The incubation period of coronavirus disease 2019 (CoVID-19) from publicly
reported confirmed cases: Estimation and application,” Ann. Intern. Med., vol. 172, no. 9, pp. 577–
582, 2020, doi: 10.7326/M20-0504.
[9] C. Elias, A. Sekri, P. Leblanc, M. Cucherat, and P. Vanhems, “The incubation period of COVID-
19: A meta-analysis,” Int. J. Infect. Dis., vol. 104, pp. 708–710, 2021, doi:
10.1016/j.ijid.2021.01.069.
[10] E. Zdravevski, P. Lameski, A. Kulakov, and D. Gjorgjevikj, “Feature selection and allocation to
diverse subsets for multi-label learning problems with large datasets,” 2014 Fed. Conf. Comput.
Sci. Inf. Syst. FedCSIS 2014, vol. 2, pp. 387–394, 2014, doi: 10.15439/2014F500.
[11] A. J. Izenman, Linear Discriminant Analysis 8.1. 2013. doi: 10.1007/978-0-387-78189-1.
[12] Y. Sun, A. K. C. Wong, and M. S. Kamel, “Classification of imbalanced data: A review,” Int. J.
Pattern Recognit. Artif. Intell., vol. 23, no. 4, pp. 687–719, 2009, doi:
10.1142/S0218001409007326.
[13] R. Van Den Goorbergh, M. Van Smeden, D. Timmerman, and Ben Van Calster, “The harm of class
imbalance corrections for risk prediction models: Illustration and simulation using logistic
regression,” J. Am. Med. Informatics Assoc., vol. 29, no. 9, pp. 1525–1534, 2022, doi: