0% found this document useful (0 votes)
71 views

Prediction of COVID-19 Possibilities Using KNeares

Uploaded by

Mirza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views

Prediction of COVID-19 Possibilities Using KNeares

Uploaded by

Mirza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

International Journal of Current Research and Review Research Article

DOI: http://dx.doi.org/10.31782/IJCRR.2021.SP173

Prediction of COVID-19 Possibilities using K-


Nearest Neighbour Classification Algorithm
IJCRR
Section: Healthcare Prasannavenkatesan Theerthagiri*, Jeena Jacob I, Usha Ruby A,
ISI Impact Factor
(2019-20): 1.628
Vamsidhar Yendapalli
IC Value (2019): 90.81
SJIF (2020) = 7.893 Department of Computer Science and Engineering, GITAM School of Technology, GITAM University, Bengaluru-561203, India

Copyright@IJCRR

ABSTRACT
Introduction: COVID-19 is an acute respiratory illness that directly affects the lungs. It is much needed to predict the possibility
of occurrence of COVID-19 based on their characteristics.
Objective: This paper studies the different machine learning classification algorithms to predict the COVID-19 recovered and
deceased cases.
Methods: The k-fold cross-validation resampling technique is used to validate the prediction model. Aim and The prediction
scores of each algorithm are evaluated with performance metrics such as prediction accuracy, precision, recall, mean square
error, confusion matrix, and kappa score. For the preprocessed dataset, the k-nearest neighbour (KNN) classification algorithm
produces 80.4 % of predication accuracy and 1.5 to 3.3 % of improved accuracy over other algorithms.
Results: The KNN algorithm predicts 92 % (true positive rate) of the deceased cases correctly, with 0.077% of misclassification.
Further, the KNN algorithm produces the lowest error rate as 0.19 on the prediction of accurate COVID-19 cases than the other
algorithm. Also, it produces the receiver operator characteristic curve with an output value of 82 %.
Conclusion: Based on the prediction results of various machine learning classification algorithms on the COVID-19 dataset,
this paper shows that the KNN algorithm predicts COVID-19 possibilities well for the smaller (730 records) dataset than other
algorithms.
Key Words: COVID-19, Prediction, Classification, Machine learning algorithms, KNN

INTRODUCTION lated data will help mankind. Many types of research have
already been done on the Computed Tomography (CT) im-
Covid-19 a disease that was caused due to a virus called cor- ages of the patients, their symptom-based analysis, and the
onavirus.1-3 It became a global epidemic disease, according influencing factors.4,5 Researches on CT images were done
to the World Health Organisation (WHO). It was started at for identifying the characteristics of the disease and also di-
the Wuhan of China at the end of 2019. The symptoms of this agnosing the disease early. CT images of COVID-19 cases
disease at an early stage are cough, fever, fatigue, and myal- have similarities in terms of inward and circular diffusion.4
gias.1 Later the patients suffer from heart damages, respira-
tory problems, and secondary infection situations. Spreading The classifications of Covid-19 are Influenza-A viral pneu-
of COVID-19 happens very fast because it spreads through monia, Covid-19, and healthy one.4 The research is done
contact, contaminated surfaces, and infected fluids. When based on CT images of 618 images with 224 images of In-
the condition of the patient becomes worse with respiratory fluenza patients, 219 images of COVID-19 patients, and 175
issues, the patient needs to be treated in an intensive care unit healthy humans, and they achieved 87.6% accuracy. Another
with ventilation. study was done for segmenting and quantifying the infec-
tion of CT images.5 They used the CT images of the chest
The mortality of this disease increases day by day, and this and lung, and they implemented it using deep learning tech-
disease becomes a big threat to humankind of the entire niques. They used 249 images for training and 300 images
world. Along with the clinical researches, the analysis of re-

Corresponding Author:
Prasannavenkatesan Theerthagiri, Department of Computer Science and Engineering, GITAM School of Technology, GITAM University,
Bengaluru-561203, India; Email: [email protected]
ISSN: 2231-2196 (Print) ISSN: 0975-5241 (Online)
Received: 01.11.2020 Revised: 01.01.2021 Accepted: 29.01.2021 Published: 30.03.2021

Int J Cur Res Rev | Vol 13 • Issue 06 • March 2021 S-156
Theerthagiri et al: Prediction of COVID-19 possibilities using k- nearest neighbour classification algorithm

for testing and achieved an accuracy of 91.6%. Pathological based on that information. Based on this knowledge, this
tests and analysis of CT images take some time. So research- technique can predict the classes for new data. In unsuper-
es are done based on the possibility of disease prediction vised learning, the information about the classes is unknown.
based on the symptoms. This work uses some classification The clustering of similar data is done by identifying the simi-
techniques for predicting the possibility of occurrence of larity among themselves. Semi-supervised techniques know
COVID-19 based on their characteristics. some data information, and the classification is done based
on it. Logistic Regression is used for relationship analysis
Most of the existing works concentrate on COVID-19 pre-
between various dependent variables.24,25 Basically, it was
diction using images. This work proposes the patient data-
used for identifying the existence of a class or event. This
based prediction of COVID-19 possibilities (recovered or
was further extended to classify more objects. Artificial neu-
deceased) using the KNN classification algorithm. The pre-
ral network (ANN) is based on learning and classifies ef-
diction performance over 730 records of the COVID-19 pa-
fectively. 26-31 Here, the nodes are arranged in the input layer,
tient dataset is evaluated using the KNN algorithm. Further,
hidden layer, and output layer. Based on the objective func-
in this work, the MSE rate, kappa scores, and classification
tion and the number of hidden layers will vary. Support Vec-
report are analyzed for the proposed KNN algorithm. The
tor Machine (SVM) is another classification technique that
results of this research work, suggest that the proposed KNN
separates the variables using a hyperplane.32-35
algorithm produces better results for the smaller dataset than
other algorithms. Many classification and prediction algorithms are applied to
study the possibility of spreading COVID-19. The research
The organization of this paper is as follows. Section 2 gives
was done on the occurrence of asymptomatic infection, and
the related work of classification techniques. Section 3 dis-
they found it is higher (15.8%) in children under 10 years.36
cusses the different machine learning models. Section 4
Some studies have done in identifying the symptoms and
analyses the performance metrics, experimental analyses,
identified having lesser senses of taste and smell are the
and results, and Section 5 gives the concluding remarks with
signs of Covid-19.37 Another work also studied the transmis-
future work.
sion process of this disease.38
The emergence of Artificial Intelligence (AI) transformed
In recent years, predictive medical analysis using machine
the world in all fields. Machine learning (ML), a subset of
learning techniques has tremendous growth with promis-
AI helps the human to find solutions for highly complex
ing results. The machine learning algorithms are effectively
problems and also plays a vital role in making human life
applied in numerous types of applications in diverse fields.
sophisticated. The application areas of ML include business
Many kinds of research have proved that the machine learn-
applications, intelligent robots, autonomous vehicle (AV),
ing predictive algorithms had provided better assistance for
healthcare, climate modelling, image processing, natural
clinical supports as well as for decision making based on
language processing (NLP), and gaming. The learning of
the patient data.39 In the healthcare field, disease predictive
ML mimics human intelligence, and it is implemented based
analysis is one of the useful and supportive applications of
on the trial and error method. The instructions to the algo-
machine learning prediction algorithms. This research work
rithm were given mainly using control statements such as
applies the predictive disease analysis using KNN machine
conditional if.6 Many prediction based algorithms are avail-
learning prediction algorithms for the novel COVID-19 dis-
able in ML.7 The ML techniques are used for classification
ease.
and prediction in various fields like disease prediction, stock
market, weather forecasting, and business. The contribution of the proposed work is listed as follows:
In the medical field also, many ML algorithms are used for • This research work investigates COVID-19 patient
disease prediction8 like coronary artery disease9, predicting data to assess the outcome possibilities of the patient.
cardiovascular disease10, and prediction of breast cancer.11 • The KNN classification algorithm is proposed in this
Several types of research are also done for COVID-19 con- work to predict the outcome possibilities of patients
firmed case live forecasting12 and for predicting the COV- such as recovered or deceased.
ID-19 outbreak.13 These works will aid the higher authorities • The prediction results of the proposed KNN algorithm
of the country in taking decisions to handle the situation by is evaluated for the accuracy rate, mean square error
foreseeing.14 At first, the COVID-19 was misinterpreted as rate, Kappa score, the area under the curve, indices,
sensitivity, specificity, and f1 score values.
pneumonia.15 But the failure of multi-organs and high mor-
• This work considers only two parameters of the pa-
tality rates made it a pandemic in the whole world.16
tients with 730 records, and KNN algorithm based
Classification techniques are broadly categorized into semi- outcome prediction results are compared to other al-
supervised,17,18 supervised 19 and unsupervised.20-23 Super- gorithms.
vised learning takes information about the classes and learns

S-157 Int J Cur Res Rev | Vol 13 • Issue 06 • March 2021


Theerthagiri et al: Prediction of COVID-19 possibilities using k- nearest neighbour classification algorithm

MATERIALS AND METHODS at the time of infection by the COVID-19 virus; 2. Gen-
der- classifies whether the patient is male or female; 3. Out-
come-denotes whether the patient has been recovered from
Data Preprocessing and Cleaning
COVID-19 disease or deceased due to COVID-19 disease.
The COVID-19 dataset from the Kaggle is taken for the
Figure 1(a) illustrates the population infected by COVID-19
predictive analysis in this research work.40 The considered
concerning age. Figure 1(b) and Figure 1(c) depict the count
dataset was cleaned using the data preprocessing and data
plot of gender and outcome of COVID-19, respectively.
cleaning methodologies, then the resulted dataset has been
considered for several numbers of experiments over different This research work analysis the prediction of recovered and
classification algorithms. The COVID-19 dataset contains deceased patients infected by COVID-19. Different classi-
the patient’s details with recovered and deceased status. The fication models are applied to the COVID-19 dataset, and
vital patient’s information is used to diagnose and predict the its performance in terms of accuracy, error rates, etc. are
COVID-19 disease among the infected population. evaluated. The classifiers evaluated in this research work
are Logistic Regression (LR), K-Nearest Neighbors Classi-
The considered COVID-19 dataset contains 100284 records.
fier (KNN), Decision Tree (DT), Support Vector Machines
The dataset contains features of patients such as patient num-
(SVM), and Multi-Layer Perceptron (MLP).
ber, state patient number, date announced, estimated onset
date, age bracket, gender, detected city, detected district, de-
tected state, state code, current status, notes, contracted from Logistic Regression (LR)
which patient (suspected), nationality, type of transmission, One of the simple and powerful prediction algorithms is lo-
status change date, source_1, source_2, source_3 (source of gistic Regression. The logistic Regression uses the sigmoid
patient information), backup notes, num cases, entry_id .40 function for predictive modelling of the given problem. It
models the dataset maps them into a value between 0 and
The data preprocessing and cleaning process (data imputa- 1. The logistic Regression performs the predictive analysis
tion-mean technique) removes the missing and outlier data based on the relationship between the binary dependent vari-
values from the dataset. The resulted dataset after preproc- able and the other one or more independent variables from
essing is reduced to 730 records with three required relevant the given dataset. To predict the output value (Y), the input
features of patient details. In the dataset, there are 730 patient values (X1, X2,…Xn) are linearly combined using the coef-
details, out of which 156 cases are in the class of ‘recovered ficient values.41 Let us consider, ‘Y’ as the output prediction
from COVID- variable and X1 and X2 are input variables, then the logistic
regression equation is given as (1),
Table 1: Sample record of cleaned dataset
1  e(
mX 1+ c )
e(
mX 2 + c )

Age Gender Outcome =
Y + (1)
 ( mX 1+ c ) ( mX 2 + c ) 
13 Female Recovered 2 1 + e 1+ e 
96 Male Recovered Where ‘c’ represents the intercept, ‘m’ is the coefficient of
89 Female Recovered input value X1 and X2 (in our case, X1, X2 are age, gender).
85 Male Recovered
The coefficient value ‘m’ can learn from the training dataset
for each input value (X1, X2).41 This work is to classify the
27 Male Recovered
deceased and recovered cases of the COVID-19 disease us-
69 Female Deceased ing the equation (1).
26 Male Recovered
65 Male Deceased K-Nearest Neighbors (KNN) Classifier
76 Male Deceased K-Nearest Neighbors algorithm is the non-parametric al-
45 Female Recovered gorithm. The learning and prediction analysis is performed
based on the given problem or dataset. The KNN classifica-
19 disease’ and 574 cases are in the class of ‘deceased by the tion model, the prediction is purely based on neighbor data
COVID-19 disease’ with 99554 records are missing required values without any assumption on the dataset. In KNN, ‘K’
essential values. Two numerical features from the dataset are represents the number of nearest neighbor data values. Based
taken as the input attributes, and one feature is considered as on ‘K’, i.e., the number of nearest neighbors, the decision is
the output attribute. The COVID-19 patient’s information is made by the KNN algorithm on classifying the given data-
presented in Table 1. set.41 The KNN model directly classifies the training dataset.
It means the prediction of a new instance is made by search-
The patient features such as age and gender is considered ing the similar ‘K’ neighbour instances in the entire training
as input variables, and the outcome is taken as the output set and classifying based on the class of highest instances. A
variable—the features such as 1. Age- denotes the patient

Int J Cur Res Rev | Vol 13 • Issue 06 • March 2021 S-158
Theerthagiri et al: Prediction of COVID-19 possibilities using k- nearest neighbour classification algorithm

similar instance is determined using the Euclidean distance have resulted as the output variable (y). That is, from the root
formula. Euclidean distance is the square root of the sum of node, the tree is traversed through each branch with their
squared differences between the new instance (xi) and the divisions, and prediction is made based on the leaf nodes. It
existing instance (xj).42 uses the greedy method for splitting the dataset in a binary
manner.43
n
Euclidean
= i, j ∑( x
k =1
ik − y jk ) 2 (2) In this research work, the COVID-19 dataset with two inputs
(x) is taken as age, gender, and output is whether the patient
is recovered or deceased. The decision tree classification al-
gorithm uses the Gini index function to determine the impu-
rity level of the leaf nodes for the predictions. The Gini index
function (G) is given in equation (3).
n

=G ∑xk (1 − xk ) (3)
i =1

Where ‘x’ is the proportion of training instances in the input


class ‘k’. Binary tree representation of the dataset predicts
straightforward.43

Figure 1(a) Population vs Age.


Support Vector Machines (SVM)
The support vector machine can handle categorical and
continuous variables. Also, the SVM model works well on
classification and regression problems. The support vector
machine is a classification algorithm that creates the hyper-
planes for each class labels in the multidimensional space by
employing the margin values. The SVM intends to maximize
the margins among different classes by optimally separat-
ing hyperplanes.44 The hyperplane is a data instance of the
given dataset used by the support vectors. The margin is the
maximum distance between the support vector and the hy-
Figure 1(b) Gender. perplane.44 If the given dataset is linear bounded, then linear
SVM can be adopted, and the dataset is non-linear bounded,
then Non-linear SVM can be adopted for the classification
tasks.45
Let us consider a dataset (A1, B1,….An, Bn); where (A1,… An) is
the set of the input variable, (B1,…, Bn) is the output variable,
and ‘C’ is the intercept, then the SVM classifier44 is given as
like equation (4).
n
1 n

SVM
= i
=i 1 =
∑β
i, j 1
i −
2
∑b b C ( a , a ) β β
j i j i j (4)

Figure 1(c) Outcome.


In the equation (4), i=1,2,,3….n; and =C bi β i + b j β j . The
SVM equation (4), is used in this research work to classify
Decision Tree (DT) the deceased and recovered cases of the COVID-19 disease.
The decision tree algorithms are the powerful prediction
model used for both classification and regression problems.
Multilayer Perceptron (MLP)
The decision tree models are represented in the form of a
The Multilayer Perceptron algorithm is suitable for classi-
binary tree. It means the given problem/dataset is solved by
fication problems and predictive analysis. The MLP is the
splitting or classifying them as a binary tree. In the decision
classical neural network with one or more layers of hidden
tree, the prediction is made by taking the root node of the bi-
neurons. It comprises the input layer (where the data vari-
nary tree with a single input variable (x), splitting the dataset
ables are fed), the hidden layer (with function to operate on
based on the variable, and its leaf nodes of the binary tree
the data), and the output layer (contains the predicted val-

S-159 Int J Cur Res Rev | Vol 13 • Issue 06 • March 2021


Theerthagiri et al: Prediction of COVID-19 possibilities using k- nearest neighbour classification algorithm

ues). MLP uses the back-propagation to learn from the given Root Mean Squared Error (RMSE): The RMSE is the
input and output dataset. The activation function Aj(X,W) of square root of the average of squared differences between
the MLP is the summation of the inputs (Xi) multiplied with predicted and actual results, likewise given in (8). It depicts
respective weights (Wij) as represented in equation (5). The the inconsistencies among the observed and predicted val-
output function (Oj) with the sigmoid activation function of ues.49
the MLP back-propagation46 algorithm is given in equation
(6). 1 n
n
RMSE
= ∑
n i =1
( Pi − Ai ) 2 (8)

Aj ( X , W ) = ∑( X i , Wij ) (5)
Accuracy: The accuracy of the prediction algorithm is the
i =0
ratio of the total number of correct predictions of class to
1 the actual class of the dataset. Equation (9) calculates the
O j ( X ,W ) = − A j ( X ,W )
(6) accuracy of the model. Typically, any prediction model pro-
1+ e duces four different results, such as true positive (TP), true
negative (TN), false positive (FP), and false-negative (FN).42

RESULTS AND DISCUSSION TP + TN


Accuracy = (9)
TP + TN + FN + FP
This section summarizes the prediction results of the logistic
Regression, k-neighbours classifier, decision tree, support Precision: the precision of the prediction algorithm is the
vector machines, and multilayer perceptron algorithms. number of correctly predicted recovered COVID-19 cases
that is belonging to the actual recovered COVID-19 cases.42,47
Cross-Validation
To evaluate and validate the performance of the machine TP True Positive
Precision
= =
learning model, resampling methods are adopted. This meth- TP + FP Total predicted positive
od estimates the prediction ability of the machine learning
algorithm on new unseen input data. The k-fold cross-vali-
Recall: recall of the prediction algorithm is the number of
dation is one of the resampling procedure used in this work
correctly predicted recovered COVID-19 cases made out of
to validate the machine learning models on the limited data
all recovered COVID-19 cases in the dataset. It is a true posi-
sample. The ‘k’ represents the number of times the data mod-
tive rate.42,47
el is to split. Each split of the data sample is called a subsam-
ple or sampling group. These subsamples are used to validate TP True Positive
the training dataset. In this work, the ‘k’ value is chosen as Recall
= =
7. Therefore, it can be called a 7-fold cross-validation resam- TP + FN Total predicted positive
pling method. The 7-fold cross-validation method intends to F1 Score: it is the measure of the balanced score (harmonic
reduce the bias of the prediction model.47 mean) of both precision and recall.42

Performance Metrics Precision × Recall


F1 Score =
Typically, the performance of the machine learning predic- Precision + Recall
tion algorithms measured by using some metrics based on
the classification algorithm. In this work, the prediction re- Cohen’s kappa Score: Cohen’s kappa score estimates the
sults are evaluated by using the metrics such as accuracy, consistency of the prediction model. It compares the result of
mean square error (MSE), root means square error (RMSE), the predicted model with actual results. It is a statistic value
Kappa score, confusion matrix, the area under the curve between 0 and 1. A value near 1 might have the great consist-
(ROC_AUC), classification performance indices, sensitiv- ency.47
ity, specificity, and f1 score values.
K=
[TP + TN / N ] − [(TP + FN )(TP + FP )(TN + FN ) / N 2 ]
Mean Square Error (MSE): It is the average of the squared 1 − [(TP + FN )(TP + FP )(TN + FN )] / N 2 ]
difference between predicted results (Pi) and actual results
(Ai). It is calculated by using the equation is given in (7), Confusion matrix: The confusion matrix provides a com-
where n is the number of samples.48 plete insight into the performance of a prediction model. It
produces prediction results in the matrix form with the in-
1 n formation of the number of correctly predicted cases, incor-
MSE
= ∑
n i =1
( Pi − Ai ) 2 (7) rectly predicted cases, errors of incorrect, and correct predic-
tion cases.47

Int J Cur Res Rev | Vol 13 • Issue 06 • March 2021 S-160
Theerthagiri et al: Prediction of COVID-19 possibilities using k- nearest neighbour classification algorithm

Receiver Operating Characteristic (ROC)-Area Under tion algorithms. From Figure 2, we can see that the k-nearest
Curve (ROC_AUC): The ROC_AUC curve is a graphical neighbour algorithm has the highest accuracy of 80.4. The
illustration of the performance of the prediction model.47 The KNN algorithm has 1.5 to 3.3 % of improved accuracy as
ROC curve is the relationship between the recall and preci- compared to LR, DT, SVM, and MLP algorithms. The KNN
sion over varying threshold values. The threshold is the posi- algorithm works by classifying the data point of the COV-
tive predictions of the model. The ROC_AUC curve plotted ID-19 dataset based on the similarity. The closely matching
by keeping the x-axis a false positive rate and the y-axis as a data point are grouped. Thus, it increases the accuracy rate
true positive rate. Its value ranges from 0 to 1.47 of the KNN algorithm. 
Table 3 presents the performance error metrics of the various
Performance Evaluation machine learning algorithms. The error metrics mean square
In most of the research works, the accuracy of the prediction error and root mean square error values for each algorithm
model has been taken as one of the common performance is evaluated.
metrics while working on a prediction algorithm.42 In this
work, the prediction accuracy (that is, whether the COV- S. No. Classifier MSE RMSE
ID-19 infected patient is recovered or deceased) of different Logistic Regression (LR) 0.2146 0.4633
machine algorithms (logistic Regression, k- nearest neigh- K Neighbors Classifier (KNN) 0.1963 0.4431
bour, decision tree, support vector machines, and multilayer
Decision Tree (DT) 0.2466 0.4966
perceptron) are determined. Each classification model has a
different prediction accuracy based on its hyperparameters Support Vector Machines (SVM) 0.21 0.4583
and a certain level of improvement over other prediction Multi-Layer Perceptron (MLP) 0.2283 0.4778
models. This work considers 70 % dataset for training and
30 % of the data samples for testing in classification algo-
rithms. In this work, each model’s accuracy is compared, and
its prediction results are summarized in Table 2.

Table 2: Accuracy score of classifiers


S. No Classifier Accuracy Kappa
Logistic Regression (LR) 78.5388 0.4109
K Neighbors Classifier (KNN) 80.3653 0.469
Decision Tree (DT) 75.3425 0.3043
Support Vector Machines 78.9954 0.4266
Figure 2: Prediction of accuracy.
(SVM)
Multi-Layer Perceptron (MLP) 77.1689 0.03411 The logistic Regression, k-nearest neighbour, decision tree,
support vector machines, and multilayer perceptron have the
In Table 2, the classification algorithms such as logistic Re- MSE error rate as 0.2146, 0.1963, 0.2466, 0.21, and 0.2283,
gression, k-nearest neighbour, decision tree, support vector respectively. As per Figure 3 (a), the KNN classification al-
machines, and multilayer perceptron have the prediction ac- gorithm produces the lowest error rate as 0.19 on the predic-
curacy of 78.5388, 80.3653, 75.3425, 78.9954, and 77.1689 tion of accurate COVID-19 cases than the other algorithm.
respectively. Whereas, the k-nearest neighbour algorithm The KNN algorithm classifies the testing dataset by calcu-
predicts the outcome of the COVID-19 cases (based on age lating the Euclidean distance between the new (testing) in-
and gender) more accurately than the other algorithms. Here, stance (xi) and the existing (training) instance (yj). Therefore,
the dataset was tested with several ‘k’ values and the ‘k’ val- it results in lower error rates.
ue 2, which classifies the KNN algorithm into two clusters as Similarly, the KNN’s RMSE error rate also very low (0.44)
recovered and deceased with reduced errors. It works by cal- as compared to the error rates of LR (0.46), DT (0.50), SVM
culating the distance between the test data and training data. (0.45), and MLP algorithms. As depicted in Figure 3(b), the
For each data points based on the distance values, the testing KNN classification algorithm produces the highest consist-
datasets are classified. Such that the KNN algorithm pro- ency among the evaluated algorithms as 0.47. The SVM al-
duces a higher classification rate than the other algorithms. gorithm offers the next highest consistency (0.42) on correct-
Cohen’s kappa score for the KNN algorithm is also high than ly predicting the COVID-19 cases. Moreover, the prediction
other algorithms. Cohen’s kappa score estimates the consist- of the decision tree algorithm has the lowest consistency
ency of the classification algorithm based on its predictions. value as 0.30. The LR and MLP have consistency values of
Figure 2 depicts the accuracy scores of different classifica- 0.41 and 0.34, respectively.

S-161 Int J Cur Res Rev | Vol 13 • Issue 06 • March 2021


Theerthagiri et al: Prediction of COVID-19 possibilities using k- nearest neighbour classification algorithm

k-nearest neighbour classification algorithm. In all classifi-


cation algorithm, 30 % of the data samples are taken for test-
ing with the 70 % training dataset. In Figure 4(a), the x-axis
represents the percentage of predicted values and the y-axis
represents the percentage of true values. It can be seen that
the KNN algorithm predicts 92 % (true positive) of the de-
ceased cases correctly, with 0.077 % (false positive) of mis-
classification. Similarly, in Figure 4(b) the confusion matrix
without normalization is depicted, where 160 cases are cor-
rectly predicted as deceased cases, and 13 cases are misclas-
Figure 3(a): MSE rates. sified. Further, 31 cases are correctly predicted as deceased
cases, and 31 cases are misclassified. Also, it correctly pre-
dicts 29 patients (true negative) as the recovered cases, and
21 cases are misclassified (false negative).
Figure 5 is the pictorial representation between the false pos-
itive rate and true positive rate in the form ROC area under
the curve. The k-nearest neighbour classification algorithm
produces the highest value of 0.89 as compared with LR, DT,
SVM, and MLP algorithms.
Figure 6 summarizes the performance metrics such as preci-
sion, recall, and confusion matrix of the k-nearest neighbour
classification algorithm. The KNN algorithm produces the
Figure 3(b): Cohen’s kappa scores. precision (true positive rate) value of 0.82 for the recovered
cases and 0.72 for the deceased cases. The recall values for
the recovered and deceased cases are 0.92 and 0.50, respec-
tively. Further, the F1 score for recovered and deceased cases
is 0.87 and 0.59, respectively.

Figure 4(a): Normalized Confusion matrix.

Figure 5: ROC_AUC Curve.

Figure 4(b): Confusion matrix (no normalization). Figure 6: Summary of Performance metrics scores of KNN
algorithm.
Figure 4(a) illustrates the normalized confusion matrix of the

Int J Cur Res Rev | Vol 13 • Issue 06 • March 2021 S-162
Theerthagiri et al: Prediction of COVID-19 possibilities using k- nearest neighbour classification algorithm

CONCLUSION AND FUTURE ENHANCEMENTS 7. Bontempi G, Taieb SB, Le Borgne YA. Machine learning strate-
gies for time series forecasting. European business intelligence
Predictive disease analysis is a major application area. This summer school. PloS One 2012;15:62-77.
work has implemented Logistic Regression, k-nearest neigh- 8. Harrell Jr FE, Lee KL, Matchar DB, Reichert TA. Regression
models for prognostic prediction: advantages, problems, and
bour, decision tree, support vector machines, and multilayer suggested solutions. Can Treat Rept 1985;69(10):1071-1077.
perceptron to classify the COVID-19 dataset. The KNN 9. Lapuerta P, Azen SP, Labree L. Use of neural networks in pre-
classification algorithm has 1.5 to 3.3 % of improved ac- dicting the risk of coronary artery disease. Comput Biomed Res
curacy over other machine learning algorithms reported in 1995;28(1):38-52.
work. Moreover, the KNN classification algorithm produces 10. Anderson KM, Odell PM, Wilson PW, Kannel WB. Cardiovas-
cular disease risk profiles. Am Heart J 1991;121(1):293-298.
the lowest error rate as 0.19 on the prediction of accurate 11. Asri H, Mousannif H, Al Moatassime H, Noel T. Using machine
COVID-19 cases than the other algorithm. To improve the learning algorithms for breast cancer risk prediction and diagno-
accuracy of predictions, future work will concentrate on pre- sis. Proc Comput Sci 2016;83:1064-1069.
dicting the COVID-19 cases using a classification and opti- 12. Grasselli G, Pesenti A, Cecconi M. Critical care utilization for
mization algorithm the COVID-19 outbreak in Lombardy, Italy: early experience
and forecast during an emergency response. J Am Med Assoc
Conflict of interest and Financial support: Nil 2020;323(16):1545-1546.
13. World Health Organization, the World Health Organization.
Author Contribution: Naming the coronavirus disease (COVID-19) and the virus that
causes it. Available: https://www.who.int/emergencies/diseases/
Prasannavenkatesan Theerthagiri: Conceived and designed novelcoronavirus-2019/technical-guidance/naming-the-corona-
the analysis, Collected the data; virus-disease-(covid-2019)-and-the-virus-that-causes-it
14. Novel CP. The epidemiological characteristics of an outbreak
Jeena Jacob I: Contributed data or analysis tools; of 2019 novel coronavirus diseases (COVID-19) in China. Am
Vamsidhar Yendapalli: Performed the analysis; Heart J 2020;41(2):145.
15. VanHoek L, Pyrc K, Jebbink MF, Vermeulen-Oost W, Berkhout
Usha Ruby A: Wrote the paper. RJ, Wolthers KC. Identification of a new human coronavirus.
Nature Med 2004;10(4):368–337.
16. Van Der Hoek L, Pyrc K, Jebbink MF, Vermeulen-Oost W, Berk-
hout RJ, Wolthers KC, et al. Identification of a new human coro-
ACKNOWLEDGEMENT navirus. Nature Med 2004;10(4):368-373.
17. Gómez-Chova L, Camps-Valls G, Bruzzone L, Calpe-Maravilla
The authors acknowledge the immense help received from J. Mean map kernel methods for semisupervised cloud classifi-
the scholars whose articles are cited and included in refer- cation. Trans Geosci Rem Sens 2009;48(1):207-220.
ences to this manuscript. The authors are also grateful to 18. Bishop C. Improving the generalization properties of radial basis
authors/editors/publishers of all those articles, journals and function neural networks. Neural Comput 1991;3(4):579-588.
19. Lee JS, Grunes MR, Ainsworth TL, Du LJ, Schuler DL, Cloud
books from where the literature for this article has been re- SR. Unsupervised classification using polarimetric decomposi-
viewed and discussed. tion and the complex Wishart classifier. Trans Geosci Rem Sens
1999;37(5):2249-2258.
20. Pahikkala T, Airola A, Gieseke F, Kramer O. Unsupervised
multi-class regularized least-squares classification. In2012
REFERENCES IEEE 12th International Conference on Data Mining 2012 Dec
1. Huang C, Wang Y, Li X, Ren L, Zhao J, Hu Y, et al. Clinical fea- 10:585-594).
tures of patients infected with 2019 novel coronavirus in Wuhan, 21. Hang J, Zhang J, Cheng M. Application of multi-class fuzzy
China. Lancet 2020;395(10223):497-506. support vector machine classifier for fault diagnosis of the wind
2. Huang P, Park S, Yan R, Lee J, Chu LC, Lin CT, Hussien A, turbine. Fuzzy Sets Syst 2016;297:128-140.
Rathmell J, Thomas B, Chen C, Hales R. Added value of com- 22. Kim KI, Jin CH, Lee YK, Kim KD, Ryu KH. Forecasting wind
puter-aided CT image features for early lung cancer diagnosis power generation patterns based on SOM clustering. In 2011 3rd
with small pulmonary nodules: a matched case-control study. International Conference on Awareness Science and Technology
Radiology 2018;286(1):286-295. (iCAST) 2011 Sep 27 (pp. 508-511).
3. Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, 23. Tolles J, Meurer WJ. Logistic regression: relating patient char-
et al. Dermatologist-level classification of skin cancer with deep acteristics to outcomes. J Am Med Assoc 2016;316(5):533-534.
neural networks. Nature 2017;542(7639):115-118. 24. Tjur T. Coefficients of determination in logistic regression mod-
4. Xie X, Li X, Wan S, Gong Y. Mining X-ray images of SARS els—A new proposal: The coefficient of discrimination. Am Stat
patients. Data Mining: Theory, Methodology, Techniques, and 2009;63(4):366-372.
Applications. Nature 2006;23:282-94. 25. Lee SJ, Hou CL. An ART-based construction of RBF networks.
5. Shan F, Gao Y, Wang J, Shi W, Shi N, Han M, Xue Z, Shi Y. Neu Net 2002;13(6):1308-1321.
Lung infection quantification of COVID-19 in CT images with 26. Cybenko G. Approximation by superpositions of a sigmoidal
deep learning. Lancet 2020:200304655. function. Math Cont Sig Syst 1989;2(4):303-314.
6. Makridakis S, Spiliotis E, Assimakopoulos V. Statistical and 27. Durbin R, Rumelhart DE. Product units: A computationally
Machine Learning forecasting methods: Concerns and ways for- powerful and biologically plausible extension to backpropaga-
ward. PloS One 2018;13(3):e0194889. tion networks. Neu Comp 1989;1(1):133-142.

S-163 Int J Cur Res Rev | Vol 13 • Issue 06 • March 2021


Theerthagiri et al: Prediction of COVID-19 possibilities using k- nearest neighbour classification algorithm

28. Buchla O, Klimek M, Sick B. Evolutionary optimization of ra- 39. COVID-19 Dataset. Retrieved from: https://www.kaggle.com/
dial basis function classifiers for data mining applications. Trans imdevskp/covid19-corona-virus-india-dataset
Syst Man Cyb Part B (Cybernetics) 2005;35(5):928-947. 40. Chiang WY, Zhang D, Zhou L. Predicting and explaining pa-
29. Yao X. Evolving artificial neural networks. Neu Comp 1999 tronage behaviour toward web and traditional stores using neu-
Sep;87(9):1423-1447. ral networks: a comparative analysis with logistic regression.
30. Gutiérrez PA, López-Granados F, Peña-Barragán JM, Jurado- Dec Supp Syst 2006;41(2):514-531.
Expósito M, Gómez-Casero MT, Hervás-Martínez C. Mapping 41. Altay O, Ulas M. Prediction of the autism spectrum disorder di-
sunflower yield as affected by Ridolfia segetum patches and el- agnosis with linear discriminant analysis classifier and K-near-
evation by applying evolutionary product unit neural networks est neighbour in children. In2018 6th International Symposium
to remote sensed data. Compt Electr Agri 2008;60(2):122-132. on Digital Forensic and Security (ISDFS) 2018 Mar 22 (pp. 1-4).
31. Boser BE, Guyon IM, Vapnik VN. A training algorithm for opti- IEEE.
mal margin classifiers. In Proceedings of the fifth annual work- 42. Elson J, Tailor A, Banerjee S, Salim R, Hillaby K, Jurkovic D.
shop on Computational learning theory 1992 Jul 1:144-152. Expectant management of tubal ectopic pregnancy: prediction
32. Cortes C, Vapnik V. Support-vector networks. Mach Learn of successful outcome using decision tree analysis. Ultrasound
1995;20(3):273-297. Obstet Gynecol 2004;23(6):552-556.
33. Salcedo‐Sanz S, Rojo‐Álvarez JL, Martínez‐Ramón M, Camps‐ 43. Schölkopf B, Smola AJ, Bach F. Learning with kernels: support
Valls G. Support vector machines in engineering: an overview. vector machines, regularization, optimization, and beyond. MIT
Mini Know Disc 2014;4(3):234-267. press; 2002.
34. Hsu CW, Lin CJ. A comparison of methods for multiclass sup- 44. Kumar GR, Ramachandra GA, Nagamani K. An efficient feature
port vector machines. Neu Net 2002;13(2):415-425. selection system to integrating SVM with genetic algorithm for
35. Lu X, Zhang L, Du H, Zhang J, Li YY, Qu J, Zhang W, Wang Y, large medical datasets. Int J 2014;4(2):272-277.
Bao S, Li Y, Wu C. SARS-CoV-2 infection in children. New Eng 45. Pal A, Singh JP, Dutta P. Path length prediction in MANET un-
J Med 2020;382(17):1663-1665. der AODV routing: Comparative analysis of ARIMA and MLP
36. Russell B, Moss C, Rigg A, Hopkins C, Papa S, Van Hemelrijck model. Egy Infor J 2015;16(1):103-111.
M. Anosmia and ageusia are emerging as symptoms in patients 46. Wu H, Yang S, Huang Z, He J, Wang X. Type 2 diabetes mel-
with COVID-19: What does the current evidence say. New Eng litus prediction model based on data mining. Inform Med Unloc
J Med 2020;14. 2018;10:100-107.
37. Li L, Yang Z, Dang Z, Meng C, Huang J, Meng H, Wang D, et 47. Theerthagiri P. FUCEM: futuristic cooperation evaluation
al. Propagation analysis and prediction of the COVID-19. Infec model using Markov process for evaluating node reliability
Dis Mod 2020;5:282-292. and link stability in mobile ad hoc network. Inform Med Unloc
38. Chetty N, Vaisla KS, Patil N. An improved method for disease 2020;26(6):4173-4188.
prediction using fuzzy approach. In2015 Second International 48. Prasannavenkatesan T. CoFEE: Context‐aware futuristic energy
Conference on Advances in Computing and Communication En- estimation model for sensor nodes using Markov model and au-
gineering 2015 May 1:568-572. toregression. Int J Commun Syst 2019:e4248.

Int J Cur Res Rev | Vol 13 • Issue 06 • March 2021 S-164

You might also like