0% found this document useful (0 votes)
44 views6 pages

Heart Disease

Uploaded by

njrakesh75
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views6 pages

Heart Disease

Uploaded by

njrakesh75
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

2022 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, AUG.

22–25, 2022, XI’AN, CHINA

MACHINE LEARNING-BASED HEART DISEASE PREDICTION: A STUDY FOR HOME


PERSONALIZED CARE

Goutam Kumar Sahoo∗,1 ,Keerthana Kanike∗,2 , Santos Kumar Das∗,3 and Poonam Singh∗,4

Department of Electronics and Communication Engineering,
National Institute of Technology Rourkela, India
Email: 1 goutamkrsahoo@[Link], 2 keerthana.kanike112@[Link],
3
dassk@[Link], 4 psingh@[Link]

ABSTRACT different machine learning (ML) prediction techniques. The


best model is decided based on the accuracy obtained on the
This study develops a framework for personalized care to
test data. The most important thing to be considered in heart
tackle heart disease risk using an at-home system. The
disease prediction is to reduce failure to identify patients with
machine learning models used to predict heart disease are
heart disease, which means reducing false negatives [2].
Logistic Regression, K-Nearest Neighbor, Support Vector
Machine, Naive Bayes, Decision Tree, Random Forest and Exploratory Data Analysis (EDA) is the key step to
XG Boost. Timely and efficient detection of heart disease identify important and relevant features to be used in the
plays an important role in health care. It is essential to further process of modeling [3]. EDA includes univariate
detect cardiovascular disease (CVD) at the earliest, consult a and bivariate analysis of features in graphical and tabular
specialist doctor before the severity of the disease and start representations. Univariate analysis like histogram, box-plot,
medication. The performance of the proposed model was count plot gives a clear idea of each feature. Whereas
assessed using the Cleveland Heart Disease dataset from the bivariate analysis such as scatter plot, bar plot, two-way table,
UCI Machine Learning Repository. Compared to all machine and correlation compare two features and mainly compare
learning algorithms, the Random Forest algorithm shows a the features with the target to get a better relationship.
better performance accuracy score of 90.16%. The best model Performance metrics like recall, accuracy, etc. play an
may evaluate patient fitness rather than routine hospital visits. important role in deciding the correct prediction model. The
The proposed work will reduce the burden on hospitals and work by Bhatt et al. [4] used the data mining tool Weka to
help hospitals reach only critical patients. predict heart disease using two classification techniques with
Index Terms— Cardio Vascular Disease (CVD), Data Mining, two different datasets. The J48 technique was applied to the
Healthcare, Machine Learning, Heart Disease Prediction. Hungarian dataset, and Naive Bayes (NB) was applied to
the echocardiogram database [5]. A classification accuracy
of 82.3% was achieved using the Hungarian dataset with all
1. INTRODUCTION features, with selected features outperforming an accuracy
of 65.64%. The performance of the model depends on the
Heart disease is a type of disease that affects the heart.
deviation and bias of the dataset [6].
According to the World Health Organization (WHO), every
year more than 17.5 million people die of heart disease [1]. As per research on the machine learning for prediction
In this state, the heart is usually unable to pump the required of heart diseases [6] NB perform well with low variance
amount of blood to various parts of the human body to carry and high biasness compared to K-Nearest Neighbour (KNN)
out the routine functions of the body, resulting in irreversible with high variance and low biasness. The reason for less
heart failure. Symptoms and signs of heart disease include model performance in case of KNN is that with high variance
chest pain, chest discomfort (angina), shortness of breath, and and low biasness KNN suffers from over fitting. Arpaia et
increase in the blood pressure. High cholesterol, high blood al. [7] reported an e-healthcare technology for home care that
pressure or diabetes can also increase the risk of heart disease. measures the risk of a cardiac patient using data acquired
There are many prevention methods to combat this disease, from the patients body. These data are experimented with
such as natural ways of stopping smoking, maintaining a a random forest (RF) classifier to classify cardiac risk, and
healthy weight, adopting a healthy diet and practicing sports the classification accuracy reported was 80%. A work by
regularly. Prediction of heart disease before occurrence is as M. A. Khan [8], proposed an IoT framework based on
important as dealing with the disease. This can be done using modified deep convolutional neural network (MDCNN) to

978-1-6654-8547-0/22/$31.00 c 2022 IEEE


evaluate heart disease more accurately. The blood pressure
and electrocardiogram (ECG) were acquired using wearable
devices like smartwatch and heart monitor device. The
acquired data were then transferred to long range (LoRa)
cloud using LoRa gateway. This work predicts heart disease
using machine learning techniques intended to be used in an
embedded system for at-home personalized care as an early Fig. 2: Heart Disease Prediction Methodology
diagnosis of CVD patient.
2.1. Dataset Collection
The next part of the report is organized as follows:
Section 2 presents the problem formulation and methodology. Data is an essential requirement for the prediction of heart
Section 3 explains the experimental results and analysis. The disease. The performance of various ML algorithms will be
conclusion of the work is mentioned in Section 4. evaluated using a publicly available dataset, the Cleveland
Heart Disease Dataset from the UCI Machine Learning
Repository. A brief description of each of the features is given
2. PROBLEM FORMULATION AND in Table 1 for a better understanding. There are 14 features, of
METHODOLOGY which 13 are considered independent variables/features, and
the one, namely the Target feature, is known as the dependent
This study develops a framework for personalized care variable/feature.
aimed at tackling the risk of CVD using an at-home system.
The best machine learning models will be used to evaluate Table 1: Dataset details for heart disease prediction [9].
the patient’s fitness instead of making regular visits to the Sl. No. Features Description
hospital. This will reduce the burden on hospitals and help 1 Age Age in years
2 Sex male = 1 ; female = 0
hospitals reach only critical patients. A system for early 3 Cp The type of Chest pain categorized into 4 values
heart disease screening is shown in Fig. 1. Heart patient 4 Trestbps Level of blood pressure at rest (in mm Hg)
5 Chol Serum cholesterol im mg/dl
can predict heart disease at home using ML-based prediction Blood sugar levels on fasting>120 mg/dl
6 Fbs
system. If the system predicts the patient with heart disease, a (1 = true; 0 = false).
7 Restecg Results of an electrocardiogram while at rest
computer-based alert will be generated suggesting the doctor 8 Thalach The accomplishment of the maximum heart rate
to be consulted for diagnosis and will also store the estimated 9 Exang Angina induced by exercise (1 = yes; 0 = no)
10 OldPeak Exercise-induced ST depression in comparison to rest
abnormal data related to CVD for future requirement. 11 Slope ST segment measured in terms of the slope
12 Ca Fluoroscopy coloured major vessels numbered from 0 to 3
Status of the heart (Normal = 3;
13 Thal
Patient fixed defect = 6 ; reversible defect = 7)
Data Storage 14 Target Heart disease diagnosis Healthy = 0; Diseased = 1
HeartData ML-based Disease Yes
Acquisition Prediction Predicted
System Model ?
Consult Doctor
Patient
for Diagnosis 2.2. Data Pre-Processing
In general, medical records that are not always complete may
Fig. 1: Preliminary Heart Disease Investigation Framework
contain missing and unwanted data. Data pre-processing
is used to remove the number of discrepancies associated
The main task is the methodology for the detection and with the data, remove duplicate records, normalize values,
prediction of heart disease. According to literature studies, account for missing data, etc. The primary step in this data
this task can be classified as a classification problem to pre-processing is to check for null values and treat them
classify whether a person has heart disease or not. The by filling in or dropping them. After importing a dataset
task requires heart patient data to predict heart disease. using the Python library pandas, common data pre-processing
For experimental work, a publicly available Cleveland heart methods such as data cleaning, data transformation, efficient
disease dataset from the UCI Machine Learning Repository processing, and classification are performed. No unique
[9] is used. The dataset has 303 rows and 75 attributes, method of data processing is used in this work.
however all published experiments refer to using a subset
of 14 of them. The patients are both male and female in
2.3. Exploratory Data Analysis (EDA)
the database. The patient’s age ranges between 29-77 years.
Most patients have chest pain typical of the angina type Exploratory data analysis (EDA) provides the detailed features
and the heart rate is between 71-262 bpm. This dataset is of the dataset through different graphs and tables. For example,
divided as two splits namely train and test split using python Fig. 3 shows the relation of a categorical variable with
library scikit-learn. The step-wise methodology adopted for four different categories namely “asymptotic”, “non-anginal
developing the heart disease prediction model is presented in pain”, “atypical angina”, “typical angina”. This plot mainly
Fig. 2 and discussed in detail. represents the count of people with “cp” divisions according
to target values. We can observe that people suffering from with and without heart disease. This can be obtained using
heart disease are mostly from the “cp” type of “non-anginal evaluation metrics of python library called scikit-learn. Using
pain” and people without a heart disease are from “typical the confusion matrix, we can visualize the performance
angina”. of computational intelligence techniques. In the confusion
matrix, there are four classification performance indices, i.e.,
TP = True Positive (correctly identified), TN = True Negative
(incorrectly identified), FP = False positive (correctly rejected),
FN = False negative (falsely rejected). The expressions used
to evaluate the various performance parameters are given in
Equations (1) to (4).
Precision × Recall
F1 − S core = 2 × (1)
Precision + Recall
TP
Precision = (2)
T P + FP
TP
Recall = (3)
T P + FN
TP + TN
Accuracy = (4)
Fig. 3: Distribution of people according to ’cp’ Categories. T P + FP + FN + T N

2.4. Modeling 3. EXPERIMENTAL RESULTS AND ANALYSIS

In prediction problem, the system modeling is the process 3.1. Experimental Setup
where the actual training happens using data mining and
machine learning algorithms. A study using a publicly The experiment was implemented in Python 3.8 using a
available dataset is made to explore the feasibility of predictive single computer (Acer Aspire A515-54G , Intel(R) Core(TM)
models for the early prediction of heart diseases. Different i5-8265U CPU @1.60GHz, RAM 8 GB) with Windows 10.
machine learning algorithms used for prediction process
are: logistic regression (LR), K-Nearest Neighbour (KNN), 3.2. Result Analysis
support vector machine (SVM), decision tree (DT), random
The methodology adopted in training of the dataset using
forest (RF), Naive Bayes (NB) and eXtreme Gradient Boosting
various ML algorithms provides predictive performance
(XG Boost). The step-wise approach of heart disease prediction
of heart disease in terms of evaluation metrics. The best
modeling is presented in Algorithm 1.
three performances with high accuracy are analyzed in
detail with the help of model performance learning curve,
Algorithm 1 : Algorithm for Heart Disease Prediction classification report and confusion matrix. The confusion
1: Load the Cleveland heart disease dataset. matrix is generated by considering 20% of the real data (ie,
2: Apply the data pre-processing techniques. test split) and made to run on a model trained with 80%
3: Perform exploratory data analysis (EDA). real data (ie, train split). A confusion matrix is a form
4: Divide the total data as feature variable and target of table that represents the actual true and false values as
variable. well as the estimated true and false values. The training
5: Split the dataset into training and testing samples. accuracy is generated using the train data, the test data, and
6: Build the model using machine learning algorithm. the epochs, which tells us the accuracy for each epoch we
7: Validate the model for prediction of healthy and diseased run. A model learning curve is used to analyze the training
class. process, which gives an idea of the model performance.
8: Calculate the performance measurement parameters. Similarly, the training loss and testing loss curves tell about
9: If (CVD detected), Then alert the patient to consult the error predicted from the actual values while training using
doctor and store the patient data locally. the trained models.

2.5. Evaluation Metrics 3.2.1. Random Forest Classifier Classification Report


After the modeling process, the main task is to evaluate The confusion matrix and training accuracy performance can
the model on the test data to check whether a particular be seen in Fig. 4. From the confusion matrix in Fig. 4a, the
machine learning algorithm predicts the persons correctly algorithm predicts 25 healthy individuals, whereas originally,
it was 31. The true positive rate (or recall) is 0.81, and the 3.2.2. Logistic Regression Classification Report
precision is 1.00. Similarly, the diseased class algorithm
predicts 36 healthy individuals whereas originally, it was 30. The confusion matrix and the training accuracy performances
The true positive rate (or recall) is 1.00, and the precision can be seen in Fig. 5. From the confusion matrix in Fig.
is 0.83. The overall prediction accuracy has been found to 5a, the algorithm predicts 26 healthy persons, whereas
be 0.9016 or 90.16%. Fig. 4b shows that the training score initially, it was 31. The actual positive rate is 0.81, and
for the random forest classifier remains constant throughout the precision is 0.96. Similarly, the diseased class algorithm
training, while the cross-validation score is finally below the predicts 35 healthy persons, whereas initially, it was 30. The
starting score. Similarly, Fig. 4c represents the training loss true positive rate is 0.97, and the precision is 0.83. The
performance curves. overall prediction accuracy is 88.52%. Fig. 5b contains two
individual graphs, one representing the accuracy score on
training data and the other on validation data for continuous
training instances. One can see that the training score for
Logistic Regression gradually decreases, whereas the cross
validation score gradually increases. Similarly, Fig. 5c
represents the training loss performance curves.

3.2.3. XG Boost Classification Report


The confusion matrix and training accuracy performance
can be seen in Fig. 6. From the confusion matrix in Fig.
6a, the algorithm predicts 28 healthy individuals, whereas
originally, it was 31. The true positive rate is 0.77 and the
accuracy is 0.86. Similarly, the diseased class algorithm
(a) Confusion Matrix predicts 33 healthy individuals whereas originally, it was 30.
The true positive rate is 0.87 and the accuracy is 0.79. The
overall prediction accuracy has been found to be 88.52%.
The learning curve in Fig. 6b of the training data split using
the XG Boost classifier shows that the training score for XG
Boost gradually increases, while the cross-validation score
does not increase gradually, but the overall initial increase in
comparison. Similarly, Fig. 6c represents the training loss
performance curves.

3.3. Performance Evaluation and Result Comparison


The evaluation metrics are calculated, and their representation
can be visualized from Table 2. The metrics considered in
(b) Training Accuracy this are Test accuracy, Precision, Recall, and F1-Score, which
are the most important metrics to decide the better model.
From the table, we can conclude that Random Forest has a
maximum test accuracy of 90.16%.

Table 2: Performance Evaluation of Different ML Models.

Test
Precision Recall F1-Score
Algorithms Accuracy
(%) (%) (%)
(%)
LR 88.52 82.85 96.66 89.23
KNN 83.6 81.25 86.66 83.87
SVM 83.6 76.31 96.66 85.29
NB 75.4 67.44 96.66 79.45
(c) Training Loss DT 78.68 77.41 80 78.68
RF 90.16 83.33 100 90.9
Fig. 4: Random forest Model Performance. XG Boost 88.52 84.84 93.33 88.88
(a) Confusion Matrix (a) Confusion Matrix

(b) Training Accuracy (b) Training Accuracy

(c) Training Loss (c) Training Loss

Fig. 5: Logistic-Regression Performance Measures. Fig. 6: XG Boost Model Performance.


on each models created using different machine learning
Table 3: 10-fold cross-validation to report std. algorithms. The accuracy and standard deviation (std) of
ML algorithms Accuracy (%) std
error are obtained, which can be seen from the Table 3. A low
LR 82.183 0.067 standard deviation means that the data is very closely related
KNN 80.15 0.058 to the average, thus the model is reliable.
SVM 83.03 0.07
DT 78.05 0.08
XG Boost 80.09 0.081 3.4. Comparative Study
RF 77.61 0.085
Table 4 shows the accuracy result of the state-of-art techniques
Cross-validation (CV) is a technique for evaluating ML compared with that of our experimental study based on
models. In K-fold CV the data is split into k-equal or Cleveland Data-set. The performance of the LR technique
nearly equal folds. A 10-fold cross validation is performed in [8, 10, 13, 14] achieved accuracy performances of 85%,
Table 4: Comparison of our study with the existing methods. algorithm for pattern discovery in healthcare data,” Int.
Journal Dist. Sensor Networks, vol. 11, no. 6, pp.
Reference Algorithms Accuracy (%)
A. K. Dwivedi [10] LR 85 615740, 2015.
Shah et al. [11] RF 86.84
M. A. Khan [8] LR 87.8 [3] Charu C Aggarwal, Data mining: the textbook,
Ayon et al. [12]
RF 89.41 Springer, 2015.
LR 92.41
Tougui et al. [13] LR 83.5 [4] Anurag Bhatt et al., “Data mining approach to predict
LR 83.3 and analyze the cardiovascular disease,” in Proc. 5th
Bharti et al. [14]
RF 80.3
RF 89.97 Int. Conf. Frontiers in Intelli. Comput. Theory Appl.
Divya et al. [15]
LR 92.01 Springer, 2017, pp. 117–126.
LR 88.52
Our Study RF 90.16 [5] Dheeru Dua and C Graff, “UCI Machine Learning
XG Boost 88.52 Repository. University of California, School of
87.8%, 83.5%, and 83.3%, respectively, which are somehow Information and Computer Science, Irvine, CA (2019),”
similar and lower than the performance of our study. The 2019.
study presented in [11, 12, 14, 15] used random forest and
achieved accuracy performance of 86.84%, 89.41%, 80.30% [6] Himanshu Sharma and MA Rizvi, “Prediction of heart
and 89.97% respectively. However, our study on RF provides disease using machine learning algorithms: A survey,”
a better accuracy performance of 90.16%. The work done Int. Journal Recent Innov. Trends Comput. Commun.,
by Divya et al. and Ayon et al. [12, 15] shows improvement vol. 5, no. 8, pp. 99–104, 2017.
in LR techniques and poor performance on RF techniques. [7] Pasquale Arpaia et al., “Conceptual design of a machine
The present work does not show superiority over all the learning-based wearable soft sensor for non-invasive
results of prior studies. However, this work achieved superior cardiovascular risk assessment,” Measurement, vol. 169,
performance only on the RF algorithm, i.e., the maximum pp. 108551, 2021.
accuracy of 90.16%, and showed comparable performance to
the LR and XG Boost algorithms. The XG Boost algorithm [8] Mohammad Ayoub Khan, “An iot framework for heart
performance obtained is 88.52% and is the best on this dataset disease prediction based on mdcnn classifier,” IEEE
to our knowledge. Access, vol. 8, pp. 34717–34727, 2020.
[9] UCI Machine learning repository., ,” Heart Disease
4. CONCLUSIONS AND FUTURE WORKS Dataset. [Online]. Available:˜[Link]
[Link]/ml/datasets/Heart+Disease.
This work presented a novel framework for heart disease
[10] Ashok Kumar Dwivedi, “Performance evaluation of
prediction by applying ML techniques. The machine learning
different machine learning techniques for prediction of
models used to predict heart disease are LR, KNN, SVM,
heart disease,” Neural Computing Appl., vol. 29, no. 10,
NB, DT, RF, and XG Boost. Compared with all the ML
pp. 685–693, 2018.
algorithms, the RF algorithm shows superior performance
accuracy of 90.16%. The best machine learning models can [11] Devansh Shah et al., “Heart disease prediction using
be deployed to evaluate the patient’s fitness instead of making machine learning techniques,” SN Computer Science,
regular visits to the hospital. This will reduce the burden on vol. 1, no. 6, pp. 1–6, 2020.
hospitals and help hospitals reach only critical patients. The
[12] Safial Islam Ayon et al., “Coronary artery heart disease
above predictions are essential to notify the doctor before the
prediction: a comparative study of computational
seriousness of the disease and to start the medication. This
intelligence techniques,” IETE J. Res., pp. 1–20, 2020.
study develops a framework for personalized care aimed at
tackling the risk of CVD using an at-home system. In the [13] Ilias Tougui et al., “Heart disease classification using
future, pre-trained deep learning model can be experimented data mining tools and machine learning techniques,”
to improve the prediction accuracy for low-cost embedded Health Technol., vol. 10, pp. 1137–1144, 2020.
system applications.
[14] Rohit Bharti et al., “Prediction of heart disease using
a combination of machine learning and deep learning,”
5. REFERENCES Comput. Intell. Neuroscience, vol. 2021, 2021.
[1] Mariano Sanz et al., “Periodontitis and cardiovascular [15] K Divya et al., “An iomt assisted heart disease
diseases: Consensus report,” J. Clinical Periodontology, diagnostic system using machine learning techniques,”
vol. 47, no. 3, pp. 268–288, 2020. in Cogn. Internet Med. Things Smart Healthcare, pp.
145–161. Springer, 2021.
[2] Ramzi A Haraty et al., “An enhanced k-means clustering

You might also like