CHAPTER 2
INTRODUCTION
According to the World Health Organization, heart disease is the first leading cause of
death in the high and second leading cause of death in low-income countries. It has remained
the leading cause of death at the global level for the last 20 years. This paper aims to analyze
several data mining techniques implemented in recent years for diagnosing heart disease. At
present, there are plenty of algorithms available that could detect and predict heart anomalies
from clinical reports. However, in this project, the focus is more on discovering and
extracting patterns from Electrocardiogram (ECG or EKG) image reports. By digitizing ECG
records, the need for time consuming manual intervention for comprehending the report can
be eliminated. With digitization, the automation of diagnosis and analysis can be achieved
quicker. Many papers related to cardiovascular prediction focused on other features that
included diet, age, gender, and many other dimensions, and then predicted for cardiovascular
diseases based on these features. Our work is more on predicting diseases by providing the
ECG chart to our model. Cardiovascular disease (CVD) refers to a class of diseases that
involve the heart and blood vessels. It is a broad term that encompasses various conditions,
including coronary artery disease, heart failure, stroke, and peripheral artery disease, among
others. Cardiovascular diseases are among the leading causes of death worldwide.
The main underlying cause of cardiovascular disease is atherosclerosis, a condition
characterized by the buildup of plaque within the arteries. Plaque consists of cholesterol, fat,
calcium, and other substances, which gradually accumulate and narrow the arteries,
restricting blood flow to the heart, brain, or other parts of the body. Several risk factors
contribute to the development of cardiovascular disease, including:
High blood pressure: Prolonged high blood pressure puts stress on the blood vessels
and can lead to their damage.
High cholesterol levels: Elevated levels of cholesterol in the blood can contribute to
the formation of plaque in the arteries.
Smoking: Tobacco use damages the blood vessels, increases blood pressure, and
lowers good cholesterol (HDL).
Diabetes: People with diabetes are at an increased risk of developing cardiovascular
disease.
Obesity: Excess weight puts strain on the heart and is associated with other risk
factors such as high blood pressure and diabetes.
Lack of physical activity: Sedentary lifestyles contribute to the risk of developing
cardiovascular disease.
Unhealthy diet: Diets high in saturated and trans fats, cholesterol, salt, and sugar can
increase the risk of CVD.
Family history: Having a close relative with cardiovascular disease increases the
likelihood of developing it.
Age and gender: The risk of cardiovascular disease increases with age, and men are
generally at a higher risk than pre-menopausal women. However, the risk for women
increases after menopause.
Prevention and management of cardiovascular disease involve various approaches,
including lifestyle modifications and medical interventions. These may include:
Healthy lifestyle: Regular physical activity, maintaining a balanced diet, managing
weight, quitting smoking, and limiting alcohol intake.
Medications: Doctors may prescribe medications to control blood pressure, lower
cholesterol levels, prevent blood clots, or manage other underlying conditions.
Medical procedures: In certain cases, surgical interventions like angioplasty, bypass
surgery, or stenting may be necessary to restore blood flow to the heart or other
affected areas.
Cardiac rehabilitation: This program involves supervised exercise, education, and
counseling to help individuals recover from a heart attack, surgery, or manage their
cardiovascular disease.
Ongoing medical care: Regular check-ups, monitoring of blood pressure, cholesterol
levels, and other risk factors are important in managing and preventing the
progression of CVD.
It's worth noting that cardiovascular disease is a complex and multifactorial condition.
If you have concerns about your cardiovascular health, it is important to consult with a
healthcare professional for a personalized assessment and guidance.
2.1. ARTIFICIAL INTELLINGENCE:
Artificial intelligence (AI) is the ability of a computer program or a machine to think
andlearn. It is also a field of study which tries to make computers "smart". As machines
become increasingly capable, mental facilities once thought to require intelligence are
removed from the definition. AI is an area of computer sciences that emphasizes the creation
of intelligent machines that work and reacts like humans. Some of the activities computers
with artificial intelligence are designed for include: Face recognition, Learning, Planning,
Decision making etc.,
Artificial intelligence is the use of computer science programming to imitate human
thought and action by analysing data and surroundings, solving or anticipating problems and
learning or self-teaching to adapt to a variety of tasks.
2.2. MACHINE LEARNING
Machine learning is a growing technology which enables computers to learn
automatically from past data. Machine learning uses various algorithms for building
mathematical models and making predictions using historical data or information. Currently,
it is being used for various tasks such as image recognition, speech recognition, email
filtering, Facebook auto-tagging, recommender system, and many more.
Machine Learning is said as a subset of artificial intelligence that is mainly concerned
with the development of algorithms which allow a computer to learn from the data and past
experiences on their own. The term machine learning was first introduced by Arthur
Samuel in 1959. We can define it in a summarized way as: “Machine learning enables a
machine to automatically learn from data, improve performance from experiences, and
predict things without being explicitly programmed”.
A Machine Learning system learns from historical data, builds the prediction models, and
whenever it receives new data, predicts the output for it. The accuracy of predicted output
depends upon the amount of data, as the huge amount of data helps to build a better model
which predicts the output more accurately.
Suppose we have a complex problem, where we need to perform some predictions, so
instead of writing a code for it, we just need to feed the data to generic algorithms, and with
the help of these algorithms, machine builds the logic as per the data and predict the output.
Machine learning has changed our way of thinking about the problem. The below block
diagram explains the working of Machine Learning algorithm:
2.2.1. Features of Machine Learning:
Machine learning uses data to detect various patterns in a given dataset.
It can learn from past data and improve automatically.
It is a data-driven technology.
Machine learning is much similar to data mining as it also deals with the huge amount
of the data.
2.2.2. Classification of Machine Learning
At a broad level, machine learning can be classified into three types:
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
1) Supervised Learning
Supervised learning is a type of machine learning method in which we provide sample
labeled data to the machine learning system in order to train it, and on that basis, it predicts
the output.
The system creates a model using labeled data to understand the datasets and learn
about each data, once the training and processing are done then we test the model by
providing a sample data to check whether it is predicting the exact output or not.
The goal of supervised learning is to map input data with the output data. The
supervised learning is based on supervision, and it is the same as when a student learns things
in the supervision of the teacher. The example of supervised learning is spam filtering.
Supervised learning can be grouped further in two categories of algorithms:
Classification
Regression
2)Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns without any
supervision.The training is provided to the machine with the set of data that has not been
labeled, classified, or categorized, and the algorithm needs to act on that data without any
supervision. The goal of unsupervised learning is to restructure the input data into new
features or a group of objects with similar patterns.
In unsupervised learning, we don't have a predetermined result. The machine tries to
find useful insights from the huge amount of data.
It can be further classifieds into two categories of algorithms:
Clustering
Association
2.3. LOGISTIC REGRESSION ALGORITHM
Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables. Logistic regression predicts
the output of a categorical dependent variable. Therefore the outcome must be a categorical
or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving
the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic regression
is used for solving the classification problems. In Logistic regression, instead of fitting a
regression line, we fit an "S" shaped logistic function, which predicts two maximum values (0
or 1). The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets. Logistic
Regression can be used to classify the observations using different types of data and can
easily determine the most effective variables used for the classification. The below image is
showing the logistic function
2.3.1. Logistic Function (Sigmoid Function):
The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
It maps any real value into another value within a range of 0 and 1.
The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the
Sigmoid function or the logistic function.
In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
2.4. XGBoost (Extreme Gradient Boosting)
XGBoost is an implementation of Gradient Boosted decision trees. This library was
written in C++. It is a type of Software library that was designed basically to improve speed
and model performance. It has recently been dominating in applied machine learning.
XGBoost models majorly dominate in many Kaggle Competitions.
In this algorithm, decision trees are created in sequential form. Weights play an
important role in XGBoost. Weights are assigned to all the independent variables which are
then fed into the decision tree which predicts results. Weight of variables predicted wrong
by the tree is increased and these variables are then fed to the second decision tree. These
individual classifiers/predictors then ensemble to give a strong and more precise model. It
can work on regression, classification, ranking, and user-defined prediction problems.
XGBoost Features
The library is laser-focused on computational speed and model performance, as such, there
are few frills.
Model Features
Three main forms of gradient boosting are supported:
Gradient Boosting
Stochastic Gradient Boosting
Regularized Gradient Boosting
System Features
For use of a range of computing environments this library provides-
Parallelization of tree construction
Distributed Computing for training very large models
Cache Optimization of data structures and algorithm
2.5. KNN(K-nearest neighbor)
K-Nearest Neighbors (KNN) is a simple and popular machine learning algorithm used
for both classification and regression tasks. It is a non-parametric method, meaning it
doesn't make any assumptions about the underlying data distribution. In KNN, the
class or value of a data point is determined by the majority vote or average of the
values of its K nearest neighbors. Here's how the KNN algorithm works:
o Data Preparation: First, you need to have a labeled dataset with features
(independent variables) and corresponding labels (dependent variables) for
classification, or with features and continuous values for regression.
o Choose the value of K: K is a hyperparameter that represents the number of
nearest neighbors to consider when making predictions. You need to select an
appropriate value for K, which may depend on the nature of your dataset and
the problem you are trying to solve.
o Calculate Distance: KNN uses distance metrics, such as Euclidean distance or
Manhattan distance, to measure the similarity between data points. The
distance is calculated based on the features of the data points.
o Find K Nearest Neighbors: The algorithm identifies the K nearest neighbors of
a given data point based on the calculated distances. These neighbors are the
data points with the smallest distances to the target point.
o Majority Vote or Average: For classification tasks, the class label of the new
data point is determined by majority vote among its K nearest neighbors. The
class with the highest count becomes the predicted class. For regression tasks,
the predicted value is the average of the values of its K nearest neighbors.
o Make Predictions: Once the majority vote or average is calculated, it is
assigned as the predicted label or value for the new data point.
o Evaluate Performance: Finally, you can evaluate the performance of the KNN
algorithm using appropriate evaluation metrics such as accuracy, precision,
recall, F1-score, or mean squared error (MSE), depending on the problem
type.
o It's important to note that KNN has some limitations. It can be
computationally expensive for large datasets since it requires calculating
distances for each data point. Additionally, it doesn't perform well when the
feature space is high-dimensional or when the dataset has imbalanced classes.
o Overall, KNN is a simple yet effective algorithm that can be used for various
classification and regression tasks, especially for small to medium-sized
datasets with relatively low-dimensional feature spaces.
SVM Algorithm
The Support Vector Machine (SVM) algorithm is a supervised learning algorithm used for
classification and regression tasks. It is a powerful and versatile algorithm that can handle
both linear and nonlinear problems. The main idea behind SVM is to find an optimal
hyperplane that separates different classes or predicts the values of a target variable. In the
case of classification, the hyperplane aims to maximize the margin between the classes,
which is the distance between the hyperplane and the nearest data points from each class. The
intuition is that by maximizing the margin, the classifier becomes more robust to new, unseen
data. Here are the key components and steps involved in the SVM algorithm:
Data Preparation: SVM requires labeled training data, where each data point is
represented by a feature vector and assigned a class label.
Feature Transformation: If the data is not linearly separable, you can apply kernel
functions to transform the feature space into a higher-dimensional space, where linear
separation might be possible.
Training: SVM finds the optimal hyperplane by solving an optimization problem. The
goal is to find the hyperplane that maximizes the margin and minimizes the
classification error. This involves solving a quadratic programming problem to
identify the support vectors, which are the data points closest to the hyperplane.
Kernel Selection: SVM supports various types of kernel functions, such as linear,
polynomial, radial basis function (RBF), and sigmoid. The choice of the kernel
depends on the problem's characteristics and the data distribution.
Regularization: SVM allows for regularization parameters to control the trade-off
between maximizing the margin and minimizing the classification error. These
parameters help to prevent overfitting and generalize well to unseen data.
Prediction: Once the SVM model is trained, it can be used to predict the class labels
or regression values for new, unseen data points.
SVM has several advantages, including its ability to handle high-dimensional data, its
effectiveness in dealing with small datasets, and its versatility in classification and
regression tasks. However, SVM can be computationally expensive, especially when
dealing with large datasets, and it may require careful selection and tuning of
parameters to achieve optimal results.
2.6. LITERATURE REVIEW
[1] Title: Detection of Cardiovascular Diseases in ECG Images Using Machine Learning and
Deep Learning Methods
Authors: Mohammed B. Abubaker; Bilal Babayiğit
Description:
Cardiovascular diseases (heart diseases) are the leading cause of death worldwide.
The earlier they can be predicted and classified; the more lives can be saved.
Electrocardiogram (ECG) is a common, inexpensive, and noninvasive tool for measuring the
electrical activity of the heart and is used to detect cardiovascular disease. In this article, the
power of deep learning techniques was used to predict the four major cardiac abnormalities:
abnormal heartbeat, myocardial infarction, history of myocardial infarction, and normal
person classes using the public ECG images dataset of cardiac patients. First, the transfer
learning approach was investigated using the low-scale pretrained deep neural networks
SqueezeNet and AlexNet. Second, a new convolutional neural network (CNN) architecture
was proposed for cardiac abnormality prediction. Third, the aforementioned pretrained
models and our proposed CNN model were used as feature extraction tools for traditional
machine learning algorithms.
[2] Title: Detection of Cardiovascular Diseases in ECG Images Using Machine Learning and
Deep Learning Methods
Authors: Mohammed B. Abubaker
Description:
Cardiovascular diseases (heart diseases) are the leading cause of death worldwide.
The earlier they can be predicted and classified; the more lives can be saved. Electro-
cardiogram (ECG) is a common, inexpensive, and noninvasive tool for measuring the
electrical activity of the heart and is used to detect cardiovascular disease. In this work, the
power of deep learning techniques was used to predict the four major cardiac abnormalities:
abnormal heartbeat, myocardial infarction, history of myocardial infarction, and normal
person classes using the public ECG images dataset of cardiac patients. First, the transfer
learning approach was investigated using the low-scale pre-trained deep neural networks
SqueezeNet and AlexNet. Second, a new Convolutional Neural Network (CNN) architecture
was proposed for cardiac abnormality prediction. Third, the afore-mentioned pretrained
models and our proposed CNN model were used as feature extraction tools for traditional
machine learning algorithms, namely Support Vector Machine (SVM), K-Nearest Neighbors
(K-NN), Decision Tree (DT), Random Forest (RF), and Naïve Bayes (NB). According to the
experimental results, the performance metrics of the proposed CNN model outperform the
exiting works; it achieves 98.23% accuracy, 98.22% recall, 98.31% precision, and 98.21% F1
score. Moreover, when the proposed CNN model is used for feature extraction, it achieves
the best score of 99.79% using the NB algorithm.
[3] Title: Improving Disease Prediction by Machine Learning
Authors: Smriti Mukesh Singh1, Dr. Dinesh B. Hanchate2 – 2018
Description:
These days utilization of Big Data is expanding in biomedical and human services
groups, exact investigation of medicinal information benefits early malady discovery, quiet
care and group administrations. Fragmented therapeutic information lessens examination
precision. The machine learning calculations are proposed for successful expectation of
ceaseless infection. To beat the trouble of deficient information, Genetic algorithm will be
utilized to remake the missing information. The dataset comprises of structured data and
unstructured data. To extract features from unstructured data RNN algorithm will be utilized.
Framework proposes SVM calculation and Naive Bayesian calculation for sickness
expectation utilizing unstructured and structured information individually from hospital
information. Community Question Answering (CQA) system is additionally proposed which
will foresee the inquiry and answers and will give proper responses to the clients. For that,
two calculations are proposed KNN and SVM. KNN algorithm will perform classification on
answers and SVM calculation will perform classification on answers. It will help client to
discover best inquiries and answers identified with infections.
[4] Title: Heart Disease Prediction System Using Data Mining Techniques
Authors: Abhishiek Taneja – 2015
Description:
In today’s modern world cardiovascular disease is the most lethal one. This disease
attacks a person so instantly that it hardly gets any time to get treated with. So diagnosing
patients correctly on timely basis is the most challenging task for the medical fraternity. A
wrong diagnosis by the hospital leads to earn a bad name and loosing reputation. At the same
time treatment of the said disease is quite high and not affordable by most of the patients
particularly in India. The purpose of this paper is to develop a cost-effective treatment using
data mining technologies for facilitating data base decision support system. Almost all the
hospitals use some hospital management system to manage healthcare in patients.
Unfortunately, most of the systems rarely use the huge clinical data where vital information is
hidden. As these systems create huge amount of data in varied forms but this data is seldom
visited and remain untapped. So, in this direction lots of efforts are required to make
intelligent decisions. The diagnosis of this disease using different features or symptoms is a
complex activity. In this paper using varied data mining technologies an attempt is made to
assist in the diagnosis of the disease in question.
[5] Title: Improved Study of Heart Disease Prediction System using Data Mining
Classification Techniques
Authors: Chaitrali S. Dangare, Sulabha S. Apte - 2016
Description:
The Healthcare industry is generally “information rich”, but unfortunately not all the
data are mined which is required for discovering hidden patterns & effective decision
making. Advanced data mining techniques are used to discover knowledge in database and
for medical research, particularly in Heart disease prediction. This paper has analysed
prediction systems for Heart disease using a greater number of input attributes. The system
uses medical terms such as sex, blood pressure, cholesterol like 13 attributes to predict the
likelihood of patient getting a Heart disease. Until now, 13 attributes are used for prediction.
This research paper added two more attributes i.e., obesity and smoking. The data mining
classification techniques, namely Decision Trees, Naive Bayes, and Neural Networks are
analyzed on Heart disease database. The performance of these techniques is compared, based
on accuracy. As per our results accuracy of Neural Networks, Decision Trees, and Naive
Bayes are 100%, 99.62%, and 90.74% respectively. Our analysis shows that out of these
three classification models Neural Networks predicts Heart disease with highest accuracy.