SKR ENGINEERING COLLEGE
DEPARTMENT OF BIOMEDICAL ENGINEERING
EARLY DETECTION OF HEART DISEASES USING
MACHINE LANGUAGE
TEAM MEMBERS:
POORVIKA.R 212020121016
RAMADEVI.R.M 212020121018
SUBHA.K 212020121023
SUPRIYA .D 212020121024
ABSTRACT
• Efficient identification of heart disease plays a key role in healthcare, particularly
in the ed of cardiology.
• The system is developed based on classification algorithms includes Logistic
regression, K-nearest neighbor, Random forest algorithm, and XGB Classifier.
• We also proposed novel fast conditional mutual information feature selection
algorithm to solve feature selection problem.
OBJECTIVE
Heart disease is one of the complex diseases and globally many people
suffered from this disease.
we proposed an efficient and accurate system to diagnosis heart disease and
the system is based on machine learning techniques.
LITERATURE SURVEY
REFERENCE AUTHOR YEAR OF REVIEW
NO PUBLICATION
1. 7th International This paper demonstrates
S. Kapoor the prediction of heart
Conference on
L.Kasar Computing in disease using multiple
machine learning
Engineering &
A.Mandole classification
Technology (ICCET algorithms such as
J.Mahajan 2022) Naive Bayes, Random
Forest, SVM etc., and
compares their accuracy
scores
LITERATURE SURVEY
REFERENCE AUTHOR YEAR OF REVIEW
NO PUBLICATION
2. Chala Bayen et 2018 Prediction and Analysis
al, [12] the occurrence of Heart
Disease using data
mining techniques.
J48, Naive Bayes
Support Vector
Machine.
It gives short time result
which helps to give
quality of services and
reduce cost of
individuals.
EXISTING SYSTEM
Existing system developed Heart Disease classification system by using machine
learning classification techniques and the performance of the system was 77% in
terms of accuracy.
Cleveland dataset was utilized with the method of global evolutionary and with
features selection method.
DBP algorithm along with FS algorithm and performance was not good
DISADVANTAGES
• Existing system performance is very low.
• Computationally complex.
• More Execution time required to generate results.
PROPOSED SYSTEM
• Our proposed system involves Logistic regression, XG Boost, Random Forest
classifier, KNN Algorithm in Machine Learning concept used to train the dataset.
• Thus, preventing Heart diseases has become more than necessary. Good data-
driven systems for predicting heart diseases can improve the entire research and
prevention process, making sure that more people can live healthy lives.
• This is where Machine Learning comes into play. Machine Learning helps in
predicting the Heart diseases, and the predictions made are quite accurate.
• The project involved analysis of the heart disease patient dataset with proper data
processing. Then, different models were trained and predictions are made with
different algorithms KNN, Decision Tree, Random Forest, Logistic Regression
etc.
ADVANTAGES
• SVM achieved high accuracy
• Computationally less complex
• High accuracy due to selection of appropriate feature of training and testing of the
model
SYSTEM ARCHITECTURE
Cross validation
Dataset Collection Random Forest
Training
KNN
Pre-Processing
Testing
XGBoost
Result Computation Logistic Regression
Best Classifier
SYSTEM MODULES
• Module 1: Dataset Collection and pre-processing
• Module 2: EDA concept
• Module 3: Train the model
• Module 4: Evaluation
• Module 5: Comparison of existing model
• Module 6: Performance analysis
DATASET COLLECTION
• Data collection is the process of gathering and measuring
information from countless different sources.
• In order to use the data we collect to develop practical artificial
intelligence (AI) and machine learning solutions, it must be collected
and stored in a way that makes sense for the business problem at
hand.
PRE-PROCESSING
• Data preprocessing is a process of preparing the raw data and making
it suitable for a machine learning model. It is the first and crucial step
while creating a machine learning model.
• A real-world data generally contains noises, missing values, and
maybe in an unusable format which cannot be directly used for
machine learning models. Data preprocessing is required tasks for
cleaning the data and making it suitable for a machine learning model
which also increases the accuracy and efficiency of a machine
learning model.
Train the model
This stage is to form evaluation the models based on the input data. For our purpose of
study, we are going to implement to train the model using four types of algorithm in
machine learning to predict heart disease.
K-nearest neighbor
The K-nearest neighbors algorithm is a supervised classification algorithm method. It
classifies objects dependent on nearest neighbor. It is a type of instance-based
learning. The calculation of distance of an attribute from its neighbors is measured
using Euclidean distance. It uses a group of named points and uses them on how to
mark another point. The data are clustered based on similarity amongst them, and is
possible to fill the missing values of data using K-NN. Once the missing values are
filled, various prediction techniques apply to the data set. It is possible to gain better
accuracy by utilizing various combinations of these algorithms.
Random forest algorithm
Random forest algorithm is a supervised classification algorithmic technique. In this
algorithm, several trees create a forest. Each individual tree in random forest lets out a
class expectation and the class with most votes turns into a model's forecast. In the
random forest classifier, the greater number of trees gives higher accuracy. The three
common methodologies are:
• Forest RI (random input choice);
• Forest RC (random blend);
• Combination of forest RI and forest RC.
It is used for classification as well as regression task, but can do well with
classification task, and can overcome missing values. Besides, being slow to obtain
predictions as it requires large data sets and more trees, results are unaccountable.
Logistic regression algorithm
Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables. Logistic
regression predicts the output of a categorical dependent variable. Therefore, the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true
or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic
values which lie between 0 and 1.
XGB Classifier
XGBoost is an optimized distributed gradient boosting library designed to be highly
efficient, flexible and portable. It implements machine learning algorithms under the
Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as
GBDT, GBM) that solve many data science problems in a fast and accurate way. The
same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve
problems beyond billions of examples.
Evaluation
We will split the data set into test and train set. After splitting the data first have to train the
data and test the data using XGboost classifier, Logistic regression, KNN, Random Forest in
Machine learning techniques.
We will initialize the classifier model. We will set two hyperparameters namely max_depth
and n_estimators. These are set on the lower side to reduce overfitting.
All right, we will now perform cross-validation on the train set to check the accuracy. The
accuracy is slightly above the half mark. This can be further improved by hyperparameter
tuning and grouping similar stocks together. The XGBoost python model tells us that the
pct_change_40 is the most important feature of the others. Since we had mentioned that we
need only 7 features, we received this list. Here’s an interesting idea, why don’t you increase
the number and see how the other features stack up, when it comes to their f-score. You can
also remove the unimportant features and then retest the model.
Comparison of Existing model
This module includes comparison of existing system algorithm accuracy and our proposed
model accuracy. Our aim is to improve the accuracy score. So, we have to change the model
in our project.
In existing system have implemented on naïve bayes, Logistic Regression, Decision Tree,
and KNN. These algorithms are used in existing system that given 85% accuracy. But
Comparison of our project, we have done higher accuracy compared to existing system.
Performance Analysis
The next stage is to predict the results using Classifier. The best method for the training
and test data set is definitely given has the best results for Classification Accuracy and
Recall for both validation cases. Now we forward this Random Forest classifier to next
stage to predict the disease that may further lead to a cardiac arrest.
The results are compared using a confusion matrix. The consistency of a classification
model can be well visualized with a tabular form also called Confusion Matrix (or
"classifier") which shows its results over a set of known test data
QUERIES
THANK YOU