MEDICAL DATA MINING AND ANALYSIS FOR HEART
DISEASE DATASET USING CLASSIFICATION
TECHNIQUES
Ranganatha S.1, Pooja Raj H.R.2, Anusha C.3, Vinay S.K.4
1,2,3
Govt. Engineering College, Hassan, INDIA,
4
PES Institute of Technology, Bangalore, INDIA
1
[email protected],2
[email protected],
[email protected] 4
[email protected]Keywords: data mining, classification, entropy, gain, health phase, a large data set is transformed into a reduced
informatics, ID3, Naïve Bayesian. (simplified) data set. Number of features and objects in this
new set are much smaller than the original set in several
Abstract different ways. The rules generated in this phase are used
later to make accurate decisions. Newly formed data set is
used to make predictions when the new instances with
Modern medicine generates a great deal of information unknown outcomes occur with the predictive algorithm. A
stored in the medical database. Extracting useful knowledge huge amount of medical records are stored in databases and
and making scientific decision for diagnosis and treatment data warehouses. Such databases and applications differ
of disease from the database increasingly becomes from one another. With the evolution of machines, we have
necessary. Data mining in medicine can deal with this found that some time consuming and complex
problem. It can also improve the management level of mathematical calculations can be done using calculators.
hospital information and promote the development of tele- Using current machines, specific information in a large data
medicine and community medicine. Medical field is set can be found very fast and in an easy manner. We use
primarily directed at patient care activity and only machines for storing information; remind us of
secondarily as research resource. The only justification for appointments, and so on. As the size of the data is
collecting medical data is to benefit the individual patient. increasing, computer storage also increases. Due to the vast
The main theme of this paper is to store medical amount of data that has been created, algorithms were
information of patients who come for hospitalization for invented which produce results once a query is supplied.
heart disease and algorithms are run on that information
and result will be provided in the form of user
understandable words and graph. When very large data sets 2 Existing System and Motivation
are present, data mining algorithms (here considering only
ID3 and Naïve Bayesian algorithms) are used. ID3 outputs The Abbreviated Injury Scale (AIS) algorithm is the first
the result in the form of decision tree which can be easily published algorithm to produce all large item-sets in a
understood. Naïve Bayesian predicts the chances of heart transaction database [2]. The Abbreviated Injury Scale
disease based on conditions given. (AIS) is an anatomical based coding system created by the
Association for the Advancement of Automotive Medicine
1 Introduction to classify and describe the severity of specific individual
injuries [7][8]. This algorithm has targeted to discover
qualitative rules. This technique is limited to only one item
Information technologies in health care have made in the consequent. This algorithm makes multiple passes
provision for creation of patient records obtained from over the entire database.
monitoring of patient visits. This information includes type
of disease, patient information, lab results, etc. Health
records are private information, yet the use of these private In AIS, the frequent item sets are generated by scanning the
documents may help in treating deadly diseases [1]. Before databases several times. The support count of each
data mining process begins, healthcare organizations must individual item is accumulated during the first pass over the
formulate a clear policy concerning privacy and security of database. Based on the minimal support count, the items
patient records. This policy must be fully implemented in whose support count is less than its minimum value gets
order to ensure patient privacy. The problem of prediction eliminated from the list of items. Candidate 2-item sets are
in medical domain can be divided into two phases [6]: generated by extending frequent 1-item sets with other
items in the transaction. During the second pass over the
learning phase and decision making phase. In the learning
database, the support count of these candidate 2-itemsets
are accumulated and checked against the support threshold.
Data security and privacy are the important issues when representing a class. The path a record takes through a
health related data is considered. Thus Health informatics decision tree can then be represented as a rule.
deals with biomedical information, data and knowledge,
and their storage, retrieval and optimal usage for problem 3 Methodology
solving and decision making [3]. One of the unique
characteristics of medical data mining is the result can be
obtained in the form of description of words, pictures or in Initially the user should login to enter the patient
graphical format (as like bar charts, pie charts, etc). information. After login, patient information page is
displayed where the user fills patient history form and it is
given as an input to the algorithms (ID3 and Naïve
In data mining, data set plays very important role. Data set Bayesian). The algorithms are executed to give the result in
can be taken for any particular disease or group of diseases. the form of decision tree in case of ID3 and probability in
But the proposed system considers data set only for heart case of Naïve Bayesian. For better understanding, the result
disease and concentrates on this single set to the maximum is also shown in the form of charts. AIS algorithm results
extent to produce useful results. in unnecessary generation and counting of candidate item
sets that turn out to be small. The AIS algorithm requires
The reason why we have selected heart disease is because too many passes over the whole database. ID3 and Naïve
in this fast moving world, people want to live very Bayesian algorithms are used to come over this
luxurious life and they work in order to earn lot of money disadvantage.
and live comfortable life. Therefore, in this race people
forget to take care of themselves, which results in change of The different components of the system are connected as
food habits and entire lifestyle. In their lifestyle they are shown in Figure.1. The flow of the system starts with the
more tensed leading to high blood pressure, sugar at a very collection or raw data, which is used for data mining. This
young age. They don’t get sufficient rest and most of them data is first preprocessed by the different tools and
eat unhealthy and non-nutritious food. They don’t even converted into formats understood by the different tools
bother if they get sick and go for their own medication. As that are used in the mining process. Missing values can be
a result of all these small negligence, it leads to the major called either in the preprocessing stage or by using a
threat called heart disease. It is a well known fact that heart separate tool. The training part of the cleaned data is first
is the most essential organ in human body, if that gets passed into different data mining algorithms where
affected then it also affects other vital parts of the body and similarities in the patterns are extracted. Once these
may spoil human health system. Therefore, it is very similarities in the data are extracted, they can be called as
important for people to go for a heart disease diagnosis. patterns or rules. Based on these patterns and rules
obtained, classification of the testing data set takes place.
Data mining basically consists of 4 types of techniques.
They are:
1. Classification
2. Association
3. Sequencing
4. Clustering
In this paper we are making use of classification techniques
[4]. Classification is a data mining (machine learning)
technique used to predict group membership for data
instances. Classification predicts categorical labels and
prediction models continuous valued functions.
Classification is the task of generalizing known structure to
apply to new data.
Classification routines in data mining uses variety of
algorithms and the particular algorithm used can affect the
way records are classified. A common approach for
classifiers is to use decision trees to partition and segment
records. New records can be classified by traversing the
tree from the root through branches and nodes, to a leaf Figure 1: System Architecture
3.1 Dataset Description
As explained earlier, the proposed system considers dataset 2. Left ventricular hypertrophy
only for the heart disease and the reason for choosing this
particular dataset is also explained in the earlier section. 3. Normal
The data can be gathered from various sources such as from
conversations with patients, laboratory results, review and Coronary Angioplasty (CA): CA is the technique of
interpretation of doctor’s prescriptions. mechanically widening narrowed or obstructed arteries.
The attributes of heart disease dataset includes: The instances of CA are:
1. Name 1. Reversible defect
2. Age 2. Fixed defect
3. Gender 3. Normal
4. Chest Pain Type Exang: Exercise induced angina. (Yes or No).
5. Rest ECG Slope: The slope of the peak exercise ST segment.
6. CA (Coronary Angioplasty) Fasting Blood Sugar (FBS): Measuring of blood sugar
level while in fasting. FBS levels between 100 and 126
7. Exang mg/dl.
8. Slope Diagnosis of heart disease: Predicting the chances of heart
disease (less or more).
9. FBS (Fasting Blood Sugar)
While considering ID3 algorithm all attributes are given
10. Diagnosis of Heart Disease equal importance. In Naïve Bayesian algorithm; age, FBS
and weight are given more importance because it filters the
Name: Name of the patient who come for hospitalization. attributes based on given conditions.
Age: Age of the patient. 3.2 ID3 algorithm
Gender: Male or Female. Decision Tree is one of the most popular classification
algorithms in current use in Data Mining. Decision Tree
Chest pain: Physical complaint that requires immediate includes various types of algorithms such as ID3, C4.5, C5,
diagnosis and evaluation. J48 and CART. In this paper, ID3 algorithm and Naive
Bayesian algorithm are utilized because they are best suited
for heart disease dataset. The basic idea of Iterative
There are 4 instances of Chest pain: Dichotomiser 3 or ID3 algorithm is to construct the
decision tree by employing a top-down, greedy search
1. Typical type1 angina through the given sets to test each attribute at every tree
node.
2. Typical type angina
The main concepts used in ID3 algorithm are Information
3. Non angina pain Gain and Entropy to select the attribute that is most useful
for classifying the given sets. Entropy is defined as the
4. Asymptomatic measure of the amount of uncertainty in the (data) set S (i.e.
entropy characterizes the data set). Entropy is calculated by
Rest ECG: An electrocardiogram (ECG) is a test that
checks the problems with the electrical activity of heart. Entropy(S) = ∑ -p(I)log p(I).
n=1 2
There are 3 instances of Rest ECG: Where, p(I) refers to the proportion of S belonging to class
I i.e., in our dataset we have two kinds of class {Yes, No}.
1. ST-T wave abnormality
Information gain IG(A) is the measure of difference The main concept used in Naive Bayesian algorithm is
between entropy in S before split and after split on an probability. Probability (or likelihood) is a measure of how
attribute A. In other words, how much uncertainty in S is likely it is that some event will occur; a number expressing
reduced after splitting set S on attribute A. Information the ratio of favorable cases to the whole number of cases
Gain is calculated by possible. Probabilities are given a value between 0
(0%chance or will not happen) and 1 (100% chance or will
Gain(S, A) = Entropy(S) - ∑( ( |S |/|S| ) x Entropy(S )) happen) [5].
v v
The steps of Naive Bayesian algorithm are:
Where, S is the total collection of records.
1. When the dataset is large, split the dataset based on given
A is the attribute for which gain is calculated. condition.
v is all the possible values of the attribute A. 2. Calculate the probability of each attribute in the reduced
dataset.
S is the number of elements for each v.
v
3. Summation of these probabilities is used to predict the
result.
∑ is the summation of ( ( |S |/|S| ) x Entropy(S ) ) for all
v v
the items from the set of v.The steps involved in ID3 The input to the algorithms can be given in the form of
algorithm are: discrete or continuous values. ID3 checks each and every
attribute, calculates gain and entropy. Until the entropy
1. Calculate the entropy of every attribute using the dataset. becomes zero and gain becomes maximum; it iterates over the
attribute.
2. Split the set into subsets using the attribute for which
entropy is minimum (or, equivalently, information gain is As ID3 checks every attribute, it takes much time to
maximum). calculate. Naive Bayesian filters the dataset for the given
condition. So it is not much effective and accurate in the
3. Make a decision tree node containing the attribute. results when compared to ID3.
4. Recurs on subsets using remaining attributes. 4 Results
3.3 Naive Bayesian algorithm When user logs in, patient information form (Figure 2) is
displayed. All the fields are entered with proper parameters.
Naive Bayesian classifier depends on Bayes' theorem which
works on probabilistic statistical classifier. The Naive
Bayesian Classifier technique is particularly suited when
the dimensionality of the inputs is high. Despite its
simplicity, Naive Bayes’ can often outperform more
sophisticated classification methods. Naive Bayesian model
identifies the characteristics of patients with heart disease.
It shows the probability of each input attribute for the
predictable state. Naive Bayesian or Bayes' Rule is the
basis for many machine-learning and data mining methods.
The rule (algorithm) is used to create models with
predictive capabilities. It provides new ways of exploring
and understanding data.
Figure.2: Patient information form
Naive Bayesian algorithm is used:
ID3 algorithm generates the decision tree as output. At
each iteration, it calculates entropy and the gain as
-When the data is high.
discussed above. When entropy becomes zero, iteration
stops and the result is obtained as shown in Figure3.
-When the attributes are independent of each other.
Thus on each iteration graph is plotted for attributes v/s
-When we want more efficient output, as compared to the gain calculated as shown in Figure4
output other methods.
In Naive Bayesian algorithm, probability is calculated Figure 6 is the graph plotted between probability values
based on the given condition and the prediction of heart (nvalues) v/s attributes.
disease is made (Figure 5).
5 Conclusions
The number of people feeling sick and getting admitted
into clinics and hospitals are increasing proportionally. The
growing number of patients indirectly increases the
amount of data that are required to be stored. As the size of
data increases, computer storage also increases. Due to the
vast amount of data that has been created, humans
invented algorithms that produce results once a query is
supplied. The goals that have been achieved by the
developed system are:
1. Simplified and reduced manual work.
Figure 3: Decision tree output from ID3 algorithm
2. Large volumes of data can be stored.
3. It provides Smooth workflow.
References
[1] Canlas, R. D, “Data Mining in Healthcare:
Current Applications and Issues”. Carnegie
Mellon University, Australia. 2009
[2] R. Agrawal, T. Imielinski, A. Swami, “Mining
Associations between Sets of Items in Massive
Databases”, Proc. of the ACM-SIGMOD, Intl
Conference on Management of Data, Washington
Figure 4: ID3 Graph D.C. 1993
[3] Shortliffe, EH.,Perrault, LE., (Eds.). “Medical
informatics: Computer applications in health
care and biomedicine”, (2nd Edition). New York:
Springer, 2000.
[4] Han, J., Kamber, M., “Data mining: Concepts
and Techniques”. New York: Morgan-Kaufman,
2000
[5] Pang-NingTan, Michael Steinbach, Vipin Kumar,
“Introduction to Data Mining”, Pearson
Education, 2007.
[6] Sunita Soni, O.P.Vyas, “Using Associative
Classifiers for Predictive Analysis in Health Care
Figure 5: Naïve Bayesian output Data Mining”, International Journal of Computer
Application (IJCA, 0975 – 8887) Volume 4–
No.5, July 2010, pages 33-34.
[7] "TRAUMA.ORG: Abbreviated Injury Scale".
Archived from the original on 6 January 2011.
Retrieved 2011-01-23.
[8] Lesko MM, Woodford M, White L, O'Brien SJ,
Childs C, Lecky FE (2010). "Using Abbreviated
Injury Scale (AIS) codes to classify Computed
Tomography (CT) features in the Marshall
System", BMC Med Res Methodol 10: 72.
Figure 6: Naïve Bayesian Graph doi:10.1186/1471-2288-10-72. PMC 2927606
PMID 20691038