Final Diabetes Prediction Documentation
Final Diabetes Prediction Documentation
Project Report
On
Submitted to
BHILAI
in partial fulfillment
of
Bachelor of Engineering
in
by
We the undersigned solemnly declare that the report of the project work entitled Diabetes Prediction
Using Data Mining, is based on my own work carried out during the course of my study under the
We assert that the statements made and conclusions drawn are an outcome of the project work. I
further declare that to the best of my knowledge and belief that the report does not contain any part of
any work which has been submitted for the award of any other degree/diploma/certificate in this
_________________ _________________
Kankasha Naij Shilpa Shahu
300102218320 300102218315
AS1108 AS0988
CERTIFICATE
This is to Certify that the report of the project submitted is an outcome of the project work entitled Diabetes
Prediction Using Data Mining carried out by Shilpa Shahu 300102218315,AS0988.; Kankasha Naij
300102218320,AS1108
Under my guidance and supervision for the award of Degree in Bachelor of Engineering in Computer Science
from Chhattisgarh Swami Vivekanand Technical University, Bhilai (C.G).
Prof. Sumit Sar Prof. Shiv Dutta Mishra Prof. Ashok Kumar Behera
Computer Sc. & Engg Computer Sc. & Engg. Computer Sc. & Engg.
The Project work as mentioned above is hereby being recommended and forwarded for examination and evaluation.
S u b m i tt e d by
Have been examined by the undersigned as a part of the examination for the award of Bachelor of Engineering
degree in Computer Science and Engineering from Chhattisgarh Swami Vivekanand Technical University, Bhilai
(C.G)
Name: Name:
Date: Date:
ACKNOWLEDGEMENT
We have great pleasure in the submission of this project report entitled Diabetes Prediction Using Data
Mining impartial fulfillment the degree of Bachelor of Engineering (CSE).While submitting this Project report,I take
this opportunity to thank those directly or indirectly related to project work.
We would like to thank my guide Prof. Ashok Kumar Behera who has provided the opportunity and
organizing project for me. Without this active co-operation and guidance ,it would have become very difficult to
complete task in time.
We would like to express sincere thanks and gratitude to Dr. M.K Gupta , Principal of the Institution, Dr.
(Mrs.) M. V. Padmavati, Head of the Department Computer Science & Engineering for their encouragement and
cordial support..
While Submission of the project, We also like to thanks to Prof. Sumit Sar and Prof. Shiv Dutta
Mishra ,Project Coordinator, faculty and all the staff of department of Computer Science & Engineering, Bhilai
Institute of Technology, Durg for their continuous help and guidance through out the course of project.
Acknowledgement is due to our parents ,family members, friends and all those persons who have helped us
directly or indirectly in the successful completion of the project work.
300102218315,300102218320
AS0988, AS1108
Content
1. INTRODUCTION
1.1 OBJECTIVE 1-2
1.2 PROJECT DESCRIPTION 3
2. SYSTEMSTUDY
2.1 EXISTING ANDPROPOSEDSYSTEM 4-5
2.2 FEASIBILITYSTUDY 6-7
2.3 TOOLS ANDTECHNOLOGIES USED 8-10
2.4 HARDWARE ANDSOFTWARE REQUIREMENTS 11
3. SOFTWARE REQUIREMENTSSPECIFICATION
3.1 USERS 12
3.2 FUNCTIONAL REQUIREMENTS 13-14
3.3 NON-FUNCTIONAL REQUIREMENTS 15-17
4. SYSTEM ANALYSIS AND DESIGN
4.1 SYSTEMPERSPECTIVE 18
4.2 DATABASE DESIGN (ER and/or Conceptual schema) 18-19
4.3 CONTEXTDIAGRAM (DFD) 20
4.4 USE CASE DIAGRAM 21
4.5 SEQUENCE DIAGRAMS 22
4.6ACTIVITYDIAGRAM 23-24
5. IMPLEMENTATION
5.1 SCREENSHOTS 25-34
6. SOFTWARE TESTING 35-42
7. CONCLUSION 43
8. FUTUREENHANCEMENTS 44
BIBLIOGRAPHY 45-46
Chapter 1
Introduction
1.1 Objective
To predict diabetes in healthcare industry using data mining
To predict and categorize the state of health.
To identify some appropriate factors that affect health conditions,
To design an artificial neural network that can be used to predict health performance based on certain pre-
defined data for a particular health condition
Diabetes is a long-lasting disease that happens when the pancreas fails to create enough insulin, or when the body
cannot use the insulin produced efficiently. Insulin is a hormone that controls the level of sugar in the blood.
Hyperglycemia or hyperglycemia is a common result of uncontrolled diabetes and, over time, causes severe damage to
many organs, particularly nerves and blood vessels.
In 2015, 8.5% of adults aged 17 years or older had diabetes. In 2013, diabetes was the cause of 1.5 million deaths, and
high blood glucose caused 2.3 million deaths. Diabetes patients have doubled in the last ten years worldwide. More
than 200 million people are infected and about seven percent increase in the annual predominance of diabetes in the
world. People for a long time suffered from different diseases that in some cases have been able to diagnose diseases
and offer them the solution in order to enhance it, but unfortunately, sometimes, due to the lack of diagnosis of
symptoms in patients for a long time may even threaten the life of the patient. Therefore, many studies have been done
in the field of predicting for several diseases to the extent that today's human take advantage of decision supports
models and smart method to predict. One of the decision support models application is in the medical field and
diagnosis of illnesses such as diabetes [1, 2]. Deferment in the diagnosis and prediction of diabetes due to insufficient
control of blood glucose increases macro vascular and Capillaries difficulties risk, ocular diseases and kidney failure
[1, 2].
Data Mining is an analytic process designed to explore data in search of consistent patterns and systematic
relationships between variables, and then to validate the results by applying the patterns found to a new subset of data.
Data mining is often described as the process of discovering patterns, correlations, trends or relationships by searching
through a large amount of data stored in repositories, databases, and data warehouses. Diabetes, often referred to by
doctors as diabetes mellitus, describes a group of metabolic diseases in which the person has high blood glucose
(blood sugar), either because insulin production is insufficient, or because the body's cells do not respond properly to
insulin, or both. This project helps in identifying whether a person has diabetes or not, if predicted diabetic the project
suggest measures for maintaining normal health and if not diabetic it predicts the risk of getting diabetic. In this
project Classification algorithm was used to classify the Pima Indian diabetes dataset. Results have been obtained
using Android Application.
Data mining, the extraction of hidden predictive information from large databases[ 1,12], is a powerful new
technology with great potential to help companies focus on the most important information in their data warehouses.
Diabetes has become a most common disease in today’s world. So for every individual it is important to take a
precautionary measure to check if the person has any chances of getting diabetes. For this purpose we use data mining
techniques to predict if a person is diabetic or not. It is attractive as the results are obtained through an android
application installed in mobile device. The main reason for accuracy of results is that only most significant attributes
causing diabetes are considered for analysis .
Data mining tools predict future trends and behaviours, allowing businesses to make proactive, knowledge-driven
decisions . The automated, prospective analyses offered by data mining move beyond the analyses of past events
provided by retrospective tools typical of decision support systems. They scour databases for hidden patterns, finding
predictive information that experts may miss because it lies outside their expectations.
Diabetes (diabetes mellitus)[24] is classed as a metabolism disorder. Metabolism refers to the way our bodies use
digested food for energy and growth.. Most of what we eat is broken down into glucose. Glucose is a form of sugar in
the blood - it is the principal source of fuel for our bodies.
A person with diabetes has a condition in which the quantity of glucose in the blood is too elevated (hyperglycemia).
This is because the body either does not produce enough insulin, produces no insulin, or has cells that do not respond
properly to the insulin the pancreas produces. This results in too much glucose building up in the blood. This excess
blood glucose eventually passes out of the body in urine . So, even though the blood has plenty of glucose,
the cells are not getting it for their essential energy and growth requirements
So we proposed an model to predict diabetes that can be useful and helpful for doctors and practitioners. In this
research, we used the following attributes: Gender Details, Age category (Under 35 or Above), Existing sufferance of
diabetes, Thirsty level, Excess hunger, How often patient feel excreta, Weight loss, Genetic existence of diabetes.
High blood glucose, Blurry vision, High blood pressure, Consumption of tobacco products or smoking, Consumption
of vegetables and fruits, Physical Activity, waist circumference, User height, User weight.
Based on the Diabetes Research Center reports, the incidence of diabetes has folded in the last ten years worldwide
and more than 200 million people are infected and about seven percent increase in the annual prevalence of diabetes
worldwide.
Since diabetes is a long-lasting disease and import permanent damage to the limbs and vital organs in the body, using
artificial intelligence tools can enhance the detection methods and disease control which will be of a great help to the
physicians. According to the Diabetes Research Center, it has been shown that early diagnosis of patients at risk can
prevent 80 percent of lasting complications of type II diabetes or deferred them
There are two types of diabetes, type I and type II diabetes. Type I diabetes also named insulin dependent and type II
diabetes named relative insulin deficiency .Protracted complications of diabetes are mainly distributed into two
categories: vascular and non-vascular complications of diabetes.
Vascular complications include micro vascular (eye disease, neuropathy, nephropathy) and macro vascular
complications (coronary artery disease, peripheral vascular disease, cerebrovascular disease). Non-vascular
complications include gastro paresis, sexual dysfunction, and skin changes .
Diabetes is one of the major international health problems. World Health Organization reports says that around 422
million people have diabetes worldwide. Also in this pandemic situation people with diabetes do face a higher chance
of experiencing serious complications from COVID-19. In general, people with diabetes are more likely to
experience severe symptoms and complications when infected with a virus
By looking intensely through literature and soliciting the experience of human experts on pathological conditions, a
number of factors have been recognized that have an impact on determining patients' cases in the subsequent period.
These factors were prudently studied and coordinated with an appropriate number for coding the computer within the
modelling environment ANN. These factors were categorized as input variables and output variables that reflect some
possible levels of disease status in terms of the assessment system. The data were entered into the JNN tool
environment, determined the value of each of the variables using JNN (the most influential factor on diabetes), then
the data were trained, validated, and tested.
Classification is one of the most important decision making techniques in many real world problem. In this work, the
main objective is to classify the data as diabetic or non-diabetic and improve the classification accuracy. For many
classification problem, the higher number of samples chosen but it doesn’t leads to higher classification accuracy. In
many cases, the performance of algorithm is high in the context of speed but the accuracy of data classification is low.
The main objective of our model is to achieve high accuracy. Classification accuracy can be increase if we use much
of the data set for training and few data sets for testing. This survey has analyzed various classification techniques for
classification of diabetic and non-diabetic data. Thus, it is observed that techniques like Support Vector Machine,
Logistic Regression, and Artificial Neural Network are most suitable for implementing the Diabetes prediction system.
The system describes the flow of the project work. The first step in the process is the collection of data needed for the
work. Here the dataset used is Pima Indian diabetes dataset, which is collected in the first step. The next step in the
process is pre processing of the data. Here we covert the raw data into understandable format. Now the pre processed
data is classified into a decision tree to predict the status of a person whether diabetic or not using the algorithm. The
user enters the details to know his/her results for the test into an android app installed in his mobile device. The
attributes entered by the user is compared with the decision tree and the results are generated.
In project diabetes data set is considered. The data set is taken by consulting specialized doctor. The data set consists
of 13 attributes: Here, the class label is binary classification. It has two values
Tested positive (1) which means diabetic
Tested negative (0) which says non diabetic
Chapter 2
System Study
Although data mining has been around for more than two decades, its potential is only being realized now. Data
mining combines statistical analysis, machine learning and database technology to extract hidden patterns and
relationships from large databases. The two most common modelling objectives are classification and prediction.
Classification models predict categorical labels (discrete, unordered) while prediction models predict continuous-
valued functions. Decision Trees and Neural Networks use classification algorithms while Regression, Association
Rules and Clustering use prediction algorithms.
Data preprocessing and data mining algorithms are used for the further process in the project. Data preprocessing
technique data transformation is applied to the data set before applying data mining algorithms. The decision tree and
regression models are built. Decision trees and Regression models are used to predict the final binary target variable.
After running different types of models, model comparison needed to select the best algorithm. The best algorithm and
best model is selected based on the high accuracy rate.
Diabetes, often referred to by doctors as diabetes mellitus, describes a group of metabolic diseases in which the person
has high blood glucose (blood sugar), either because insulin production is insufficient, or because the body's cells do
not respond properly to insulin, or both.
This project helps in identifying whether a person has diabetes or not, if predicted diabetic the project suggest
measures for maintaining normal health and if not diabetic it predicts the risk of getting diabetic. In this project
Classification algorithm was used to classify the Indian diabetes dataset. Results have been obtained using Android
Application.
Also there is a need to automate the overall process of diabetes prediction. This automation of diabetic database helps
identification of impact of diabetes on various human organs.
The author in R. Ali, M. H. Siddiqi, M. Idris, B. H. Kang and S. Lee used Data Mining to develop a model for
classifying diabetic patient control level based on historical medical records. The author was motivated by the death
caused by diabetes in the world which necessitated avoiding the complication of the disease. He developed a new
predictive model using data mining techniques which would classify diabetic patient control level based on historical
medical records. The research was carried out using three data mining techniques which are Naïve Bayes, Logistic and
J48. The research was implemented using WEKA application. The result showed that Logistic data mining algorithm
gave a precision average of 0.73, recall of 0.744, F-measure of 0.653 and accuracy of 74.4%. Naïve Bayes gave a
precision average of 0.717, recall of 0.742, F-measure of 0.653 and accuracy of 74.2%. J48 gave a precision average
of 0.54, recall of 0.735, F-measure of 0.623 and accuracy of 73.5%. This proved that the logistic algorithm was more
accurate than the other two. The research was limited in that only diabetes type 2 was considered. They also did not
look into the discovery of appropriate features with minimal effort and validation on discovered features.
The author in S. Abu Naser, I. Zaqout, M. A. Ghosh, R. Atallah and E. Alajrami, developed a prediction model for
diabetes Type II treatment plans by using data mining. The author was motivated by the highly dangerous
complication of chronic disease as well as the complication which required amputation of one of the parties. He
developed a new model for classifying diabetes type 2 treatment plans which could help the control of blood glucose
level of diabetic patient. He made use of J48 algorithm in conducting the experiment on 318 medical records which
was collected from JABER ABN ABU ALIZ clinic centre for diabetes in Sudan. The basic control information
showed that 59.1% of the record was considered for Oral Hypo glycemic, 35.5% for Insulin and 5.3% for Diet. The
evaluation was done using the WEKA application. The research work did not consider diabetes type 1 patients which
could have been included with additional attributes. Also, the nutrition system and exercise could have been included
to increase the accuracy of the system.
The authors in A. Elzamly, S. S. Abu Naser, B. Hussin and M. Doheir, used prediction of diabetes mellitus based on
boosting ensemble modelling. They were motivated by the focus of aiding diabetes patients fit themselves into their
normal activities of life by early predicting their state and tacking it. They intended to predict the diabetes types of
patients based on physical and clinical information using boosting ensemble technique. They made use of boosting
ensemble technique which internally uses random committee classifier. The architecture used was supported by
integrating data management, learning, and prediction components together. The evaluation result of the technique
showed accuracy gave a weighted average TP rate of 0.81, FP rate of 0.198, Precision of 0.81, Recall of 0.81, F-
measure of 0.82 and ROC area of 0.82 for diabetes type 1 and 2. The research work is intended to be extended in
future the integration into a cloud based clinical decision support system for chronic diseases and the inclusion of a
feedback mechanism to increase the level of satisfaction of users.
2.2 Feasibility Study
Diabetes or diabetes mellitus is a metabolic disorder (metabolic) in the body. This disease destroy the ability to
produce insulin in the patient's body or the body develops resistance to insulin the and consequently the produced
insulin cannot achieve its normal job. The main role of the produced insulin is to decrees blood sugar by different
instruments. There are two key types of diabetes. In Type I diabetes, obliteration of beta pancreatic cells damage
insulin construction and in type II, there is a progressive insulin confrontation in the body and ultimately may yield to
the obliteration of pancreatic beta cells and faults in insulin production. In type II diabetes, it is known that genetic
issues, obesity and lack of physical activity have a vital part in a person .
Even though the precise cause of type I diabetes is unidentified, issues that may indicate a greater risk comprise the
followings :
Family history. A person risk upsurges if his parent or sibling has history of type I diabetes.
Environmental factors. Situations for example contact with a viral illness probably play some role in type I
diabetes.
The existence of harmful immune system cells. Occasionally family members of a person with type I diabetes
are examined for the existence of diabetes autoantibodies. If a person has these autoantibodies, he/she has a
chance of increased risk for evolving type I diabetes. Nonetheless not every person who has these
autoantibodies gets diabetes.
Geography. Some countries, like Sweden, have bigger rates of type I diabetes.
Researchers don't completely comprehend why certain people develop pre-diabetes and type II diabetes and others
don't. It's sure that some factors upsurge the risk like :
Weight. The more fatty tissue you have, the more resilient a person cells to insulin.
Inactivity. The less energetic a person is, the more a person has risk. Physical activity assists a person control
of his/her weight, consumes glucose as energy and makes a person cells more sensitive to insulin.
Family history. A person risk upsurges if his parent or sibling has history of type II diabetes. .
Age. A person risk upsurges as he/she gets older. This may be because a person has a habit to exercise less,
lose muscle mass and add weight as he/she gets older. Nonetheless type II diabetes is likewise growing
among children, youths and adults.
Gestational diabetes. If a person developed gestational diabetes when she was pregnant, her risk of emerging
pre-diabetes and type II diabetes far ahead upsurges. If she gives birth to a baby weighing more than 4
kilograms, she is also at risk of type II diabetes.
Polycystic ovary syndrome. For females, having polycystic ovary syndrome increases the risk of getting
diabetes.
High blood pressure. Having blood pressure more than 140/90 millimeters of mercury (mm Hg) is connected
to an augmented risk of type II diabetes.
Abnormal cholesterol and triglyceride levels. If a person has low levels of high-density lipoprotein, or good
cholesterol, his/her risk of type II diabetes is going to be higher. Triglycerides are additional type of fat passed
in the blood. A person with greater levels of triglycerides has an augmented risk of type II diabetes.
A practical approach to this type of problem is the application of regression analysis where past data is better
combined into some functions. The result is an equation in which both xj inputs are multiplied by wj; the sum of all
these products is constant, and then output y = Σ wj xj +, where j = 0..n.
The problem is the difficulty of choosing an appropriate function to have all the collected data and adjust the output
automatically when more information is attained, because the candidate's performance is organized by a number of
arguments, and this control will not have any clear regression model.
The artificial neural network, which emulates the human thinking in solving a problem, is a more common approach
that can address this type of problems. Thus, the attempt to develop an adaptive system such as artificial neural
network to predict the situation and classification based on the results of these arguments.
Diabetes is not only affected by various factors like height, weight, hereditary factor and insulin but the major reason
considered is sugar concentration among all factors. The early identification is the only remedy to stay away from the
complications. Many researchers are conducting experiments for diagnosing the diseases using various classification
algorithms of machine learning approaches like J48, SVM, Naive Bayes, Decision Tree, Decision Table etc. as
researches have proved that machine-learning algorithms works better in diagnosing different diseases. Data Mining
and Machine learning algorithms gain its strength due to the capability of managing a large amount of data to combine
data from several different sources and integrating the background information in the study. This research work
focuses on pregnant women suffering from diabetes. In this work, Naive Bayes, SVM, and Decision Tree machine
learning classification algorithms are used and evaluated on the PIDD dataset to find the prediction of diabetes in a
patient. Experimental performance of all the three algorithms are compared on various measures and achieved good
accuracy. The remaining of the research discussion is organized as follows: Section-II briefs Related Work of various
classification techniques for prediction of diabetes, Section-III describes the Methodology and brief discussion of
Dataset used, Section-IV discusses evaluated Results, and Section-V determines the Conclusion of the research work
Sernyak used logistic regression analysis to calculate odds ratio neuroleptic unusual version and a diagnosis of
diabetes in each of the age groups, control the effects of population, and diagnosis. Thirugnanam has improved
diabetes prediction using fuzzy neural networks [10]. Hamid and others have offered hybrid intelligent systems for the
detection of micro albuminuria in patients with type 2 diabetes without measuring the urinary albumin. Javad and
others proposed the method base on automatic learning on type II diabetes to regulate blood sugar .
2.3 Tools And Technologies Used
Proposed procedure is summarized in figure-1 below in the form of model diagram. The figure shows the flow of the
research conducted in constructing the model.
A B
A-Tested Negative 500 0
B-Tested Positive 268 0
Naive Bayes Classifier:Naive Bayes is a classification technique with a notion which defines all features are
independent and unrelated to each other. It defines that status of a specific feature in a class does not affect the status
of another feature. Since it is based on conditional probability it is considered as a powerful algorithm employed for
classification purpose. It works well for the data with imbalancing problems and missing values. Naive Bayes [24] is a
machine learning classifier which employs the Bayes Theorem. Using Bayes theorem posterior probability P(C|X) can
be calculated from P(C),P(X) and P(X|C) . Therefore, P(C|X) = (P(X|C) P(C))/P(X) Where, P(C|X) = target class’s
posterior probability . P(X|C) = predictor class’s probability. P(C) = class C’s probability being true. P(X) =
predictor’s prior probability. The evaluated performance of Naive Bayes algorithm using Confusion Matrix is as
follows:
A B
A-Tested Negative 422 78
B-Tested Positive 104 164
Decision Tree Classifier :Decision Tree is a supervised machine learning algorithm used to solve classification
problems. The main objective of using Decision Tree in this research work is the prediction of target class using
decision rule taken from prior data. It uses nodes and internodes for the prediction and classification. Root nodes
classify the instances with different features. Root nodes can have two or more branches while the leaf nodes represent
classification. In every stage, Decision tree chooses each node by evaluating the highest information gain among all
the attributes . The evaluated performance of Decision Tree technique using Confusion Matrix is as follows:
A B
A-Tested Negative 407 93
B-Tested Positive 108 160
Dataset Used:In this work WEKA tool is used for performing the experiment. WEKA is a software which is designed
in the country New Zealand by University of Waikato, which includes a collection of various machine learning
methods for data classification, clustering, regression, visualization etc. One of the biggest advantages of using WEKA
is that it can be personalized according to the requirements. The main aim of this study is the prediction of the patient
affected by diabetes using the WEKA tool by using the medical database PIDD.
Accuracy Measures: Naive Bayes, SVM and Decision Tree algorithms are used in this research work. Experiments are
performed using internal cross-validation 10-folds. Accuracy, F-Measure, Recall, Precision and ROC (Receiver
Operating Curve)measures are used for the classification of this work.
1. Accuracy (A) Accuracy determines the accuracy of the algorithm in predicting instances. A=(TP+TN) / (Total no of samples)
2. Precision (P) Classifier¢¢s correctness/accuracy is measured by Precision. P = TP / (TP+ FP)
3. Recall (R) To measure the classifier¢¢s completeness or sensitivity, Recall is used. R =TP / (TP+FN)
4. F-Measure F-Measure is the weighted average of precision and recall. F=2*(P*R) / (P+R)
5. ROC ROC(Receiver Operating Curve) curves are used to compare the
usefulness of tests.
Corresponding classifiers performance over Accuracy, Precision, F-measure, Recall and ROC values are listed in
Table-7 and classifiers performance on the basis of classified instances are defined in Table-8. Where, TP defines True
Positive, TN defines True Negative, FP defines False positive, FN defines False Negative. The corresponding
classifiers performance on the basis of Accuracy, Precision, F-measure, Recall and ROC values are listed in Table-7
and classifier’s performance on the basis of classified instances are shown in Table-8.
3.1 Users
We covert the raw data into understandable format. Now the pre processed data is classified into a decision tree to
predict the status of a person whether diabetic or not using the algorithm. The user enters the details to know his
results for the test into an android app installed in his mobile device. The attributes entered by the user is compared
with the decision tree and the results are generated.
User:
Gender Details.
Age category(Under 35 or Above)
Existing sufferance of diabetes
Thirsty level
Excess hunger
How often patient feel excreta.
Weight loss
Genetic existence of diabetes
High blood glucose
Blurry vision
High blood pressure
Consumption of tobacco products or smoking
Consumption of vegetables and fruits
Physical Activity
Input of waist circumference
Input of height
Input of weight
We have combined three classification algorithms through a voting mechanism to increase the accuracy level of the
model. So if one algorithm does not predict it correctly, it doesn’t affect to the final prediction because the system
considers the predictions of other two algorithms too. It gives the majorities decision. Thus ensures more accuracy
than a single algorithm.
A Decision tree is basically a tree structure(Han and Kamber, 2006), which has the form of a flowchart. It can be used
as a method for classification and prediction with a representation using nodes and internodes. Root and internal nodes
are the test cases. Leaf nodes considered as class variables. Figure 2. shows a sample decision tree structure.
Among classification data mining methods, decision tree algorithm provides powerful techniques for prediction.
Among ID3, C4.5, C5, J48 and CHAIAD decision tree algorithms, we have selected J48 algorithm to develop our
model. It’s a java based algorithm, it works as follows. In order to classify a new item, it first creates a decision tree
based on the attribute values of the available training data set. Every node of the decision tree is generated by
calculating the highest information gain for all attributes. If any attribute gives an unambiguous end result (explicit
classification of class attribute), the branch of that attribute will be terminated and then target value is assigned to it.
We have used 12-fold cross validation technique to build the model using this algorithm. It’s simply as follows.
Break data into 12 sets of size n/12.
Train on 11 datasets and test on 1.
Repeat 12 times and take a mean accuracy.
In 12-fold cross-validation, the original sample is randomly partitioned into 12 equal sized subsamples. Of the 12
subsamples, a single subsample is retained as the validation data for test the model, and the remaining (12− 1)
subsamples are used as training data.
Here we have used 70:30 percentage split technique to build the model using Naïve Bayes algorithm. This means 70
percent of the data set have been used to train the data and other 30 percent of the data set have been used to test the
model.
This algorithm is commonly used for solving the quadratic programming problems that arise during the training of
SVM (Support Vector Machines). SMO uses heuristics to partition the training problem into smaller problems that can
be solved analytically. SMO algorithm it replaces all missing values and transforms nominal attributes into binary
ones. It also normalizes all attributes by default which helps to speed up the training process. We have used 70:30
percentage split technique to train and test the data set using this model. Here we are not only considering the accuracy
but it should have the ability to handle missing values well. This algorithm does that very accurately because it uses
heuristics to partition the training problem into smaller problems. That’s the main reason we have selected this
algorithm.
3.3 Non-Functional Requirements
This section explains the overall design of the system and what is the process it has followed in order to get the
prediction.
Dataset Used:
The data set we have used is a benchmarked dataset which can be used for comparing the accuracy and the efficiency
of our model. Data has been obtained from Pima Indians Diabetes Database, National Institute of Diabetes and
Digestive and Kidney Diseases.
Number of Instances: 600
Number of Attributes: 13 + (1 class attribute).
For Each Attribute: (all numeric-valued).
Inputs:
Gender Details.
Age category(Under 35 or Above)
Existing sufferance of diabetes
Thirsty level
Excess hunger
How often patient feel excreta.
Weight loss
Genetic existence of diabetes
High blood glucose
Blurry vision
High blood pressure
Consumption of tobacco products or smoking
Consumption of vegetables and fruits
Physical Activity
Input of waist circumference
Input of height
Input of weight
Procedure:
Then User inputs data to the system in order to diagnose whether he has the disease or not.
Build a model using J48 Decision Tree Algorithm and train the data set.
Build a model using Naïve Bayes Algorithm and train the data set.
Build a model using SMO Support Vector Machine Algorithm and train the data set.
Test the data set using these three models.
Get the evaluation results.
Finally, get the predicted voting from all classifiers and gives the diagnostic result.
The artificial neural network is much similar as natural neural network of a brain. Artificial Neural networks (ANN)
typically consist of multiple layers or a cube design, and the signal path traverses from front to back. Back propagation
is the use of forward stimulation to reset weights on the "front" neural units and this is sometimes done in combination
with training where the correct result is known. More modern networks are a bit freer flowing in terms of stimulation
and inhibition with connections interacting in a much more chaotic and complex fashion. Dynamic neural networks
are the most advanced, in that they dynamically can, based on rules, for new connections and even new neural units
while disabling others.Generally, the artificial neural network is consisting of the layers and network function, the
layers of the network are including: input layer, hidden layer and output layer. The input neurons define all the input
attribute values for the data mining model. In our work, the number of neurons is 7, since each item in our data set has
7 attributes, including: Glucose, Blood Pressure, Skin Thickness, Insulin, BMI, Diabetes Pedigree Function, and age.
For the hidden layer, hidden neurons receive inputs from input neurons and provide outputs to output neurons. The
hidden layer is where the various probabilities of the inputs are assigned weights. A weight describes the relevance or
importance of a particular input to the hidden neuron. Mathematically, a neuron's network function f(x) is defined as
composition of other functions gi (x), which can further be defied as a composition of other functions. The important
characteristic of the activation function is that it provides a smooth transition as input values change, like a small
changes in input produces a small changes in output. The artificial neural networks are applied to tend to fall within
the broad categories. Application areas include the system identification and control (vehicle control, trajectory
prediction, process control, natural resources management), quantum chemistry, game-playing and decision making
(backgammon, chess, poker), pattern recognition (radar systems, face identification, object recognition and more),
sequence recognition (gesture, speech, handwritten text recognition), medical diagnosis, financial applications (e.g.
automated trading systems), data mining (or knowledge discovery in databases, "KDD"), visualization and e-mail
spam filtering.
The Support Vector Machine (SVM) was first proposed by Vapnik, and SVM is a set of related supervised learning
method always used in medical diagnosis for classification and regression. SVM simultaneously minimize the
empirical classification error and maximize the geometric margin. So SVM is called Maximum Margin Classifiers.
SVM is a general algorithm based on guaranteed risk bounds of statistical learning theory, so called structural risk
minimization principle. SVMs can efficiently perform nonlinear classification using what is called the kernel trick,
implicitly mapping their inputs into high-dimensional feature spaces. The kernel trick allows constructing the classifier
without explicitly knowing the feature space. The below Figure 3.Shows the structure of SVM
Recently, SVM has attracted a high degree of interest in the machine learning research community. Several recent
studies have reported that the SVM (support vector machines) generally are capable of delivering higher performance
in terms of classification accuracy than the other data classification algorithms. SVM is a technique suitable for binary
classification tasks, so we choose SVM to predict the diabetes. The reason is SVM is well known for its discriminative
power for classification, especially in the cases where a large number of features are involved, and in our case where
the dimension of the feature is 7.
Logistic Regression
In statistics Logistic regression is a regression model where the dependent variable is categorical, namely binary
dependent variable-that is, where it can take only two values, "0" and "1", which represent outcomes such as pass/fail,
win/lose, alive/dead or healthy/sick. Logistic regression is used in various fields, including machine learning, most
medical fields, and social sciences. For example, the Trauma and Injury Severity Score (TRISS), which is widely used
to predict mortality in injured patients, was originally developed using logistic regression. Many other medical scales
used to assess severity of a patient have been developed using logistic regression. The technique can also be used in
engineering, especially for predicting the probability of failure of a given process, system or product. It is also used in
marketing applications such as prediction of a customer's propensity to purchase a product or halt a subscription. In
economics it can be used to predict the likelihood of a person's choosing to be in the labor force, and a business
application is about to predict the likelihood of a homeowner defaulting on a mortgage. Conditional random fields, an
extension of logistic regression to sequential data, are used in natural language processing. In this paper, Logistic
regression was used to predict whether a patient suffer from diabetes, based on seven observed characteristics of the
patient.
Chapter 4.
System Analysis And Design
The below figure 4.2.1 shows the detailed context of data flow diagram
Fig 4.2.1 Data Flow Diagram(level-1)
4.3 DatabaseDesign
The below figure 4.3 shows the entities and their relationship among the system.
Test Diabetes
Manage Diabetes
User
View List Of
Diabetologist
The below Figure 4.6.1 shows the activity in which diabetes is detected and the result is produced
Diabetes Patient
Data
Chapter 5
Implementation
5.1 ScreenShot
Fig 6.1.2 Show’s the test diabetes module user gender check
Fig 6.1.3 shows the test diabetes prediction symptoms check.
Fig 6.2.1 shows the diabetes suggestion module describing the type-1 diabetes
Fig 6.2.2 shows the type- 2 level symptoms of diabetes disease
Fig 6.2.3 shows the moderated level of diabetes after symptoms detection as a output
Fig 6.2.4 shows the low level of diabetes as a output
Fig 6.2.5 shows the high risk level of diabetes as a output after the prediction of the system.
Chapter 7.
Conclusion
An Application using a data mining algorithm of classes comparison has been developed to predict the occurrence of
or recurrence of diabetes risks. In addition, the result of the application shows that the predictions system is capable of
predicting diabetes effectively, efficiently and most importantly, timely. That means the application is capable of
helping a physician in making decisions towards patient health risks. It generates results that make it closer to the real
life situations. That makes the data mining more helpful in the health sector, which means that it is necessary for
knowledge discovery in the healthcare’s sector. Much more than huge savings in costs in terms of medical expenses,
loss of duty time and usage of critical medical facilities, The naïve bayes classifier based system is very useful for
diagnosis of diabetes. The system can perform good prediction with less error and this technique could be an important
tool for supplementing the medical doctors improper forming expert diagnosis. In this method the efficiency of
forecasting was found to be around 95%.
This application would be a tremendous asset for doctors who can have structured specific and invaluable information
about their patients / others so that they can ensure that their diagnosis or inferences are correct and professional.
Finally, the huge appreciations received from the doctors on having such software prove that in a place like, where
diseases are on the rise, such applications should be developed to convert the entire state. The common person stands
to benefit from doctors having such a tool so that he/she can be better knowledgeable as far as personal health and
wellbeing is concerned.
The discovery of knowledge from datasets is important in order to make effective diagnosis. The aim of data mining is
to extract information stored in dataset and generate clear and understandable patterns. This study aims at the
discovery of a decision tree model for the prediction of diabetes. Pre-processing is used to improve the quality of data.
While pre processing, the significant attributes of the dataset are considered for prediction of diabetes. This is an
important factor for consideration. The decision tree algorithm used for classification also produces maximum
accuracy when compared to other algorithms of classification. Finally the results of the system are obtained in an
android application which is very useful for the present generation.
Chapter 8.
Future Enhancements
In future this system can designed for any prediction of any other disease such as cancer, thyroid, lung diseases etc., if
these an android application of such disease prediction would be of great use in the near future. Another future
enhancement would be to reduce the no of attributes considered for the prediction purpose. Considering less no of
attributes and produce more accurate results is needed as an enhancement for the existing system .
Also on improving the accuracy of the prediction by increasing the level of training data. Its performance can be
further improved by identifying and incorporating various other parameters and increasing size of training.
Bibliography
[1] P. Yasodha, M. Kannan, “Analysis of a Population of Diabetic Patient Databases in Weka Tool”, International
Journal of Scientific & Engineering Research Volume 2, Issue 5, May-2011
[3] Han, J., Kamber, M.: Data Mining; Concepts and Techniques, Morgan Kaufmann Publishers (2000) pp 132-133
[4] Gloria L.A. Beckles and Patricia E. Thompson-Reidy the authors of“ Diabetes and Women’s Health Across the
Life Stages”.
[5] Jiawei Han, Micheline Kamber, Jian Pei, “Data Mining Concepts and Techniques” Third edition . pp 125-129
[6] Folorunso O and Ogunde A. O (2004), “Data Mining as a Technique for Knowledge” pp 72-76
[7] Management in Business Process Redesign” The Electronic Journal of Knowledge Management Volume 2 Issue 1,
pp 33-44
[8] P.Yashoda, M.Kanan, Analysis of a population of diabetic patients databases in WEKA tool, IJSER, vol2, issue5,
may 2011 pp 21-72
[9] Mukesh kumari, Dr. Rajan Vohra ,Anshul arora Prediction of Diabetes Using Bayesian Network (IJCSIT)
International Journal of Computer Science and Information Technologies, Vol. 5 (4) , 2014, 5174-5178
[10] M. Khajehei, F. Etemady, "Data Mining and Medical Research Studies," cimsim, pp.119-122, 2010 Second
International Conference on Computational Intelligence, Modelling and Simulation, 2010 pp 109-110
[11] Kaur H, Wasan SK,” Empirical Study on Applications of Data Mining Techniques in Healthcare”, Journal of
Computer Science,2(2):194-200,2006
[12] Analysis of a Population of Diabetic Patients Databases with Classifiers using c4.5 Algorithm” World Academy
of Science, Engineering and Technology International Journal of Medical, Pharmaceutical Science and Engineering
Vol: 7 No: 8, 2013 pp 1115-1223
[14] P. Radha , Dr. B. Srinivasan Predicting Diabetes by cosequencing the various Data Mining Classification
Techniques IJISET - InternationalJournal of Innovative Science
[16] E.Knorr.E and R.Ng, “Algorithms forming distance -based outliers in large datasets”, in proceedings of 1998
International Conference on Very Large Data Bases (VldB’98), pp. 392-403 New York, 1998.
[15] E.Jiawei Hen and Micheline Kamber “DataMining Concepts and Techniques”, CA:Elsevier Inc,SanFranciso,
2006 pp 234-276
[16] U.M.Piatetsky-Shapiro and G.Smyth “From DataMining to Knowledge Discovery : An Overview”,1996, pp.1 -36
[17] S.C.Liao & M.Embrenchts, “Data Mining techniques applied to medical information”, Med.Inform, 2000, pp.81
102.
[18] L.Breiman, J.Friedman, J.Olsen C.Stone, “Classification and Re-gression Trees”, Chapman & Hal, 1984, 122-
134. Engineering & Technology, Vol. 1 Issue6, August 2014pp-124-139
[20] Szakacs-Simon, P. Dept. of Autom., “Transilvania” Univ., Brasov, Romania Moraru, S.A. ; Perniu, L.Android
application developed to extend health monitoring device range and real-time patient tracking International Journal of
Advanced Research in Computer Science and Software Engineering pp-34-67
[23] en.wikipedia.org/wiki/Diabetes_mellitus