10 Chapter 4
10 Chapter 4
69
4.2.1 Importance of Case Based Reasoning
Case based reasoning is closely related to the human reasoning. In many
situations the problems human encounter are solved with human equivalent of case
based reasoning. When a person encounters a previously inexperienced situation or
problem, he refers it to a past experience of a similar problem. This similar, previous
experience may be one himself or someone else has experienced. In case it is
someone else experience, the case will be added to the reasoner’s memory via an oral
or written account of that experience [10].
In medical diagnosis, when the patient comes to the doctor, he examines the
patient and immediately recollects the similar past cases. Because similar cases have
similar answers, the doctor looks for the similar past cases to treat the present case.
Some times the retrieved similar case may be fully or partially like the new case. If it
is a fully similar one then the corresponding solution may be used otherwise the
solution is going to be modified to adapt the new case. In our research when an input
case comes the case based reasoning system retrieves the near by similar cases and
give it as training samples to the Neural Network. Neural Network classification
performance fully depends on the quality of the training samples which were used in
the time of training. The training samples should be valid and related ones. Even
though the number of training samples may be more if they are not related then they
will affect the classification performance of the neural network. One of the main
challenges of the data mining application is data growth. Medical data set may have
more number of patients record. Instead of giving the meaningless and more volume
data it is always good to give meaningful and related data for training. It will help the
neural network to train properly and classify the user input more accurately.
CBR System
Case-based reasoning is a methodology for solving problems by utilizing
previous experiences. It involves retaining a memory of previous problems and their
solutions and using it to solve new problems. When presented with a problem case
base reasoner searches its memory of past cases and attempts to find a case that has
the same problem specification as the current case. If the reasoner cannot find an
identical case in its case base, it will attempt to find a case or cases in the case base
that most closely match the current query case.
In the situation where a previous identical case is retrieved, presuming its
solution was successful, it can be returned as the current problem’s solution. In the
more likely case that the retrieved case is not identical to the current case, an
adaptation phase occurs. In adaptation, the differences between the current case and
the retrieved case must first be identified and then the solution associated with the
retrieved case must be modified taking into account these differences. The solution
70
returned in response to the current problem specification may then be tried in the
appropriate domain setting.
Case-based reasoning (CBR) system incorporates the reasoning mechanism
and the external facets like the input specification, the output suggested solution, and
the memory of past cases that are referenced by the reasoning mechanism. It is
represented in Figure 4.1.
Case Base
Case Base
Case Case
Problem Case Retriever Reasoner Derived Solution
CBR system has an internal structure divided into two major parts called the
case retriever and the case reasoner as shown in Figure.4.2. The case retriever’s job is
to find the appropriate cases in the case base for the given input case. The case
71
reasoner uses the retrieved cases to find a solution to the given input case. This
reasoning generally involves both determining the differences between the retrieved
cases and the current input case and modifying the retrieved solution appropriately,
reflecting these differences. The case reasoner may or may not consult the case base
to find the relative cases to find the solution.
Case can be in the form of a record, which contains all the related information
of a previous experience or problem. The information recorded about this past
experience depends on the domain of the reasoner and the purpose to which the case
will be put. In the instance of a problem solving CBR system, the details will usually
include the specification of the problem and the relevant attributes of the environment
that are the circumstances of the problem. The other important part of the case is the
solution that was applied in the previous situation. Depending on how the CBR
system reasons with cases, this solution may include only the facts of the solution, or,
additionally, the steps or processes involved in obtaining the solution. It is also
important to include the achieved measure of success in the case description if the
cases in the case base have achieved different degrees of success or failure.
If the problem domain has a fundamental model or do not have exceptions or
novel cases or the case is likely to occur very frequently then it has more chance of
using the case based reasoning technique. To reduce the knowledge acquisition task,
avoid the repeating mistakes in the past, graceful degradation of performance, and
reason in a domain with a small body of knowledge and learn over time are some of
the important reasons for which case based reasoning is used.
Case Representation
Cases in a case base can represent many different types of knowledge and
store it in many different representational formats. The objective of a system will
greatly influence what is stored. A case based reasoning system may be aimed at the
creation of a new design or plan, the diagnosis of a new problem, or the argument of a
point of view with precedents. In each type of system, a case may represent something
different. The cases could be people, things or objects, situations, diagnoses, designs,
plans or rulings among others.
In many practical CBR applications, cases are usually represented as two
unstructured sets of attribute value pairs, i.e. the problem and solution features.
However, the decision of what to represent can be one of the difficult decisions to
make. For example: In some sort of medical CBR system, that diagnosis a patient, a
case could represent an individual’s entire case history or be limited to a single visit to
a doctor. In this situation the case may be a set of symptoms along with the diagnosis.
It may also include a diagnosis or treatment. If a case is a person then a more
complete model is being used as this could incorporate the change of symptoms from
one visit to the next. It is however harder to find and use cases in this format to search
for a particular set of symptoms in a current problem and obtain a diagnosis/treatment.
Alternatively if a case is simply a single visit to the doctor involving the symptoms at
the time of that visit and the diagnosis of those symptoms, the changes in symptoms
72
that might be a useful key in solving a problem may be missed. Cases may need to be
broken down and consist of sub-cases. For example, a case could be a person’s
medical history and could include all visits made by them to the doctor as sub cases.
A sample case structure is represented in Figure 4.3.
Patient
Age
Height
Weight
Visit 1
Symptom 1
Symptom 2
Diagnosis
Treatment
Visit 2
Visit 3
No matter what the case actually represents as a whole, the features of it have
to be represented in some format. One of the advantages of case-based reasoning is
the flexibility it has in this regard. Depending on what types of features have to be
represented, an appropriate implementation platform can be chosen. Ranging from
simple Boolean, numeric and textual data to binary files, time dependent data, and
relationships between data, CBR can be made to reason with all of them.
No matter what is stored, or the format it is represented in, a case must store
that information that is relevant to the purpose of the system and which will ensure
that the most appropriate case is retrieved in each new situation. Thus the cases have
to include those features that will ensure that case will be retrieved in the most
appropriate contexts.
In many CBR systems, all existing cases do not need to be stored. In these
systems criteria are needed to decide which cases will be stored and which will be
discarded. In the situation where two or more cases are very similar, only one case
may need to be stored. Alternatively, it may be possible to create an artificial case that
73
is a generalization of two or more actual incidents or problems. By creating
generalized cases the most important aspects of a case need only be stored once.
When choosing a representation format for a case, there are many choices and
many factors to consider. Some examples of representation formats that may be used
include data base formats, frames, objects, and semantic networks.
Whatever format the cases are represented in, the collection of cases itself has
to be structured in some way to facilitate the retrieval of the appropriate case when
queried. Numerous approaches have been used for this. A flat case base is a common
structure; in this method indices are chosen to represent the important aspects of the
case and retrieval involves comparing the current cases features to each case in the
case base. In our work the diabetes dataset is stored in the form of a flat file. It
contains nine fields to store the patient input and output parameters. Another
common case base structure is a hierarchical structure that stores the cases by
grouping them to reduce the number of cases that have to be searched.
Case Indexing
Case indexing refers to assigning indices to cases for future retrieval and
comparisons. This choice of indices is important to being able to retrieve the right
case at the right time. This is because the indices of a case will determine in which
context it will be retrieved in future. These are some suggestions for choosing indices.
Indices must be both predictive and predictive in a useful manner. This means that
they should reflect the important aspects of the case, the attributes that influenced the
outcome of the case and also those which will describe the circumstances in which it
is expected that they should be retrieved in the future. Indices should be abstract
enough to allow for that cases retrieval in all the circumstances in which the case will
be useful, but not too abstract. When a case’s indices are too abstract that case may be
retrieved in too many situations or too much processing would be required to match
cases.
Case Retrieval
Case retrieval is the process of finding within the case base those cases that are
the closest to the current case. To carry out case retrieval there must be criteria that
determine how a case is judged to be appropriate for retrieval and a mechanism to
control how the case base is searched. The selection criteria are necessary to decide
which case is the best one to retrieve, that is, to determine how close the current and
stored cases are.
This criteria depends in part on what the case retriever is searching for. Most
often the case retriever is searching for an entire case, the features of which will be
compared to the current query case. There are however times when a portion of a case
is required. This may be because no full case that exists and a solution is being built
by selecting portions of multiple cases, or because a retrieved case is being modified
by adopting a portion of another case in the case base. The actual processes involved
74
in retrieving a case from the case base depend very much on the memory model and
indexing procedures used.
(x 1 - x 2 ) 2 + (y1 - y 2 ) 2
We considered all the parameters having the same and equal weight. When a
new input case comes we retrieve all the nearby past cases based on the distance value
which is calculated using the Euclidean distance. In our research we a fixed distance
value (e.g 1.5) and all the cases whose distance values are less than or equal to the
fixed value are retrieved for training the artificial neural network.
Instead of retrieving all the past case bases only the cases which are in side the
fixed distance value boundary is retrieved and send it as the training samples to the
feed forward backpropagation neural network.
75
Table 4.1 Michie, Spiegelhalter and Taylor Classification Result on Pima
Indian Diabetes Dataset
Among the used 22 different machine learning algorithms that Logdisc was the most
impressive one. It gave 77.7 % correct classification and 22.3 % miss classification.
K-NN gave the least performance with 67.6 % correct classification and 32.4 % miss
classification result. Two most important algorithms in the artificial neural network
namely Backprob and RBF gave 75.2 % and 75.7 % correct classification and 24.8%
and 24.3% miss classification.
Lot of research has been done in the artificial neural network to improve the
classification accuracy in the Pima Indian diabetes dataset. Jeatrakul and Wong have
done a comparative study of different neural networks performance on the Pima
Indian Diabetes dataset [17]. They used 5 different type of neural network
architectures namely Back propagation Neural Network (BPNN), Radial basis
function Neural Network (RBFNN), General Regression Neural Network (GRNN),
76
Probabilistic Neural Network (PNN) and Complementary Neural Network (CMTNN).
The table-4.2 shows the classification performance of the BPNN, RBFNN, GRNN,
PNN and CMTNN.
Table 4.2 Jeatrakul and Wong Classification Result on the Pima Indian
Diabetes Dataset
Estebanez, Alter and Valls used genetic programming based data projections for
classification tasks [20]. They used the Pima Indian diabetes dataset in their research
and reduced the input dimension from 8 to 3. They applied the Support Vector
Machine (SVM), Simple Logistics and Multilayer Perceptron algorithms on Pima
Indian Diabetes data for the classification purpose. Their results are available in the
table-4.3.
Table 4.3 Estebanez, Alter and Valls Classification Result on Pima Indian
Diabetes Dataset
Classification
Sr. No. Algorithm
Performance
1 SVM 77.21
2 Simple Logistics 77.86
3 Multilayer Perceptron 76.69
77
decision support system has given average 79% classification performance. Bylander
used naïve Bayes, Decision trees and two types of belief networks on the Pima Indian
diabetes dataset [19]. The table – 4.4 shows the various classification methods and
classification performance obtained by Bylander.
Misra and Dehuri in their research paper Functional Link Artificial Neural Network
for Classification Task in Data Mining created a Functional Link Artificial Neural
network and compared its classification performance with other machine learning
algorithms [16]. Their FLANN has given 78.13% classification performance and
MLP gave 75.2% classification performance. Table – 4.5 gives the classification
performance of different machine learning algorithms on the Pima Indian Diabetes
dataset. KNN from Case based reasoning can be used to retrieve the similar past cases
and remove the outliers through that neural network classification performance can be
improved [48].
Classification
Sr. No. Accuracy
Systems Name
1 NN 65.1
2 KNN 69.7
3 FSS 73.6
4 BSS 67.7
5 MFS1 68.5
6 MFS2 72.5
7 CART 74.5
8 C4.5 74.7
9 FID3.1 75.9
10 MLP 75.2
11 FLANN 78.13
78
4.5 Proposed Model and its Functioning
The Algorithm
1. Do the preprocessing step for handling the missing values in the Pima Indian
Diabetes Dataset.
i) Replace the missing values by its column’s mean value for each output
class separately.
2. Divide the preprocessed total dataset into two by 80% and 20% and named it
as Training Dataset T1 and Testing Dataset T2.
4. Train the Case Based Reasoning System using K-Nearest Neighbor Algorithm
using Training Dataset T1.
5. Ensemble System calculates the combined mean value from the outputs of the
ANN and CBR systems. Based on the calculated mean value it displays the
output either “Positive for Diabetes” or “Negative for Diabetes”.
Ensemble Method
Calculated
Input Test Data (Using Mean
Output for
Method)
Test Data
Case Based Reasoning System
(Trained Using K-NN
Algorithm) CBR Calculated
Output Probability
Figure 4.4 Proposed Hybrid Machine Learning System for Medical Diagnosis
79
The block diagram has two important trained machine learning systems. The First one
is Artificial Neural Network system which used the Backpropagation algorithm for its
training. The Second one is Case Based System which used the K-Nearest Neighbor
Algorithm for its training. The Total dataset contains 768 patient records out of
which 614 records (80%) are used for training and the remaining 154 records (20%)
used for testing purpose. In the time of testing the new test data will be passed
through the trained ANN and the CBR Systems. Both the systems will give the
calculated output values to the Ensemble Module. Input values will be in between 0 to
1.
Ensemble module uses the mean method to combine the values from the ANN and
CBR system. Based on the calculated output value the result will be either positive or
negative for diabetes disease. We used the cut off value as .5. If the Ensemble module
calculated value is greater than or equal to .5 then the output will be “Positive to
Diabetes” otherwise the output will be “Negative to Diabetes”.
We have conducted 10 different test cases from the Pima Indian Diabetes Dataset
based on the different random sampling. Below ANN and K-NN is constructed for the
first test case. Below paragraphs explain clearly the structure, training and testing of
the ANN, KNN and the Hybrid method.
The Total Pima Indian Diabetes Dataset contains 768 patients’ records. We used 8-5-
1 ANN architecture for training and testing the Pima Indian Diabetes Dataset. Below
table shows the partition of the training and testing dataset and it’s sizes. We used the
random sampling with the random seed 12345 to select the samples for training and
testing.80% of the data is assigned to training dataset and the remaining 20% is
assigned to the testing data set. The ANN architecture is represented in Figure 4.5.
Table 4.6 shows the dataset partition for training and testing purpose.
Data
Data source Sheet1!$A$2:$I$769
Selected
preg Pg dbp skin insulin bmi pedig age class
variables
Partitioning
Randomly chosen
Method
Random
12345
Seed
# training
614
rows
# validation
154
rows
Data
Training data used for ['preprocessed_mean_separate_class.xls']'Data_Partition1'!$C$
building the model 19:$J$632
# Records in the
614
training data
['preprocessed_mean_separate_class.xls']'Data_Partition1'!$C$
Validation data
633:$J$786
# Records in the
154
validation data
Input variables
Yes
normalized
81
ANN Network Parameters
The dataset has 8 inputs and 1 output fields. The hidden layer has 5 nodes and the
output layer has one node. We used the Standard error functions as cost function.
Standard Sigmoid function is used in hidden and output layer as activation function.
We used the 200 Epochs to train the network. Table 4.7 shows the ANN network
training functions and parameters.
Variables
# Input Variables 8
Input variables preg Pg dbp skin insulin bmi pedig age
Output variable class
Parameters/Options
# Hidden layers 1
# Nodes in HiddenLayer-1 5
CostFunctions Squared error
Hidden layer sigmoid Standard
Output layer sigmoid Standard
# Epochs 200
Step size for gradient descent 0.1
Weight change momentum 0.6
Error tolerance 0.01
Weight decay 0
Inter layer Node Connection Weights
In between the input, hidden and output layer the ANN has the node connection. The
connection has the weight. The following table 4.8 shows the inter layer connection
weights between the input layer, hidden layer and hidden layer and output layer.
Input Layer
Hidden Bias
Layer # 1 preg Pg dbp skin insulin bmi pedig age Node
Node # 1 -4.36 -3.23 -0.18 -4.12 -5.17 -2.31 -2.59 3.57 -0.70
Node # 2 -1.85 -5.71 1.94 -0.36 -6.08 -1.25 -4.95 2.22 -2.51
Node # 3 -0.87 -0.35 0.27 0.08 2.68 0.00 -0.21 -4.84 -4.40
Node # 4 1.92 -1.25 0.25 5.21 -14.06 0.73 0.06 1.35 -1.72
Node # 5 -1.98 -0.29 1.14 -10.26 -1.52 -3.28 3.00 -3.61 -3.19
82
Hidden Layer # 1
Output Bias
Node # 1 Node # 2 Node # 3 Node # 4 Node # 5
Layer Node
1 -2.96828 -2.69314 -6.6264 -5.84048 -3.81711 6.81824
0 2.96833 2.69316 6.62652 5.84058 3.81716 -6.81835
We have used the 200 Epochs to train the ANN. Initially the training error rate is 36.6
and for the final Epoch the Error rate is 10.4. The following Figure 4.6 shows the
comparison between the different Epoch Number and the Error Rate in the Form of
the Chart.
83
Following table 4.9 shows the Classification Confusion Matrix and the Error report
for the training dataset which contains 614 cases. Class 1 points to “+ive to Diabetes”
and class 0 points to “-ive to Diabetes”. The ANN system classifies 552 cases out of
614 cases correctly and miss classifies 62 cases out of 614 cases wrongly. The ANN
Correct classification performance rate is 89.90% and the miss classification Error
Rate is 10.10%.
Error Report
Class # Cases # Errors % Error
1 225 26 11.56
0 389 36 9.25
Overall 614 62 10.10
Following table 4.10 shows the Classification Confusion Matrix and Error Report for
the validation (testing) dataset which contains 154 cases. Class 1 points to “+ive to
Diabetes” and class 0 points to “-ive to Diabetes”. The ANN system classifies 133
cases out of 154 cases correctly and miss classifies 21 cases out of 154 cases wrongly.
The ANN Correct classification performance rate is 86.36% and the miss
classification Error Rate is 13.64%.
84
ANN Validation Data – Classification Error Report
Error Report
Class # Cases # Errors % Error
1 43 7 16.28
0 111 14 12.61
Overall 154 21 13.64
ANN Total Correct Classification and Miss Classification Error Number for 10
different Test Cases
Below table 4.11 contains the reports of the 10 different test cases correct and miss
classification information in the form of numbers. Among the 10 cases test case 1
gave minimum miss classification number 21 out of 154 test cases. Test cases 4, 7, 9
gave maximum miss classification number 28 out of 154 test cases. The average
correct classification number was 128.9 out of 154 and the average miss classification
number was 25.1.
ANN Total Correct Classification Accuracy and Miss Classification Error Rate
for 10 different Test Cases
85
test case 4, 7, 9 gave the maximum 18.18%. Over all the average correct classification
accuracy was 83.70% and the miss classification Error rate was 16.30%.
Table 4.12 ANN Classification Performance Accuracy for 10 different test cases
In the literature survey we have found that 4 different researchers constructed the
Artificial Neural Network using the Backpropagation Algorithm based on the Pima
Indian Diabetes Dataset. Their classification performance was less than 77%.The
dataset contains two different classes namely Class 1 and Class 0. Class 1 means the
patient is “+ive for Diabetes” and Class 0 means the patient is “-ive for Diabetes”. We
have separated the data set based on the two different classes and replaced the missing
values by the corresponding field’s mean value. Pima Indian Diabetes Dataset has lot
of missing values. Handling of the missing values plays the major role in the
classification performance. Training the Neural Network takes an important step in
classification performance because neural network has many local minima. Finding
the best solution is a trail and error method. Because of our effective way of missing
values handling and the effective training we reach the classification performance
83.70. Our neural network structure is 8-5-1.
86
5 hidden nodes we used in the hidden layer. The table 4.13 and the Figure 4.7 show
the effective classification performance of our neural network.
Correct Miss
Classification Classification
[Link] ANN System Reference Accuracy Error Rate
1 Michie, Spiegelhalter and Taylor 75.20 24.80
2 Jeatrakul and Wong 76.17 23.83
3 Estebanez, Alter and Valls 76.69 23.31
4 Misra and Dehuri 75.20 24.80
5 Our Method 83.70 16.30
Comparison Chart
87
KNN Training Information
We have carried out 10 different test cases using the different random samples in the
Pima Indian Diabetes Dataset. Below table shows the training in the KNN System and
the best K value in the test case 1.
Test Case 1 training and testing Information:
Best K Value Selection
In the time of training, the system substitutes the various values of k and calculates
the corresponding training and validation (testing) Error rates. We have given the
input value k=20 to the system. It has generated the training and testing errors for the
20 different k values. The system automatically finds the best k value based on the
Testing error Value. For the k value 13 and 20 the Error Rate is 10.39 and this is the
minimum value among the set of 20 k values. The system has chosen the minimum k
value 13 as the best k value. The following table 4.14 shows the training and testing
error values for different k values and the selection of the best k = 13.
Table 4.14 KNN Training and Testing Errors for Different K values
88
Below Figure 4.8 shows the Training Error Curve for the different values of K in the
time of training. For the values of k=10 up to 20 the training error is zero. For the
value k=9 it gave the maximum training error 15.80.
89
Figure 4. 9 KNN Validation Error Curve
Above Figure 4.9 shows the Validation Error for the different values of K. for the k
value 13 and 20 it gave the minimum testing error value 10.39. The value 13 is
selected as a best k value and used in the time of validation.
90
Classification Performance on Training Data
Following table 4.15 shows the Classification Confusion Matrix for the training
dataset which contains 614 cases. Class 1 points to “+ive to Diabetes” and class 0
points to “-ive to Diabetes”. The KNN system classifies 517 cases out of 614 cases
correctly and miss classifies 97 cases out of 614 cases wrongly. The KNN Correct
classification performance rate is 84.20% and the miss classification Error Rate is
15.80%.
Error Report
Class # Cases # Errors % Error
1 225 55 24.44
0 389 42 10.80
Overall 614 97 15.80
Following table 4.16 shows the Classification Confusion Matrix for the validation
(testing) dataset which contains 154 cases. Class 1 points to “+ive to Diabetes” and
class 0 points to “-ive to Diabetes”. The KNN system classifies 138 cases out of 154
cases correctly and miss classifies 16 cases out of 154 cases wrongly. The KNN
Correct classification performance rate is 89.61% and the miss classification Error
Rate is 10.39%.
91
Table 4.16 KNN validation Data- Performance and Error Report
Error Report
Class # Cases # Errors % Error
1 43 7 16.28
0 111 9 8.11
Overall 154 16 10.39
Below table 4.17 contains the report of the 10 different test cases correct and miss
classification information in the form of numbers. Among the 10 cases test case 1
gave minimum miss classification number 16 out of 154 test cases. Test case 9 gave
maximum miss classification number 32 out of 154 test cases. The average correct
classification number was 129.8 out of 154 and the average miss classification
number was 24.2.
92
We have converted the number of correct classification number and miss
classification error numbers in terms of percentage. Below table 4.17 contains the
correct classification accuracy and miss classification error rate in terms of Percentage
(%). The test no 1 gives the minimum miss classification error rate 10.39% and the
test case 9 gave the maximum 20.78%. Over all the average correct classification
accuracy is 84.29% and the miss classification Error rate was 15.71%.
In the literature survey we have found that 2 different researchers constructed the
Case Based Reasoning System using the K-Nearest Neighbor Algorithm based on the
Pima Indian Diabetes Dataset. Their classification performance was less than 70%.
We have separated the data set based on the two different classes and replaced the
missing values by the corresponding field’s mean value. Pima Indian Diabetes Dataset
has lot of missing values. Handling of the missing values plays the major role in the
classification performance. In the K-NN method finding the best value of K plays a
major role in the classification performance. It is a trail and error method. But in the
xlminer software there is an option to find the best k value based on the training and
testing dataset. We used the xlminer and found the best k value. Because of our
effective way of missing values handling and the finding the best k value through
xlminer software we reach the classification performance 84.29. The below table 4.18
93
And the Figure 4.10 show the effective classification performance of our case based
reasoning system with the earlier Case based reasoning systems on the Pima Indian
Diabetes Dataset.
94
Hybrid System Performance and Results
Note:
PM-MCN- Proposed Model Miss Classification Number
Below we have given the detailed result calculation for first 5 tests for the above
table-4.19 in the form of table 4.20 to 4.24. Remaining tests results are attached in
Appendix-C.
We used the cut off (Threshold) value .5 for the combined Ensemble method. We add
the ANN probability of Success for the Class 1(+ive For Diabetic) and the KNN
(CBR) probability of Success for the Class 1 and found the average probability value
95
for Class 1. If the average probability value is greater than or equal to .5 (Cut off
Value) then we give the result “+ive for Diabetes” other wise the result will be “-ive
For Diabetes”.
Note:
Yes – Positive for Diabetes Disease
No – Negative for Diabetes Disease
PM Result – Proposed Model Result
96
Result Calculation for Test. No 2
Table 4.21 Output Calculation for Test. No.2
97
Result Calculation for Test. No 3
Table 4.22 Output Calculation for Test. No.3
98
Result Calculation for Test. No 4
99
Result Calculation for Test. No 5
100
4.7 Comparing Proposed Model Results with Earlier Results
101
Comparison Chart
The above table 4.25 and figure 4.11 show the comparison between the individual
machine learning methods classification rates with the proposed Ensemble Model.
The proposed model gives the correct classification performance 84.16 % but in the
other machine learning algorithms LogDisc gives maximum only 77.7%.
102
Table 4.26 Comparing proposed Model Result with Table-4.2 Result
Classification
Sr. No. Model Type
Performance (%)
1 BPNN 76.17
2 GRNN 75.26
3 RBFNN 76.56
4 PNN 75.26
5 CMTNN 76.49
6 Proposed Model 84.16
Comparison Chart
Figure 4.12 Chart shows comparison between PM Result and Table-4.2 Result
103
Comparing Proposed Model Results with Table-4.3 Result
Table-4.3 has 3 machine learning methods like SVM, Single Logistics and Multilayer
Perceptron and their classification performance on the Pima Indian Diabetes dataset.
With in 3 algorithms single logistics gives maximum classification performance
77.86%. Our proposed ensemble model has given 84.16% classification performance.
It is represented in the form of Table-4.27 and figure 4.13.
Classification
Algorithm Performance
SVM 77.21
Simple Logistics 77.86
Multilayer Perceptron 76.69
PM (Ensemble Method) 84.16
Comparisons Chart
Figure 4.13 Chart shows the Comparison between PM result and Table-4.3
result
104
Comparing our Proposed Model result with Bylander classification Table-4.4
result
Method Accuracy
Belief Network(Laplace) 72.50%
Belief Network 72.30%
Decision Tree 72%
Naïve Bayes 71.50%
Proposed Model 84.16%
Bylander has applied 4 machine learning methods on the Pima Indian Diabetes dataset
and got the classification performance maximum 72.30% using the Belief network.
Our proposed method has given 84.16% classification performance on the diabetes
dataset. It gives near to 12% more classification compare to the Bylander result. It is
represented in the form of table 4.28 and figure 4.14.
Comparisons Chart
105
Comparing Proposed Model result with Misra and Dehuri Table-4.5 result
Misra and Dehuri constructed the Functional Link Artificial neural network. FLANN
has given 78.13% classification accuracy on the Pima Indian Diabetes dataset. Our
proposed model has given 84.16% classification accuracy that is near to 6% more
when it compared to FLANN classification performance. It is represented in the form
of table 4.29 and figure 4.15.
Comparisons Chart
106
4. 8 Chapter Summary
Our is the first hybrid method for classifying the Pima Indian Diabetes
Dataset. In the process according to our Proposed Algorithm we have developed an
Artificial Neural Network. It gave 83.70% classification performance. It is near to 7%
more than the previous ANN System classification performances.
107
108