0% found this document useful (0 votes)
29 views40 pages

10 Chapter 4

The document discusses hybrid machine learning and improving classification accuracy. It proposes combining artificial neural networks with case-based reasoning to improve neural network classification performance for medical diagnosis. Case-based reasoning involves retrieving similar past cases and modifying solutions to adapt to new cases. The hybrid approach aims to use case-based reasoning to provide the neural network with high-quality, relevant training samples in order to train properly and classify inputs more accurately.

Uploaded by

Monzha labs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views40 pages

10 Chapter 4

The document discusses hybrid machine learning and improving classification accuracy. It proposes combining artificial neural networks with case-based reasoning to improve neural network classification performance for medical diagnosis. Case-based reasoning involves retrieving similar past cases and modifying solutions to adapt to new cases. The hybrid approach aims to use case-based reasoning to provide the neural network with high-quality, relevant training samples in order to train properly and classify inputs more accurately.

Uploaded by

Monzha labs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CHAPTER – IV

HYBRID MACHINE LEARNING FOR IMPROVING


CLASSIFICATION ACCURACY

4.1.1 Hybrid Machine Learning


Machine Learning algorithms are used in the data mining applications to
retrieve the hidden information that may be used for good decision-making. Machine
learning contains various techniques like rule based learning, case based reasoning,
artificial neural network and decision tree. Every technique has it’s own advantages
and disadvantages. In the past lot of hybrid machine learning systems were developed
to bring the best from the two different machine learning methods. For example a
hybrid machine learning system is created based on genetic algorithm and support
vector machines for stock market prediction by Rohit and Kumkum [8] Nerijis, Ignas
and Vida developed a hybrid machine learning approach for text categorization using
decision trees and artificial neural network [8]. Sankar developed an integrated data
mining approach for maintenance scheduling using case based reasoning and artificial
neural network [9] and Mammone developed a hybrid machine learning system
combining neural network and decision tree. In our research we are using artificial
neural network with case based reasoning technique. Artificial neural network gives
good classification accuracy compare to other machine learning techniques like rule
based system and decision tree. Through this hybrid machine learning we want to
improve the classification accuracy of the neural network system that may be used for
medical diagnosis.

4.1.2 Improving Classification Accuracy


For a machine learning system to be useful in solving medical diagnostic tasks
it has to satisfy some desired features like good performance, transparency of
diagnostic knowledge, ability to explain decisions and ability of the algorithm to
reduce the number of tests to obtain reliable diagnosis. To give good performance the
classification system has to give the diagnostic accuracy on new cases as high as
possible. Now a days to get best performance several learning algorithm are used to
test on the available dataset and best one or two get selected for diagnosis. Always
artificial neural network gives good performance in medical diagnosis still it is
desirable to improve its accuracy. To achieve this we are combining it with the case
based reasoning approach.

69
4.2.1 Importance of Case Based Reasoning
Case based reasoning is closely related to the human reasoning. In many
situations the problems human encounter are solved with human equivalent of case
based reasoning. When a person encounters a previously inexperienced situation or
problem, he refers it to a past experience of a similar problem. This similar, previous
experience may be one himself or someone else has experienced. In case it is
someone else experience, the case will be added to the reasoner’s memory via an oral
or written account of that experience [10].
In medical diagnosis, when the patient comes to the doctor, he examines the
patient and immediately recollects the similar past cases. Because similar cases have
similar answers, the doctor looks for the similar past cases to treat the present case.
Some times the retrieved similar case may be fully or partially like the new case. If it
is a fully similar one then the corresponding solution may be used otherwise the
solution is going to be modified to adapt the new case. In our research when an input
case comes the case based reasoning system retrieves the near by similar cases and
give it as training samples to the Neural Network. Neural Network classification
performance fully depends on the quality of the training samples which were used in
the time of training. The training samples should be valid and related ones. Even
though the number of training samples may be more if they are not related then they
will affect the classification performance of the neural network. One of the main
challenges of the data mining application is data growth. Medical data set may have
more number of patients record. Instead of giving the meaningless and more volume
data it is always good to give meaningful and related data for training. It will help the
neural network to train properly and classify the user input more accurately.

4.2.2 Case Based Reasoning Approach

CBR System
Case-based reasoning is a methodology for solving problems by utilizing
previous experiences. It involves retaining a memory of previous problems and their
solutions and using it to solve new problems. When presented with a problem case
base reasoner searches its memory of past cases and attempts to find a case that has
the same problem specification as the current case. If the reasoner cannot find an
identical case in its case base, it will attempt to find a case or cases in the case base
that most closely match the current query case.
In the situation where a previous identical case is retrieved, presuming its
solution was successful, it can be returned as the current problem’s solution. In the
more likely case that the retrieved case is not identical to the current case, an
adaptation phase occurs. In adaptation, the differences between the current case and
the retrieved case must first be identified and then the solution associated with the
retrieved case must be modified taking into account these differences. The solution

70
returned in response to the current problem specification may then be tried in the
appropriate domain setting.
Case-based reasoning (CBR) system incorporates the reasoning mechanism
and the external facets like the input specification, the output suggested solution, and
the memory of past cases that are referenced by the reasoning mechanism. It is
represented in Figure 4.1.

Case Base

Problem Case Case Base Reasoning


Derived Solution
Mechanism

Figure 4.1 CBR System [10]

CBR System Internal Structure

Case Base

Case Case
Problem Case Retriever Reasoner Derived Solution

Figure 4.2 Two Major Components of a CBR System

CBR system has an internal structure divided into two major parts called the
case retriever and the case reasoner as shown in Figure.4.2. The case retriever’s job is
to find the appropriate cases in the case base for the given input case. The case

71
reasoner uses the retrieved cases to find a solution to the given input case. This
reasoning generally involves both determining the differences between the retrieved
cases and the current input case and modifying the retrieved solution appropriately,
reflecting these differences. The case reasoner may or may not consult the case base
to find the relative cases to find the solution.
Case can be in the form of a record, which contains all the related information
of a previous experience or problem. The information recorded about this past
experience depends on the domain of the reasoner and the purpose to which the case
will be put. In the instance of a problem solving CBR system, the details will usually
include the specification of the problem and the relevant attributes of the environment
that are the circumstances of the problem. The other important part of the case is the
solution that was applied in the previous situation. Depending on how the CBR
system reasons with cases, this solution may include only the facts of the solution, or,
additionally, the steps or processes involved in obtaining the solution. It is also
important to include the achieved measure of success in the case description if the
cases in the case base have achieved different degrees of success or failure.
If the problem domain has a fundamental model or do not have exceptions or
novel cases or the case is likely to occur very frequently then it has more chance of
using the case based reasoning technique. To reduce the knowledge acquisition task,
avoid the repeating mistakes in the past, graceful degradation of performance, and
reason in a domain with a small body of knowledge and learn over time are some of
the important reasons for which case based reasoning is used.

Case Representation
Cases in a case base can represent many different types of knowledge and
store it in many different representational formats. The objective of a system will
greatly influence what is stored. A case based reasoning system may be aimed at the
creation of a new design or plan, the diagnosis of a new problem, or the argument of a
point of view with precedents. In each type of system, a case may represent something
different. The cases could be people, things or objects, situations, diagnoses, designs,
plans or rulings among others.
In many practical CBR applications, cases are usually represented as two
unstructured sets of attribute value pairs, i.e. the problem and solution features.
However, the decision of what to represent can be one of the difficult decisions to
make. For example: In some sort of medical CBR system, that diagnosis a patient, a
case could represent an individual’s entire case history or be limited to a single visit to
a doctor. In this situation the case may be a set of symptoms along with the diagnosis.
It may also include a diagnosis or treatment. If a case is a person then a more
complete model is being used as this could incorporate the change of symptoms from
one visit to the next. It is however harder to find and use cases in this format to search
for a particular set of symptoms in a current problem and obtain a diagnosis/treatment.
Alternatively if a case is simply a single visit to the doctor involving the symptoms at
the time of that visit and the diagnosis of those symptoms, the changes in symptoms

72
that might be a useful key in solving a problem may be missed. Cases may need to be
broken down and consist of sub-cases. For example, a case could be a person’s
medical history and could include all visits made by them to the doctor as sub cases.
A sample case structure is represented in Figure 4.3.

Patient
Age
Height
Weight
Visit 1
Symptom 1
Symptom 2
Diagnosis
Treatment

Visit 2
Visit 3

Figure 4.3 A Patient Case Record

No matter what the case actually represents as a whole, the features of it have
to be represented in some format. One of the advantages of case-based reasoning is
the flexibility it has in this regard. Depending on what types of features have to be
represented, an appropriate implementation platform can be chosen. Ranging from
simple Boolean, numeric and textual data to binary files, time dependent data, and
relationships between data, CBR can be made to reason with all of them.
No matter what is stored, or the format it is represented in, a case must store
that information that is relevant to the purpose of the system and which will ensure
that the most appropriate case is retrieved in each new situation. Thus the cases have
to include those features that will ensure that case will be retrieved in the most
appropriate contexts.
In many CBR systems, all existing cases do not need to be stored. In these
systems criteria are needed to decide which cases will be stored and which will be
discarded. In the situation where two or more cases are very similar, only one case
may need to be stored. Alternatively, it may be possible to create an artificial case that

73
is a generalization of two or more actual incidents or problems. By creating
generalized cases the most important aspects of a case need only be stored once.
When choosing a representation format for a case, there are many choices and
many factors to consider. Some examples of representation formats that may be used
include data base formats, frames, objects, and semantic networks.
Whatever format the cases are represented in, the collection of cases itself has
to be structured in some way to facilitate the retrieval of the appropriate case when
queried. Numerous approaches have been used for this. A flat case base is a common
structure; in this method indices are chosen to represent the important aspects of the
case and retrieval involves comparing the current cases features to each case in the
case base. In our work the diabetes dataset is stored in the form of a flat file. It
contains nine fields to store the patient input and output parameters. Another
common case base structure is a hierarchical structure that stores the cases by
grouping them to reduce the number of cases that have to be searched.

Case Indexing
Case indexing refers to assigning indices to cases for future retrieval and
comparisons. This choice of indices is important to being able to retrieve the right
case at the right time. This is because the indices of a case will determine in which
context it will be retrieved in future. These are some suggestions for choosing indices.
Indices must be both predictive and predictive in a useful manner. This means that
they should reflect the important aspects of the case, the attributes that influenced the
outcome of the case and also those which will describe the circumstances in which it
is expected that they should be retrieved in the future. Indices should be abstract
enough to allow for that cases retrieval in all the circumstances in which the case will
be useful, but not too abstract. When a case’s indices are too abstract that case may be
retrieved in too many situations or too much processing would be required to match
cases.

Case Retrieval
Case retrieval is the process of finding within the case base those cases that are
the closest to the current case. To carry out case retrieval there must be criteria that
determine how a case is judged to be appropriate for retrieval and a mechanism to
control how the case base is searched. The selection criteria are necessary to decide
which case is the best one to retrieve, that is, to determine how close the current and
stored cases are.
This criteria depends in part on what the case retriever is searching for. Most
often the case retriever is searching for an entire case, the features of which will be
compared to the current query case. There are however times when a portion of a case
is required. This may be because no full case that exists and a solution is being built
by selecting portions of multiple cases, or because a retrieved case is being modified
by adopting a portion of another case in the case base. The actual processes involved

74
in retrieving a case from the case base depend very much on the memory model and
indexing procedures used.

Nearest Neighbor Retrieval based on Euclidean distance


Euclidean distance is used to retrieve all near by similar cases to the current user case.
If u = (x 1 , y1 ) and v = (x 2 , y 2 ) then Euclidean distance between u and v is

(x 1 - x 2 ) 2 + (y1 - y 2 ) 2
We considered all the parameters having the same and equal weight. When a
new input case comes we retrieve all the nearby past cases based on the distance value
which is calculated using the Euclidean distance. In our research we a fixed distance
value (e.g 1.5) and all the cases whose distance values are less than or equal to the
fixed value are retrieved for training the artificial neural network.
Instead of retrieving all the past case bases only the cases which are in side the
fixed distance value boundary is retrieved and send it as the training samples to the
feed forward backpropagation neural network.

4.3 Introduction to the Pima Indian Diabetes dataset


This dataset was originally donated by Vincent Sigillito, Applied physics
Laboratory, John Hopkins university, Laurel, MD 20707. It was selected from a larger
database held by the national Institute of diabetes and digestive and kidney diseases. It
is publicly available in the machine learning dataset UCI. All patients represented in
this dataset are females at least 21 years of Pima Indian heritage living near Phoenix,
Arizona, USA. This dataset contains 8 input variables and a single output variable
called class. The class value 1 means the patient is tested positive for diabetes and 0
means tested negative for diabetes disease.

4.4 Literature Background


Pima Indian Diabetes dataset is very difficult to classify. Lot of research has
been done on this dataset to improve the classification accuracy. Michie, Spiegelhalter
and Taylor used different machine learning methods to classify the Pima Indian
Diabetes dataset [15]. The table-4.1 shows the names of the applied machine learning
algorithms and the calculated classification accuracies on the Pima Indian diabetes
dataset.

75
Table 4.1 Michie, Spiegelhalter and Taylor Classification Result on Pima
Indian Diabetes Dataset

Correct Miss Classification


Sr. No Algorithm
Classification (%) ((Error Rate) %)
1 Discrim 77.5 22.5
2 Quadisc 73.8 26.2
3 Logdisc 77.7 22.3
4 SMART 76.8 23.2
5 ALLOC80 69.9 30.1
6 K-NN 67.6 32.4
7 CASTLE 74.2 25.8
8 CART 74.5 25.5
9 IndCART 72.9 27.1
10 NewID 71.1 28.9
11 AC2 72.4 27.6
12 Baytree 72.9 27.1
13 NaiveBay 73.8 26.2
14 CN2 71.1 28.9
15 C4.5 73 27
16 Itrule 75.5 24.5
17 Cal5 75 25
18 Kohonen 72.7 27.3
19 DIPOL92 77.6 22.4
20 Backprob 75.2 24.8
21 RBF 75.7 24.3
22 LVQ 72.8 27.2
Average 73.80 26.20

Among the used 22 different machine learning algorithms that Logdisc was the most
impressive one. It gave 77.7 % correct classification and 22.3 % miss classification.
K-NN gave the least performance with 67.6 % correct classification and 32.4 % miss
classification result. Two most important algorithms in the artificial neural network
namely Backprob and RBF gave 75.2 % and 75.7 % correct classification and 24.8%
and 24.3% miss classification.

Lot of research has been done in the artificial neural network to improve the
classification accuracy in the Pima Indian diabetes dataset. Jeatrakul and Wong have
done a comparative study of different neural networks performance on the Pima
Indian Diabetes dataset [17]. They used 5 different type of neural network
architectures namely Back propagation Neural Network (BPNN), Radial basis
function Neural Network (RBFNN), General Regression Neural Network (GRNN),

76
Probabilistic Neural Network (PNN) and Complementary Neural Network (CMTNN).
The table-4.2 shows the classification performance of the BPNN, RBFNN, GRNN,
PNN and CMTNN.

Table 4.2 Jeatrakul and Wong Classification Result on the Pima Indian
Diabetes Dataset

Test No. BPNN GRNN RBFNN PNN CMTNN


1 77.27 74.68 79.22 74.68 77.92
2 76.62 79.87 79.22 79.87 76.62
3 70.13 70.13 74.03 70.13 72.08
4 85.71 81.82 79.22 81.82 83.77
5 75.97 75.97 77.27 75.97 75.32
6 70.78 70.13 72.08 70.13 72.08
7 75.32 72.73 76.62 72.73 75.97
8 79.22 78.57 77.27 78.57 79.22
9 74.68 74.68 76.62 74.68 75.32
10 75.97 74.03 74.03 74.03 76.62
Average 76.17 75.26 76.56 75.26 76.49

Estebanez, Alter and Valls used genetic programming based data projections for
classification tasks [20]. They used the Pima Indian diabetes dataset in their research
and reduced the input dimension from 8 to 3. They applied the Support Vector
Machine (SVM), Simple Logistics and Multilayer Perceptron algorithms on Pima
Indian Diabetes data for the classification purpose. Their results are available in the
table-4.3.

Table 4.3 Estebanez, Alter and Valls Classification Result on Pima Indian
Diabetes Dataset

Classification
Sr. No. Algorithm
Performance
1 SVM 77.21
2 Simple Logistics 77.86
3 Multilayer Perceptron 76.69

Multilayer Perceptron from Artificial Neural Network has given 76.69 %


classification performance. Single Logistics has given the maximum performance
77.86%. Lena Kallin Westin in her missing data and the preprocessing perceptron
paper discussed the different preprocessing methods for handling the missing data in
the Pima Indian Diabetes dataset [18]. She developed a preprocessing perceptron to
train decision support system on the diabetes dataset. The accuracy of the trained

77
decision support system has given average 79% classification performance. Bylander
used naïve Bayes, Decision trees and two types of belief networks on the Pima Indian
diabetes dataset [19]. The table – 4.4 shows the various classification methods and
classification performance obtained by Bylander.

Table 4.4 Bylander Classification Performance on Pima Indian Diabetes


Dataset

Sr. No. Method Accuracy


1 Belief Network(Laplace) 72.50%
2 Belief Network 72.30%
3 Decision Tree 72.00%
4 Naïve Bayes 71.50%

Misra and Dehuri in their research paper Functional Link Artificial Neural Network
for Classification Task in Data Mining created a Functional Link Artificial Neural
network and compared its classification performance with other machine learning
algorithms [16]. Their FLANN has given 78.13% classification performance and
MLP gave 75.2% classification performance. Table – 4.5 gives the classification
performance of different machine learning algorithms on the Pima Indian Diabetes
dataset. KNN from Case based reasoning can be used to retrieve the similar past cases
and remove the outliers through that neural network classification performance can be
improved [48].

Table 4.5 Misra and Dehuri Classification Performance on Pima Indian


Diabetes Dataset

Classification
Sr. No. Accuracy
Systems Name
1 NN 65.1
2 KNN 69.7
3 FSS 73.6
4 BSS 67.7
5 MFS1 68.5
6 MFS2 72.5
7 CART 74.5
8 C4.5 74.7
9 FID3.1 75.9
10 MLP 75.2
11 FLANN 78.13

78
4.5 Proposed Model and its Functioning

To increase the classification accuracy we have proposed a hybrid machine


learning algorithm using the Multilayer Perceptron from Artificial Neural Network
and K-NN from Case Based Reasoning.

The Algorithm

1. Do the preprocessing step for handling the missing values in the Pima Indian
Diabetes Dataset.

i) Replace the missing values by its column’s mean value for each output
class separately.

2. Divide the preprocessed total dataset into two by 80% and 20% and named it
as Training Dataset T1 and Testing Dataset T2.

3. Train the Artificial Neural Network System using the Backpropagation


Algorithm using Training Dataset T1.

4. Train the Case Based Reasoning System using K-Nearest Neighbor Algorithm
using Training Dataset T1.

5. Ensemble System calculates the combined mean value from the outputs of the
ANN and CBR systems. Based on the calculated mean value it displays the
output either “Positive for Diabetes” or “Negative for Diabetes”.

6. Use the Testing dataset T2 to calculate the classification performance of the


Proposed System.

The block diagram of the Proposed Model is displayed in Figure-4.4.

Artificial Neural Network ANN Calculated Output


System (Trained Using Probability ANN
Backpropagation Algorithm)

Ensemble Method
Calculated
Input Test Data (Using Mean
Output for
Method)
Test Data
Case Based Reasoning System
(Trained Using K-NN
Algorithm) CBR Calculated
Output Probability

Figure 4.4 Proposed Hybrid Machine Learning System for Medical Diagnosis

79
The block diagram has two important trained machine learning systems. The First one
is Artificial Neural Network system which used the Backpropagation algorithm for its
training. The Second one is Case Based System which used the K-Nearest Neighbor
Algorithm for its training. The Total dataset contains 768 patient records out of
which 614 records (80%) are used for training and the remaining 154 records (20%)
used for testing purpose. In the time of testing the new test data will be passed
through the trained ANN and the CBR Systems. Both the systems will give the
calculated output values to the Ensemble Module. Input values will be in between 0 to
1.
Ensemble module uses the mean method to combine the values from the ANN and
CBR system. Based on the calculated output value the result will be either positive or
negative for diabetes disease. We used the cut off value as .5. If the Ensemble module
calculated value is greater than or equal to .5 then the output will be “Positive to
Diabetes” otherwise the output will be “Negative to Diabetes”.

Created Artificial Neural Network Structure


The Artificial neural Network used the Multi layer feed forward Network architecture
and it used the Backpropagation Algorithm for Training. The network has a single
hidden layer, an input layer and an output layer. The input layer has 8 input nodes.
The hidden layer has 5 neurons and the output layer has a single neuron. Sigmoid
function is used in the hidden layer and in the output layer. Squared Error is used as
cost function for error calculation to adjust the network weight.

Input Layer Hidden


80 Layer Output Layer
Figure 4.5. Multi Layer Feed Forward Neural Network
4.6 Experiments Results and its Explanation

We have conducted 10 different test cases from the Pima Indian Diabetes Dataset
based on the different random sampling. Below ANN and K-NN is constructed for the
first test case. Below paragraphs explain clearly the structure, training and testing of
the ANN, KNN and the Hybrid method.

ANN Training Information

The Total Pima Indian Diabetes Dataset contains 768 patients’ records. We used 8-5-
1 ANN architecture for training and testing the Pima Indian Diabetes Dataset. Below
table shows the partition of the training and testing dataset and it’s sizes. We used the
random sampling with the random seed 12345 to select the samples for training and
testing.80% of the data is assigned to training dataset and the remaining 20% is
assigned to the testing data set. The ANN architecture is represented in Figure 4.5.
Table 4.6 shows the dataset partition for training and testing purpose.

Table 4.6 ANN Training and Testing Dataset

Data
Data source Sheet1!$A$2:$I$769
Selected
preg Pg dbp skin insulin bmi pedig age class
variables
Partitioning
Randomly chosen
Method
Random
12345
Seed
# training
614
rows
# validation
154
rows

Data
Training data used for ['preprocessed_mean_separate_class.xls']'Data_Partition1'!$C$
building the model 19:$J$632
# Records in the
614
training data
['preprocessed_mean_separate_class.xls']'Data_Partition1'!$C$
Validation data
633:$J$786
# Records in the
154
validation data
Input variables
Yes
normalized

81
ANN Network Parameters
The dataset has 8 inputs and 1 output fields. The hidden layer has 5 nodes and the
output layer has one node. We used the Standard error functions as cost function.
Standard Sigmoid function is used in hidden and output layer as activation function.
We used the 200 Epochs to train the network. Table 4.7 shows the ANN network
training functions and parameters.

Table 4.7 ANN Network Parameters and Activation Functions

Variables
# Input Variables 8
Input variables preg Pg dbp skin insulin bmi pedig age
Output variable class

Parameters/Options
# Hidden layers 1
# Nodes in HiddenLayer-1 5
CostFunctions Squared error
Hidden layer sigmoid Standard
Output layer sigmoid Standard
# Epochs 200
Step size for gradient descent 0.1
Weight change momentum 0.6
Error tolerance 0.01
Weight decay 0
Inter layer Node Connection Weights
In between the input, hidden and output layer the ANN has the node connection. The
connection has the weight. The following table 4.8 shows the inter layer connection
weights between the input layer, hidden layer and hidden layer and output layer.

Table 4.8 ANN Inter Layer Connection Weights

Input Layer

Hidden Bias
Layer # 1 preg Pg dbp skin insulin bmi pedig age Node
Node # 1 -4.36 -3.23 -0.18 -4.12 -5.17 -2.31 -2.59 3.57 -0.70
Node # 2 -1.85 -5.71 1.94 -0.36 -6.08 -1.25 -4.95 2.22 -2.51
Node # 3 -0.87 -0.35 0.27 0.08 2.68 0.00 -0.21 -4.84 -4.40
Node # 4 1.92 -1.25 0.25 5.21 -14.06 0.73 0.06 1.35 -1.72
Node # 5 -1.98 -0.29 1.14 -10.26 -1.52 -3.28 3.00 -3.61 -3.19

82
Hidden Layer # 1
Output Bias
Node # 1 Node # 2 Node # 3 Node # 4 Node # 5
Layer Node
1 -2.96828 -2.69314 -6.6264 -5.84048 -3.81711 6.81824
0 2.96833 2.69316 6.62652 5.84058 3.81716 -6.81835

ANN Training Curve

We have used the 200 Epochs to train the ANN. Initially the training error rate is 36.6
and for the final Epoch the Error rate is 10.4. The following Figure 4.6 shows the
comparison between the different Epoch Number and the Error Rate in the Form of
the Chart.

Figure 4. 6 ANN Training Error Curve

83
Following table 4.9 shows the Classification Confusion Matrix and the Error report
for the training dataset which contains 614 cases. Class 1 points to “+ive to Diabetes”
and class 0 points to “-ive to Diabetes”. The ANN system classifies 552 cases out of
614 cases correctly and miss classifies 62 cases out of 614 cases wrongly. The ANN
Correct classification performance rate is 89.90% and the miss classification Error
Rate is 10.10%.

ANN Training Data – Performance Report

Table 4.9 ANN Training Data Performance and Error Report

Classification Confusion Matrix


Predicted Class
Actual Class 1 0
1 199 26
0 36 353

ANN Training Data –Error Report

Error Report
Class # Cases # Errors % Error
1 225 26 11.56
0 389 36 9.25
Overall 614 62 10.10

Following table 4.10 shows the Classification Confusion Matrix and Error Report for
the validation (testing) dataset which contains 154 cases. Class 1 points to “+ive to
Diabetes” and class 0 points to “-ive to Diabetes”. The ANN system classifies 133
cases out of 154 cases correctly and miss classifies 21 cases out of 154 cases wrongly.
The ANN Correct classification performance rate is 86.36% and the miss
classification Error Rate is 13.64%.

ANN Validation Data – Classification Performance Report

Table 4.10 ANN Validation Data- Performance and Error Report

Classification Confusion Matrix


Predicted Class
Actual Class 1 0
1 36 7
0 14 97

84
ANN Validation Data – Classification Error Report

Error Report
Class # Cases # Errors % Error
1 43 7 16.28
0 111 14 12.61
Overall 154 21 13.64

ANN Total Correct Classification and Miss Classification Error Number for 10
different Test Cases

Below table 4.11 contains the reports of the 10 different test cases correct and miss
classification information in the form of numbers. Among the 10 cases test case 1
gave minimum miss classification number 21 out of 154 test cases. Test cases 4, 7, 9
gave maximum miss classification number 28 out of 154 test cases. The average
correct classification number was 128.9 out of 154 and the average miss classification
number was 25.1.

Table 4.11 ANN Classification Performance for 10 different test cases

Correct Classification Miss Classification


Test. No Numbers Error Numbers
1 133 21
2 132 22
3 131 23
4 126 28
5 131 23
6 129 25
7 126 28
8 128 26
9 126 28
10 127 27
Average 128.9 25.1

ANN Total Correct Classification Accuracy and Miss Classification Error Rate
for 10 different Test Cases

We have converted the number of correct classification number and miss


classification error numbers in terms of percentage. Below table 4.12 contains the
correct classification accuracy and miss classification error rate in terms of Percentage
(%). The test no 1 gives the minimum miss classification error rate 13.64% and the

85
test case 4, 7, 9 gave the maximum 18.18%. Over all the average correct classification
accuracy was 83.70% and the miss classification Error rate was 16.30%.

Table 4.12 ANN Classification Performance Accuracy for 10 different test cases

Test. No Correct Classification Miss Classification


1 86.36 13.64
2 85.71 14.29
3 85.06 14.94
4 81.82 18.18
5 85.06 14.94
6 83.77 16.23
7 81.82 18.18
8 83.12 16.88
9 81.82 18.18
10 82.47 17.53
Average 83.70 16.30

Performance Comparison of Earlier ANN Systems and Our ANN Systems

In the literature survey we have found that 4 different researchers constructed the
Artificial Neural Network using the Backpropagation Algorithm based on the Pima
Indian Diabetes Dataset. Their classification performance was less than 77%.The
dataset contains two different classes namely Class 1 and Class 0. Class 1 means the
patient is “+ive for Diabetes” and Class 0 means the patient is “-ive for Diabetes”. We
have separated the data set based on the two different classes and replaced the missing
values by the corresponding field’s mean value. Pima Indian Diabetes Dataset has lot
of missing values. Handling of the missing values plays the major role in the
classification performance. Training the Neural Network takes an important step in
classification performance because neural network has many local minima. Finding
the best solution is a trail and error method. Because of our effective way of missing
values handling and the effective training we reach the classification performance
83.70. Our neural network structure is 8-5-1.

86
5 hidden nodes we used in the hidden layer. The table 4.13 and the Figure 4.7 show
the effective classification performance of our neural network.

Table 4.13 Performance comparisons between Different ANN Systems

Correct Miss
Classification Classification
[Link] ANN System Reference Accuracy Error Rate
1 Michie, Spiegelhalter and Taylor 75.20 24.80
2 Jeatrakul and Wong 76.17 23.83
3 Estebanez, Alter and Valls 76.69 23.31
4 Misra and Dehuri 75.20 24.80
5 Our Method 83.70 16.30

Comparison Chart

Figure 4.7 Various ANN Classifiers Performance Comparisons

87
KNN Training Information

We have carried out 10 different test cases using the different random samples in the
Pima Indian Diabetes Dataset. Below table shows the training in the KNN System and
the best K value in the test case 1.
Test Case 1 training and testing Information:
Best K Value Selection
In the time of training, the system substitutes the various values of k and calculates
the corresponding training and validation (testing) Error rates. We have given the
input value k=20 to the system. It has generated the training and testing errors for the
20 different k values. The system automatically finds the best k value based on the
Testing error Value. For the k value 13 and 20 the Error Rate is 10.39 and this is the
minimum value among the set of 20 k values. The system has chosen the minimum k
value 13 as the best k value. The following table 4.14 shows the training and testing
error values for different k values and the selection of the best k = 13.

Table 4.14 KNN Training and Testing Errors for Different K values

Validation error log for different k and the Best K Value = 13


% Error
Value of k % Error Training
Validation(Testing)
1 0.00 15.58
2 11.40 21.43
3 11.07 12.34
4 13.68 16.88
5 13.68 13.64
6 14.50 14.94
7 14.17 15.58
8 14.82 13.64
9 15.80 14.29
10 14.01 11.04
11 15.64 11.04
12 14.82 12.34
13 15.80 10.39 <--- Best k
14 15.31 11.69
15 16.61 11.69
16 16.12 12.34
17 16.45 11.69
18 15.96 11.69
19 16.45 11.04
20 16.12 10.39

88
Below Figure 4.8 shows the Training Error Curve for the different values of K in the
time of training. For the values of k=10 up to 20 the training error is zero. For the
value k=9 it gave the maximum training error 15.80.

Figure 4. 8 K-NN Training Curve

89
Figure 4. 9 KNN Validation Error Curve

Above Figure 4.9 shows the Validation Error for the different values of K. for the k
value 13 and 20 it gave the minimum testing error value 10.39. The value 13 is
selected as a best k value and used in the time of validation.

90
Classification Performance on Training Data

Following table 4.15 shows the Classification Confusion Matrix for the training
dataset which contains 614 cases. Class 1 points to “+ive to Diabetes” and class 0
points to “-ive to Diabetes”. The KNN system classifies 517 cases out of 614 cases
correctly and miss classifies 97 cases out of 614 cases wrongly. The KNN Correct
classification performance rate is 84.20% and the miss classification Error Rate is
15.80%.

Training Data scoring - Summary Report (for k=13)

Table 4.15 KNN Training Data- performance and Error Report

Classification Confusion Matrix


Predicted Class
Actual Class 1 0
1 170 55
0 42 347

Error Report
Class # Cases # Errors % Error
1 225 55 24.44
0 389 42 10.80
Overall 614 97 15.80

Classification Performance on Validation (Testing) Data

Following table 4.16 shows the Classification Confusion Matrix for the validation
(testing) dataset which contains 154 cases. Class 1 points to “+ive to Diabetes” and
class 0 points to “-ive to Diabetes”. The KNN system classifies 138 cases out of 154
cases correctly and miss classifies 16 cases out of 154 cases wrongly. The KNN
Correct classification performance rate is 89.61% and the miss classification Error
Rate is 10.39%.

91
Table 4.16 KNN validation Data- Performance and Error Report

Classification Confusion Matrix


Predicted Class
Actual Class 1 0
1 36 7
0 9 102

Error Report
Class # Cases # Errors % Error
1 43 7 16.28
0 111 9 8.11
Overall 154 16 10.39

KNN Classifiers Correct and Miss Classification Performance for the 10


different test cases.

Below table 4.17 contains the report of the 10 different test cases correct and miss
classification information in the form of numbers. Among the 10 cases test case 1
gave minimum miss classification number 16 out of 154 test cases. Test case 9 gave
maximum miss classification number 32 out of 154 test cases. The average correct
classification number was 129.8 out of 154 and the average miss classification
number was 24.2.

Table 4.17 KNN Classification performance for 10 Test Cases

KNN (CBR) Classifier’s Total Correct Classification and


Miss Classification Numbers for 154 Test Cases
Miss Classification
Test .No Correct Classification Number
Number
1 138 16
2 129 25
3 127 27
4 133 21
5 128 26
6 136 18
7 125 29
8 129 25
9 122 32
10 131 23
Average 129.8 24.2

92
We have converted the number of correct classification number and miss
classification error numbers in terms of percentage. Below table 4.17 contains the
correct classification accuracy and miss classification error rate in terms of Percentage
(%). The test no 1 gives the minimum miss classification error rate 10.39% and the
test case 9 gave the maximum 20.78%. Over all the average correct classification
accuracy is 84.29% and the miss classification Error rate was 15.71%.

Table 4.17 KNN Classification performance Accuracy for 10 Test Cases

KNN(CBR) Classifier’s Correct Classification Accuracy and


Miss Classification Error Rate for 154 Test Cases

Miss Classification Error


Test .No Correct Classification Accuracy
Rate
1 89.61 10.39
2 83.77 16.23
3 82.47 17.53
4 86.36 13.64
5 83.12 16.88
6 88.31 11.69
7 81.17 18.83
8 83.77 16.23
9 79.22 20.78
10 85.06 14.94
Average 84.29 15.71

Performance Comparison of Earlier KNN Systems and Our KNN Systems

In the literature survey we have found that 2 different researchers constructed the
Case Based Reasoning System using the K-Nearest Neighbor Algorithm based on the
Pima Indian Diabetes Dataset. Their classification performance was less than 70%.
We have separated the data set based on the two different classes and replaced the
missing values by the corresponding field’s mean value. Pima Indian Diabetes Dataset
has lot of missing values. Handling of the missing values plays the major role in the
classification performance. In the K-NN method finding the best value of K plays a
major role in the classification performance. It is a trail and error method. But in the
xlminer software there is an option to find the best k value based on the training and
testing dataset. We used the xlminer and found the best k value. Because of our
effective way of missing values handling and the finding the best k value through
xlminer software we reach the classification performance 84.29. The below table 4.18

93
And the Figure 4.10 show the effective classification performance of our case based
reasoning system with the earlier Case based reasoning systems on the Pima Indian
Diabetes Dataset.

Table 4.18 Different KNN System Performances on Pima Indian Diabetes


Dataset

KNN System Correct Classification Miss Classification


[Link]
Reference Accuracy Error Rate
1 Michie_Spiegelhalter 67.6 32.4
2 Jeatrakul_Wong 69.7 30.3
3 Our_Method 84.29 15.71

Figure 4.10 Different KNN Classifiers Performance on Pima Indian Diabetes


Dataset

94
Hybrid System Performance and Results

According to our proposed method algorithm first we constructed separately a


Artificial Neural Network and a Case Based Reasoning System for the Pima Indian
Diabetes Dataset. We divided the total dataset 768 into 614 training dataset and 154
testing dataset. We used the same 614 training dataset to train both the ANN and
KNN Systems and the same testing dataset we used to test the classification
performance of the ANN and KNN systems. We took the average probability value
for class 1 or class 0 based on the probability performance from the ANN and KNN
Systems. We kept the cut off probability value .5 for the class 1. If the new proposed
method probability for a test data is greater than .5 then it belongs to class 1 or it
belongs to class 0. We have conducted 10 different tests for the proposed model. The
test results are represented in the following table-4.19.

Table 4.19 Proposed Hybrid Machine Learning Model Classification Result

Classification Miss Classification


Test. No PM- MCN (%) Error (%)
1 18 88.31 11.69
2 21 86.36 13.64
3 25 83.77 16.23
4 28 81.82 18.18
5 24 84.42 15.58
6 21 86.36 13.64
7 26 83.12 16.88
8 26 83.12 16.88
9 31 79.87 20.13
10 24 84.42 15.58
Average 24 84.16 15.84

Note:
PM-MCN- Proposed Model Miss Classification Number

Below we have given the detailed result calculation for first 5 tests for the above
table-4.19 in the form of table 4.20 to 4.24. Remaining tests results are attached in
Appendix-C.

Thresh Hold Value / Cutoff Value for Hybrid Model

We used the cut off (Threshold) value .5 for the combined Ensemble method. We add
the ANN probability of Success for the Class 1(+ive For Diabetic) and the KNN
(CBR) probability of Success for the Class 1 and found the average probability value

95
for Class 1. If the average probability value is greater than or equal to .5 (Cut off
Value) then we give the result “+ive for Diabetes” other wise the result will be “-ive
For Diabetes”.

Result Calculation for Test. No 1

Table 4.20 Output Calculation for Test. No.1

Row Actual ANN Prob. CBR Prob. Ensemble PM


Id. Class for 1 (success) for 1 (success) Average Result
2 1 0.99 0.85 0.92 Yes
7 0 0.01 0.31 0.16 Yes
8 0 0.00 0.23 0.12 Yes
13 1 0.18 0.23 0.20 No
22 0 0.00 0.15 0.08 Yes
24 0 0.09 0.77 0.43 Yes
25 0 0.15 0.85 0.50 No
28 1 1.00 0.85 0.92 Yes
33 0 0.01 0.15 0.08 Yes
40 0 0.00 0.08 0.04 Yes
46 0 0.01 0.38 0.20 Yes
59 1 0.36 0.62 0.49 No
61 0 0.32 0.15 0.24 Yes
70 1 1.00 0.62 0.81 Yes
79 0 0.12 0.46 0.29 Yes
80 0 0.85 0.77 0.81 No
83 1 0.55 0.46 0.50 Yes
86 0 0.00 0.00 0.00 Yes
93 0 0.01 0.31 0.16 Yes
103 1 0.00 0.00 0.00 No
109 0 0.00 0.15 0.08 Yes
117 0 0.01 0.31 0.16 Yes
118 0 0.72 0.69 0.71 No
120 0 0.01 0.08 0.04 Yes

Note:
Yes – Positive for Diabetes Disease
No – Negative for Diabetes Disease
PM Result – Proposed Model Result

96
Result Calculation for Test. No 2
Table 4.21 Output Calculation for Test. No.2

Row Actual ANN Prob. CBR Prob. for Ensemble PM


Id. Class for 1 (success) 1 (success) Average Result
3 1 0.97 0.93 0.95 Yes
7 0 0.08 0.36 0.22 Yes
9 0 0.31 0.57 0.44 Yes
11 1 0.96 0.71 0.84 Yes
33 0 0.36 0.07 0.21 Yes
34 0 0.33 0.07 0.20 Yes
38 1 0.81 0.50 0.66 Yes
45 1 0.88 0.86 0.87 Yes
50 1 1.00 0.71 0.85 Yes
53 0 0.54 0.57 0.56 No
54 1 1.00 1.00 1.00 Yes
62 0 0.27 0.07 0.17 Yes
63 0 0.00 0.00 0.00 Yes
72 1 0.99 0.71 0.85 Yes
74 0 0.52 0.29 0.40 Yes
83 1 0.54 0.57 0.56 Yes
91 0 0.01 0.07 0.04 Yes
100 1 0.79 0.86 0.83 Yes
101 1 0.99 0.64 0.82 Yes
102 1 0.99 0.64 0.82 Yes
105 1 0.65 0.71 0.68 Yes
112 0 0.42 0.43 0.43 Yes
115 0 0.21 0.71 0.46 Yes
122 0 0.25 0.43 0.34 Yes

97
Result Calculation for Test. No 3
Table 4.22 Output Calculation for Test. No.3

ANN Prob. CBR Prob.


Row Actual Ensemble PM
for 1 for 1
Id. Class Average Result
(success) (success)
4 0 0.99 0.78 0.88 No
17 0 0.66 0.56 0.61 No
19 1 0.13 0.56 0.34 No
21 0 0.01 0.00 0.00 Yes
33 0 0.09 0.00 0.04 Yes
37 1 1.00 0.78 0.89 Yes
38 1 0.02 0.44 0.23 No
43 1 0.97 0.67 0.82 Yes
59 1 0.69 0.56 0.62 Yes
85 0 0.13 0.33 0.23 Yes
91 0 0.00 0.00 0.00 Yes
93 0 0.04 0.22 0.13 Yes
99 1 0.84 0.78 0.81 Yes
100 1 0.66 1.00 0.83 Yes
101 1 1.00 0.67 0.83 Yes
112 0 0.01 0.22 0.12 Yes
115 0 0.17 0.56 0.36 Yes
118 0 0.07 0.67 0.37 Yes
119 0 0.01 0.22 0.12 Yes
132 1 0.96 0.56 0.76 Yes
134 1 0.99 0.44 0.72 Yes
139 0 0.00 0.11 0.06 Yes
140 1 0.85 0.11 0.48 No
141 0 0.02 0.11 0.07 Yes

98
Result Calculation for Test. No 4

Table 4.23 Output Calculation for Test. No.4

ANN Prob. CBR Prob.


Row Actual Ensemble PM
for 1 for 1
Id. Class Average Result
(success) (success)
2 1 0.99 0.80 0.90 Yes
11 1 1.00 0.80 0.90 Yes
15 1 1.00 0.40 0.70 Yes
23 1 1.00 0.60 0.80 Yes
25 0 1.00 0.70 0.85 No
45 1 0.97 0.90 0.94 Yes
46 0 0.21 0.20 0.20 Yes
67 0 0.00 0.10 0.05 Yes
70 1 0.98 0.60 0.79 Yes
81 1 1.00 0.90 0.95 Yes
82 1 1.00 0.80 0.90 Yes
84 0 0.07 0.00 0.03 Yes
86 0 0.00 0.00 0.00 Yes
87 0 0.00 0.00 0.00 Yes
88 0 0.01 0.00 0.00 Yes
91 0 0.00 0.00 0.00 Yes
102 1 1.00 0.70 0.85 Yes
104 1 1.00 0.80 0.90 Yes
108 1 0.05 0.00 0.03 No
110 1 1.00 0.70 0.85 Yes
113 1 1.00 1.00 1.00 Yes
117 0 0.08 0.20 0.14 Yes
127 1 1.00 1.00 1.00 Yes
134 1 0.77 0.60 0.69 Yes

99
Result Calculation for Test. No 5

Table 4.24 Output Calculation for Test. No.5

ANN Prob. CBR Prob.


Row Actual Ensemble PM
for 1 for 1
Id. Class Average Result
(success) (success)
8 0 0.00 0.08 0.04 Yes
18 1 0.53 0.67 0.60 Yes
20 0 0.00 0.00 0.00 Yes
30 0 0.68 0.42 0.55 No
40 0 0.01 0.00 0.00 Yes
46 0 0.04 0.33 0.19 Yes
50 1 1.00 0.67 0.83 Yes
51 1 0.88 0.92 0.90 Yes
58 0 0.43 0.42 0.42 Yes
62 0 0.94 0.08 0.51 No
65 0 0.06 0.42 0.24 Yes
73 1 0.85 0.75 0.80 Yes
77 0 0.86 0.92 0.89 No
82 1 0.90 0.83 0.87 Yes
86 0 0.00 0.00 0.00 Yes
100 1 0.19 0.67 0.43 No
101 1 0.94 0.75 0.84 Yes
103 1 0.00 0.00 0.00 No
106 0 0.09 0.00 0.04 Yes
107 1 0.94 0.67 0.80 Yes
109 0 0.02 0.08 0.05 Yes
112 0 0.62 0.50 0.56 No
125 1 0.87 0.42 0.64 Yes
129 1 0.41 0.33 0.37 No

100
4.7 Comparing Proposed Model Results with Earlier Results

Table 4.25 Comparing Proposed Model Results with Table-4.1 Result

Correct Miss Classification


Sr. No. Algorithm
Classification (%) ((Error Rate) %)
1 Discrim 77.5 22.5
2 Quadisc 73.8 26.2
3 Logdisc 77.7 22.3
4 SMART 76.8 23.2
5 ALLOC80 69.9 30.1
6 K-NN 67.6 32.4
7 CASTLE 74.2 25.8
8 CART 74.5 25.5
9 IndCART 72.9 27.1
10 NewID 71.1 28.9
11 AC2 72.4 27.6
12 Baytree 72.9 27.1
13 NaiveBay 73.8 26.2
14 CN2 71.1 28.9
15 C4.5 73 27
16 Itrule 75.5 24.5
17 Cal5 75 25
18 Kohonen 72.7 27.3
19 DIPOL92 77.6 22.4
20 Backprob 75.2 24.8
21 RBF 75.7 24.3
22 LVQ 72.8 27.2
PM(Ensemble
23 Method) 84.16 15.84

101
Comparison Chart

Figure 4.11 Comparing PM Result with Table -4.1 Results

The above table 4.25 and figure 4.11 show the comparison between the individual
machine learning methods classification rates with the proposed Ensemble Model.
The proposed model gives the correct classification performance 84.16 % but in the
other machine learning algorithms LogDisc gives maximum only 77.7%.

Comparing Proposed Model Results with Table-4.2 Result

Table-4.2 has 5 different artificial neural network architectures classification


performance on the Pima Indian diabetes dataset. Complementary Neural Network
(CMTNN) gives the maximum classification performance 76.49%. But our proposed
system gives 84.16% classification on the Pima Indian diabetes dataset. It is
represented in the form of table-4.26 and figure 4.12.

102
Table 4.26 Comparing proposed Model Result with Table-4.2 Result

Classification
Sr. No. Model Type
Performance (%)
1 BPNN 76.17
2 GRNN 75.26
3 RBFNN 76.56
4 PNN 75.26
5 CMTNN 76.49
6 Proposed Model 84.16

Comparison Chart

Figure 4.12 Chart shows comparison between PM Result and Table-4.2 Result

103
Comparing Proposed Model Results with Table-4.3 Result

Table-4.3 has 3 machine learning methods like SVM, Single Logistics and Multilayer
Perceptron and their classification performance on the Pima Indian Diabetes dataset.
With in 3 algorithms single logistics gives maximum classification performance
77.86%. Our proposed ensemble model has given 84.16% classification performance.
It is represented in the form of Table-4.27 and figure 4.13.

Table 4.27 Comparing Proposed Model Result with Table-4.3 Result

Classification
Algorithm Performance
SVM 77.21
Simple Logistics 77.86
Multilayer Perceptron 76.69
PM (Ensemble Method) 84.16

Comparisons Chart

Figure 4.13 Chart shows the Comparison between PM result and Table-4.3
result

104
Comparing our Proposed Model result with Bylander classification Table-4.4
result

Table 4.28 Comparing Proposed Model Result with Table-4.4 Result

Method Accuracy
Belief Network(Laplace) 72.50%
Belief Network 72.30%
Decision Tree 72%
Naïve Bayes 71.50%
Proposed Model 84.16%

Bylander has applied 4 machine learning methods on the Pima Indian Diabetes dataset
and got the classification performance maximum 72.30% using the Belief network.
Our proposed method has given 84.16% classification performance on the diabetes
dataset. It gives near to 12% more classification compare to the Bylander result. It is
represented in the form of table 4.28 and figure 4.14.

Comparisons Chart

Figure 4.14 Comparing PM Classification Performances with Bylander result

105
Comparing Proposed Model result with Misra and Dehuri Table-4.5 result

Table 4.29 Comparing Proposed Model Result with Table-4.5 Result

Classification Systems Name Classification Accuracy


NN 65.1
KNN 69.7
FSS 73.6
BSS 67.7
MFS1 68.5
MFS2 72.5
CART 74.5
C4.5 74.7
FID3.1 75.9
MLP 75.2
FLANN 78.13
Proposed Model 84.16

Misra and Dehuri constructed the Functional Link Artificial neural network. FLANN
has given 78.13% classification accuracy on the Pima Indian Diabetes dataset. Our
proposed model has given 84.16% classification accuracy that is near to 6% more
when it compared to FLANN classification performance. It is represented in the form
of table 4.29 and figure 4.15.
Comparisons Chart

Figure 4.15 Performance Comparisons between FLANN and Proposed Model

106
4. 8 Chapter Summary

Diabetes disease is the fourth biggest cause of death worldwide particularly in


the industrial and developing countries. Lot of research has been done to classify the
diabetes dataset and to improve the classification performance based on the individual
machine learning method. In this chapter we did the data pre-processing for handling
the missing data. Pima Indian Diabetes dataset has lot of missing values. We have
replaced the missing values based on the corresponding fields mean values for
separate classes.

Our is the first hybrid method for classifying the Pima Indian Diabetes
Dataset. In the process according to our Proposed Algorithm we have developed an
Artificial Neural Network. It gave 83.70% classification performance. It is near to 7%
more than the previous ANN System classification performances.

Next we developed a Case based reasoning system based on the K-Nearest


Neighbor method. It gave 84.29% classification performance. It is near to 14% more
than the previous KNN classifiers Systems. The Hybrid system combined the
performance of both the ANN and KNN Systems and found the average probability
value for output classes. Based on the 10 test cases conducted on the Pima Indian
diabetes dataset our proposed ensemble method gave classification accuracy 84.16%.
It is near to 7% more compare to the earlier methods. Through the hybrid machine
learning we improved the classification performance and the reliability of the system.
These two are the prime objectives of our research.

107
108

You might also like