0% found this document useful (0 votes)
79 views7 pages

Diagnosis of Lung Cancer Prediction System Using Data Mining Classification Techniques

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views7 pages

Diagnosis of Lung Cancer Prediction System Using Data Mining Classification Techniques

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

V. Krishnaiah et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol.

4 (1) , 2013, 39 - 45

Diagnosis of Lung Cancer Prediction System


Using Data Mining Classification Techniques
V.Krishnaiah #1, Dr.G.Narsimha*2, Dr.N.Subhash Chandra#3
#1
Associate Professor, Dept of CSE,
CVR College of Engineering, Hyderabad, India
#2
Assistant Professor, Dept of CSE,
JNTUH College of EngineeringKondagattu, Andrapradesh, India
#3
Professor of CSE & Principal, Holy Mary Institute of Technology and Science,
Hyderabad, India

Abstract— Cancer is the most important cause of death for  dyspnea (shortness of breath with activity),
both men and women. The early detection of cancer can be  hemoptysis (coughing up blood),
helpful in curing the disease completely. So the requirement of
techniques to detect the occurrence of cancer nodule in early
 chronic coughing or change in regular coughing pattern,
stage is increasing. A disease that is commonly misdiagnosed is  wheezing,
lung cancer. Earlier diagnosis of Lung Cancer saves enormous  chest pain or pain in the abdomen,
lives, failing which may lead to other severe problems causing  cachexia (weight loss, fatigue, and loss of appetite),
sudden fatal end. Its cure rate and prediction depends mainly  dysphonia (hoarse voice),
on the early detection and diagnosis of the disease. One of the  clubbing of the fingernails(uncommon),
most common forms of medical malpractices globally is an
error in diagnosis. Knowledge discovery and data mining  dysphasia(difficulty swallowing),
have found numerous applications in business and scientific  Pain in shoulder ,chest , arm,
domain. Valuable knowledge can be discovered from  Bronchitis or pneumonia,
application of data mining techniques in healthcare system. In  Decline in Health and unexplained weight loss.
this study, we briefly examine the potential use of classification
based data mining techniques such as Rule based, Decision Mortality and morbidity due to tobacco use is very
tree, Naïve Bayes and Artificial Neural Network to massive
high. Usually lung cancer develops within the wall or
volume of healthcare data. The healthcare industry collects
huge amounts of healthcare data which, unfortunately, are not epithelium of the bronchial tree. But it can start anywhere
“mined” to discover hidden information. For data in the lungs and affect any part of the respiratory system.
preprocessing and effective decision making One Dependency Lung cancer mostly affects people between the ages of 55
Augmented Naïve Bayes classifier (ODANB) and naive creedal and 65 and often takes many years to develop [2].
classifier 2 (NCC2) are used. This is an extension of naïve There are two major types of lung cancer. They
Bayes to imprecise probabilities that aims at delivering robust are Non-small cell lung cancer (NSCLC) and small cell
classifications also when dealing with small or incomplete data lung cancer (SCLC) or oat cell cancer. Each type of lung
sets. Discovery of hidden patterns and relationships often goes cancer grows and spreads in different ways, and is treated
unexploited. Diagnosis of Lung Cancer Disease can answer
differently. If the cancer has features of both types, it is
complex “what if” queries which traditional decision support
systems cannot. Using generic lung cancer symptoms such as called mixed small cell/large cell cancer.
age, sex, Wheezing, Shortness of breath, Pain in shoulder, Non-small cell lung cancer is more common than
chest, arm, it can predict the likelihood of patients getting a SCLC and it generally grows and spreads more slowly.
lung cancer disease. Aim of the paper is to propose a model for SCLC is almost related with smoking and grows more
early detection and correct diagnosis of the disease which will quickly and form large tumors that can spread widely
help the doctor in saving the life of the patient. through the body. They often start in the bronchi near the
center of the chest. Lung cancer death rate is related to total
Keywords—Lung cancer, Naive Bayes, ODANB, NCC2, Data amount of cigarette smoked [3].
Mining, Classification.
Smoking cessation, diet modification, and
I. INTRODUCTION chemoprevention are primary prevention activities.
Screening is a form of secondary prevention. Our method of
Lung cancer is the one of the leading cause of cancer finding the possible Lung cancer patients is based on the
deaths in both women and men. Manifestation of Lung systematic study of symptoms and risk factors. Non-clinical
cancer in the body of the patient reveals through early symptoms and risk factors are some of the generic
symptoms in most of the cases. [1]. Treatment and indicators of the cancer diseases. Environmental factors
prognosis depend on the histological type of cancer, the have an important role in human cancer. Many carcinogens
stage (degree of spread), and the patient's performance are present in the air we breathe, the food we eat, and the
status. Possible treatments include surgery, chemotherapy, water we drink. The constant and sometimes unavoidable
and radiotherapy Survival depends on stage, overall health, exposure to environmental carcinogens complicates the
and other factors, but overall only 14% of people diagnosed investigation of cancer causes in human beings. The
with lung cancer survive five years after the diagnosis. complexity of human cancer causes is especially
Symptoms that may suggest lung cancer include: challenging for cancers with long latency, which are

www.ijcsit.com 39
V. Krishnaiah et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 4 (1) , 2013, 39 - 45

associated with exposure to ubiquitous environmental x. Fatigue and weakness


carcinogens. xi. New onset of wheezing
Pre-diagnosis techniques xii. Swelling of the neck and face
Pre-diagnosis helps to identify or narrow down the xiii. Clubbing of the fingers and toes. The nails appear
possibility of screening for lung cancer disease. Symptoms to bulge out more than normal.
and risk factors (smoking, alcohol consumption, obesity, Xiv.Paraneoplastic syndromes which are caused by
and insulin resistance) had a statistically significant effect biologically active substances that are secreted by the
in pre-diagnosis stage.[4]. The lung cancer diagnostic and tumor.
prognostic problems are mainly in the scope of the widely xv. Fever
discussed classification problems. These problems have xvi. Hoarseness of voice
attracted many researchers in computational intelligence, Xvii.Puffiness of face
data mining, and statistics fields. Xviii.Loss of appetite
Cancer research is generally clinical and/or Xix.Nausea and vomiting
biological in nature, data driven statistical research has
become a common complement. Predicting the outcome of C. Lung cancer risk factors:
a disease is one of the most interesting and challenging a. Smoking:
tasks where to develop data mining applications. As the use i. Beedi
of computers powered with automated tools, large volumes ii. Cigarette
of medical data are being collected and made available to iii. Hukka
the medical research groups. As a result, Knowledge b. Second-hand smoke
Discovery in Databases (KDD), which includes data mining c. High dose of ionizing radiation
techniques, has become a popular research tool for medical d. Radon exposure
researchers to identify and exploit patterns and relationships e. Occupational exposure to mustard gas chloromethyl
among large number of variables, and made them able to ether, inorganic arsenic, chromium, nickel, vinyl
predict the outcome of a disease using the historical cases chloride, radon asbestos
stored within datasets. The objective of this study is to f. Air pollution
summarize various review and technical articles on g. Insufficient consumption of fruits & vegetables
diagnosis of lung cancer. It gives an overview of the current h. Suffering with other types of malignancy
research being carried out on various lung cancer datasets
using the data mining techniques to enhance the lung cancer III. KNOWLEDGE DISCOVERY AND
diagnosis. DATA MINING
This section provides an introduction to knowledge
II. LITERATURE FOR LUNG CANCER discovery and data mining. We list the various analysis
The approach that is being followed here for the tasks that can be goals of a discovery process and lists
prediction technique is based on systematic study of the methods and research areas that are promising in solving
statistical factors, symptoms and risk factors associated these analysis tasks.
with Lung cancer. Non-clinical symptoms and risk factors
are some of the generic indicators of the cancer diseases. A. Knowledge Discovery Process
Initially the parameters for the pre-diagnosis are collected The terms Knowledge Discovery in Databases
by interacting with the pathological, clinical and medical (KDD) and Data Mining are often used interchangeably.
oncologists (Domain experts). KDD is the process of turning the low-level data into high-
A. Statistical Incidence Factors: level knowledge. Hence, KDD refers to the nontrivial
extraction of implicit, previously unknown and potentially
i. Age-adjusted rate (ARR) useful information from data in databases. While data
ii. Primary histology mining and KDD are often treated as equivalent words but
iii. Area-related incidence chance in real data mining is an important step in the KDD process.
iv. Crude incidence rate The following figure. 1 shows data mining as a step in an
B. Lung cancer symptoms: iterative knowledge discovery process.
The following are the generic lung cancer symptoms [14].
i. A cough that does not go away and gets worse over
time
ii. Coughing up blood (heamoptysis) or bloody mucus.
iii. Chest, shoulder, or back pain that doesn't go away
and often is made worse by deep Hoarseness
iv. Weight loss and loss of appetite
v. Increase in volume of sputum
vi. Wheezing
vii. Shortness of breath
Viii.Repeated respiratory infections, such as bronchitis
or pneumonia
Ix. Repeated problems with pneumonia or bronchitis Figure 1. Steps in KDD

www.ijcsit.com 40
V. Krishnaiah et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 4 (1) , 2013, 39 - 45

The Knowledge Discovery in Databases process algorithms. The specific algorithm is selected based on the
comprises of a few steps leading from raw data collections particular objective to be achieved and the quality of the
to some form of new knowledge [5]. The iterative process data to be analyzed.
consists of the following steps:
(1) Data cleaning: also known as data cleansing it is a
phase in which noise data and irrelevant data are removed
from the collection.
(2) Data integration: at this stage, multiple data sources,
often heterogeneous, may be combined in a common source.
(3) Data selection: at this step, the data relevant to the
analysis is decided on and retrieved from the data collection.
(4) Data transformation: also known as data
consolidation, it is a phase in which the selected data is
transformed into forms appropriate for the mining
procedure.
(5) Data mining: it is the crucial step in which clever
techniques are applied to extract patterns potentially useful.
(6) Pattern evaluation: this step, strictly interesting
patterns representing knowledge are identified based on
given measures.
Figure 2. Data Mining Process Representation
(7) Knowledge representation: is the final phase in
which the discovered knowledge is visually represented to
(5) Evaluation and Deployment: Based on the results of
the user. In this step visualization techniques are used to
the data mining algorithms, an analysis is conducted to
help users understand and interpret the data mining results.
determine key conclusions from the analysis and create a
B. Data Mining Process series of recommendations for consideration.
In the KDD process, the data mining methods are IV. DATA MINING CLASSIFICATION METHODS
for extracting patterns from data. The patterns that can be
discovered depend upon the data mining tasks applied. The data mining consists of various methods.
Generally, there are two types of data mining tasks: Different methods serve different purposes, each method
descriptive data mining tasks that describe the general offering its own advantages and disadvantages. In data
properties of the existing data, and predictive data mining mining, classification is one of the most important task. It
tasks that attempt to do predictions based on available data. maps the data in to predefined targets. It is a supervised
Data mining can be done on data which are in quantitative, learning as targets are predefined.
textual, or multimedia forms. The aim of the classification is to build a classifier
Data mining applications can use different kind of based on some cases with some attributes to describe the
parameters to examine the data. They include association objects or one attribute to describe the group of the objects.
(patterns where one event is connected to another event), Then, the classifier is used to predict the group attributes of
sequence or path analysis (patterns where one event leads to new cases from the domain based on the values of other
another event), classification (identification of new patterns attributes. The most used classification algorithms exploited
with predefined targets) and clustering (grouping of in the microarray analysis belong to four categories: IF-
identical or similar objects).Data mining involves some of THEN Rule, Decision tree, Bayesian classifiers and Neural
the following key steps [6]- networks.
(1) Problem definition: The first step is to identify goals.
Based on the defined goal, the correct series of tools can be IF-THEN Rule:
applied to the data to build the corresponding behavioral Rule induction: is the process of extracting useful ‘if then’
model. rules from data based on statistical significance. A Rule
(2) Data exploration: If the quality of data is not suitable based system constructs a set of if-then-rules. Knowledge
for an accurate model then recommendations on future data represents has the form
collection and storage strategies can be made at this. For
analysis, all data needs to be consolidated so that it can be IF conditions THEN conclusion:
treated consistently. This kind of rule consists of two parts. The rule
(3) Data preparation: The purpose of this step is to clean antecedent (the IF part) contains one or more conditions
and transform the data so that missing and invalid values about value of predictor attributes where as the rule
are treated and all known valid values are made consistent consequent (THEN part) contains a prediction about the
for more robust analysis. value of a goal attribute. An accurate prediction of the value
(4) Modeling: Based on the data and the desired of a goal attribute will improve decision-making process.
outcomes, a data mining algorithm or combination of IF-THEN prediction rules are very popular in data mining;
algorithms is selected for analysis. These algorithms they represent discovered knowledge at a high level of
include classical techniques such as statistics, abstraction. Rule Induction Method has the potential to use
neighborhoods and clustering but also next generation retrieved cases for predictions [7].
techniques such as decision trees, networks and rule based

www.ijcsit.com 41
V. Krishnaiah et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 4 (1) , 2013, 39 - 45

Decision Tree: describe a few Classification data mining techniques with


Decision tree derives from the simple divide-and- illustrations of their applications to healthcare.
conquer algorithm. In these tree structures, leaves represent
classes and branches represent conjunctions of features that A. Rule set classifiers
lead to those classes. At each node of the tree, the attribute Complex decision trees can be difficult to understand,
that most effectively splits samples into different classes is for instance because information about one class is usually
chosen. To predict the class label of an input, a path to a distributed throughout the tree. C4.5 introduced an
leaf from the root is found depending on the value of the alternative formalism consisting of a list of rules of the
predicate at each node that is visited. The most common form “if A and B and C and ... then class X”, where rules
algorithms of the decision trees are ID3 [8] and C4.5 [9]. for each class are grouped together. A case is classified by
An evolution of decision tree exploited for microarray data finding the first rule whose conditions are satisfied by the
analysis is the random forest [10], which uses an ensemble case; if no rule is satisfied, the case is assigned to a default
of classification trees. [11] Showed the good performance class.
of random forest for noisy and multi-class microarray data.
IF conditions THEN conclusion
Bayesian classifiers and Naive Bayesian: This kind of rule consists of two parts. The rule
From a Bayesian viewpoint, a classification problem antecedent (the IF part) contains one or more conditions
can be written as the problem of finding the class with about value of predictor attributes where as the rule
maximum probability given a set of observed attribute consequent (THEN part) contains a prediction about the
values. Such probability is seen as the posterior probability value of a goal attribute. An accurate prediction of the value
of the class given the data, and is usually computed using of a goal attribute will improve decision-making process.
the Bayes theorem. Estimating this probability distribution IF-THEN prediction rules are very popular in data mining;
from a training dataset is a difficult problem, because it they represent discovered knowledge at a high level of
may require a very large dataset to significantly explore all abstraction.
the possible combinations.
Conversely, Naive Bayesian is a simple probabilistic In the health care system it can be applied as follows:
classifier based on Bayesian theorem with the (naive) (Symptoms) (Previous--- history) → (Cause—of---
independence assumption. Based on that rule, using the disease).
joint probabilities of sample observations and classes, the
algorithm attempts to estimate the conditional probabilities Example 1: If_then_rule induced in the diagnosis of level
of classes given an observation. Despite its simplicity, the of alcohol in blood.
Naive Bayes classifier is known to be a robust method, IF Sex = MALE AND Unit = 8.9 AND Meal = FULL
which shows on average good performance in terms of THEN
classification accuracy, also when the independence Diagnosis=Blood_alcohol_content_HIGH.
assumption does not hold [12].
B. Decision Tree algorithm
Artificial Neural Networks (ANN):
An artificial neural network is a mathematical model It is a knowledge representation structure consisting
based on biological neural networks. It consists of an of nodes and branches organized in the form of a tree such
interconnected group of artificial neurons and processes that, every internal non-leaf node is labeled with values of
information using a connectionist approach to computation. the attributes. The branches coming out from an internal
Neurons are organized into layers. The input layer consists node are labeled with values of the attributes in that node.
simply of the original data, while the output layer nodes Every node is labeled with a class (a value of the goal
represent the classes. Then, there may be several hidden attribute). Tree based models which include classification
layers. A key feature of neural networks is an iterative and regression trees, are the common implementation of
learning process in which data samples are presented to the induction modeling [15]. Decision tree models are best
network one at a time, and the weights are adjusted in order suited for data mining. They are inexpensive to construct,
to predict the correct class label. Advantages of neural easy to interpret, easy to integrate with database system and
networks include their high tolerance to noisy data, as well they have comparable or better accuracy in many
as their ability to classify patterns on which they have not applications. There are many Decision tree algorithms such
been trained. In [13] a review of advantages and as HUNTS algorithm (this is one of the earliest algorithm),
disadvantages of neural networks in the context of CART, ID3, C4.5 (a later version ID3 algorithm), SLIQ,
microarray analysis is presented. SPRINT [15].
The decision tree shown in Fig. 3 is built from the
V. DATA MINING CLASSIFICATION METHODS very small training set (Table 1). In this table each row
There are various data mining techniques available corresponds to a patient record. We will refer to a row as a
with their suitability dependent on the domain application. data instance. The data set contains three predictor
Statistics provide a strong fundamental background for attributes, namely Age, Gender, Intensity of symptoms and
quantification and evaluation of results. However, one goal attribute, namely disease whose values (to be
algorithms based on statistics need to be modified and predicted from symptoms) indicates whether the
scaled before they are applied to data mining. We now corresponding patient have a certain disease or not.

www.ijcsit.com 42
V. Krishnaiah et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 4 (1) , 2013, 39 - 45

DECISION TREE is a categorical variable) or for regressions (where the


output variable is continuous).
The architecture of the neural network shown in
figure.4 consists of three layers such as input layer, hidden
layer and output layer. The nodes in the input layer linked
with a number of nodes in the hidden layer. Each input
node joined to each node in the hidden layer. The nodes in
the hidden layer may connect to nodes in another hidden
layer, or to an output layer. The output layer consists of one
or more response variables [16].

Figure 3. A decision tree built from the data in Table 1

Table 1: Data set used to build decision tree of Fig. 3


1) Data Set
Age 2) Gender 3) Intensity of 4) Disease(goal)
Symptoms
25 Male medium yes
32 Male high yes
24 Female medium yes
44 Female high yes
30 Female low no
21 Male low no
18 Female low no
34 Male medium no Figure 4. A neural network with one hidden layer.
55 Male medium no
A main concern of the training phase is to focus on
Decision tree can be used to classify an unknown the interior weights of the neural network, which adjusted
class data instance with the help of the above data set given according to the transactions used in the learning process.
in the Table 1. The idea is to push the instance down the For each training transaction, the neural network receives in
tree, following the branches whose attributes values match addition the expected output [17]. This concept drives us to
the instances attribute values, until the instance reaches a modify the interior weights while trained neural network
leaf node, whose class label is then assigned to the instance used to classify new images
[15]. For example, the data instance to be classified is
described by the tuple (Age=23, Gender=female, Intensity D. . Bayesian Network Structure Discoveries
of symptoms = medium, Goal =?), where “?” denotes the A conditional probability is the likelihood of some
unknown value of the goal instance. In this example, conclusion, C, given some evidence/observation, E, where a
Gender attribute is irrelevant to a particular classification dependence relationship exists between C and E.
task. The tree tests the intensity of symptom value in the
instance. If the answer is medium; the instance is pushed This probability is denoted as P(C | E) where
down through the corresponding branch and reaches the
Age node. Then the tree tests the Age value in the instance.
If the answer is 23, the instance is again pushed down
through the corresponding branch. Now the instance (1)
reaches the leaf node, where it is classified as yes.
C. Neural Network Architecture Bayes’ theorem is the method of finding the converse
probability of the conditional,
Especially, the neural network approach has been
widely adopted in recent years. The neural network has
several advantages, including its nonparametric nature,
arbitrary decision boundary capability, easy adaptation to
different types of data and input structures, fuzzy output
values, and generalization for use with multiple images. (2)
Neural networks are of particular interest because they offer
a means of efficiently modeling large and complex This conditional relationship allows an investigator
problems in which there may be hundreds of predictor to gain probability information about either C or E with the
variables that have many interactions. (Actual biological known outcome of the other. Now consider a complex
neural networks are incomparably more complex.) Neural problem with n binary variables, where the relationships
nets may used in classification problems (where the output among them are not clear for predicting a single class
output variable (e.g., node 1 in Figure 5). If all variables

www.ijcsit.com 43
V. Krishnaiah et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 4 (1) , 2013, 39 - 45

were related using a single joint distribution, the equivalent account when grouping instances together. There is always
of all nodes being first level parents, the number of possible a risk that distinctions between the different instances in
combinations of variables would be equal to (2n-1). This relation to the class can be wiped out when using such a
results in the need for a very large amount of data [18, 19]. filter.
If dependence relationships between these variables could
be determined resulting in independent variables being E. Some Implementation Details
removed, fewer nodes would be adjacent to the node of JNCC2 loads data from ARFF files this is a plain
interest. This parent node removal leads to a significant text format, originally developed for WEKA (Witten and
reduction in the number of variable combinations, thereby Frank, 2005). A large number of ARFF data sets, including
reducing the amount of needed data. Furthermore, variables the data sets from the UCI repository are available from
that are directly conditional, not to the node of interest but http://www.cs.waikato.ac.nz/ml/weka/index_datasets.html.2
to the parents of the node of interest (as nodes 4 and 5 are 636. As a pre-processing step, JNCC2 [25] discreteness all
with respect to node 1 in Figure 5), can be related, which the numerical features, using the supervised discretization
allows for a more robust system when dealing with missing algorithm of Fayyad and Irani (1993). The discretization
data points. This property of requiring less information intervals are computed on the training set, and then applied
based on pre-existing understanding of the system’s unchanged on the test set. NCC2 [25] is implemented
variable dependencies is a major benefit of Bayesian exploiting the computationally efficient procedure.
Networks [20]. Some further theoretical underpinnings of Algorithm 1:
the Bayesian approach for classification have been Pseudo code for validation via testing file.
addressed in [21] and [22]. A Bayesian Network (BN) is a ValidateTestFile ()
relatively new tool that identifies probabilistic correlations /*loads training and test file; reads list of non-Mar
in order to make predictions or assessments of class features;
membership. discretizes features*/
parseArffFile ();
parseArffTestingFile();
parseNonMar();
discretizeNumFeatures();
/*learns and validates NBC*/
nbc = new NaiveBayes(trainingSet);
nbc.classifyInstances(testSet);
/*learns and validates NCC2; the list of non-Mar
features in training and testing is required*/
Figure 5. Basic Bayesian Network Structure an ncc2 = new NaiveCredalClassifier2(trainingSet,
Terminology nonMarTraining, nonMarTesting);
While the independence assumption may seem as a ncc2.classifyInstances(testingSet);
simplifying one and would therefore lead to less accurate /*writes output files*/
classification, this has not been true in many applications. writePerfIndicators();
For instance, several datasets are classified in [23] using the writePredictions();
naïve Bayesian classifier, decision tree induction, instance JNCC2 can perform three kinds of experiments:
based learning, and rule induction. These methods are training and testing, cross-validation, and classification of
compared showing the naïve classifier as the overall best instances of the test set whose class is unknown. The
method. To use a Bayesian Network as a classifier, first, pseudo code of the experiment with training and testing is
one must assume that data correlation is equivalent to described by Algorithm 1.
statistical dependence. The ODANB has been compared with other existing
1) Bayesian Network Type methods that improves the Naïve Bayes and with the Naïve
The kind of Bayesian Network (BN) retrieved by the Bayes itself. The results of the comparison prove that the
algorithm is also called Augmented Naïve BN, ODANB outperforms the other methods for the disease
characterized mainly by the points below. prediction not related to lung cancer.
 All attributes have certain influence on the class. The comparison criteria that have been introduced are
 The conditional dependency assumption is relaxed • Accuracy of prediction (measures defined from the
(certain attributes have been added a parent). confusion matrix outputs). The table-2 below recaps the
2) Pre-Processing Techniques benchmarked algorithms accuracy for each dataset consider.
The following data pre-processing techniques In each row in bold the best performing algorithm:
applied to the data before running the ODANB [24] TABLE 2 COMPARISONS OF RESULTS
algorithm. DATASETS ODANB NB
Replace Missing Values: This filter will scan all (or
LUNG CANCER-C 80.46 84.14
selected) nominal and numerical attributes and replace LUNG CANCER-H 79.66 84.05
missing values with the modes and mean. LUNG CANCER-STATLOG 80.00 83.70
Discrimination: This filter is designed to convert
numerical attributes into nominal ones; however the
unsupervised version does not take class information into

www.ijcsit.com 44
V. Krishnaiah et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 4 (1) , 2013, 39 - 45

We focus on the results which clearly states that [3] Murat Karabhatak, M.Cevdet Ince 2008. Expert system for
detection of breast cancer based on association rules and neural
TAN(Tree Augmented Naïve Bayes) [25] works efficiently network. Journal: Expert systems with Applications.
for the comparison of data sets of general and regular things [4] ICMR Report 2006. Cancer Research in ICMR Achievements in
like vehicles, anneal(metallurgy) over ODANB, Naïve Nineties.
Bayes. But for Diagnosis of Lung Cancer Disease Naïve [5] Osmar R. Zaïane, Principles of Knowledge Discovery in Databases.
[Online].
Bayes observes better results. Available:webdocs.cs.ualberta.ca/~zaiane/courses/cmput690
/notes/Chapter1/ch1.pdf.
VI. CONCLUSION [6] [The Data Mining Process. [Online]. Available:
A prototype lung cancer disease prediction system is http://publib.boulder.ibm.com/infocenter/db2luw/v9/index.jsp?topic
=/com.ibm.im.easy.doc/c_dm_process.html.Shelly Gupta et al./
developed using data mining classification techniques. The Indian Journal of Computer Science and Engineering (IJCSE).
system extracts hidden knowledge from a historical lung [7] Harleen Kaur and Siri Krishan Wasan, Empirical Study on
cancer disease database. The most effective model to Applications of Data Mining Techniques in Healthcare, Journal of
predict patients with Lung cancer disease appears to be Computer Science 2 (2): 194-200, 2006ISSN 1549-3636.
[8] J.R. Quinlan. Induction of decision trees. Machine learning,
Naïve Bayes followed by IF-THEN rule, Decision Trees 1(1):81–106, 1986.
and Neural Network. Decision Trees results are easier to [9] J.R. Quinlan. C4. 5: Programming for machine learning. Morgan
read and interpret. The drill through feature to access Kauffmann, 1993.
detailed patients’ profiles is only available in Decision [10] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
[11] R. D´ıaz-Uriarte and A. de André’s. Gene selection and
Trees. Naïve Bayes fared better than Decision Trees as it classification of microarray data using random forest. BMC
could identify all the significant medical predictors. The bioinformatics, 7(1):3, 2006.
relationship between attributes produced by Neural [12] R.S. Michal ski and K. Kaufman. Learning patterns in noisy data:
Network is more difficult to understand. TheAQ approach. Machine Learning and its Applications, Springer-
Verlag, pages 22–38, 2001.
In some cases even in the advanced level Lung [13] R. Linder, T. Richards, and M. Wagner. Microarray data classified
cancer patients does not show the symptoms associated by artificial neural networks. METHODS IN MOLECULAR
with the Lung cancer. BIOLOGYCLIFTON THEN TOTOWA-, 382:345, 2007.
Prevalence of Lung cancer disease is high in India, [14] Murat Karabhatak, M.Cevdet Ince 2008. Expert system for
detection of breast cancer based on association rules and neural
especially in rural India, did not get noticed at the early network. Journal: Expert systems with Applications.
stage, because of the lack of awareness. Also it is not [15] Han, J. and M. Kamber, 2001. Data Mining: Concepts and
possible for the voluntary agencies to carry out the Techniques. San Francisco, Morgan Kauffmann Publishers.
screening for all the people. The emphasis of this work is to [16] [by Two Crows Corporation Introduction to Data Mining and
Knowledge Discovery .Third Edition,2005. ISBN: 1-892095-02-5,
find the target group of people who needs further screening Pages 10, 11.
for Lung cancer disease, so that the prevalence and [17] Maria-Luiza Antonie, Osmar R. Za¨ıane, Alexandru Coman
mortality rate could be brought down. Application of Data Mining Techniques for Medical Image
Lung cancer prediction system can be further Classification.Page 97
[18] Heckerman, D., A Tutorial on Learning with Bayesian
enhanced and expanded. It can also incorporate other data Networks.1995, Microsoft Research.
mining techniques, e.g., Time Series, Clustering and [19] Neapolitan, R., Learning Bayesian Networks. 2004, London:
Association Rules. Continuous data can also be used Pearson Printice Hall.
instead of just categorical data. Another area is to use Text [20] Neapolitan, R., Learning Bayesian Networks. 2004, London:
Pearson Prentice Hall.
Mining to mine the vast amount of unstructured data [21] Krishnapuram, B., et al., A Bayesian approach to joint feature
available in healthcare databases. Another challenge would selection and classifier design. Pattern Analysis and Machine
be to integrate data mining and text mining [26]. Intelligence, IEEE Transactions on, 2004. 6(9): p. 1105-1111.
[22] Shantakumar B.Patil, Y.S.Kumaraswamy, Intelligent and Effective
ACKNOWLEDGMENT Heart Attack Prediction System Using Data Mining and Artificial
Neural Network, European Journal of Scientific Research ISSN
The authors would like thank CVR College of 1450-216X Vol.31 No.4 (2009), pp.642-656 © Euro Journals
Engineering, Hyderabad, for providing its amenities. Publishing, Inc. 2009.
[23] Sellappan Palaniappan, Rafiah Awang, Intelligent Heart Disease
REFERENCES Prediction System Using Data Mining Techniques, 978-1-4244-
1968-5/08/$25.00 ©2008 IEEE.
[1] Sang Min Park, Min Kyung Lim, Soon Ae Shin & Young Ho Yun [24] Juan Bernabé Moreno, One Dependence Augmented Naive Bayes,
2006. Impact of prediagnosis smoking, Alcohol, Obesity and Insulin University of Granada, Department of Computer Science and
resistance on survival in Male cancer Patients: National Health Artificial Intelligence.
Insurance corporation study. Journal of clinical Oncology, Vol 24 [25] Juan Bernabé Moreno, One Dependence Augmented Naive Bayes,
Number 31 November 2006. University of Granada, Department of Computer Science and
[2] Yongqian Qiang, Youmin Guo, Xue Li, Qiuping Wang, Hao Chen, Artificial Intelligence.
& Duwu Cuic 2007 .The Diagnostic Rules of Peripheral Lung [26] Weiguo, F., Wallace, L., Rich, S., Zhongju, Z.: “Tapping the Power
cancer Preliminary study based on Data Mining Technique. Journal of Text Mining”, Communication of the ACM. 49(9), 77-82, 2006.
of Nanjing Medical University, 21(3):190-195

www.ijcsit.com 45

You might also like