Diagnosis of Lung Cancer Prediction System Using Data Mining Classification Techniques

Uploaded by

KEZZIA MAE ABELLA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

79 views7 pages

Diagnosis of Lung Cancer Prediction System Using Data Mining Classification Techniques

Uploaded by

KEZZIA MAE ABELLA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

V. Krishnaiah et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol.

4 (1) , 2013, 39 - 45

Diagnosis of Lung Cancer Prediction System

Using Data Mining Classification Techniques
V.Krishnaiah #1, Dr.G.Narsimha*2, Dr.N.Subhash Chandra#3
#1
Associate Professor, Dept of CSE,
CVR College of Engineering, Hyderabad, India
#2
Assistant Professor, Dept of CSE,
JNTUH College of EngineeringKondagattu, Andrapradesh, India
#3
Professor of CSE & Principal, Holy Mary Institute of Technology and Science,
Hyderabad, India

Abstract— Cancer is the most important cause of death for  dyspnea (shortness of breath with activity),
both men and women. The early detection of cancer can be  hemoptysis (coughing up blood),
helpful in curing the disease completely. So the requirement of
techniques to detect the occurrence of cancer nodule in early
 chronic coughing or change in regular coughing pattern,
stage is increasing. A disease that is commonly misdiagnosed is  wheezing,
lung cancer. Earlier diagnosis of Lung Cancer saves enormous  chest pain or pain in the abdomen,
lives, failing which may lead to other severe problems causing  cachexia (weight loss, fatigue, and loss of appetite),
sudden fatal end. Its cure rate and prediction depends mainly  dysphonia (hoarse voice),
on the early detection and diagnosis of the disease. One of the  clubbing of the fingernails(uncommon),
most common forms of medical malpractices globally is an
error in diagnosis. Knowledge discovery and data mining  dysphasia(difficulty swallowing),
have found numerous applications in business and scientific  Pain in shoulder ,chest , arm,
domain. Valuable knowledge can be discovered from  Bronchitis or pneumonia,
application of data mining techniques in healthcare system. In  Decline in Health and unexplained weight loss.
this study, we briefly examine the potential use of classification
based data mining techniques such as Rule based, Decision Mortality and morbidity due to tobacco use is very
tree, Naïve Bayes and Artificial Neural Network to massive
high. Usually lung cancer develops within the wall or
volume of healthcare data. The healthcare industry collects
huge amounts of healthcare data which, unfortunately, are not epithelium of the bronchial tree. But it can start anywhere
“mined” to discover hidden information. For data in the lungs and affect any part of the respiratory system.
preprocessing and effective decision making One Dependency Lung cancer mostly affects people between the ages of 55
Augmented Naïve Bayes classifier (ODANB) and naive creedal and 65 and often takes many years to develop [2].
classifier 2 (NCC2) are used. This is an extension of naïve There are two major types of lung cancer. They
Bayes to imprecise probabilities that aims at delivering robust are Non-small cell lung cancer (NSCLC) and small cell
classifications also when dealing with small or incomplete data lung cancer (SCLC) or oat cell cancer. Each type of lung
sets. Discovery of hidden patterns and relationships often goes cancer grows and spreads in different ways, and is treated
unexploited. Diagnosis of Lung Cancer Disease can answer
differently. If the cancer has features of both types, it is
complex “what if” queries which traditional decision support
systems cannot. Using generic lung cancer symptoms such as called mixed small cell/large cell cancer.
age, sex, Wheezing, Shortness of breath, Pain in shoulder, Non-small cell lung cancer is more common than
chest, arm, it can predict the likelihood of patients getting a SCLC and it generally grows and spreads more slowly.
lung cancer disease. Aim of the paper is to propose a model for SCLC is almost related with smoking and grows more
early detection and correct diagnosis of the disease which will quickly and form large tumors that can spread widely
help the doctor in saving the life of the patient. through the body. They often start in the bronchi near the
center of the chest. Lung cancer death rate is related to total
Keywords—Lung cancer, Naive Bayes, ODANB, NCC2, Data amount of cigarette smoked [3].
Mining, Classification.
Smoking cessation, diet modification, and
I. INTRODUCTION chemoprevention are primary prevention activities.
Screening is a form of secondary prevention. Our method of
Lung cancer is the one of the leading cause of cancer finding the possible Lung cancer patients is based on the
deaths in both women and men. Manifestation of Lung systematic study of symptoms and risk factors. Non-clinical
cancer in the body of the patient reveals through early symptoms and risk factors are some of the generic
symptoms in most of the cases. [1]. Treatment and indicators of the cancer diseases. Environmental factors
prognosis depend on the histological type of cancer, the have an important role in human cancer. Many carcinogens
stage (degree of spread), and the patient's performance are present in the air we breathe, the food we eat, and the
status. Possible treatments include surgery, chemotherapy, water we drink. The constant and sometimes unavoidable
and radiotherapy Survival depends on stage, overall health, exposure to environmental carcinogens complicates the
and other factors, but overall only 14% of people diagnosed investigation of cancer causes in human beings. The
with lung cancer survive five years after the diagnosis. complexity of human cancer causes is especially
Symptoms that may suggest lung cancer include: challenging for cancers with long latency, which are

www.ijcsit.com 39
V. Krishnaiah et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 4 (1) , 2013, 39 - 45

associated with exposure to ubiquitous environmental x. Fatigue and weakness

carcinogens. xi. New onset of wheezing
Pre-diagnosis techniques xii. Swelling of the neck and face
Pre-diagnosis helps to identify or narrow down the xiii. Clubbing of the fingers and toes. The nails appear
possibility of screening for lung cancer disease. Symptoms to bulge out more than normal.
and risk factors (smoking, alcohol consumption, obesity, Xiv.Paraneoplastic syndromes which are caused by
and insulin resistance) had a statistically significant effect biologically active substances that are secreted by the
in pre-diagnosis stage.[4]. The lung cancer diagnostic and tumor.
prognostic problems are mainly in the scope of the widely xv. Fever
discussed classification problems. These problems have xvi. Hoarseness of voice
attracted many researchers in computational intelligence, Xvii.Puffiness of face
data mining, and statistics fields. Xviii.Loss of appetite
Cancer research is generally clinical and/or Xix.Nausea and vomiting
biological in nature, data driven statistical research has
become a common complement. Predicting the outcome of C. Lung cancer risk factors:
a disease is one of the most interesting and challenging a. Smoking:
tasks where to develop data mining applications. As the use i. Beedi
of computers powered with automated tools, large volumes ii. Cigarette
of medical data are being collected and made available to iii. Hukka
the medical research groups. As a result, Knowledge b. Second-hand smoke
Discovery in Databases (KDD), which includes data mining c. High dose of ionizing radiation
techniques, has become a popular research tool for medical d. Radon exposure
researchers to identify and exploit patterns and relationships e. Occupational exposure to mustard gas chloromethyl
among large number of variables, and made them able to ether, inorganic arsenic, chromium, nickel, vinyl
predict the outcome of a disease using the historical cases chloride, radon asbestos
stored within datasets. The objective of this study is to f. Air pollution
summarize various review and technical articles on g. Insufficient consumption of fruits & vegetables
diagnosis of lung cancer. It gives an overview of the current h. Suffering with other types of malignancy
research being carried out on various lung cancer datasets
using the data mining techniques to enhance the lung cancer III. KNOWLEDGE DISCOVERY AND
diagnosis. DATA MINING
This section provides an introduction to knowledge
II. LITERATURE FOR LUNG CANCER discovery and data mining. We list the various analysis
The approach that is being followed here for the tasks that can be goals of a discovery process and lists
prediction technique is based on systematic study of the methods and research areas that are promising in solving
statistical factors, symptoms and risk factors associated these analysis tasks.
with Lung cancer. Non-clinical symptoms and risk factors
are some of the generic indicators of the cancer diseases. A. Knowledge Discovery Process
Initially the parameters for the pre-diagnosis are collected The terms Knowledge Discovery in Databases
by interacting with the pathological, clinical and medical (KDD) and Data Mining are often used interchangeably.
oncologists (Domain experts). KDD is the process of turning the low-level data into high-
A. Statistical Incidence Factors: level knowledge. Hence, KDD refers to the nontrivial
extraction of implicit, previously unknown and potentially
i. Age-adjusted rate (ARR) useful information from data in databases. While data
ii. Primary histology mining and KDD are often treated as equivalent words but
iii. Area-related incidence chance in real data mining is an important step in the KDD process.
iv. Crude incidence rate The following figure. 1 shows data mining as a step in an
B. Lung cancer symptoms: iterative knowledge discovery process.
The following are the generic lung cancer symptoms [14].
i. A cough that does not go away and gets worse over
time
ii. Coughing up blood (heamoptysis) or bloody mucus.
iii. Chest, shoulder, or back pain that doesn't go away
and often is made worse by deep Hoarseness
iv. Weight loss and loss of appetite
v. Increase in volume of sputum
vi. Wheezing
vii. Shortness of breath
Viii.Repeated respiratory infections, such as bronchitis
or pneumonia
Ix. Repeated problems with pneumonia or bronchitis Figure 1. Steps in KDD

www.ijcsit.com 40
V. Krishnaiah et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 4 (1) , 2013, 39 - 45

The Knowledge Discovery in Databases process algorithms. The specific algorithm is selected based on the
comprises of a few steps leading from raw data collections particular objective to be achieved and the quality of the
to some form of new knowledge [5]. The iterative process data to be analyzed.
consists of the following steps:
(1) Data cleaning: also known as data cleansing it is a
phase in which noise data and irrelevant data are removed
from the collection.
(2) Data integration: at this stage, multiple data sources,
often heterogeneous, may be combined in a common source.
(3) Data selection: at this step, the data relevant to the
analysis is decided on and retrieved from the data collection.
(4) Data transformation: also known as data
consolidation, it is a phase in which the selected data is
transformed into forms appropriate for the mining
procedure.
(5) Data mining: it is the crucial step in which clever
techniques are applied to extract patterns potentially useful.
(6) Pattern evaluation: this step, strictly interesting
patterns representing knowledge are identified based on
given measures.
Figure 2. Data Mining Process Representation
(7) Knowledge representation: is the final phase in
which the discovered knowledge is visually represented to
(5) Evaluation and Deployment: Based on the results of
the user. In this step visualization techniques are used to
the data mining algorithms, an analysis is conducted to
help users understand and interpret the data mining results.
determine key conclusions from the analysis and create a
B. Data Mining Process series of recommendations for consideration.
In the KDD process, the data mining methods are IV. DATA MINING CLASSIFICATION METHODS
for extracting patterns from data. The patterns that can be
discovered depend upon the data mining tasks applied. The data mining consists of various methods.
Generally, there are two types of data mining tasks: Different methods serve different purposes, each method
descriptive data mining tasks that describe the general offering its own advantages and disadvantages. In data
properties of the existing data, and predictive data mining mining, classification is one of the most important task. It
tasks that attempt to do predictions based on available data. maps the data in to predefined targets. It is a supervised
Data mining can be done on data which are in quantitative, learning as targets are predefined.
textual, or multimedia forms. The aim of the classification is to build a classifier
Data mining applications can use different kind of based on some cases with some attributes to describe the
parameters to examine the data. They include association objects or one attribute to describe the group of the objects.
(patterns where one event is connected to another event), Then, the classifier is used to predict the group attributes of
sequence or path analysis (patterns where one event leads to new cases from the domain based on the values of other
another event), classification (identification of new patterns attributes. The most used classification algorithms exploited
with predefined targets) and clustering (grouping of in the microarray analysis belong to four categories: IF-
identical or similar objects).Data mining involves some of THEN Rule, Decision tree, Bayesian classifiers and Neural
the following key steps [6]- networks.
(1) Problem definition: The first step is to identify goals.
Based on the defined goal, the correct series of tools can be IF-THEN Rule:
applied to the data to build the corresponding behavioral Rule induction: is the process of extracting useful ‘if then’
model. rules from data based on statistical significance. A Rule
(2) Data exploration: If the quality of data is not suitable based system constructs a set of if-then-rules. Knowledge
for an accurate model then recommendations on future data represents has the form
collection and storage strategies can be made at this. For
analysis, all data needs to be consolidated so that it can be IF conditions THEN conclusion:
treated consistently. This kind of rule consists of two parts. The rule
(3) Data preparation: The purpose of this step is to clean antecedent (the IF part) contains one or more conditions
and transform the data so that missing and invalid values about value of predictor attributes where as the rule
are treated and all known valid values are made consistent consequent (THEN part) contains a prediction about the
for more robust analysis. value of a goal attribute. An accurate prediction of the value
(4) Modeling: Based on the data and the desired of a goal attribute will improve decision-making process.
outcomes, a data mining algorithm or combination of IF-THEN prediction rules are very popular in data mining;
algorithms is selected for analysis. These algorithms they represent discovered knowledge at a high level of
include classical techniques such as statistics, abstraction. Rule Induction Method has the potential to use
neighborhoods and clustering but also next generation retrieved cases for predictions [7].
techniques such as decision trees, networks and rule based

www.ijcsit.com 41
V. Krishnaiah et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 4 (1) , 2013, 39 - 45

Decision Tree: describe a few Classification data mining techniques with

Decision tree derives from the simple divide-and- illustrations of their applications to healthcare.
conquer algorithm. In these tree structures, leaves represent
classes and branches represent conjunctions of features that A. Rule set classifiers
lead to those classes. At each node of the tree, the attribute Complex decision trees can be difficult to understand,
that most effectively splits samples into different classes is for instance because information about one class is usually
chosen. To predict the class label of an input, a path to a distributed throughout the tree. C4.5 introduced an
leaf from the root is found depending on the value of the alternative formalism consisting of a list of rules of the
predicate at each node that is visited. The most common form “if A and B and C and ... then class X”, where rules
algorithms of the decision trees are ID3 [8] and C4.5 [9]. for each class are grouped together. A case is classified by
An evolution of decision tree exploited for microarray data finding the first rule whose conditions are satisfied by the
analysis is the random forest [10], which uses an ensemble case; if no rule is satisfied, the case is assigned to a default
of classification trees. [11] Showed the good performance class.
of random forest for noisy and multi-class microarray data.
IF conditions THEN conclusion
Bayesian classifiers and Naive Bayesian: This kind of rule consists of two parts. The rule
From a Bayesian viewpoint, a classification problem antecedent (the IF part) contains one or more conditions
can be written as the problem of finding the class with about value of predictor attributes where as the rule
maximum probability given a set of observed attribute consequent (THEN part) contains a prediction about the
values. Such probability is seen as the posterior probability value of a goal attribute. An accurate prediction of the value
of the class given the data, and is usually computed using of a goal attribute will improve decision-making process.
the Bayes theorem. Estimating this probability distribution IF-THEN prediction rules are very popular in data mining;
from a training dataset is a difficult problem, because it they represent discovered knowledge at a high level of
may require a very large dataset to significantly explore all abstraction.
the possible combinations.
Conversely, Naive Bayesian is a simple probabilistic In the health care system it can be applied as follows:
classifier based on Bayesian theorem with the (naive) (Symptoms) (Previous--- history) → (Cause—of---
independence assumption. Based on that rule, using the disease).
joint probabilities of sample observations and classes, the
algorithm attempts to estimate the conditional probabilities Example 1: If_then_rule induced in the diagnosis of level
of classes given an observation. Despite its simplicity, the of alcohol in blood.
Naive Bayes classifier is known to be a robust method, IF Sex = MALE AND Unit = 8.9 AND Meal = FULL
which shows on average good performance in terms of THEN
classification accuracy, also when the independence Diagnosis=Blood_alcohol_content_HIGH.
assumption does not hold [12].
B. Decision Tree algorithm
Artificial Neural Networks (ANN):
An artificial neural network is a mathematical model It is a knowledge representation structure consisting
based on biological neural networks. It consists of an of nodes and branches organized in the form of a tree such
interconnected group of artificial neurons and processes that, every internal non-leaf node is labeled with values of
information using a connectionist approach to computation. the attributes. The branches coming out from an internal
Neurons are organized into layers. The input layer consists node are labeled with values of the attributes in that node.
simply of the original data, while the output layer nodes Every node is labeled with a class (a value of the goal
represent the classes. Then, there may be several hidden attribute). Tree based models which include classification
layers. A key feature of neural networks is an iterative and regression trees, are the common implementation of
learning process in which data samples are presented to the induction modeling [15]. Decision tree models are best
network one at a time, and the weights are adjusted in order suited for data mining. They are inexpensive to construct,
to predict the correct class label. Advantages of neural easy to interpret, easy to integrate with database system and
networks include their high tolerance to noisy data, as well they have comparable or better accuracy in many
as their ability to classify patterns on which they have not applications. There are many Decision tree algorithms such
been trained. In [13] a review of advantages and as HUNTS algorithm (this is one of the earliest algorithm),
disadvantages of neural networks in the context of CART, ID3, C4.5 (a later version ID3 algorithm), SLIQ,
microarray analysis is presented. SPRINT [15].
The decision tree shown in Fig. 3 is built from the
V. DATA MINING CLASSIFICATION METHODS very small training set (Table 1). In this table each row
There are various data mining techniques available corresponds to a patient record. We will refer to a row as a
with their suitability dependent on the domain application. data instance. The data set contains three predictor
Statistics provide a strong fundamental background for attributes, namely Age, Gender, Intensity of symptoms and
quantification and evaluation of results. However, one goal attribute, namely disease whose values (to be
algorithms based on statistics need to be modified and predicted from symptoms) indicates whether the
scaled before they are applied to data mining. We now corresponding patient have a certain disease or not.

www.ijcsit.com 42
V. Krishnaiah et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 4 (1) , 2013, 39 - 45

DECISION TREE is a categorical variable) or for regressions (where the

output variable is continuous).
The architecture of the neural network shown in
figure.4 consists of three layers such as input layer, hidden
layer and output layer. The nodes in the input layer linked
with a number of nodes in the hidden layer. Each input
node joined to each node in the hidden layer. The nodes in
the hidden layer may connect to nodes in another hidden
layer, or to an output layer. The output layer consists of one
or more response variables [16].

Figure 3. A decision tree built from the data in Table 1

Table 1: Data set used to build decision tree of Fig. 3

1) Data Set
Age 2) Gender 3) Intensity of 4) Disease(goal)
Symptoms
25 Male medium yes
32 Male high yes
24 Female medium yes
44 Female high yes
30 Female low no
21 Male low no
18 Female low no
34 Male medium no Figure 4. A neural network with one hidden layer.
55 Male medium no
A main concern of the training phase is to focus on
Decision tree can be used to classify an unknown the interior weights of the neural network, which adjusted
class data instance with the help of the above data set given according to the transactions used in the learning process.
in the Table 1. The idea is to push the instance down the For each training transaction, the neural network receives in
tree, following the branches whose attributes values match addition the expected output [17]. This concept drives us to
the instances attribute values, until the instance reaches a modify the interior weights while trained neural network
leaf node, whose class label is then assigned to the instance used to classify new images
[15]. For example, the data instance to be classified is
described by the tuple (Age=23, Gender=female, Intensity D. . Bayesian Network Structure Discoveries
of symptoms = medium, Goal =?), where “?” denotes the A conditional probability is the likelihood of some
unknown value of the goal instance. In this example, conclusion, C, given some evidence/observation, E, where a
Gender attribute is irrelevant to a particular classification dependence relationship exists between C and E.
task. The tree tests the intensity of symptom value in the
instance. If the answer is medium; the instance is pushed This probability is denoted as P(C | E) where
down through the corresponding branch and reaches the
Age node. Then the tree tests the Age value in the instance.
If the answer is 23, the instance is again pushed down
through the corresponding branch. Now the instance (1)
reaches the leaf node, where it is classified as yes.
C. Neural Network Architecture Bayes’ theorem is the method of finding the converse
probability of the conditional,
Especially, the neural network approach has been
widely adopted in recent years. The neural network has
several advantages, including its nonparametric nature,
arbitrary decision boundary capability, easy adaptation to
different types of data and input structures, fuzzy output
values, and generalization for use with multiple images. (2)
Neural networks are of particular interest because they offer
a means of efficiently modeling large and complex This conditional relationship allows an investigator
problems in which there may be hundreds of predictor to gain probability information about either C or E with the
variables that have many interactions. (Actual biological known outcome of the other. Now consider a complex
neural networks are incomparably more complex.) Neural problem with n binary variables, where the relationships
nets may used in classification problems (where the output among them are not clear for predicting a single class
output variable (e.g., node 1 in Figure 5). If all variables

www.ijcsit.com 43
V. Krishnaiah et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 4 (1) , 2013, 39 - 45

were related using a single joint distribution, the equivalent account when grouping instances together. There is always
of all nodes being first level parents, the number of possible a risk that distinctions between the different instances in
combinations of variables would be equal to (2n-1). This relation to the class can be wiped out when using such a
results in the need for a very large amount of data [18, 19]. filter.
If dependence relationships between these variables could
be determined resulting in independent variables being E. Some Implementation Details
removed, fewer nodes would be adjacent to the node of JNCC2 loads data from ARFF files this is a plain
interest. This parent node removal leads to a significant text format, originally developed for WEKA (Witten and
reduction in the number of variable combinations, thereby Frank, 2005). A large number of ARFF data sets, including
reducing the amount of needed data. Furthermore, variables the data sets from the UCI repository are available from
that are directly conditional, not to the node of interest but http://www.cs.waikato.ac.nz/ml/weka/index_datasets.html.2
to the parents of the node of interest (as nodes 4 and 5 are 636. As a pre-processing step, JNCC2 [25] discreteness all
with respect to node 1 in Figure 5), can be related, which the numerical features, using the supervised discretization
allows for a more robust system when dealing with missing algorithm of Fayyad and Irani (1993). The discretization
data points. This property of requiring less information intervals are computed on the training set, and then applied
based on pre-existing understanding of the system’s unchanged on the test set. NCC2 [25] is implemented
variable dependencies is a major benefit of Bayesian exploiting the computationally efficient procedure.
Networks [20]. Some further theoretical underpinnings of Algorithm 1:
the Bayesian approach for classification have been Pseudo code for validation via testing file.
addressed in [21] and [22]. A Bayesian Network (BN) is a ValidateTestFile ()
relatively new tool that identifies probabilistic correlations /*loads training and test file; reads list of non-Mar
in order to make predictions or assessments of class features;
membership. discretizes features*/
parseArffFile ();
parseArffTestingFile();
parseNonMar();
discretizeNumFeatures();
/*learns and validates NBC*/
nbc = new NaiveBayes(trainingSet);
nbc.classifyInstances(testSet);
/*learns and validates NCC2; the list of non-Mar
features in training and testing is required*/
Figure 5. Basic Bayesian Network Structure an ncc2 = new NaiveCredalClassifier2(trainingSet,
Terminology nonMarTraining, nonMarTesting);
While the independence assumption may seem as a ncc2.classifyInstances(testingSet);
simplifying one and would therefore lead to less accurate /*writes output files*/
classification, this has not been true in many applications. writePerfIndicators();
For instance, several datasets are classified in [23] using the writePredictions();
naïve Bayesian classifier, decision tree induction, instance JNCC2 can perform three kinds of experiments:
based learning, and rule induction. These methods are training and testing, cross-validation, and classification of
compared showing the naïve classifier as the overall best instances of the test set whose class is unknown. The
method. To use a Bayesian Network as a classifier, first, pseudo code of the experiment with training and testing is
one must assume that data correlation is equivalent to described by Algorithm 1.
statistical dependence. The ODANB has been compared with other existing
1) Bayesian Network Type methods that improves the Naïve Bayes and with the Naïve
The kind of Bayesian Network (BN) retrieved by the Bayes itself. The results of the comparison prove that the
algorithm is also called Augmented Naïve BN, ODANB outperforms the other methods for the disease
characterized mainly by the points below. prediction not related to lung cancer.
 All attributes have certain influence on the class. The comparison criteria that have been introduced are
 The conditional dependency assumption is relaxed • Accuracy of prediction (measures defined from the
(certain attributes have been added a parent). confusion matrix outputs). The table-2 below recaps the
2) Pre-Processing Techniques benchmarked algorithms accuracy for each dataset consider.
The following data pre-processing techniques In each row in bold the best performing algorithm:
applied to the data before running the ODANB [24] TABLE 2 COMPARISONS OF RESULTS
algorithm. DATASETS ODANB NB
Replace Missing Values: This filter will scan all (or
LUNG CANCER-C 80.46 84.14
selected) nominal and numerical attributes and replace LUNG CANCER-H 79.66 84.05
missing values with the modes and mean. LUNG CANCER-STATLOG 80.00 83.70
Discrimination: This filter is designed to convert
numerical attributes into nominal ones; however the
unsupervised version does not take class information into

www.ijcsit.com 44
V. Krishnaiah et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 4 (1) , 2013, 39 - 45

We focus on the results which clearly states that [3] Murat Karabhatak, M.Cevdet Ince 2008. Expert system for
detection of breast cancer based on association rules and neural
TAN(Tree Augmented Naïve Bayes) [25] works efficiently network. Journal: Expert systems with Applications.
for the comparison of data sets of general and regular things [4] ICMR Report 2006. Cancer Research in ICMR Achievements in
like vehicles, anneal(metallurgy) over ODANB, Naïve Nineties.
Bayes. But for Diagnosis of Lung Cancer Disease Naïve [5] Osmar R. Zaïane, Principles of Knowledge Discovery in Databases.
[Online].
Bayes observes better results. Available:webdocs.cs.ualberta.ca/~zaiane/courses/cmput690
/notes/Chapter1/ch1.pdf.
VI. CONCLUSION [6] [The Data Mining Process. [Online]. Available:
A prototype lung cancer disease prediction system is http://publib.boulder.ibm.com/infocenter/db2luw/v9/index.jsp?topic
=/com.ibm.im.easy.doc/c_dm_process.html.Shelly Gupta et al./
developed using data mining classification techniques. The Indian Journal of Computer Science and Engineering (IJCSE).
system extracts hidden knowledge from a historical lung [7] Harleen Kaur and Siri Krishan Wasan, Empirical Study on
cancer disease database. The most effective model to Applications of Data Mining Techniques in Healthcare, Journal of
predict patients with Lung cancer disease appears to be Computer Science 2 (2): 194-200, 2006ISSN 1549-3636.
[8] J.R. Quinlan. Induction of decision trees. Machine learning,
Naïve Bayes followed by IF-THEN rule, Decision Trees 1(1):81–106, 1986.
and Neural Network. Decision Trees results are easier to [9] J.R. Quinlan. C4. 5: Programming for machine learning. Morgan
read and interpret. The drill through feature to access Kauffmann, 1993.
detailed patients’ profiles is only available in Decision [10] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
[11] R. D´ıaz-Uriarte and A. de André’s. Gene selection and
Trees. Naïve Bayes fared better than Decision Trees as it classification of microarray data using random forest. BMC
could identify all the significant medical predictors. The bioinformatics, 7(1):3, 2006.
relationship between attributes produced by Neural [12] R.S. Michal ski and K. Kaufman. Learning patterns in noisy data:
Network is more difficult to understand. TheAQ approach. Machine Learning and its Applications, Springer-
Verlag, pages 22–38, 2001.
In some cases even in the advanced level Lung [13] R. Linder, T. Richards, and M. Wagner. Microarray data classified
cancer patients does not show the symptoms associated by artificial neural networks. METHODS IN MOLECULAR
with the Lung cancer. BIOLOGYCLIFTON THEN TOTOWA-, 382:345, 2007.
Prevalence of Lung cancer disease is high in India, [14] Murat Karabhatak, M.Cevdet Ince 2008. Expert system for
detection of breast cancer based on association rules and neural
especially in rural India, did not get noticed at the early network. Journal: Expert systems with Applications.
stage, because of the lack of awareness. Also it is not [15] Han, J. and M. Kamber, 2001. Data Mining: Concepts and
possible for the voluntary agencies to carry out the Techniques. San Francisco, Morgan Kauffmann Publishers.
screening for all the people. The emphasis of this work is to [16] [by Two Crows Corporation Introduction to Data Mining and
Knowledge Discovery .Third Edition,2005. ISBN: 1-892095-02-5,
find the target group of people who needs further screening Pages 10, 11.
for Lung cancer disease, so that the prevalence and [17] Maria-Luiza Antonie, Osmar R. Za¨ıane, Alexandru Coman
mortality rate could be brought down. Application of Data Mining Techniques for Medical Image
Lung cancer prediction system can be further Classification.Page 97
[18] Heckerman, D., A Tutorial on Learning with Bayesian
enhanced and expanded. It can also incorporate other data Networks.1995, Microsoft Research.
mining techniques, e.g., Time Series, Clustering and [19] Neapolitan, R., Learning Bayesian Networks. 2004, London:
Association Rules. Continuous data can also be used Pearson Printice Hall.
instead of just categorical data. Another area is to use Text [20] Neapolitan, R., Learning Bayesian Networks. 2004, London:
Pearson Prentice Hall.
Mining to mine the vast amount of unstructured data [21] Krishnapuram, B., et al., A Bayesian approach to joint feature
available in healthcare databases. Another challenge would selection and classifier design. Pattern Analysis and Machine
be to integrate data mining and text mining [26]. Intelligence, IEEE Transactions on, 2004. 6(9): p. 1105-1111.
[22] Shantakumar B.Patil, Y.S.Kumaraswamy, Intelligent and Effective
ACKNOWLEDGMENT Heart Attack Prediction System Using Data Mining and Artificial
Neural Network, European Journal of Scientific Research ISSN
The authors would like thank CVR College of 1450-216X Vol.31 No.4 (2009), pp.642-656 © Euro Journals
Engineering, Hyderabad, for providing its amenities. Publishing, Inc. 2009.
[23] Sellappan Palaniappan, Rafiah Awang, Intelligent Heart Disease
REFERENCES Prediction System Using Data Mining Techniques, 978-1-4244-
1968-5/08/$25.00 ©2008 IEEE.
[1] Sang Min Park, Min Kyung Lim, Soon Ae Shin & Young Ho Yun [24] Juan Bernabé Moreno, One Dependence Augmented Naive Bayes,
2006. Impact of prediagnosis smoking, Alcohol, Obesity and Insulin University of Granada, Department of Computer Science and
resistance on survival in Male cancer Patients: National Health Artificial Intelligence.
Insurance corporation study. Journal of clinical Oncology, Vol 24 [25] Juan Bernabé Moreno, One Dependence Augmented Naive Bayes,
Number 31 November 2006. University of Granada, Department of Computer Science and
[2] Yongqian Qiang, Youmin Guo, Xue Li, Qiuping Wang, Hao Chen, Artificial Intelligence.
& Duwu Cuic 2007 .The Diagnostic Rules of Peripheral Lung [26] Weiguo, F., Wallace, L., Rich, S., Zhongju, Z.: “Tapping the Power
cancer Preliminary study based on Data Mining Technique. Journal of Text Mining”, Communication of the ACM. 49(9), 77-82, 2006.
of Nanjing Medical University, 21(3):190-195

www.ijcsit.com 45

Early Lung Cancer Prediction Models
No ratings yet
Early Lung Cancer Prediction Models
8 pages
Lung Cancer Detection Using Machine Learning
No ratings yet
Lung Cancer Detection Using Machine Learning
24 pages
Paper Mtech Sce
No ratings yet
Paper Mtech Sce
9 pages
Lung Cancer Prediction by Using Machine Learning Models With Distributed System and Weka Visualization Ijariie24170
No ratings yet
Lung Cancer Prediction by Using Machine Learning Models With Distributed System and Weka Visualization Ijariie24170
15 pages
Lung Disease Prediction System Using Data Mining Techniques
No ratings yet
Lung Disease Prediction System Using Data Mining Techniques
6 pages
Lung Cancer Detection and Classification Using Machine Learning Algorithms
No ratings yet
Lung Cancer Detection and Classification Using Machine Learning Algorithms
6 pages
Lung Cancer Diagnosis Using Prewitt & SVM As Hybrid Model
No ratings yet
Lung Cancer Diagnosis Using Prewitt & SVM As Hybrid Model
8 pages
05 Chapter1
No ratings yet
05 Chapter1
16 pages
1 s2.0 S1746809423007528 Main
No ratings yet
1 s2.0 S1746809423007528 Main
12 pages
Documentation
No ratings yet
Documentation
67 pages
An Integrated Deep Learning Based Enhanced Grey Wolf Optimization For Lung Cancer Prediction
No ratings yet
An Integrated Deep Learning Based Enhanced Grey Wolf Optimization For Lung Cancer Prediction
14 pages
Lung Cancer Prediction Using Feed Forward Back Propagation Neural Networks With Optimal Features
No ratings yet
Lung Cancer Prediction Using Feed Forward Back Propagation Neural Networks With Optimal Features
8 pages
Prediction Lung Cancer in Machine Learning Perspective
No ratings yet
Prediction Lung Cancer in Machine Learning Perspective
5 pages
Lung Cancer Prediction Using Data Mining Techniques
No ratings yet
Lung Cancer Prediction Using Data Mining Techniques
6 pages
ML For Air Quality
No ratings yet
ML For Air Quality
11 pages
Lung Cancer Prediction Model Using Machine Learning Techniques
No ratings yet
Lung Cancer Prediction Model Using Machine Learning Techniques
8 pages
Diagnostic Approach of Lung Cancer A Literature
No ratings yet
Diagnostic Approach of Lung Cancer A Literature
9 pages
Graduation Project Paper
No ratings yet
Graduation Project Paper
8 pages
AI Research Paper Final
No ratings yet
AI Research Paper Final
12 pages
Lung Cancer Detection Using CNN - 2021-2022: Dept of Ece, Bldea',Cet, Vijaypur
No ratings yet
Lung Cancer Detection Using CNN - 2021-2022: Dept of Ece, Bldea',Cet, Vijaypur
21 pages
Lung Cancer
No ratings yet
Lung Cancer
9 pages
PHAR 7488 Lung Cancer SP 2020 v2
No ratings yet
PHAR 7488 Lung Cancer SP 2020 v2
35 pages
Compeleceng D 24 03004
No ratings yet
Compeleceng D 24 03004
33 pages
Lung Cancer Review
No ratings yet
Lung Cancer Review
39 pages
PHAR 7488 Lung Cancer SP 2020 v2
No ratings yet
PHAR 7488 Lung Cancer SP 2020 v2
35 pages
An Intelligent Algorithm For Lung Cancer Diagnosis Using Extracted Features
No ratings yet
An Intelligent Algorithm For Lung Cancer Diagnosis Using Extracted Features
16 pages
V5I2N01
No ratings yet
V5I2N01
7 pages
Lung Cancer
No ratings yet
Lung Cancer
17 pages
Lung Cancer Detection Using Machine Learning IJERTCONV7IS01011
No ratings yet
Lung Cancer Detection Using Machine Learning IJERTCONV7IS01011
6 pages
بحث سلمى
No ratings yet
بحث سلمى
38 pages
Lung Cancer Prediction Using Machine Learning On Data From A Symptom E-Questionnaire For Never Smokers, Formers Smokers and Current Smokers
No ratings yet
Lung Cancer Prediction Using Machine Learning On Data From A Symptom E-Questionnaire For Never Smokers, Formers Smokers and Current Smokers
11 pages
Kalaivani 2020 IOP Conf. Ser. Mater. Sci. Eng. 994 012026
No ratings yet
Kalaivani 2020 IOP Conf. Ser. Mater. Sci. Eng. 994 012026
6 pages
Deep Learning For Lungs Cancer Detection A Review
No ratings yet
Deep Learning For Lungs Cancer Detection A Review
40 pages
Lung Cancer Survival Prediction Using ML
No ratings yet
Lung Cancer Survival Prediction Using ML
8 pages
Lung Cancer
No ratings yet
Lung Cancer
19 pages
IJRAR22B3053
No ratings yet
IJRAR22B3053
18 pages
Lung Cancer Paper
No ratings yet
Lung Cancer Paper
7 pages
Lung Cancer A. Description
No ratings yet
Lung Cancer A. Description
15 pages
Conference Research Paper NCEICT 06 JayaprakashB
No ratings yet
Conference Research Paper NCEICT 06 JayaprakashB
8 pages
Lung Rads
No ratings yet
Lung Rads
24 pages
1CD22MC043 Part 2
No ratings yet
1CD22MC043 Part 2
50 pages
The Descriptive Epidemiology of Lung Cancer: Andrea Borondy Kitts November 3, 2014
No ratings yet
The Descriptive Epidemiology of Lung Cancer: Andrea Borondy Kitts November 3, 2014
25 pages
Lungs Cancer Presentation - 110006
No ratings yet
Lungs Cancer Presentation - 110006
40 pages
Recommendation Report F
No ratings yet
Recommendation Report F
10 pages
Lung Cancer Prediction Using Machine Learning Classifiers
No ratings yet
Lung Cancer Prediction Using Machine Learning Classifiers
8 pages
Lung Cancer - 2021
No ratings yet
Lung Cancer - 2021
20 pages
Referensi 1
No ratings yet
Referensi 1
9 pages
Lung Cancer Report
No ratings yet
Lung Cancer Report
55 pages
Lung Cancer Risk Prediction With Machine Learning Models: Big Data and Cognitive Computing
No ratings yet
Lung Cancer Risk Prediction With Machine Learning Models: Big Data and Cognitive Computing
14 pages
LCAM+ PEL+vr4+
No ratings yet
LCAM+ PEL+vr4+
4 pages
Hybrid Model Detection and Classification of Lung Cancer
No ratings yet
Hybrid Model Detection and Classification of Lung Cancer
11 pages
597 Icac3n23
No ratings yet
597 Icac3n23
5 pages
ABSTRACT
No ratings yet
ABSTRACT
3 pages
A Critical Study of Classification Algorithms For Lungcancer Disease Detection and Diagnosis
No ratings yet
A Critical Study of Classification Algorithms For Lungcancer Disease Detection and Diagnosis
8 pages
p250 PDF
No ratings yet
p250 PDF
8 pages
Lung Cancer
100% (2)
Lung Cancer
11 pages
Ant Colony
No ratings yet
Ant Colony
5 pages
Lung Cancer Overview: Causes & Types
No ratings yet
Lung Cancer Overview: Causes & Types
11 pages
Ml2 Script v2
No ratings yet
Ml2 Script v2
123 pages
12395-Article (PDF) - 25776-2-10-20210118
No ratings yet
12395-Article (PDF) - 25776-2-10-20210118
39 pages
Artificial Intelligence: Tutorial 7 Questions Uncertainty and Imprecision
No ratings yet
Artificial Intelligence: Tutorial 7 Questions Uncertainty and Imprecision
6 pages
Statistics & Probability Monographs
100% (1)
Statistics & Probability Monographs
259 pages
Machine Learning
No ratings yet
Machine Learning
42 pages
Probabilistic Reasoning in AI
No ratings yet
Probabilistic Reasoning in AI
23 pages
Probabilistic Reasoning in AI
No ratings yet
Probabilistic Reasoning in AI
98 pages
(Final) 600+ ML MCQ
100% (2)
(Final) 600+ ML MCQ
319 pages
Unit 6
100% (2)
Unit 6
12 pages
Directed Graphical Models
No ratings yet
Directed Graphical Models
54 pages
Unit - 5
No ratings yet
Unit - 5
111 pages
CS7003 2024-25 - CWK2 - Option 2
No ratings yet
CS7003 2024-25 - CWK2 - Option 2
2 pages
Module 3 Notes
No ratings yet
Module 3 Notes
33 pages
MACHINE LEARNING Question Bank
No ratings yet
MACHINE LEARNING Question Bank
11 pages
MSC Programme
No ratings yet
MSC Programme
33 pages
Simulation of Insurance Data With Actuar
No ratings yet
Simulation of Insurance Data With Actuar
14 pages
ML Lab
No ratings yet
ML Lab
75 pages
Data Mining Practical Machine Learning Tools and Techniques Fourth Edition Ian H. Witten Full
No ratings yet
Data Mining Practical Machine Learning Tools and Techniques Fourth Edition Ian H. Witten Full
89 pages
Showfile
No ratings yet
Showfile
130 pages
Unit 6
No ratings yet
Unit 6
126 pages
CS3492-DBMS Question Bank - Watermark
No ratings yet
CS3492-DBMS Question Bank - Watermark
23 pages
of Bayesian Statistics (Chirayu Jain & Group)
No ratings yet
of Bayesian Statistics (Chirayu Jain & Group)
8 pages
639544869
No ratings yet
639544869
589 pages
4.3 Bayesian Belief Network
No ratings yet
4.3 Bayesian Belief Network
13 pages
U4 ML Updated
No ratings yet
U4 ML Updated
32 pages
AI & Data Science Final Year Syllabus
No ratings yet
AI & Data Science Final Year Syllabus
126 pages
Week 8
No ratings yet
Week 8
7 pages
250 MCQ of ML
100% (3)
250 MCQ of ML
47 pages
Module 3 Notes - AI
No ratings yet
Module 3 Notes - AI
55 pages
Exp1 A09 DS
No ratings yet
Exp1 A09 DS
6 pages

Diagnosis of Lung Cancer Prediction System Using Data Mining Classification Techniques

Uploaded by

Diagnosis of Lung Cancer Prediction System Using Data Mining Classification Techniques

Uploaded by

V. Krishnaiah et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol.

Diagnosis of Lung Cancer Prediction System

associated with exposure to ubiquitous environmental x. Fatigue and weakness

Decision Tree: describe a few Classification data mining techniques with

DECISION TREE is a categorical variable) or for regressions (where the

Figure 3. A decision tree built from the data in Table 1

Table 1: Data set used to build decision tree of Fig. 3

You might also like