Diagnosis of Lung Cancer Prediction System Using Data Mining Classification Techniques
Diagnosis of Lung Cancer Prediction System Using Data Mining Classification Techniques
4 (1) , 2013, 39 - 45
Abstract— Cancer is the most important cause of death for dyspnea (shortness of breath with activity),
both men and women. The early detection of cancer can be hemoptysis (coughing up blood),
helpful in curing the disease completely. So the requirement of
techniques to detect the occurrence of cancer nodule in early
chronic coughing or change in regular coughing pattern,
stage is increasing. A disease that is commonly misdiagnosed is wheezing,
lung cancer. Earlier diagnosis of Lung Cancer saves enormous chest pain or pain in the abdomen,
lives, failing which may lead to other severe problems causing cachexia (weight loss, fatigue, and loss of appetite),
sudden fatal end. Its cure rate and prediction depends mainly dysphonia (hoarse voice),
on the early detection and diagnosis of the disease. One of the clubbing of the fingernails(uncommon),
most common forms of medical malpractices globally is an
error in diagnosis. Knowledge discovery and data mining dysphasia(difficulty swallowing),
have found numerous applications in business and scientific Pain in shoulder ,chest , arm,
domain. Valuable knowledge can be discovered from Bronchitis or pneumonia,
application of data mining techniques in healthcare system. In Decline in Health and unexplained weight loss.
this study, we briefly examine the potential use of classification
based data mining techniques such as Rule based, Decision Mortality and morbidity due to tobacco use is very
tree, Naïve Bayes and Artificial Neural Network to massive
high. Usually lung cancer develops within the wall or
volume of healthcare data. The healthcare industry collects
huge amounts of healthcare data which, unfortunately, are not epithelium of the bronchial tree. But it can start anywhere
“mined” to discover hidden information. For data in the lungs and affect any part of the respiratory system.
preprocessing and effective decision making One Dependency Lung cancer mostly affects people between the ages of 55
Augmented Naïve Bayes classifier (ODANB) and naive creedal and 65 and often takes many years to develop [2].
classifier 2 (NCC2) are used. This is an extension of naïve There are two major types of lung cancer. They
Bayes to imprecise probabilities that aims at delivering robust are Non-small cell lung cancer (NSCLC) and small cell
classifications also when dealing with small or incomplete data lung cancer (SCLC) or oat cell cancer. Each type of lung
sets. Discovery of hidden patterns and relationships often goes cancer grows and spreads in different ways, and is treated
unexploited. Diagnosis of Lung Cancer Disease can answer
differently. If the cancer has features of both types, it is
complex “what if” queries which traditional decision support
systems cannot. Using generic lung cancer symptoms such as called mixed small cell/large cell cancer.
age, sex, Wheezing, Shortness of breath, Pain in shoulder, Non-small cell lung cancer is more common than
chest, arm, it can predict the likelihood of patients getting a SCLC and it generally grows and spreads more slowly.
lung cancer disease. Aim of the paper is to propose a model for SCLC is almost related with smoking and grows more
early detection and correct diagnosis of the disease which will quickly and form large tumors that can spread widely
help the doctor in saving the life of the patient. through the body. They often start in the bronchi near the
center of the chest. Lung cancer death rate is related to total
Keywords—Lung cancer, Naive Bayes, ODANB, NCC2, Data amount of cigarette smoked [3].
Mining, Classification.
Smoking cessation, diet modification, and
I. INTRODUCTION chemoprevention are primary prevention activities.
Screening is a form of secondary prevention. Our method of
Lung cancer is the one of the leading cause of cancer finding the possible Lung cancer patients is based on the
deaths in both women and men. Manifestation of Lung systematic study of symptoms and risk factors. Non-clinical
cancer in the body of the patient reveals through early symptoms and risk factors are some of the generic
symptoms in most of the cases. [1]. Treatment and indicators of the cancer diseases. Environmental factors
prognosis depend on the histological type of cancer, the have an important role in human cancer. Many carcinogens
stage (degree of spread), and the patient's performance are present in the air we breathe, the food we eat, and the
status. Possible treatments include surgery, chemotherapy, water we drink. The constant and sometimes unavoidable
and radiotherapy Survival depends on stage, overall health, exposure to environmental carcinogens complicates the
and other factors, but overall only 14% of people diagnosed investigation of cancer causes in human beings. The
with lung cancer survive five years after the diagnosis. complexity of human cancer causes is especially
Symptoms that may suggest lung cancer include: challenging for cancers with long latency, which are
www.ijcsit.com 39
V. Krishnaiah et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 4 (1) , 2013, 39 - 45
www.ijcsit.com 40
V. Krishnaiah et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 4 (1) , 2013, 39 - 45
The Knowledge Discovery in Databases process algorithms. The specific algorithm is selected based on the
comprises of a few steps leading from raw data collections particular objective to be achieved and the quality of the
to some form of new knowledge [5]. The iterative process data to be analyzed.
consists of the following steps:
(1) Data cleaning: also known as data cleansing it is a
phase in which noise data and irrelevant data are removed
from the collection.
(2) Data integration: at this stage, multiple data sources,
often heterogeneous, may be combined in a common source.
(3) Data selection: at this step, the data relevant to the
analysis is decided on and retrieved from the data collection.
(4) Data transformation: also known as data
consolidation, it is a phase in which the selected data is
transformed into forms appropriate for the mining
procedure.
(5) Data mining: it is the crucial step in which clever
techniques are applied to extract patterns potentially useful.
(6) Pattern evaluation: this step, strictly interesting
patterns representing knowledge are identified based on
given measures.
Figure 2. Data Mining Process Representation
(7) Knowledge representation: is the final phase in
which the discovered knowledge is visually represented to
(5) Evaluation and Deployment: Based on the results of
the user. In this step visualization techniques are used to
the data mining algorithms, an analysis is conducted to
help users understand and interpret the data mining results.
determine key conclusions from the analysis and create a
B. Data Mining Process series of recommendations for consideration.
In the KDD process, the data mining methods are IV. DATA MINING CLASSIFICATION METHODS
for extracting patterns from data. The patterns that can be
discovered depend upon the data mining tasks applied. The data mining consists of various methods.
Generally, there are two types of data mining tasks: Different methods serve different purposes, each method
descriptive data mining tasks that describe the general offering its own advantages and disadvantages. In data
properties of the existing data, and predictive data mining mining, classification is one of the most important task. It
tasks that attempt to do predictions based on available data. maps the data in to predefined targets. It is a supervised
Data mining can be done on data which are in quantitative, learning as targets are predefined.
textual, or multimedia forms. The aim of the classification is to build a classifier
Data mining applications can use different kind of based on some cases with some attributes to describe the
parameters to examine the data. They include association objects or one attribute to describe the group of the objects.
(patterns where one event is connected to another event), Then, the classifier is used to predict the group attributes of
sequence or path analysis (patterns where one event leads to new cases from the domain based on the values of other
another event), classification (identification of new patterns attributes. The most used classification algorithms exploited
with predefined targets) and clustering (grouping of in the microarray analysis belong to four categories: IF-
identical or similar objects).Data mining involves some of THEN Rule, Decision tree, Bayesian classifiers and Neural
the following key steps [6]- networks.
(1) Problem definition: The first step is to identify goals.
Based on the defined goal, the correct series of tools can be IF-THEN Rule:
applied to the data to build the corresponding behavioral Rule induction: is the process of extracting useful ‘if then’
model. rules from data based on statistical significance. A Rule
(2) Data exploration: If the quality of data is not suitable based system constructs a set of if-then-rules. Knowledge
for an accurate model then recommendations on future data represents has the form
collection and storage strategies can be made at this. For
analysis, all data needs to be consolidated so that it can be IF conditions THEN conclusion:
treated consistently. This kind of rule consists of two parts. The rule
(3) Data preparation: The purpose of this step is to clean antecedent (the IF part) contains one or more conditions
and transform the data so that missing and invalid values about value of predictor attributes where as the rule
are treated and all known valid values are made consistent consequent (THEN part) contains a prediction about the
for more robust analysis. value of a goal attribute. An accurate prediction of the value
(4) Modeling: Based on the data and the desired of a goal attribute will improve decision-making process.
outcomes, a data mining algorithm or combination of IF-THEN prediction rules are very popular in data mining;
algorithms is selected for analysis. These algorithms they represent discovered knowledge at a high level of
include classical techniques such as statistics, abstraction. Rule Induction Method has the potential to use
neighborhoods and clustering but also next generation retrieved cases for predictions [7].
techniques such as decision trees, networks and rule based
www.ijcsit.com 41
V. Krishnaiah et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 4 (1) , 2013, 39 - 45
www.ijcsit.com 42
V. Krishnaiah et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 4 (1) , 2013, 39 - 45
www.ijcsit.com 43
V. Krishnaiah et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 4 (1) , 2013, 39 - 45
were related using a single joint distribution, the equivalent account when grouping instances together. There is always
of all nodes being first level parents, the number of possible a risk that distinctions between the different instances in
combinations of variables would be equal to (2n-1). This relation to the class can be wiped out when using such a
results in the need for a very large amount of data [18, 19]. filter.
If dependence relationships between these variables could
be determined resulting in independent variables being E. Some Implementation Details
removed, fewer nodes would be adjacent to the node of JNCC2 loads data from ARFF files this is a plain
interest. This parent node removal leads to a significant text format, originally developed for WEKA (Witten and
reduction in the number of variable combinations, thereby Frank, 2005). A large number of ARFF data sets, including
reducing the amount of needed data. Furthermore, variables the data sets from the UCI repository are available from
that are directly conditional, not to the node of interest but http://www.cs.waikato.ac.nz/ml/weka/index_datasets.html.2
to the parents of the node of interest (as nodes 4 and 5 are 636. As a pre-processing step, JNCC2 [25] discreteness all
with respect to node 1 in Figure 5), can be related, which the numerical features, using the supervised discretization
allows for a more robust system when dealing with missing algorithm of Fayyad and Irani (1993). The discretization
data points. This property of requiring less information intervals are computed on the training set, and then applied
based on pre-existing understanding of the system’s unchanged on the test set. NCC2 [25] is implemented
variable dependencies is a major benefit of Bayesian exploiting the computationally efficient procedure.
Networks [20]. Some further theoretical underpinnings of Algorithm 1:
the Bayesian approach for classification have been Pseudo code for validation via testing file.
addressed in [21] and [22]. A Bayesian Network (BN) is a ValidateTestFile ()
relatively new tool that identifies probabilistic correlations /*loads training and test file; reads list of non-Mar
in order to make predictions or assessments of class features;
membership. discretizes features*/
parseArffFile ();
parseArffTestingFile();
parseNonMar();
discretizeNumFeatures();
/*learns and validates NBC*/
nbc = new NaiveBayes(trainingSet);
nbc.classifyInstances(testSet);
/*learns and validates NCC2; the list of non-Mar
features in training and testing is required*/
Figure 5. Basic Bayesian Network Structure an ncc2 = new NaiveCredalClassifier2(trainingSet,
Terminology nonMarTraining, nonMarTesting);
While the independence assumption may seem as a ncc2.classifyInstances(testingSet);
simplifying one and would therefore lead to less accurate /*writes output files*/
classification, this has not been true in many applications. writePerfIndicators();
For instance, several datasets are classified in [23] using the writePredictions();
naïve Bayesian classifier, decision tree induction, instance JNCC2 can perform three kinds of experiments:
based learning, and rule induction. These methods are training and testing, cross-validation, and classification of
compared showing the naïve classifier as the overall best instances of the test set whose class is unknown. The
method. To use a Bayesian Network as a classifier, first, pseudo code of the experiment with training and testing is
one must assume that data correlation is equivalent to described by Algorithm 1.
statistical dependence. The ODANB has been compared with other existing
1) Bayesian Network Type methods that improves the Naïve Bayes and with the Naïve
The kind of Bayesian Network (BN) retrieved by the Bayes itself. The results of the comparison prove that the
algorithm is also called Augmented Naïve BN, ODANB outperforms the other methods for the disease
characterized mainly by the points below. prediction not related to lung cancer.
All attributes have certain influence on the class. The comparison criteria that have been introduced are
The conditional dependency assumption is relaxed • Accuracy of prediction (measures defined from the
(certain attributes have been added a parent). confusion matrix outputs). The table-2 below recaps the
2) Pre-Processing Techniques benchmarked algorithms accuracy for each dataset consider.
The following data pre-processing techniques In each row in bold the best performing algorithm:
applied to the data before running the ODANB [24] TABLE 2 COMPARISONS OF RESULTS
algorithm. DATASETS ODANB NB
Replace Missing Values: This filter will scan all (or
LUNG CANCER-C 80.46 84.14
selected) nominal and numerical attributes and replace LUNG CANCER-H 79.66 84.05
missing values with the modes and mean. LUNG CANCER-STATLOG 80.00 83.70
Discrimination: This filter is designed to convert
numerical attributes into nominal ones; however the
unsupervised version does not take class information into
www.ijcsit.com 44
V. Krishnaiah et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 4 (1) , 2013, 39 - 45
We focus on the results which clearly states that [3] Murat Karabhatak, M.Cevdet Ince 2008. Expert system for
detection of breast cancer based on association rules and neural
TAN(Tree Augmented Naïve Bayes) [25] works efficiently network. Journal: Expert systems with Applications.
for the comparison of data sets of general and regular things [4] ICMR Report 2006. Cancer Research in ICMR Achievements in
like vehicles, anneal(metallurgy) over ODANB, Naïve Nineties.
Bayes. But for Diagnosis of Lung Cancer Disease Naïve [5] Osmar R. Zaïane, Principles of Knowledge Discovery in Databases.
[Online].
Bayes observes better results. Available:webdocs.cs.ualberta.ca/~zaiane/courses/cmput690
/notes/Chapter1/ch1.pdf.
VI. CONCLUSION [6] [The Data Mining Process. [Online]. Available:
A prototype lung cancer disease prediction system is http://publib.boulder.ibm.com/infocenter/db2luw/v9/index.jsp?topic
=/com.ibm.im.easy.doc/c_dm_process.html.Shelly Gupta et al./
developed using data mining classification techniques. The Indian Journal of Computer Science and Engineering (IJCSE).
system extracts hidden knowledge from a historical lung [7] Harleen Kaur and Siri Krishan Wasan, Empirical Study on
cancer disease database. The most effective model to Applications of Data Mining Techniques in Healthcare, Journal of
predict patients with Lung cancer disease appears to be Computer Science 2 (2): 194-200, 2006ISSN 1549-3636.
[8] J.R. Quinlan. Induction of decision trees. Machine learning,
Naïve Bayes followed by IF-THEN rule, Decision Trees 1(1):81–106, 1986.
and Neural Network. Decision Trees results are easier to [9] J.R. Quinlan. C4. 5: Programming for machine learning. Morgan
read and interpret. The drill through feature to access Kauffmann, 1993.
detailed patients’ profiles is only available in Decision [10] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
[11] R. D´ıaz-Uriarte and A. de André’s. Gene selection and
Trees. Naïve Bayes fared better than Decision Trees as it classification of microarray data using random forest. BMC
could identify all the significant medical predictors. The bioinformatics, 7(1):3, 2006.
relationship between attributes produced by Neural [12] R.S. Michal ski and K. Kaufman. Learning patterns in noisy data:
Network is more difficult to understand. TheAQ approach. Machine Learning and its Applications, Springer-
Verlag, pages 22–38, 2001.
In some cases even in the advanced level Lung [13] R. Linder, T. Richards, and M. Wagner. Microarray data classified
cancer patients does not show the symptoms associated by artificial neural networks. METHODS IN MOLECULAR
with the Lung cancer. BIOLOGYCLIFTON THEN TOTOWA-, 382:345, 2007.
Prevalence of Lung cancer disease is high in India, [14] Murat Karabhatak, M.Cevdet Ince 2008. Expert system for
detection of breast cancer based on association rules and neural
especially in rural India, did not get noticed at the early network. Journal: Expert systems with Applications.
stage, because of the lack of awareness. Also it is not [15] Han, J. and M. Kamber, 2001. Data Mining: Concepts and
possible for the voluntary agencies to carry out the Techniques. San Francisco, Morgan Kauffmann Publishers.
screening for all the people. The emphasis of this work is to [16] [by Two Crows Corporation Introduction to Data Mining and
Knowledge Discovery .Third Edition,2005. ISBN: 1-892095-02-5,
find the target group of people who needs further screening Pages 10, 11.
for Lung cancer disease, so that the prevalence and [17] Maria-Luiza Antonie, Osmar R. Za¨ıane, Alexandru Coman
mortality rate could be brought down. Application of Data Mining Techniques for Medical Image
Lung cancer prediction system can be further Classification.Page 97
[18] Heckerman, D., A Tutorial on Learning with Bayesian
enhanced and expanded. It can also incorporate other data Networks.1995, Microsoft Research.
mining techniques, e.g., Time Series, Clustering and [19] Neapolitan, R., Learning Bayesian Networks. 2004, London:
Association Rules. Continuous data can also be used Pearson Printice Hall.
instead of just categorical data. Another area is to use Text [20] Neapolitan, R., Learning Bayesian Networks. 2004, London:
Pearson Prentice Hall.
Mining to mine the vast amount of unstructured data [21] Krishnapuram, B., et al., A Bayesian approach to joint feature
available in healthcare databases. Another challenge would selection and classifier design. Pattern Analysis and Machine
be to integrate data mining and text mining [26]. Intelligence, IEEE Transactions on, 2004. 6(9): p. 1105-1111.
[22] Shantakumar B.Patil, Y.S.Kumaraswamy, Intelligent and Effective
ACKNOWLEDGMENT Heart Attack Prediction System Using Data Mining and Artificial
Neural Network, European Journal of Scientific Research ISSN
The authors would like thank CVR College of 1450-216X Vol.31 No.4 (2009), pp.642-656 © Euro Journals
Engineering, Hyderabad, for providing its amenities. Publishing, Inc. 2009.
[23] Sellappan Palaniappan, Rafiah Awang, Intelligent Heart Disease
REFERENCES Prediction System Using Data Mining Techniques, 978-1-4244-
1968-5/08/$25.00 ©2008 IEEE.
[1] Sang Min Park, Min Kyung Lim, Soon Ae Shin & Young Ho Yun [24] Juan Bernabé Moreno, One Dependence Augmented Naive Bayes,
2006. Impact of prediagnosis smoking, Alcohol, Obesity and Insulin University of Granada, Department of Computer Science and
resistance on survival in Male cancer Patients: National Health Artificial Intelligence.
Insurance corporation study. Journal of clinical Oncology, Vol 24 [25] Juan Bernabé Moreno, One Dependence Augmented Naive Bayes,
Number 31 November 2006. University of Granada, Department of Computer Science and
[2] Yongqian Qiang, Youmin Guo, Xue Li, Qiuping Wang, Hao Chen, Artificial Intelligence.
& Duwu Cuic 2007 .The Diagnostic Rules of Peripheral Lung [26] Weiguo, F., Wallace, L., Rich, S., Zhongju, Z.: “Tapping the Power
cancer Preliminary study based on Data Mining Technique. Journal of Text Mining”, Communication of the ACM. 49(9), 77-82, 2006.
of Nanjing Medical University, 21(3):190-195
www.ijcsit.com 45