Ijcns 2022111614325160
Ijcns 2022111614325160
1. Introduction
Nowadays, there is huge amount of data being collected and stored in databases
everywhere across the globe running into terabytes of data and the tendency is to
keep increasing year after year. Today, the healthcare industry which is one of
the largest industries throughout the world includes medical industries having
the large amounts of health-related and medical-related data. It also includes
thousands of hospitals, clinics and other types of facilities that provide primary,
secondary and tertiary levels of care. The delivery of healthcare service is the
most visible part of any healthcare system, both to users and the general public.
Accurate, early and error-free diagnosis and treatment given to patients has been
DOI: 10.4236/ijcns.2022.159011 Sep. 30, 2022 149 Int. J. Communications, Network and System Sciences
E. N. Ekwonwune et al.
Statement of Problem
Data Mining as an analytic process is designed to extract useful data, patterns
and trends from a large amount of data (typically business, medical or market
related data) by using techniques like clustering, classification, association and
regression. The ultimate goal of data mining is prediction. Research statistics
have shown that most healthcare-related diseases make use of data mining tech-
niques that do not create optimal results.
Below are some of the various inconsistencies associated with the use of the
wrong data mining techniques:
1) Not having very high accuracy in decision.
2) Shortage of expertise.
3) Difficulties in knowledge upgrade.
4) Time-dependent performance (very time-consuming).
Because of these problems, there is necessity to deploy data mining to provide
the assistance mechanism in diagnosis procedure. The conclusion is clear: hu-
mans and their statistical methods cannot ad hoc analyze complex data without
errors. In medicine and healthcare where safety is critical, it is important if data
mining techniques are to be widely accepted in clinical practice [6].
The goal of the process is to take the medical data which contain many attributes
and determine which ones are actually relevant to the diagnosis, symptoms and
result of heart disease. Without automatic methods for extracting this informa-
tion, it is practically impossible to mine for them, seeing that we are looking at a
very huge amount of data running into terabytes of data.
2. Literature Review
2.1. Conceptual Framework
The healthcare industry battles with millions of digitally recorded data and pat-
terns being collected at enormous speed due to the widespread usage of powerful
computer devices nowadays [7]. The data collected are mostly unorganized and
have not been used properly for appropriate applications, thus, imposing new
challenges regarding their management including their modeling, storage, and
retrieval capabilities. There is often interesting knowledge in the data that is not
readily evident. The spread of electronic patient records, with their comput-
er-readable entries e.g. Magnetic Resonance Imaging (MRI), signals like ECG
(Electrocardiography), clinical information like blood sugar, blood pressure,
cholesterol levels, etc. as well as the physician’s interpretation is opening new
possibilities for medical data mining and a world of virtual research [8].
Knowledge Discovery in Databases (KDD) and Data Mining (DM) provide a
solution to the information flood problem by extracting valid, novel, potentially
useful, and ultimately understandable patterns from data [9]. Patterns constitute
compact and rich in semantics representations of raw data [10]; compact by
means that they summarize, to some degree, the amount of information con-
tained in the original raw data and rich in semantics by means that they reveal
new knowledge hidden in the abundance of raw data.
Different data mining tasks achieve different insights over the data: classifica-
tion captures the class of data or a new item, clusters reveal natural groups in
data, decision trees detect characteristics that predict (with respect to a given
class attribute) the behavior of future records, and so on [11]. This unorganized
data requires processing to be done to generate meaningful and useful informa-
tion from the large databases. In order to organize large amount of data, you
implement the concept of Database Management Systems (DBMS) such as
Oracle, and SQL Server. These Database Management Systems require you to
use SQT, a specialized query language to retrieve data from a database. However,
the use of SQT is not always adequate to meet the end user requirements of spe-
cialized and sophisticated information from an unorganized large data bank.
Database researchers pay more attention to the issues related to the volume of
data and also concerned with the effective use of the available database tech-
niques such as efficient data retrieval mechanisms. This therefore necessitates
you to look for certain alternative techniques to retrieve information from large
and mostly unorganized sources of data.
Nowadays, data stored in medical databases are growing in an increasingly
rapid way. Analyzing that data is crucial for medical decision-making and man-
agement [12]. It has been widely recognized that medical data analysis can lead
to an enhancement of health care by improving the performance of patient
management tasks. There are two main aspects that define the need for medical
data analysis:
1) Support of specific knowledge-based problem solving activities through the
analysis of patients’ raw data collected in monitoring.
2) Discovery of new knowledge that can be extracted through the analysis of
representative collections of example cases, described by symbolic or numeric
descriptors. For these purposes, the increase in database size makes traditional
manual data analysis to be insufficient. To fill this gap, new research fields such
as knowledge discovery in databases (KDD) have rapidly grown in recent years.
KDD is concerned with the efficient computer-aided acquisition of useful know-
ledge from large sets of data.
It also includes the choice of encoding schemes, preprocessing, sampling, and
projections of the data prior to the data mining step.
Data Mining refers to the application of algorithms for extracting patterns
from data without the additional steps of the KDD process [12]. The KDD
process is often to be nontrivial; however, we take the larger view that KDD is an
all-encompassing concept. KDD is a process that involves many different steps.
The input to this process is the data, and the output is the useful information de-
sired by the users. However, the objective may be unclear or inexact. The process
itself is interactive and may require much elapsed time. To ensure the usefulness
and accuracy of the results of the process, interaction throughout the process
with both domain experts and technical experts might be needed.
Data mining is the step in the process of knowledge discovery in databases,
that inputs predominantly cleaned, transformed data, searches the data using
algorithms, and outputs patterns and relationships of interest in a particular re-
presentational form or a set of such representations as classification rules or
trees, regression and clustering, to the interpretation/evaluation step of the KDD
process. The definition clearly implies that what data mining (in this view) dis-
covers are hypotheses about patterns and relationships. Those patterns and rela-
tionships are then subject to interpretation and evaluation before they can be
called knowledge.
A simple data mining process model includes the following steps [13]:
1) Select a target data set.
2) Data preprocessing.
3) Data transformation.
4) Data mining.
5) Interpretation/evaluation.
6) Presentation.
7) Documentation: Simply the documentation and reporting it to interested
parties are done at this last step.
Whereas in unsupervised learning no training set is used. Each data mining
technique serves a different purpose depending on the modeling objective. The
two most common modeling objectives are classification and prediction. Classi-
fication models predict categorical labels (discrete, unordered) while prediction
The whole process of Data Mining consists of three main phases. This is
shown in Figure 2 below:
1) Data Pre-processing—Data cleaning, integration, selection, and transfor-
mation take place.
2) Data Extraction—Occurrence of exact data mining.
3) Data Evaluation and Presentation—Analyzing and presenting results.
• Government sources.
• Company or healthcare Databases.
• Old data can be used to develop new knowledge.
• New knowledge can be used to improve services or products.
• Improvements lead to:
Bigger profits.
More efficient service.
2.4.1. Classification
In their opinion, [1] stated that Classification is the process of predicting output
based on some given input data. The goal of classification is to accurately predict
the target class for each case in the data [17]. In order to predict the data, it
processes the training set and predictive set. It first develops relationships be-
tween the attributes of training data set. Then it is provided with the predictive
data set, which contains similar attributes but with different data values. Then it
analyzes the given data and produces prediction by placing the different data sets
in different classes based on the relationship of attributes [18] [19]. For example,
in a medical database; the training set would have relevant patient information
based on its previous records, whereas the prediction attribute is whether the pa-
tient has chances of heart attack as shown in Table 1 and Table 2.
Classification uses predictive rules expressed in the form of IF-THEN rules
where the first part (IF part) consists of conjunction of conditions and the second
part (THEN part) predicts a certain prediction attribute value that satisfies the
first part. Using the above example, a rule predicting the first row in the training
set may be represented as follows: IF (age = 62 and heart rate > 72) or (age > 60 and
45 96 143/69 ?
63 54 108/73 ?
83 95 115/68 ?
blood pressure > 140/70) then Heart problem = yes. This technique provide 80%
prediction rate, but the optimal solution is a rule with 100% prediction rate;
which is very hard to achieve. Following are the classification techniques used in
health care.
2.4.5. Regression
Regression is a data mining technique that helps in identifying those functions
that are useful in order to demonstrate the correlation among different variables.
It is a mathematical tool and can be easily constructed using training data sets.
Regression can be classified into linear and non-linear based on certain count of
Table 3. Usage history of classification techniques in HealthCare Sector. Source: Kamna Solanki et al.
1. Hu et al. [12] SVM, decision tree, bagging and boosting. To analyze micro array data.
2. Huang et al. [13] Hybrid SVM based diagnosis model For breast cancer.
3. Khan et al. [14] Decision tree For breast cancer.
4. Chang et al. [15] Integrated Decision tree model. For skin diseases in adults and children.
6. Moon et al. [17] Decision tree algorithm To characterize the smoking behaviour among
smokers by assessing their psychological health
conditions and consumption of alcohol.
7. Chien et al. [18] Hybrid decision tree classifier. For chronic disease.
2.4.6. Clustering
It is an unsupervised learning technique which is different from classification
technique (supervised learning method). It is best suited for large amount of da-
ta. It works by observing independent variables. The main task is to form clus-
ters from large databases on the basis of similarity measure. Different types of
clustering algorithms are defined in Table 5 and various clustering algorithms
used in health care are described in Table 6.
Table 4. Usage history of regression techniques in HealthCare sector. Source: Kamna Solanki et al.
2. Xie et al. [23] Regression decision tree algorithm To study number of hospitalization days.
3. Alapont et al. [24] Linear regression For effective utilization of hospital resources.
Table 5. Types of clustering algorithms in Healthcare sector. Source: Source: Kamna Solanki et al.
Technique Description
1. Partitioned Clustering With the help of “n” data points maximum possible of “k” clusters
is obtained by relocating objects to “k” clusters.
2. Hierarchical Clustering Data points are partitioned in tree form either top-down or bottom-up.
3. Density-based Clustering It can handle cluster of any arbitrary shape whereas above two can
handle only spherical shape clusters.
Table 6. Usage history of clustering techniques in Healthcare sector. Source: Kamna Solanki et al.
the quality of the decision-making process in pharma industry. One of the major
problems with pharmaceutical data is actually the lack of information. Predict-
ing drug behaviour is essential to find out if the treatment helps the patients or
their health status gets worse. Data mining can help experts in healthcare man-
agement [29] to make decisions in the sector of customer relationship manage-
ment. Patients will receive better and more affordable healthcare services if large
amount of data about the degree of other patients’ satisfaction regarding medical
sector will be analyzed and adequately interpreted. Biological databases may be
considered the raw material for multi-relational data mining techniques [30],
due to their wide variety of data types, often with complex relational structure.
At the University of Alabama [31], there was implemented a surveillance sys-
tem that uses data mining techniques (association rules) in order to identify new
and interesting patterns in the infection control data. Data collected over one
year (1996) were analyzed and three separate analyses were conducted, each one
using a different size of data partition.
In the research [32], it is presented the case study of American Healthways
which provides diabetes management services to hospitals and health plans so
that to enhance the quality and lower the cost of treating patients with diabetes.
The authors of the present article focus their research on applying data min-
ing techniques in order to classify patients with thyroid disorders. In the litera-
ture existing on the diagnosis of thyroid diseases, the authors have identified the
following data mining algorithms: decision trees, artificial neural networks,
support vector machine, expert systems, etc. For example, the diagnosis of thy-
roid disorders by using ANN’s is discussed in [33] [34] [35]. In [33], authors
used data related to UCI site, collected in 1992 by James Cook University, Towns-
ville of Australia. The total number of laboratory samples was 215. Data mining
algorithm used five attributes as predictors and one attribute as a target. By se-
lecting a hidden layer, the Logsig activation function for the hidden layer and 6
neurons from this layer, the level of classification accuracy was 98.6% in case of
thyroid disease. The software used for testing the model was MATLAB 2012. In
[34], authors present their work with respect to three ANN algorithms for the
diagnosis of thyroid disease: the Back propagation algorithm (BPA), the radial
basis function (RBF) Networks and the learning vector quantization (LVQ)
networks. After the model evaluation, LVQ network had the best accuracy rate,
i.e. 98%.
The classification of thyroid nodules was performed with support vector ma-
chines in [36]. In [37] there is presented a comparison study on data mining
classification algorithms (C 4.5, C5.0) for the thyroid cancer. The authors of [37]
used a database with 400 records extracted from the UCI thyroid database and
29 attributes. The study indicated that the confidence level for the rule set gen-
erated by C5.0 was higher than 95%. In [37], C4.5 approach was implemented in
java platform by using Eclipse and XP operating system. A diagnosis expert sys-
tem based on fuzzy rules is described in [38], while a three-stage expert system
Conflicts of Interest
The authors declare no conflicts of interest regarding the publication of this pa-
per.
References
[1] Kamma, S., Parul, B., Sandeep, D. and Sudluv (2016) Analysis of Application of Da-
ta Mining Techniques in Healthcare. International Journal of Computer Applica-
tions, 148, 16-21. https://doi.org/10.5120/ijca2016911011
[2] Koh, C.H. and Tan, G. (2011) Data Mining Applications in Healthcare. Journal of
Healthcare Information Management, 19, 64-72.
[3] Kaur, H. and Wasan, S.K. (2006) Empirical Study on Applications of Data Mining
https://doi.org/10.5120/21307-4126
[21] Durairaj, M. and Ranjani, V. (2013) Data Mining Applications in Healthcare Sector:
A Study. International Journal of Scientific and Technology Research, 2, 29-35.
[22] Vapnik, V. (1998) The Support Vector Method of Function Estimation. In: Suykens,
J.A.K. and Vandewalle, J., Eds., Nonlinear Modeling, Springer, Berlin, 55-85.
https://doi.org/10.1007/978-1-4615-5703-6_3
[23] Burges, C.J.C. (1998) A Tutorial on Support Vector Machines for Pattern Recogni-
tion. Data Mining and Knowledge Discovery, 2, 121-167.
https://doi.org/10.1023/A:1009715923555
[24] Robertson, J. (2012) Data Mining in Doctor’s Office Helps Solve Medical Mysteries.
Vol. 1, Wal-Mart or Western Union United Healthcare Corp., New York.
[25] Ionita, I. and Ionita, L. (2016) Applying Data Mining Techniques in Healthcare.
Studies in Informatics and Control, 25, 385-394.
https://doi.org/10.24846/v25i3y201612
[26] Canlas Jr., R.D. (2015) Data Mining in Healthcare: Current Applications and Issues.
[27] Ranjan, J. (2007) Application of Data Mining Techniques in Pharmaceutical Indus-
try. Journal of Theoretical and Applied Information Technology, 3, 61-67.
[28] Diwani, S., Mishol, S., Kayange, D.S., Machuve, D. and Sam, A. (2013) Overview
Applications of Data Mining in Health Care: The Case Study of Arusha Region. In-
ternational Journal of Computational Engineering Research, 3, 73-77.
[29] Desikan, P., Hsu, K.W. and Srivastava, J. (2011) Data Mining for Healthcare Man-
agement. SIAM International Conference on Data Mining, Arizona.
[30] Page, D. and Craven, M. (2016) Biological Applications of Multi-Relational Data
Mining. http://www.kdd.org/exploration_files/Page.pdf
[31] Brossette, S.E., Sprague, A.P., Hardin, M.K., Waites, B., Jones, W.T. and Moser, S.A.
(1998) Association Rules and Data Mining in Hospital Infection Control and Public
Health Surveillance. Journal of the American Medical Informatics Association, 5,
373-381. https://doi.org/10.1136/jamia.1998.0050373
[32] Ridinger, M. (2002) American Healthways Uses SAS to Improve Patient Care. DM
Review, 12, Article No. 139.
[33] Gharehchopogh, F.S., Molany, M. and Mokri, F.D. (2013) Using Artificial Neural
Network in Diagnosis of Thyroid Disease: A Case Study. International Journal on
Computational Sciences & Applications, 3, 49-61.
[34] Shukla, A. and Kaur, P. (2009) Diagnosis of Thyroid Disorders Using Artificial
Neural Networks. IEEE International Advance Computing Conference, Patiala, 6-7
March 2009, 1016-1020. https://doi.org/10.1109/IADCC.2009.4809154
[35] Prerana, E., Sehgal, P. and Taneja, K. (2015) Predictive Data Mining for Diagnosis
of Thyroid Disease Using Neural Network. International Journal of Research in
Management, Science & Technology, 3, 75-80.
[36] Chang, C.Y., Tsai, M.F. and Chen, S.J. (2008) Classification of the Thyroid Nodules
Using Support Vector Machines. International Joint Conference on Neural Net-
works, Hong Kong, 1-8 June 2008, 3093-3098.
https://doi.org/10.1109/IJCNN.2008.4634235
[37] Upadhayay, A., Shukla, S. and Kumar, S. (2013) Empirical Comparison by Data
Mining Classification Algorithms (C 4.5 & C 5.0) for Thyroid Cancer Data Set. In-
ternational Journal of Computer Science & Communication Networks, 3, 64-68.
[38] Keleş, A. and Keleş, A. (2008) ESTDD: Expert System for Thyroid Diseases Diagno-
sis. Expert Systems with Applications, 34, 242-246.
https://doi.org/10.1016/j.eswa.2006.09.028
[39] Chen, H.L., Yang, B., Wang, G., Liu, J., Chen, Y.D. and Liu, D.Y. (2012) A Three
Stage Expert System Based on Support Vector Machines for Thyroid Disease Diag-
nosis. Journal of Medical Systems, 36, 1953-1963.
https://doi.org/10.1007/s10916-011-9655-8
[40] UCI Machine Learning Repository.
https://archive.ics.uci.edu/ml/machinelearning-databases/thyroid-disease