0% found this document useful (0 votes)
29 views9 pages

Paper 14-Deep Learning Hybrid With Binary Dragonfly

Uploaded by

DevendraKumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views9 pages

Paper 14-Deep Learning Hybrid With Binary Dragonfly

Uploaded by

DevendraKumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

(IJACSA) International Journal of Advanced Computer Science and Applications,

Vol. 12, No. 3, 2021

Deep Learning Hybrid with Binary Dragonfly Feature


Selection for the Wisconsin Breast Cancer Dataset
Marian Mamdouh Ibrahim1*, Dina Ahmed Salem2, Rania Ahmed Abdel Azeem Abul Seoud3
PhD researcher - Faculty of engineering, Fayoum University, Egypt, Fayoum1
Assistant Professor - Computer Department - Faculty of Engineering
Misr University for Science and Technology (MUST).Egypt, Cairo2
Professor of digital signals at the Department of Electrical Engineering
Faculty of Engineering Fayoum University Egypt, Fayoum3

Abstract—Breast cancer is the world’s top cancer affecting attention. AI is beneficial in reducing medical human-errors
women. While the danger of the factors varies from a place, (because it minimizes possible errors) that might occur due to
lifestyle, and diet. Treatment procedures after discovering a unskilled doctors [3].
confirmed cancer case can reduce the risk of the disease.
Unfortunately, breast cancers that arise in low and middle- More research is being done on breast cancer diagnosis
income countries are diagnosed at a very late stage in which the using the Wisconsin Breast Cancer Database (WBCD)[4].
chances of survival are impeded and reduced. Early detection is Many methods have been constantly developed to achieve
therefore required not only to improve the accuracy of accurate and efficient diagnosis results and several experiments
discovering breast cancer but also to increase the chances of were performed on the WBCD using multiple classifiers and
making the right decision on a successful treatment plan. There feature selection techniques. Many of them show a good
have been several studies tending to build software models classification accuracy, for example, in [5] the performance
utilizing machine learning and soft computing techniques for criterion of supervised learning classifiers such as Naïve Bayes
cancer detection. This research aims to build a model scheme to (NB), Support Vector Machine (SVM-RBF) kernel, and neural
facilitate the detection of breast cancer and to provide the exact networks (NN) are compared to find the best classifier using
diagnosis. Improving the accuracy of a proposed model has, the dataset (WBCD), and the SVM-RBF has the best outcome
therefore, been one of the key fields of study. The model is based achieving 96.84%. The robustness of the least square Support
on deep learning that intends to develop a framework to Vector Machine (SVM) obtained a classification accuracy of
accurately separate benign and malignant breast tumors. This
98.53% [6]. In [7] Linear Regression achieved an average
study optimizes the learning algorithm by applying the Dragonfly
algorithm to select the best features and perfect parameter values
training accuracy of 96.093%, whereas Multilayer Perceptron
of the deep learning model. Moreover, it compares deep learning (MLP) is 99.038%, Softmax Regression has an average
results against that of support vector machine (SVM), random training accuracy of 97.366573% and the accuracy obtained by
forest (RF), and k nearest neighbor (KNN). Those classifiers are SVM (97.13%) is better than the accuracy obtained by KNN
chosen as they are the most reliable algorithms having a solid [8]. The prediction accuracy of the SVM (linear kernel) in [9]
fingerprint in the field of clinical data classification. reaches 97.14%, an accuracy of 95.71% using RBF kernel, and
Consequently, the hybrid model of deep learning combined with 97.14% using RF classifier for Breast Cancer detection. The
binary dragonfly has accurately classified between benign and accuracy obtained from the system which combines rough set
malignant breast tumors with fewer features. Besides, deep theory with backpropagation neural network in [10] is 98.6%
learning model has achieved better accuracy in classifying on the breast cancer dataset. The first stage handles missing
Wisconsin Breast Cancer Database using all available features. values to obtain a smooth data set and to select appropriate
attributes from the clinical dataset by the indiscernibility
Keywords—Breast cancer; Wisconsin data set; classifiers; deep relation method. The second stage is classification using a
learning; feature selection; dragonfly backpropagation neural network. The algorithm KNN for
classification which is used in [11] with several different types
I. INTRODUCTION
of distances and classification rules is used in the diagnosis and
Breast cancer is the most common cancer in women and, classification of cancer, and these experiments are conducted
overall, the second most leading to death. In 2019, women on the database WBCD. The results advocate the use of the
were diagnosed with an estimated 268,600 new cases of KNN algorithm with both types of Euclidean distance and
invasive breast cancer and approximately 2,670 cases were Manhattan that give the best results (98.70% for Euclidean
diagnosed in men [1]. An accurate diagnosis for various sorts distance and 98.48% for Manhattan with k = 1), these values
of cancer plays a great role for doctors to assist them in are not significantly affected even when k=1 is increased to 50.
determining and choosing the proper treatment. Lately, the SVM and KNN individually used in [12] achieved the accuracy
application of various artificial intelligence (AI) classification of 98.57% and 97.14%, respectively. This work aims to
methods has been proven in aiding doctors to facilitate their automatically design and modify the parameters of the deep
decision-making process [2]. Recently, the use of AI learning model hybrid with the Dragonfly algorithm for breast
classification techniques in the medical field generally and cancer diagnosis.
cancer detection particularly has grabbed the researchers’
*Corresponding Author

114 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 3, 2021

II. MATERIALS AND METHODS reduction [19] is a process used in Data Mining where the
numbers of random variables under consideration are reduced.
A. Machine Intelligence Library An essential step in the efficient analysis of large high-
The software, developed for implementation in this study is dimensional data sets is the reduction of dimensions. PCA
written by using Spyder which is an interactive development performs dimensionality reduction whilst maintaining
environment capable of advanced editing, interactive testing, maximum feasible arbitrariness in the high-dimensional space.
debugging, and introspection for Python (version 3.7 was used) PCA is probably the oldest and certainly the most popular
programming language. Also, Keras [13] neural network API technique for computing lower-dimensional representations of
was used for deep learning in the developed method. It is a multivariate data. The technique is linear in the sense that the
high-level neural network API, supporting Python which can components are linear combinations of the original variables
convert the results rapidly, highly modular, minimalist, and has (features), but non-linearity in the data is preserved for
extensible features. Keras with Google TensorFlow backend is effective visualization. The PCA is a method of statistical data
used to implement the deep learning algorithms in this study, analysis that transforms the initial set of variables into an
with the aid of other scientific computing libraries: matplotlib assorted set of linear combinations, referred to as the principal
[14] is a comprehensive library for creating interactive, and components (PC), with variance-specific properties. This
animated visualizations in Python, NumPy [15] is a library for condenses the system's dimensionality while retaining the
the Python programming language, adding support for big, variable connections information. The analysis is carried out by
multi-dimensional arrays and matrices, along with a huge calculating and analyzing the data covariance matrix on a data
collection of high-level mathematical functions to operate on set, its eigenvalues along with its respective eigenvectors
these arrays, and scikit-learn [16] is a free software machine systematized in descending order.
learning library for the Python programming language, where it
emphasizes several classifications, regression, and clustering E. Classification Techniques
algorithms including support vector machines, k-means, The classification aims to develop a set of models that can
random forests. correctly classify the class of different objects. There are three
types of inputs to such models, which are: (a) a bunch of
B. Dataset Description objects that are described as training data, (b) the dependent
The UCI machine learning repository has been used to variables, and (c) classes that may be a group of variables
download the WBCD [4] for breast cancer classification [17]. describing various characteristics of the objects. Once a
This dataset usage is more common among researchers who classification model is built, it tends to be utilized to classify
utilize machine learning methods for the classification of breast the class of the objects to which class information is
cancer. Each dataset is composed of a set of numerical unidentified [20]. There are numerous sorts of classifiers that
attributes that were assessed by fine needle aspiration (FNA) have been utilized for a cancer diagnosis; some of them are
from human breast cancer tissue. WBCD has 699 instances and NN, SVM, KNN, NB, and RF. They are used to classify cancer
10 attributes including the class attribute. One of the two datasets as malignant and benign tumors.
possible classes is found in each instance; malignant (M) or
benign (B). Every attribute has been represented in the form of 1) Support vector machine: Support vector machine
an integer between 1 and 10. These attributes include: (SVM) classifier is a type of supervised machine learning
(uniformity of cell size, clump thickness, uniformity of cell classification algorithm, it is applied in classifying cancer
shape, single epithelial cell size, marginal adhesion, bare because it is a non-probabilistic binary and nonlinear
nuclei, normal nuclei, bland chromatin, and mitosis). statistical tool which works by separating space into two
C. Data Preprocessing regions by a straight line or hyperplane in higher dimensions.
It examines the data, recognizes the pattern, and classifies the
Preparing data for use in a machine learning (ML)
data based on common attributes by using kernel tricks. The
framework is significant, where data preparation requires at
least 80 percent of the total time expected to create an ML kernel is a set of numerical functions that are used in SVM.
system. Data preparation has three main phases: cleaning, The kernel's function is to take data as an input and convert it
normalizing, and encoding, and splitting. Each of the three into the form necessary. Various kinds of kernel functions
phases has several steps. Equation (1) is used to normalize were utilized by the SVM algorithm. These functions can be
dataset attributes. different types; for example, linear, nonlinear, polynomial,
X−µ radial basis (RBF), and sigmoid functions.
Z= (1) 2) Naïve Bayes: Naïve Bayes (NB) is a probabilistic
σ

Where X represents the dataset attributes, µ represents the classifier based on the Bayes theorem. Rather than predictions,
mean value for each dataset attribute x(i), and σ represents the it produces probability estimates. For the value of each class,
corresponding standard deviation. This normalization it estimates the probability of each given instance belongs to
technique was implemented using the Standard Scaler of scikit- that class. An advantage of the NB classifier is that it requires
learn. a small amount of training data in order to estimate the
D. Principal Component Analysis parameters that are mandatory for classification.
3) Artificial Neural Network: Artificial Neural Network
Principal Component Analysis (PCA) [18] is a dimension (ANN) is a numerical model based on biological neural
reduction method that includes related features. Dimensionality
networks. It comprises an interconnected group of artificial

115 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 3, 2021

neurons, and it processes information employing a insensitive to large variations. The classification issue is an
connectionist approach to the computing process. In most important component in the field of deep learning since it is
cases, an ANN is a robust framework that changes its structure focused on judging a new sample that belongs to which
based on outside or inner data that flows through the network predefined sample category, according to a train set containing
a certain number of known samples. The classification problem
during the learning phase. One of the fundamental advantages
is also called supervised classification, since all samples in the
of ANN over conventional methods is its ability in capturing train set are labeled, and all categories are predefined.
the complex and nonlinear interaction between prognostic
markers and the outcome to be anticipated. The output is defined by the following formula in (2):
4) Random forest (RF): Random forest (RF) algorithm is Y =𝑓(∑𝑗 𝑤𝑗 𝑥𝑗 + 𝑏) (2)
a supervised classification algorithm that creates a forest with
several trees. It is a flexible, easy to utilize machine learning Where w j is the network weights, b is a bias term, and f is a
algorithm that mostly produces a great result. Due to its specified activation function. Figure 3 shows a natural
simplicity, it is also one of the most used algorithms. The extension of this simple model is attained by combining
multiple neurons to form a so-called hidden layer.
more trees in the forest the more robust the forest appears in
general. In the same way in the random forest classifier, the
Artificial
higher the number of trees in the forest indicates the high itillegence

accuracy results.
5) K-nearest Neighbors: K-nearest Neighbors (KNN) is Machine
learning
one of the most used algorithms in machine learning. It is a
method of learning based on instances that do not require a Deep Learning
phase of learning. The model developed is the training sample,
connected to a distance function and the choice function of the
class based on the classes of the nearest neighbors. Before convolutional
classifying a new element, it must be compared with other neural network

elements using a similarity measure. Its k-nearest neighbors


consider the class that appears most among the neighbors is
assigned to the element to be classified. Besides, the Fig. 1. The Relationship between Artificial Intelligence, Machine Learning,
Deep Learning, and Artificial Neural Networks.
appropriate functioning of the method relies on the choice of
some number of a parameter such as the k parameter
represents the number of neighbors chosen to assign the new
element to the class and the distance used.
III. DEEP LEARNING
Deep learning (DL) is one of the numerous strategies found
within machine learning (ML) as shown in Figure 1, where ML
[21] is a discipline of artificial intelligence that ensures the
software estimate results with better accuracy, without the need
to write explicit codes to perform the task mentioned. DL
methods are utilized in ML in terms of quick learning and
implementation of large and complex data. DL is widely
utilized in numerous software disciplines for example
computer vision, speech and sound processing, bioinformatics,
computer games, search engines, manufacturing, online
advertising, and financing, etc. It is realized that DL provides Fig. 2. The Analogy between an Artificial Neuron and a Biological Neuron.
X Represents the Inputs, the Bias b, the Activation Function ϕ, and Weights w
highly successful results in processes of estimation and are Adjusted Automatically by the Network.
classification.
DL describes a bunch of computational models composed
of many layers of data processing, which make it conceivable
to learn by representing these data through several levels of
abstraction [22] from a large amount of training data, these
models discover recurrent structures by automatically refining
their interior parameters via a backpropagation algorithm as
shown in Figure 2. Each layer of the network transforms the
signal nonlinearly to increase the selectivity and invariance of
the representation. With a sufficient number of layers, the
network can generate a hierarchy of representations that will
make the model both sensitive to very small details and Fig. 3. Representation of Layers of Deep Learning.

116 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 3, 2021

IV. FEATURE SELECTION


Feature selection (FS) is a pre-processing method that has
been demonstrated to significantly affect the performance of
the data mining techniques [23] (e.g. classification) in terms of
either the quality of the extracted patterns or the running time
required to analyze the complete dataset. It reduces the number
of features, removes irrelevant, redundant, or noisy features,
and brings about palpable effects for applications: speeding up
a data mining algorithm, improving learning accuracy, and (a) Static Swarm. (b) Dynamic Swarm.
leading to better model comprehensibility’s methods are Fig. 4. Dynamic and Static Dragonflies.
arranged into filters and wrappers [24]. While a learning
algorithm (e.g. classification) is approached by the wrapper in Five behaviors characterize DA, where X is the position
the evaluation of the feature subset, filters rely on the data itself vector, X j is the j-th neighbor of the X, and N denotes the
to evaluate the feature subset using designated methods (e.g. neighborhood size:
information gain) [25]. Separation: dragonflies use this strategy to separate
Searching for an ideal subset of features is a major themselves from other agents. This procedure is formulated as
challenge when solving feature selection problems. The (3):
primary target when selecting a feature is to find a set of M 𝑆𝑖 = ∑𝑁
𝑗=1 𝑋 − 𝑋𝑖 (3)
features from an original set of N where M < N without
information lose. Therefore, an impractical approach to this Alignment: shows how an agent will set its velocity to the
problem is to create all possible subsets. If the dataset includes velocity vector of other adjacent dragonflies. This concept is
N features, then there will be 2N subsets to be generated and modeled based on (4) Where Vj indicates the velocity vector of
evaluated, which are considered computationally expensive the j-th neighbor:
tasks [23]. This paper introduces Dragonfly as a feature
∑𝑁
𝑗=1 𝑉𝑖
selection and studies its effect on accuracy. 𝐴𝑖 = (4)
𝑁
A. Dragonfly Algorithm (DA)
Cohesion: shows members’ inclination to move in the
Dragonfly is an open-source python library for scalable direction of the nearest mass center. This step is formulated as
Bayesian optimization. Bayesian optimization is utilized for in (5):
optimizing black-box functions whose evaluations are usually
expensive. Beyond vanilla optimization techniques, DA ∑𝑁
𝑗=1 𝑋𝑗
𝐶𝑖 = −𝑋 (5)
provides an array of tools to scale up Bayesian optimization to 𝑁
expensive large-scale problems. These include features that are Attraction: illustrates the propensity of members to step
especially suited for high dimensional optimization, parallel towards the food source. The attraction tendency among the
evaluations in synchronous or asynchronous settings which food source and the i-th agent is performed based on (6) Where
means conducting multiple evaluations in parallel, multi- Floc is the food source’s location:
fidelity optimization which using cheap approximations to
speed up the optimization process, and multi-objective 𝐹𝑖 = 𝐹𝑙𝑜𝑐 − 𝑋 (6)
optimization which optimizing multiple functions Distraction: illustrates the proclivity of dragonflies to keep
simultaneously. It is compatible with Python2 (>= 2.7) and themselves away from a conflicting enemy. The distraction
Python3 (>= 3.5) and has been tested on Linux, Mac OS, and among the enemy and the i-th dragonfly is performed
Windows platforms. according to (7) Where Eloc is the enemy’s location:
DA is a recently well-established population-based
𝐸𝑖 = 𝐸𝑙𝑜𝑐 + 𝑋 (7)
optimizer proposed by Mirjalili in 2016 [26]. The hunting and
migration strategies of dragonflies are the base of the DA In DA, the fitness of food source and position vectors are
algorithm. The hunting method is known as a static swarm updated based on the fittest agent found so far. Moreover, the
(feeding), in which all members of a swarm can fly in small fitness values and positions of the enemy are calculated based
clusters over a limited space for discovering food sources. on the worst dragonfly. This fact will help DA converge in the
Dynamic swarming is considered the migration strategy of solution space towards more promising regions and in turn,
dragonflies (migratory). In this phase, the dragonflies are eager avoid non-promising areas. The position vectors of dragonflies
to take off in bigger clusters, and as a result, the swarm can are updated based upon two rules: the position vector and the
migrate. Dynamic and static groups are shown in Figure 4. step vector (X). The step vector indicates the dragonflies’
Moreover, in other swarm-based methods, the operators of DA direction of motion and it is calculated as in (8):
perform two main concepts: intensification, encouraged by the
dynamic swarming activities, and diversification, motivated by X t+1 = (sS i + aA i + cC i + f F i + eE i ) + wX t (8)
the static swarming activities.

117 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 3, 2021

Where s, w, a, c, f, and e show the weighting vectors of


Read Imputati Normalize Split
different components. The location vector of members is
on dataset(7
calculated as in Eq. (9), where t is iteration: WBCD dataset WBCD 0%
dataset dataset %30%)
X t+1 = X t + X t+1 (9)
The pseudo-code of the DA algorithm is given in
Algorithm 1 as follows: Evaluation
Calculate Best parameters
accuracy Using DA function based
Initialize the population Xi (i = 1, 2, ..., n) on DL model
Initialize ΔXi (i = 1, 2, ..., n)
while (t < Max_Iteration)
Evaluate each dragonfly Fig. 5. Block Diagram of the Model.

Update (F) and (E) D. Splitting the Dataset


Update the main coefficients(i.,e., w, s, a, c, f, and e)
The data used in experiments are separated into two groups
Calculate S, A, C, F, and E(using Eqs. (3) to (7)) as training and testing using (train_test_split) library in python.
Update step vectors using Eq. (8) The allocation of available data among these three data sets is
Calculate T(ΔX) using Eq. (9) vital for the objectivity of success. Because of different tests,
Update Xt+1 using Eq. (8) the data set in the suggested model was allocated as 80% (559
Return the best agent data) for training, 20% (140 data) for testing and validating.
The process of allocation is shown in Figure 6. The cross-
End while validation method was utilized in the implementation of this
process [28].
V. EXPERIMENTAL DISCUSSION
The data is split into a training set (80%), testing set (10%),
and validation sets (10%) several times, and a Cross-Validation
(CV) approach was utilized to evaluate the accuracy,
sensitivity, and specificity of each of the classifiers with five
folds.
A. Model
The proposed model in this study is represented in a block
diagram as shown in Figure 5 explaining the process conducted Fig. 6. Splitting Dataset using Cross-validation.
within the model. It is planned with three phases first of them
utilizing traditional classifiers for example SVM, NB, RF, and E. Neural Network Model
KNN, secondly applying deep learning for enhancing the
accuracy of the detection of breast cancer, finally applying the Any neural network (NN) model has multiple parameters
principle of feature selection using DA with deep learning that control its performance for example number of hidden
classification to improve the performance. layers, number of nodes in each layer, the type of activation
function, number of epochs, and batch size.
B. Missing Dataset Technique
F. Epoch
The training of a model with a dataset containing missing
values may significantly influence the quality of the deep The NN learns the patterns of input data by reading the
learning model. For this reason, the utilization of WBCD in input dataset and applying various calculations to it. However,
training was ensured by the correction of 16 incorrect data it does not make that only once, it learns, over and over,
found with statistical missing value analysis. There are two utilizing the input dataset and learning outcomes from the
techniques in handling the missing data: The mean Imputation previous trials. An epoch is a process of learning from the input
technique and Missing Data Ignoring Technique. Mean dataset in each trial. Expanding the number of epochs doesn't
Imputation technique works by calculating the mean value of generally imply that the network will give better results, it may
readily available values in a column and then substituting the cause overfitting. Using the trial-and-error method, several
missing values in each column independently from each other epochs were chosen until the outcomes still the same after a
[27]. Missing Data Ignoring Technique simply deletes the very few cycles.
cases that contain missing data. G. Activation Functions of the Neural Network
C. Normalization of Dataset The activation functions used in the layers of the created
A normalization process between the ranges of 0-1 was neural network are described as follows:
applied in the data set for eliminating the long learning time 1) Rectified Linear Unit (ReLU): ReLU activation
caused by the size of the data set. The MinMaxScaler method function is utilized in the input layer and hidden layers of the
was used in this process as shown in (1). neural network. ReLU as seen in Figure 7 is an activation

118 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 3, 2021

function that recently gained popularity for its practicality in


deep learning. It enables the neural network to learn faster
[21]. The numerical expression of the function is provided in
(10).
f(x) = max(0, x) (10)
2) Sigmoid activation function: The sigmoid activation
function is used in the output layer of the neural network. It is
a function that gets a value between the ranges of (0, 1) as
seen in Figure 8. The numerical expression of the function is
given in (11). Fig. 9. Softmax Activation Function.

1
σ(x) = (11) H. Dropout
1+exp(−𝑥)
Dropout is one of the methods that are utilized to prevent
3) Softmax activation function: The Softmax is a function memorization. In each iteration, it randomly removes some
that turns a vector of K real values into a vector of K real neurons from a layer at a specified rate. The process of dropout
values that sum to 1. The input values can be positive, is shown in Figure 10. They dropped the crossed units out of
negative, zero, or greater than one, but the Softmax transforms the network.
them into values somewhere in the range 0 and 1 as shown in
Figure 9, so that they can be interpreted as probabilities. Large
multi-layer neural networks end in a penultimate layer that
outputs real-valued scores that are not efficiently scaled and
which makes working with them complicated. In the current
study, the Softmax is very helpful as it turns the scores into a
normalized probability distribution. Consequently, it is normal
(a) Standard Neural Network. (b) Network after Dropout.
to append a Softmax function as the final layer of the neural
Fig. 10. Dropout Neural Network Model. (a) A Standard Neural Network. (b)
network. After the Dropout is Applied, the Same Network. Dotted Lines Indicate a
Node that has been Dropped.

I. Optimization
Optimization is a basic issue in the learning process in deep
learning applications. Its techniques are utilized to find the
optimum value in solving non-linear problems. RMSprop,
adagrad, adadelta, adam, adamax. Moreover, there are
differences between each of these algorithms in terms of
performance and speed. In this study, the optimization
algorithm of Adaptive Moment Optimization (Adam) was
applied.
Fig. 7. Relu Activation Function. J. Loss Function
The loss function is a type of function that measures both
the error rate and performance of a designed model. In DL, the
last layer of a NN is the layer where the loss of function is
defined. In DL applications, the function calculates the
dissimilarity between the estimation of the designed model and
the required real value. In case that a model with good
estimation capability is designed, the difference between the
real value and estimated value will be lower. An output of a
higher loss value indicates that the designed model contains
defects. In the literature, there are various loss functions such
as mean squared error, mean absolute percentage error, mean
Fig. 8. Sigmoid Function Activation. squared logarithmic error, hinge, logcosh, sparse categorical
cross-entropy, binary cross-entropy, kullback, poisson, and
many others. In this study, the meager straight out cross-
entropy misfortune work was utilized.

119 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 3, 2021

K. Early Stopping TABLE I. RESULTS OF TRADITIONAL CLASSIFIERS

In the models where training is made by iteration with data, classifier Missing values PCA accuracy
the duration of learning must be terminated at the right time.
RF (100) Mean 97
Otherwise, if training is not stopped, all of the samples in the
data set for training will be memorized by the system. These RF (10) Mean 95
outcomes in a decrease in the capability of estimation of RF (10) Mean 1 98
unknown samples. In case of early termination, the
RF (100) Mean 1 98
performance of the system will decline in that it could not fully
analyze the data. The same outcome will also arise in the case RF (10) Remove 98
of over-training. In case of an overfitting possibility for the RF (100) remove 95
program, a parameter of early stopping was defined; the
KNN (10) Mean 1 97
training will be stopped regardless of the number of iterations.
KNN ( 3) Mean 1 98
VI. RESULTS AND DISCUSSION KNN ( 3) Remove 96
In this section, performance evaluation is discussed using NB Mean 1 97
accuracy which is used as the percentage of correct predictions.
Table 1 shows the comparative study of different classifiers Svm (rbf) Remove 1 99
which easily analyses KNN, RBF, SVM, and NB. Some Svm (rbf) Mean 96
experiments handled missing data by Mean Imputation Svm (rbf) Remove 97
technique and others by Missing Data Ignoring Technique. It
declares that RF gives a better result when using 10 trees, and Svm (rbf) Mean 1 98
KNN with 3 neighbors which reduces the complexity of the Svm (linear) Mean 1 97
model and consumes less processing time. PCA+SVM with
RBF kernel when using missing data ignoring technique TABLE II. DEEP LEARNING RESULTS
considered as a better classifier as compared to others which
achieved 99%. Number Activation
Epochs Dropout accuracy
Of layers functions
A. Deep Learning Usage with the Dataset 2 250 Sigmoid,Softmax 0.5 99
The proposed model utilizes two layers at the start, then
2 100 Sigmoid,Softmax 0.3 98.54
eventually experiments with more layers which have been
observed that the convergence time is larger for deeper 2 100 Sigmoid,Softmax 0.5 97.85
networks. Many parameters control the deep learning model. 2 2000 Relu,Sigmoid 99.3
One of them is the number of hidden layers, if data is less
3 150 Sigmoid,Sigmoid,Softmax 0.3 98
complex and is having fewer features then neural networks
with 1 to 2 hidden layers will work, but if data is having large 3 150 Relu,Relu,Softmax 0.3 97
features, so to get an optimum solution, 3 to 5 hidden layers 3 250 Relu,Relu,Softmax 0.3 97.08
can be used. It should be noticed that increasing hidden layers
3 1000 Relu,Relu,Softmax 0.3 97.08
will also increase the complexity of the model which may
sometimes lead to overfitting. Another one is the number of 3 100 Relu,Sigmoid,Softmax 0.5 98.54
hidden neurons; it should be between the size of the input layer 3 100 Relu,Sigmoid,Softmax 0.3 97.8
and the size of the output layer. It may be 2/3 the size of the
Sigmoid,Sigmoid,Sigmoid
input layer, plus the size of the output layer and it should be 4 250
,Softmax
0.5 99
less than twice the size of the input layer [29]. The experiments
Sigmoid,Sigmoid,Sigmoid
are based on using batch size 16 and 9 neurons in each layer; 4 1000 0.3 98.5
,Softmax
the result is as shown in Table 2.
Sigmoid,Sigmoid,Sigmoid
4 100 0.3 99.3
As shown above, in Table 2, the best accuracy achieved is ,Softmax
99.3% with 2 hidden layers and epochs 2000 while the Sigmoid,Sigmoid,Sigmoid
accuracy reduces to 99% with 250 epochs only. More epochs 4 150 0.3 97.08
,Softmax
mean more iteration and more consumption of time and
Sigmoid,Softmax,Softmax
resources. However, the difference in accuracy is not 4 150 0.3 98.54
,Softmax
significantly considerable to endure more time consumption.
5 15 Sigmoid,…,softmax 93.3
Also, the same accuracy level of 99.3% is attained using 4
hidden layers and only 100 epochs. Besides, plots of the 5 15 Softmax ,…,softmax 94.3
characteristic of the 4 hidden layers model are shown in Figure 5 20 Softmax,…,softmax 95.2
11. In graph (a) the training accuracy visibly increases over
time, until it reaches nearly 95%, while the validation accuracy 5 250 Sigmoid,…,softmax 0.25 96
reaches a plateau at a range of 98–99.3% after 21epochs.
Moreover, the validation loss, presented in a graph (b), reaches
its minimum after 50 epochs and then halts, while the training
loss keeps decreasing exponentially until it drops to nearly 0.

120 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 3, 2021

The characteristics of the 2 hidden layers model are shown


in Figure 12, where graph (a) explains the behavior of the
training and validation loss and graph (b) presents the
accuracy. The validation loss clearly reaches its minimum after
200 epochs and then halts, while after 250 epochs keep
decreasing exponentially and reach the minimum value for
nearly 0 then it is steady-state. Also, the training and validation
accuracy increases linearly until 30 epochs, and then it reaches
nearly 100%.
After applying feature selection using DA with deep
learning, the results are shown in Table 3. As noticed in Table
3, it gives the best result is 97.907% when choosing 20
(a)
population,100 iteration, and 100 epochs which recommended
five attributes as the most important (uniformity of cell size,
uniformity of cell shape, bare nuclei, bland chromatin, and
mitosis). These ML methods have been chosen because the
results obtained from these methods have appeared to be more
accurate than traditional classifiers. In addition to
implementing these ML techniques for bigger data in the future
will be at a faster rate. The main focus is to choose the most
suitable classifier model for obtaining the highest accuracy and
to find an improvement of similar previous works on the same
database.

TABLE III. DEEP LEARNING MIXED WITH DA RESULTS


(b)

Number of
population

Activation

accuracy
Fig. 11. (a) Training vs Validation Loss, (b) Training vs Validation Accuracy
iteration

function
features

features
epoch
folds

of 4 Hidden Layers Model with 100 Epochs.

5 10 100 100 110011001 5 Relu,Sigmoid 96.52


5 20 100 100 011001010 4 Relu,Sigmoid 97.43
10 10 100 100 110011000 4 Relu,Sigmoid 96.69
Relu,Sigmoid
10 20 100 100 111001000 4 97.619
,Softmax
10 30 150 100 111011001 6 Relu,Sigmoid 97.62
Relu,Sigmoid
10 20 100 1000 110001001 4 97.818
,Sigmoid,Softmax
Sigmoid,Softmax
5 20 100 200 101001111 6 ,Softmax,Softmax 97.256
Dropout=0.25
(a) Softmax,Softmax
10 20 100 100 101001011 5 97.907
,Softmax,Softmax
Sigmoid,Sigmoid
10 20 100 100 001001001 3 ,Softmax 96.71
Dropout=0.5
10 20 100 100 110101011 6 Relu,Sigmoid 96.89
Relu,Sigmoid
10 20 100 1000 100111011 6 97.49
,Sigmoid

VII. CONCLUSION
Breast cancer prediction is very significant in the area of
Medicare and Biomedical. This study aims to enhance the
accuracy of the diagnosis of breast cancer with the deep
learning method. Analysis of WBCD with traditional
(b)
classifiers such as NB, SVM, KNN, and RF achieved high
Fig. 12. Two Hidden Layers Model with 2000 Epoch, (a) Training vs accuracy. Proposed a model that predicts breast cancer based
Validation Loss (b) Training vs Validation Accuracy.
on a deep investigation in the performance of different deep

121 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 12, No. 3, 2021

networks on this dataset. It has been implemented by Python to [11] S. AhmedMedjahed, T. Ait Saadi, and A. Benyettou, “Breast Cancer
be the most effective in classifying the diagnostic data set into Diagnosis by using k-Nearest Neighbor with Different Distances and
Classification Rules,” Int. J. Comput. Appl., vol. 62, no. 1, pp. 1–5, 2013,
the two classes because of the seriousness of cancer; it's found doi: 10.5120/10041-4635.
that the accuracy of the proposed model ranges between 93.5% [12] M. M. Islam, H. Iqbal, M. R. Haque, and M. K. Hasan, “Prediction of
and 99.3%. In the case of the two hidden layers model, the breast cancer using support vector machine and K-Nearest neighbors,”
highest outcomes result with 250 and 2000 epochs are 99% and 5th IEEE Reg. 10 Humanit. Technol. Conf. 2017, R10-HTC 2017, vol.
99.3% respectively. The same result might be obtained with 2018-Janua, no. February 2018, pp. 226–229, 2018, doi: 10.1109/R10-
four hidden layers models and 100 epochs. It is noticed that DL HTC.2017.8288944.
hybrid with DA as a feature selection model achieved an [13] H. Singh, Practical Machine Learning with AWS. 2021.
accuracy of 97.907%. Such comparative analysis of breast [14] J. D. Hunter, “Matplotlib: A 2D graphics environment,” Comput. Sci.
Eng., vol. 9, no. 3, pp. 90–95, 2007, doi: 10.1109/MCSE.2007.55.
cancer classification would provide insights on the efficient
approaches for the detection of cancer problems. [15] S. Van Der Walt, S. C. Colbert, and G. Varoquaux, “The NumPy array: A
structure for efficient numerical computation,” Comput. Sci. Eng., vol.
13, no. 2, pp. 22–30, 2011, doi: 10.1109/MCSE.2011.37.
VIII. FUTURE WORK
[16] H. Li and D. Phung, “Journal of Machine Learning Research: Preface,” J.
The proposed model is applied to numerical data only. It Mach. Learn. Res., vol. 39, no. 2014, pp. i–ii, 2014.
would be interesting to see its behavior when it is applied to [17] L. Vig, “Comparative Analysis of Different Classifiers for the Wisconsin
different types of data available in the medical field such as Breast Cancer Dataset,” OALib, vol. 01, no. 06, pp. 1–7, 2014, doi:
mammograms. In the future, the research may be carried out 10.4236/oalib.1100660.
for a screening of features to diagnose breast cancer tumors. [18] Y. Qu, G. Ostrouchov, N. Samatova, and A. Geist, “Principal Component
Analysis for Dimension Reduction in Massive Distributed Data Sets,”
REFERENCES Work. High Perform. Data Min. Second SIAM Int. Conf. Data Min., no.
[1] S. Chopra and E. L. Davies, “Breast cancer,” Med. (United Kingdom), June 2014, pp. 4–9, 2002.
vol. 48, no. 2, pp. 113–118, 2020, doi: 10.1016/j.mpmed.2019.11.009. [19] N. Varghese, “A Survey Of Dimensionality Reduction And Classification
[2] B. Sahu, S. Mohanty, and S. Rout, “A Hybrid Approach for Breast Methods,” Int. J. Comput. Sci. Eng. Surv., vol. 3, no. 3, pp. 45–54, 2012,
Cancer Classification and Diagnosis,” ICST Trans. Scalable Inf. Syst., doi: 10.5121/ijcses.2012.3304.
vol. 0, no. 0, p. 156086, 2018, doi: 10.4108/eai.19-12-2018.156086. [20] V. Saravanan and R. Mallika, “An effective classification model for
[3] M. Paredes, “Can Artificial Intelligence help reduce human medical cancer diagnosis using micro array Gene expression data,” Proc. - 2009
errors? Two examples from ICUs in the US and Peru,” vol. 2009, pp. 1– Int. Conf. Comput. Eng. Technol. ICCET 2009, vol. 1, pp. 137–141,
12, 2018, [Online]. Available: https://techpolicyinstitute.org/wp- 2009, doi: 10.1109/ICCET.2009.38.
content/uploads/2018/02/Paredes-Can-Artificial-Intelligence-help-reduce- [21] İ. Yıldız and A. T. Karadeniz, “Enhancement Of Breast Cancer Diagnosis
human-medical-errors-DRAFT.pdf. Accuracy With Deep Learning,” Eur. J. Sci. Technol., no. October, pp.
[4] Dr. WIlliam H. Wolberg, “UCI Machine Learning Repository: Breast 452–462, 2019, doi: 10.31590/ejosat.638428.
Cancer Wisconsin (Original) Data Set.” https://archive.ics.uci.edu/ml/ [22] Y. Bengio, Learning deep architectures for AI, vol. 2, no. 1. 2009.
datasets/Breast+Cancer+Wisconsin+%28Original%29 (accessed Dec. 16, [23] M. M. Mafarja, D. Eleyan, I. Jaber, A. Hammouri, and S. Mirjalili,
2020). “Binary Dragonfly Algorithm for Feature Selection,” Proc. - 2017 Int.
[5] S. Aruna, S. P. Rajagopalan, and L. V Nandakishore, “Knowledge Based Conf. New Trends Comput. Sci. ICTCS 2017, vol. 2018-Janua, pp. 12–
Analysis of Various Statistical Tools in Detecting Breast Cancer,” 17, 2017, doi: 10.1109/ICTCS.2017.43.
Comput. Sci. Inf. Technol., vol. 2, pp. 37–45, 2011, doi: [24] H. (National U. of S. Liu, H. (Osaka U. Motoda, R. Setiono, and Z. Zhao,
10.5121/csit.2011.1205. “Feature Selection : An Ever Evolving Frontier in Data Mining,” J. Mach.
[6] K. Polat and S. Güneş, “Breast cancer diagnosis using least square Learn. Res. Work. Conf. Proc. 10 Fourth Work. Featur. Sel. Data Min.,
support vector machine,” Digit. Signal Process. A Rev. J., vol. 17, no. 4, pp. 4–13, 2010.
pp. 694–701, Jul. 2007, doi: 10.1016/j.dsp.2006.10.008. [25] C. S. Yang, L. Y. Chuang, Y. J. Chen, and C. H. Yang, “Feature selection
[7] A. F. M. Agarap, “On breast cancer detection: An application of machine using memetic algorithms,” Proc. - 3rd Int. Conf. Converg. Hybrid Inf.
learning algorithms on the Wisconsin diagnostic dataset,” ACM Int. Technol. ICCIT 2008, vol. 1, pp. 416–423, 2008, doi:
Conf. Proceeding Ser., no. 1, pp. 5–9, 2018, doi: 10.1109/ICCIT.2008.81.
10.1145/3184066.3184080. [26] M. Mafarja, A. A. Heidari, H. Faris, S. Mirjalili, and I. Aljarah,
[8] H. Asri, H. Mousannif, H. Al Moatassime, and T. Noel, “Using Machine Dragonfly algorithm: Theory, literature review, and application in feature
Learning Algorithms for Breast Cancer Risk Prediction and Diagnosis,” selection, vol. 811. Springer International Publishing, 2020.
Procedia Comput. Sci., vol. 83, no. Fams, pp. 1064–1069, 2016, doi: [27] Q. Song and M. Shepperd, “Missing data imputation techniques,” Int. J.
10.1016/j.procs.2016.04.224. Bus. Intell. Data Min., vol. 2, no. 3, pp. 261–291, 2007, doi:
[9] P. S. Kohli and A. L. Regression, “2020 IEEE 5th International 10.1504/IJBIDM.2007.015485.
Conference on Computing Communication and Automation, ICCCA [28] D. Berrar, “Cross-validation,” Encycl. Bioinforma. Comput. Biol. ABC
2020,” 2020 IEEE 5th Int. Conf. Comput. Commun. Autom. ICCCA Bioinforma., vol. 1–3, no. April, pp. 542–545, 2018, doi: 10.1016/B978-
2020, pp. 1–4, 2020. 0-12-809633-8.20349-X.
[10] K. B. Nahato, K. N. Harichandran, and K. Arputharaj, “Knowledge [29] F. S. Panchal and M. Panchal, “International Journal of Computer
mining from clinical datasets using rough sets and backpropagation Science and Mobile Computing Review on Methods of Selecting Number
neural network,” Comput. Math. Methods Med., vol. 2015, no. April, of Hidden Nodes in Artificial Neural Network,” Int. J. Comput. Sci. Mob.
2015, doi: 10.1155/2015/460189. Comput., vol. 3, no. 11, pp. 455–464, 2014, [Online]. Available:
www.ijcsmc.com.

122 | P a g e
www.ijacsa.thesai.org

You might also like