0% found this document useful (0 votes)
32 views4 pages

(IJCST-V12I3P13) :thachayani M, Chaitanya Sai Jangam, Kalyan T, SriManjunadh Maddukuri, Sangadi Manikanta

An ensemble learning based classifier to aid in the early diagnosis of breast cancer is presented in this paper. Four machine learning algorithms are investigated and the random forest classifier is selected as the base model based on the performance. An ensemble model is created using bagging and boosting techniques employing the base classifier. Logistic regression is applied as the meta classifier for stacking. The developed ensemble model resulted in an improved accuracy of 96.49% compared t

Uploaded by

editor1ijcst
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views4 pages

(IJCST-V12I3P13) :thachayani M, Chaitanya Sai Jangam, Kalyan T, SriManjunadh Maddukuri, Sangadi Manikanta

An ensemble learning based classifier to aid in the early diagnosis of breast cancer is presented in this paper. Four machine learning algorithms are investigated and the random forest classifier is selected as the base model based on the performance. An ensemble model is created using bagging and boosting techniques employing the base classifier. Logistic regression is applied as the meta classifier for stacking. The developed ensemble model resulted in an improved accuracy of 96.49% compared t

Uploaded by

editor1ijcst
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

International Journal of Computer Science Trends and Technology (IJCST) – Volume 12 Issue 3, May - Jun 2024

RESEARCH ARTICLE OPEN ACCESS

RF-LR Ensemble Classifier for Breast Cancer Detection


Thachayani M [1], Chaitanya Sai Jangam, Kalyan T, SriManjunadh Maddukuri,
Sangadi Manikanta
Dept. of Electronics and Communication Engineering, Puducherry Technological University, Puducherry, India
ABSTRACT
An ensemble learning based classifier to aid in the early diagnosis of breast cancer is presented in this paper. Four machine
learning algorithms are investigated and the random forest classifier is selected as the base model based on the performance. An
ensemble model is created using bagging and boosting techniques employing the base classifier. Logistic regression is applied
as the meta classifier for stacking. The developed ensemble model resulted in an improved accuracy of 96.49% compared to the
92.55% accuracy of the baseline model.
Keywords — Breast cancer detection, Ensemble learning, Exploratory analysis, Logistic regression, Random forest

I. INTRODUCTION
Breast cancer is a major threat particularly for the female tuning the hyperparameters and its impact on accuracy
population which ranks second leading cause of cancer-related performance of random forest, extra tree (ET), and support
deaths in women [1]. Survival rate is higher in cases where the vector machine classifiers are analysed in [9]. It is observed
detection is done during the early stages while it is still that the tuned SVM classifier outperformed the other two with
localized and not yet spread to other parts of the body. The an accuracy of 97.78%. From the survey, it is observed that
motivation for this work arises from the potential for machine variants of SVM and RF outperformed several other base
learning to assist in early and accurate diagnosis of breast classifier models for breast cancer detection application.
cancer which could significantly improve the recovery rates Further it is inferred that several factors such as choice of
and reduce fatalities. features, number of features; and processes such as cross-
Several researches focusing on utilizing the power of validation, boosting, parameter tuning and ensemble has a
machine learning models to enhance diagnostic accuracy is significant impact on the overall performance. In this paper,
reported in literature. An extensive review of literature related an ensemble model based on bagged and boosted RF base
to machine learning based breast cancer detection is reported classifiers combined using a Logistic Regression (LR) meta
in [2]. Some of the closely related work on application of classifier is investigated for breast cancer detection
machine learning and particularly ensemble based learning application. The following section enumerates the
techniques for breast cancer detection is discussed here. methodology used and Section III presents the key results and
Support Vector Machine (SVM), Artificial Neural Network Section IV concludes the paper.
(ANN) and Naıve Bayes Algorithms are investigated for their
suitability in detecting breast cancer and it is observed that the II. METHODOLOGY
SVM classifier outperformed the other two with an accuracy Fig.1 shows the key steps involved in the process of
of 96.72% [3]. A stacking classifier is implemented using K- training and testing the proposed classifier system. The
Nearest Neighbor (KNN), SVM, and Random Forest (RF) as process starts with the pre-processing of the data, which
base classifiers and Logistic Regression as meta classifier involves cleansing and other preliminary processing aiming to
considering 20 features of the breast cancer data and achieved validate the data set for completeness and integrity. Then the
an accuracy of 97.20% [4]. Employed an ensemble model data set is split into training and test sets. Exploratory analysis
created with Bayesian network and Radial Basis Function and is carried-out on the training data set to visualize the relevance
achieved a prediction accuracy of 97% [5]. The biased results of various predictors to the diagnosis. The correlation between
due to class imbalance in the dataset is observed in the the predictors are plotted using different visualization tools to
decision tree classifier and adaptive boosting is employed to identify the relevant predictors. The ensemble model is trained
address this issue. Significant improvement in accuracy is with the selected predictors and cross-validated to ensure that
reported with boosting [6]. Applied t-distributed stochastic the model is not over-fitted.
neighbour embedding (t-SNE) for cost optimization and
dimension reduction. Then snapshot ensemble technique is
used to combine the predictions from the base models leading
to achieve an accuracy of 86.6% [7]. In [8], the authors
investigated on the averaged perceptron model for breast
cancer detection and recorded an accuracy score of 0.984 with
zero false negatives. Cross validation approach is utilized for

ISSN: 2347-8578 www.ijcstjournal.org Page 62


International Journal of Computer Science Trends and Technology (IJCST) – Volume 12 Issue 3, May - Jun 2024
which depicts the attributes, radius_mean, texture_mean,
perimeter_mean and area_mean. From this figure, relevance
of the texture and area or perimeter features in the
classification is evident. Further, the strong correlation
between radius and the area as well as perimeter is also
showcased.

Fig.1 Methodology

The testing of the trained model is done using the test data
Fig. 2 Correlation matrix plot
and the performance of the proposed classifier is assessed in
terms of parameters such as accuracy, precision, recall and
F1-score.

III. RESULTS AND DISCUSSION

The WISCONSIN breast cancer data set which consists of


569 samples with 30 attributes derived from ten main
properties of breast cell nuclei is used to train and test the
classifiers [10]. This information characterizes the cell nuclei
and acquired from digitized image of a fine needle aspirate
(FNA) of a breast mass. Label encoding of the diagnosis
parameter is done during the pre-processing. Standard scaling
of data is done for uniformity. Then the data set is split into
training and test sets comprising of randomly selected 70%
and 30% of the original data samples. Exploratory analysis
utilized various data visualization tools to explore the
relationship between the predictors or features and the target
parameter. It also involves analyzing correlation between the
predictors. Multiple levels of analysis are carried-out to
Fig. 3 Sample pair-wise plots
identify the relevant predictors in order to achieve optimum
performance. From the first level of analysis and the literature
Grid based ten-fold cross validation is done for the base
the ten mean attributes are chosen out of the thirty attributes
classifiers. Hyperparameters such as model complexity and
as the predictors and are analyzed further. The correlation
training rate are optimized to enhance the performance. The
matrix plot is observed for the first-level predictors and the
models considered are LR, RF, SVM and KNN. The
target diagnosis. This is shown in Fig.2. From this figure, it
individual model is trained and then tested. The F1 score and
can be observed that the fractal_dimension feature is feebly
accuracy score obtained are listed in Table I. This table
correlated to the target compared to all the other nine
reveals that the RF classifier outperforms the other considered
attributes. Hence these nine except the fractal_dimension
classifiers and hence is selected as the base model for creating
feature is analyzed further. Pair-wise correlation plots are
the ensemble classifier.
plotted. Some of these plots are shown in Fig.3 as samples,

ISSN: 2347-8578 www.ijcstjournal.org Page 63


International Journal of Computer Science Trends and Technology (IJCST) – Volume 12 Issue 3, May - Jun 2024
TABLE I IV. CONCLUSIONS
PERFORMANCE OF THE BASIC MODELS
This paper presents an ensemble earning classifier with
Sl. Model F1 Score Accuracy random forest as the base model for assisting in the accurate
no Score
diagnosis of breast cancer. Using exploratory analysis nine out
0 LR 0.916010 0.909574
of the thirty predictors is chosen for classification. Four base
1 RF 0.992126 0.925532
2 SVM 1.000000 0.909574 classifiers are investigated and based on the performance, the
3 KNN 0.923885 0.914894 random forest is chosen as the base model and an ensemble
model is created by using bagging and boosting techniques.
In bagging, multiple models of the same base classifier is Logistic regression model is used for aggregating and forming
trained with different data sub-sets generated using boot-strap the final prediction. The ensemble classifier exhibited an
sampling and final decision will be based on the aggregate improved accuracy of 96.49% compared to the 92.55%
decisions formed by voting. This process aids to improve the accuracy of the base model.
diversity of the classifier leading to more robust performance.
Bagging and boosting techniques are used to form an REFERENCES
ensemble of the RF model and aggregation is done using an [1] Breast Cancer Statistics, available at
LR meta classifier. The performance of the ensemble https://www.cdc.gov/ breastcancer/statistics.
classifier is evaluated in terms of accuracy, precision, recall [2] Jafari, Ali, “Machine-Learning Methods in Detecting
and F1-score. Breast Cancer and Related Therapeutic Issues: A
Review.” Computer Methods in Biomechanics and
Biomedical Engineering: Imaging & Visualization, vol.
12, 2024.
[3] Md. I. H. Showrov, M. T. Islam, Md. D. Hossain, and
Md. S. Ahmed, ‘‘Performance comparison of three
classifiers for the classification of breast cancer dataset,’’
in Proc. 4th Int. Conf. Electr. Inf. Commun. Technol.
(EICT), Dec. 2019, pp. 1–5.
[4] M. R. Basunia, I. A. Pervin, M. Al Mahmud, S. Saha,
and M. Arifuzzaman, ‘‘On predicting and analyzing
breast cancer using data mining approach,’’ in Proc.
IEEE Region 10 Symp. (TENSYMP), Jun. 2020, pp.
1257–1260.
[5] M. A. Jabbar, ‘‘Breast cancer data classification using
ensemble machine learning,’’ Eng. Appl. Sci. Res., vol.
Fig. 4 Receiver operating characteristics (RoC) 48, no. 1, pp. 65–72, 2021.
[6] T. A. Assegie, R. L. Tulasi, and N. K. Kumar, ‘‘Breast
cancer prediction model with decision tree and adaptive
boosting,’’ IAES Int. J. Artif. Intell., vol. 10, no. 1, p.
184, 2021.
[7] N. Sharma, K. P. Sharma, M. Mangla, and R. Rani,
‘‘Breast cancer classification using snapshot ensemble
deep learning model and t-distributed stochastic
neighbor embedding,’’ Multimedia Tools Appl., vol. 82,
no. 3, pp. 4011–4029, Jan. 2023.
[8] V. Birchha and B. Nigam, ‘‘Performance analysis of
averaged perceptron machine learning classifier for
breast cancer detection,’’ Proc. Comput.Sci., vol. 218, pp.
2181–2190, 2023.
Fig. 5 Results of the performance evaluation [9] N. Mohd Ali, R. Besar, and N. A. A. Aziz, ‘‘A case
study of microarray breast cancer classification using
The receiver operating characteristics of the ensemble machine learning algorithms with grid search cross
model is shown in Fig. 4 and the output screenshot showing validation,’’ Bull. Electr. Eng. Informat., vol. 12, no. 2,
the confusion matrix and performance metrics is presented in pp. 1047–1054, Apr. 2023.
Fig.5. From these figures it is evident that the proposed [10] Wisconsin Breast Cancer Dataset from UCI Machine
ensemble classifier based on RF and LR classifier performs Learning Repository, available at
significantly better than the base RF classifier with an https://archive.ics.uci.edu/ml/datasets/
improved accuracy score of 0.965 compared to the original Breast+Cancer+Wisconsin+%28Diagnostic%29.
score of 0.9255.

ISSN: 2347-8578 www.ijcstjournal.org Page 64


International Journal of Computer Science Trends and Technology (IJCST) – Volume 12 Issue 3, May - Jun 2024

ISSN: 2347-8578 www.ijcstjournal.org Page 65

You might also like