0% found this document useful (0 votes)
62 views37 pages

Capstone Project Report

The document presents a thesis titled 'Automated Breast Cancer Recognition Based on Machine Learning,' submitted by Aenish Prasain and Seimon Moktan Tamang for a Bachelor of Technology in Computer Science and Engineering. It outlines the motivation, objectives, and proposed methodologies for utilizing machine learning algorithms to improve breast cancer detection and diagnosis. The thesis includes a literature survey of existing models, a proposed system framework, and an analysis of various machine learning techniques for effective breast cancer recognition.

Uploaded by

sammy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views37 pages

Capstone Project Report

The document presents a thesis titled 'Automated Breast Cancer Recognition Based on Machine Learning,' submitted by Aenish Prasain and Seimon Moktan Tamang for a Bachelor of Technology in Computer Science and Engineering. It outlines the motivation, objectives, and proposed methodologies for utilizing machine learning algorithms to improve breast cancer detection and diagnosis. The thesis includes a literature survey of existing models, a proposed system framework, and an analysis of various machine learning techniques for effective breast cancer recognition.

Uploaded by

sammy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Automated Breast Cancer Recognition Based on

Machine Learning

Submitted in partial fulfillment of the requirements for the degree of

Bachelor of Technology
in
Computer Science and Engineering

by
AENISH PRASAIN
19BCE2522

SEIMON MOKTAN TAMANG


19BCE2527

Under the guidance of


Prof. / Dr. Lijo
V.P
SCOPE
VIT,
Vellore.

May, 2023
DECLARATION

I hereby declare that the thesis entitled “Automated Breast


Cancer Recognition Based on Machine Learning" submitted by me, for the award of
the degree of Bachelor of Technology in Computer Science and Engineering to VIT is
a record of bonafide work carried out by me under the supervision of Prof/Dr. Lijo
V.P.
I further declare that the work reported in this thesis has not been submitted
and will not be submitted, either in part or in full, for the award of any other degree or
diploma in this institute or any other institute or university.

Place : Vellore
Date :
Signature of the Candidate
CERTIFICATE

This is to certify that the thesis entitled “Automated Breast Cancer


Recognition Based on Machine Learning” submitted by Seimon Moktan Tamang
& 19BCE2527, SCOPE, VIT, for the award of the degree of Bachelor of Technology
in Computer Science and Engineering, is a record of bonafide work carried out by
him / her under my supervision during the period, 01. 07. 2022 to 30.04.2023, as per
the VIT code of academic and research ethics.

The contents of this report have not been submitted and will not be submitted
either in part or in full, for the award of any other degree or diploma in this institute or
any other institute or university. The thesis fulfills the requirements and regulations of
the University and in my opinion meets the necessary standards for submission.

Place : Vellore
Date : Signature of the Guide

Internal Examiner External Examiner

Head of the Department

Programme
ACKNOWLEDGEMENTS

Student Name
Executive Summary

Summary of the thesis


One page and not exceeding 200 words
Times New Roman, 12
CONTENTS Page
No.

Acknowledgement i

Executive Summary ii

Table of Contents Iii

List of Figures ix

List of Tables xiv

Abbreviations xvi

Symbols and Notations xix

1 INTRODUCTION 1
1.1 Theoretical Background 1
1.2 Motivation 2

1.3 Aim of the Proposed Work 3

1.4 Objective(s) of the Proposed Work 4

2. Literature Survey 5
2.1. Survey of the Existing Models/Work 5
2.2. Summary/Gaps identified in the Survey 9
3. Overview of the Proposed System 10
3.1. Introduction and Related Concepts 10
3.2. Framework, Architecture or Module for the Proposed System(with explanation) 12
3.3. Proposed System Model(ER Diagram/UML Diagram/Mathematical Modeling) 15
4. Proposed System Analysis and Design 18
4.1. Introduction 18
4.2. Requirement Analysis 19
4.2.1.Functional Requirements
4.2.1.1. Product Perspective
4.2.1.2. Product features
4.2.1.3. User characteristics
4.2.1.4. Assumption & Dependencies
4.2.1.5. Domain Requirements
4.2.1.6. User Requirements
4.2.2.Non Functional Requirements 25
4.2.2.1. Product Requirements
4.2.2.1.1. Efficiency (in terms of Time and Space)
4.2.2.1.2. Reliability
4.2.2.1.3. Portability
4.2.2.1.4. Usability
4.2.2.2. Organizational Requirements 30
4.2.2.2.1. Implementation Requirements (in terms of deployment)
4.2.2.2.2. Engineering Standard Requirements
4.2.2.3. Operational Requirements (Explain the applicability for your work
w.r.to the following operational requirement(s))
 Economic
 Environmental
 Social
 Political
 Ethical
 Health and Safety
 Sustainability
 Legality

4.2.3.System Requirements 40
4.2.3.1. H/W Requirements(details about Application Specific Hardware)
4.2.3.2. S/W Requirements(details about Application Specific Software)
5. Results and Discussion 43
6. References 50
APPENDIX A
List of Figures

Figure No. Title Page No.


2.1 Figure caption 13
2.2 Figure caption 15

(In the chapters, figure caption should come below the figure and table caption should
come above the table. Figure and table captions should be of font size 10.)
List of Tables

Table No. Title Page No.


2.1 Table caption 28
List of Abbreviations

3GPP Third Generation Partnership Project


2G Second Generation
3G Third Generation
4G Fourth Generation
AWGN Additive White Gaussian Noise
Symbols and Notations

f CFO
 NCFO

(Times new roman-12 font size, 1.5 line spacing)


1. INTRODUCTION

1.1. THEORETICAL BACKGROUND

Breast cancer is the most common cancer among women worldwide, including in
the developed and developing worlds. Although early identification can reduce the
chance of dying from breast cancer, in low and middle-income nations, the disease
is diagnosed at a late stage. Early detection remains the best strategy for improving
breast cancer survival. Thus, the creation of computational models capable of more
accurately identifying breast cancer and enabling understanding of its
pathophysiological underpinnings will help to reduce the societal burden of breast
cancer. There have been a lot of earlier machine learning-based studies. Decision
trees, KNN, SVM, naive bays, and other machine learning algorithms perform
better in their respective fields. Breast cancer is the most commonly reported
cancer type among women worldwide, and it has the second highest female death
rate of any cancer kind. This paper presents a review on different types of
algorithms present with their efficiency and limitations.

1.2. MOTIVATION

As important as data science is everywhere, it also has a major use in healthcare.


Breast cancer is the most common kind of cancer in women, taking lives on its
own. This high death rate from breast cancer necessitates focus for early
identification and prevention. Machine Learning has several applications in
predicting Breast cancer as a possible contributor to cutting-edge technology
development.

1.3. AIM OF THE PROPOSED WORK

In depth analysis of various Machine Learning Algorithms such as Logistic


Regression, Random Forest Classifier, Ada Boost, XGBoost Classifier, Nueral
Network and evaluation based on the features: accuracy, precision, F1 Score, ROC
Curve and find the most feasible algorithm for recognition of breast cancer.

1.4. OBJECTIVE

 Introduce the use of interpretability methods to better understand the


underlying mechanisms of breast cancer.

 Propose a framework for enabling the wide adoption of ML techniques in


breast cancer diagnosis by explaining and justifying the prediction of the
model to the physician.

 Models ability to adapt properly to new, previously unseen data, drawn


from the same distribution as the one used to create the model.
 Establish a method for optimizing the performance of ML algorithms, such
as RF, by feature selection guided by interpretability methods.

 Getting close to accurate answers about standard deviations, categorical


variables, and confidence intervals

 Facilitating more sophisticated and accurate data analysis or modeling

2. LITERATURE SURVEY

2.1. SURVEY OF THE EXISTING MODELS/WORK

Noreen Fatima [1] examined several data mining, machine learning, and deep
learning algorithms in search of one that may more accurately predict the growth
of breast cancer. Following that, a review of the most important machine learning,
ensemble, and deep learning approaches was given. These techniques greatly
extend the algorithms that are used to forecast breast cancer. It provides easy
information processing and cost reduction as well. Quick and Efficient to use
Simple, fast and less complex. Some of the limitations were that the methods
implemented were prone to errors and had complexities in its real time
implementation.

Dejun Zhang[2] integrated a principal component analysis technique and an


autoencoder neural network to create an unsupervised feature learning framework
that can extract various traits from gene expression patterns. To predict clinical
outcomes in breast cancer, an ensemble classifier using the AdaBoost algorithm
(PCA-AE-Ada) was built as the basis for the characteristics that were acquired.
The suggested approach demonstrates stronger prediction capabilities, and the
deeper learning-based classifier outperforms the alternatives.
The features that the neural network automatically extracted demonstrated
remarkable capabilities for efficient generalization and explicitly enhanced the
performance of outcome prediction. But it had difficulties to obtain better
performance and could not be implemented in real time.
Guoqing Bao [3] developed a framework for pan-cancer prognostic analysis that
incorporated statistical power, biological justification, and machine learning
algorithms. From TCGA training studies (n = 1,878), the framework uncovered a
5-lncRNA signature (ENSG00000206567, PCAT29, ENSG00000257989,
LOC388282, and LINC00339). The discovered lncRNAs are substantially related
to overall survival (OS) in the TCGA cohort (n = 4,231) (all P $\leq$ 1.48E-11).
The signature produced a time-dependent ROC/AUC of 0.66 at 5 years by
stratifying the population into low- and high-risk groups with statistically different
survival outcomes (median OS of 9.84 years versus 4.37 years, log-rank P =
1.48E-38). . To correlate the 5-lncRNA profile with patient prognoses for all types
of cancer, an indexing system was created. The results of in silico functional
analysis showed that the lncRNAs are connected to common biological
mechanisms that underlie human malignancies. This framework took minimal time
and had mininmal memory usage. It also has the ability to deliver high quality
results with improved reliabity and resilience. But it had some limitations that it
required additional configuration and was quite heavyweight.

Chen Peng[4] suggests a deep learning approach for the identification of genes
associated to breast cancer (CapsNetMMD) by using Capsule Network-based
Modeling of Multiomics Data. He uses breast cancer-related genes that are well-
known in CapsNetMMD to change the problem of supervised classification from
one of gene identification into one of categorization. They detect breast cancer-
related genes substantially more effectively than previous machine learning
techniques by modeling characteristics based on the capsule network. It is fast and
efficient, but also as accurate as the state-of-the-art algorithms. It solves problems
on an end to end basis. Corresponding time cost is greatly reduced. It’s limitation
is that it has not been investigated thoroughly.

Tariq Mahmood[5] and his fellow researchers investigate various well-known


databases using the keyword "breast cancer" to present a thorough survey on the
current diagnostic approaches. This opens up new research challenges for
radiologists and researchers to get involved as early as possible in the development
of an effective and reliable system for predicting breast cancer using well-known
deep learning techniques.
The analysis of medical multi-image modalities used in this study allowed to
systematically compare the strengths, limitations, and performance of recent DL
and ML schemes. As demonstrated in the development of DL approaches, the
process of breast abnormalities segmentation and classification is improved, which
significantly benefited radiologists and researchers.

Meriem Sebai[6] proposed a technique for employing a partially supervised deep


learning framework to automatically find mitotic figures on breast cancer
histopahology slides. To segment the breast cancer stained slides, they use a
semantic segmentation system with two-stream fully convolutional networks. The
model's second branch is trained using strong labels, whereas the first branch is
trained with weak labels. In order to achieve a more precise mitosis detection, it
merges the projected score maps of the two FCNs. A weight transfer function that
can move semantic information from the weak segmentation branch to the strong
segmentation branch has also been developed by them. With this, the issue of
training the strong segmentation branch on a dataset devoid of pixel-level
annotations is less problematic. The framework is fast and efficient but is also
accurate as state of the art algorithms. It also performs better on various
circumstances and environment. It’s limitation is the big payloads and high
running times.

Beibit Abdikenov[7] proposes a Here we offer a robust prognostic modeling


strategy that yields a Pareto optimum collection of deep neural networks (DNNs)
with comparable performance metrics. Through this method, categorical
characteristics in multidimensional space may be represented as vectors with
improved interpretability. Evidence is provided to support the claim that DNNs
optimized using evolutionary techniques perform better than other classifiers stated
in this study. This proposed strategy is quick and efficient to use and provides
optimal performance on various circumstances and environment. But it is
unsuitable for large case scenarios and hence cannot meet current network business
demands.

Kai Zheng[8] suggested a prediction approach based on chaotic game


representation that takes into account both linear sequence information and
nonlinear interactions at the same time. Most sequence comparison methods,
including k-mer, can only quantify nonlinear sequence associations, but gene
expression is connected to linear sequence information. When compared to other
classifiers and prediction algorithms, CGMDA performed well. It can improve the
worst-case performance, corresponding time cost is greatly reduced, lowers the
complexity threshold. Some limitations were Complexity of its Real Time
Implementation, cannot be implemented in real time and Heavyweight.

Md Tauhidul Islam[9] devised a non-invasive, poroelastography-based approach


for imaging the normalized solid stress inside tumors in vivo. The suggested
technique is based on an analytical model of solid stress distribution that shows
that the solid stress inside the tumor has the same spatial distribution as the
compression-induced stress created in the tumor during a creep compression
experiment. Because solid stress is a significant component of the cancer
microenvironment, the suggested methodology may yield new insights into cancer
mechanopathology and, ultimately, lead to improved cancer detection and
treatment strategies. It was highly effective with complex problems, can improve
the worst-case performance and Built-in error handling. It was unsuitable for large
scale scenarios, heavyweight and cannot be implemented real time.

Muhammad Nauman[10] proposed the application of the CP-Nets formal- ism in


conjunction with state space analysis. Formal verification is a technique used to
validate the correctness of complicated software systems and processes. The
decision tree method is employed in the proposed approach to extract prognostic
rules from breast cancer data sets. To establish accuracy and eliminate labeling
mistake, prognostic rules are turned into CPNets. According to the suggested
technique, the correctness of the obtained prognostic rule may be assured by the
formal- ism of CP-Nets. The resultant prediction rules can offer clinicians with
information about the patient's diagnosis, allowing them to detect dangerous illness
and save lives. For the prognostic procedure, only supervised machine learning
with the CP-Nets formalism was employed in this study. It achieved a well-
balanced tradeoff among various parameters, can improve the worst-case
performance. Some limitations were it may take huge time and economic cost to
construct. Solutions have been proved ineffective.This system is opportunistic and
uncontrollable.

Ziyu Ning[11] proposed topologically inferred miRNA-mediated subpathway


activity profiles using BRCA datasets and other additional datasets, and then
constructed the Global Directed Pathway Network (GDPN) with genes as nodes,
and identified a set of miRNAmediated subpathways that are survival-related risk
markers using 3 BRCA datasets and other 23 cancer datasets. Large activity values
linked with poor prognosis in BRCA datasets, such as hsa-miR-107 and hsa-miR-
142-3p. By combining pathway topological information, the technique may
significantly minimize noise from sequencing mistakes and sample variability, as
well as break down the boundaries of pathways and give a new metric to find
survival-related markers. It enhance correlation strength with finer and more
compact information and performed better on various circumstances and
environment. The limitations were poor application performance, tedious message
updating and additional configuration was required.

Hasan Nasir Khan[12] proposed system trained on four perspectives of


mammograms. They conducted tests using publicly available datasets such as
CBIS-DDSM (Curated Breast Imaging Subset of DDSM) and the mini-MIAS
mammography database. When compared to the literature, the MVFF-based
system outperformed a single view-based system for mammography classification.
They attained an AUC of 0.932 for mass and calcification and 0.84 for malignant
and benign, which is greater than all single-view based methods. For normal and
abnormal categorization, the AUC is 0.93. It determined that a CADx system
based on multi-view feature fusion is more efficient than a single-view system.
Some limitations were critical design challenges, approach was time consuming
and computation burden may limit its further application for real scenarios.

Md Tauhidul Islam[13] used poroelastography-based techniques to quantify uid


pressure, uid velocity, axial strain TC, and uid ow inside a tumor as a result of
externally applied compression in this article. They analytically demonstrated that
these calculated parameters are connected to the underlying IFP, IFV, uid ow, and
associated factors within tumors. As a result, the adoption of non-invasive
poroelastography technologies may be of enormous clinical significance for cancer
diagnosis, prognosis, and treatment. The outcomes were found to be attractive,
offered increased flexibility and provided integrity and non-transferability. The
drawbacks were tedious message updating, difficult method and cannot be
implemented in real time.

K. Sabeena Beevi[14] presents a methodology for segmenting and classifying


mitotic nuclei in breast histopathology pictures that is both effective and accurate.
To lessen the complexity of segmenting accurate nuclei boundaries in big clinical
pictures, the suggested approach employs a stain normalization procedure. To
discover mitotic candidates from contour segmentation nuclei areas, a multi
classier based on deep belief network is used. It tolerated variations, accurate as
the state of art algorithm but was unsuitable for large scale scenarios and the
solutions were proved ineffective.

Ali Bou Nassif [15] presents a comprehensive review of the current state of
research on the use of artificial intelligence (AI) techniques for breast cancer
detection. The authors systematically review the literature on the various AI
methods used, their performance, and the challenges that remain in the field. The
main focus of the review is on deep learning, computer-aided diagnosis (CAD),
machine learning, natural language processing, and decision tree-based algorithms.
The studies reviewed used these AI techniques to analyze different types of data
such as mammograms, breast ultrasound images, and pathology images. The
authors found that AI techniques have shown great promise for improving the
accuracy of breast cancer detection, however, there are several challenges that
must be addressed before AI techniques can be widely adopted for breast cancer
detection. The research provides valuable insights into the current state of research
on AI-based breast cancer detection and highlights the need for further research to
address the challenges that remain.

Viswanatha Reddy Allugunti [16] collected a dataset of thermographic images of


the breast and applied various machine learning and deep learning algorithms to
classify the images as normal or abnormal. They used different algorithms such as
Support Vector Machine (SVM), Random Forest (RF), and Convolutional Neural
Network (CNN) to classify the images. They performed experiments and evaluated
the performance of these algorithms using different evaluation metrics like
accuracy, precision, recall and F1-score.The results of the experiments showed that
the deep learning algorithm (CNN) had a higher accuracy in detecting breast
cancer compared to traditional machine learning algorithms like SVM and RF. The
authors also found that the deep learning algorithm performed well even when the
dataset was small, suggesting that it may be a useful tool for detecting breast
cancer in resource-limited settings. They also found that the CNN model
outperforms the traditional machine learning algorithm in all the evaluation
metrics.

Abien Fred M. Agarap[17] presents an analysis of the Wisconsin diagnostic


dataset using different machine learning algorithms for breast cancer detection.
The author applies various machine learning algorithms such as k-Nearest
Neighbors, Decision Tree, Random Forest, and Artificial Neural Network to the
dataset and compares their performance in terms of accuracy, precision, recall, and
F1-score. The article also provides a brief introduction to the dataset and the
preprocessing steps applied to the data.The results of the study showed that the
Artificial Neural Network algorithm performed the best among the algorithms
tested, with an accuracy of 96.53%. The author also found that the combination of
the Artificial Neural Network and Random Forest algorithms resulted in the
highest F1-score of 0.9647, which is a measure of the balance between precision
and recall.

Siham A. Mohammed [18] and Sadeq Darrab investigate the performance of


several machine learning algorithms in classifying mammographic images as
normal or abnormal (indicating the presence of breast cancer).The authors
collected a dataset of mammographic images and applied various machine learning
algorithms such as K-Nearest Neighbors (KNN), Support Vector Machine (SVM),
Random Forest (RF), Naive Bayes (NB) and Neural Network (NN) to classify the
images. They performed experiments and evaluated the performance of these
algorithms using different evaluation metrics such as accuracy, precision, recall
and F1-score.
The results of the experiments showed that the Neural Network algorithm had the
highest accuracy in detecting breast cancer compared to other machine learning
algorithms. The authors also found that the Neural Network algorithm performed
well even when the dataset was small, suggesting that it could be a useful tool for
detecting breast cancer in resource-limited settings.

Nikita Rane[19] proposed to use six different machine learning algorithms (Naive
Bayes, Random Forest, Artificial Neural Networks, Nearest Neighbour, Support
Vector Machine and Decision Tree) on the Wisconsin Diagnostic Breast Cancer
(WDBC) dataset, which is derived from digitized images of MRI scans. The
dataset was partitioned into a training and testing phase, and the algorithm that
yield the best result was used to classify the cancer as benign or malignant on a
website. They used SQLAlchemy in their implementation.

Dana Bazazeh[20] conduct a comparison of different machine learning algorithms


and evaluate their performance in detecting and diagnosing breast cancer. The
paper uses data visualization techniques to present the results of the analysis. The
goal of this research is to identify the most effective algorithm for detecting and
diagnosing breast cancer and to provide insights into the advantages and
limitations of each.

Muhammet Fatih Ak[21] collected a dataset of mammographic images and applied


various machine learning algorithms such as K-Nearest Neighbors (KNN),
Decision Trees (DT), Support Vector Machine (SVM), Naive Bayes (NB),
Random Forest (RF) and Neural Network (NN) to classify the images as normal or
abnormal (indicating the presence of breast cancer). They also used data
visualization techniques such as histograms, scatter plots, and heatmaps to
represent the data and identify patterns in the images. They performed experiments
and evaluated the performance of these algorithms and techniques using different
evaluation metrics such as accuracy, precision, recall, and F1-score. The paper
concludes that the use of data visualization and machine learning techniques for
early detection of breast cancer using mammographic images is promising and
could be a useful tool in clinical practice.

Amit Kumar Jaiswal[22] evaluates the performance of different semi-supervised


learning algorithms in detecting lymph node metastases and to identify the most
effective algorithm for the task. They use a combination of supervised and
unsupervised learning techniques to train the models, which allows them to
leverage large amounts of unlabeled data in the analysis.

Saul Calderon-Ramirez[23] evaluates the performance of different semi-supervised


learning algorithms in classifying mammograms and to identify the most effective
algorithm for the task. The authors also propose a method to improve the
uncertainty estimations by taking into account the uncertainty of the classifier and
the similarity between the instance and the nearest training examples. The research
is an important contribution to the field of mammogram classification, providing
valuable insights into the use of semi-supervised learning techniques to improve
the accuracy and efficiency of the classification process and the uncertainty
estimations.

Ronald CK Chan[24] conducted a systematic search on PubMed and identified 664


research papers related to the use of AI in breast cancer pathology. They grouped
the papers into six major tasks performed by pathologists: molecular and hormonal
analysis, grading, mitotic figure counting, ki-67 indexing, tumour-infiltrating
lymphocyte assessment, and lymph node metastases identification. For each task,
the authors also listed open-source datasets that can be used to build AI tools. The
paper concludes that many AI tools have shown promise and feasibility in the
automation of routine pathology investigations. The authors expect continued
growth in the use of AI in this field as new algorithms mature. The paper also
includes a graphical abstract that summarizes the major uses of AI in breast cancer
histopathology.

2.2. SUMMARY/GAPS IDENTIFIED IN THE SURVEY

From the above literature reviews it is concluded that [16] In terms of accuracy,
precision, and the amount of data required, it was discovered that CNN
outperforms the other systems now in use. CNN had a 89.67 percent accuracy rate.
[18] The two with the best results were decision trees and sequential minimal
optimization, both of which achieved 75.52% in the Breast Cancer dataset. The
accuracy then rises to 86.20% with decision trees in the Breast Cancer dataset and
89.56% with SMO in dataset after applying preprocessing approaches. [17]The
L2-SVM in this study outperformed the results of (SVM with Gaussian RBF, with
a test accuracy of 89.28%) with a test accuracy of 96.09%. However, it was based
on 10% more training data than the previous study (70% vs. 60%).[15] The highest
accuracy using imaging data for multiclass differentiation or breast cancer subtype
classification was 90% in a paper employing ML. When comparing hybrid and
solo deep learning models for gene expression, it was found that the accuracy of
the standalone deep learning models is consistently greater.

3. OVERVIEW OF THE PROPOSED SYSTEM

3.1. INTRODUCTION AND RELATED CONCEPTS

Breast cancer (BC) is the second most common type of cancer in women that
results in death, and it has a relatively high mortality rate. Early detection can
minimize its impact. Early diagnosis of BCs will significantly improve the
prognosis and likelihood of recovery since it could encourage patients to receive
prompt surgical care. Therefore, it is essential to have a method that makes it
possible for the medical field to swiftly and accurately diagnose breast cancer. Due
to its benefits in modeling a vital feature detection from complicated Breast
Cancer datasets, machine learning is frequently utilized in the classification of
breast cancer pattern. We suggest an ensemble of classifiers-based system for the
automatic diagnosis and prognosis of BC.

Machine learning and deep learning-based classifiers have demonstrated


extraordinary potential to improve classification and prediction accuracy.
Additionally, a number of ensembles of various ML-based classifiers were
evaluated for the classification of breast cancer.

3.2. ARCHITECTURE FOR THE PROPOSED SYSTEM


Fig:

Description:

 We import necessary libraries such as pandas, numpy, matplotlib, scikit-learn, and


TensorFlow.

 Collect Dataset which contains features of breast cells.

 Preprocess the data by removing unnecessary columns, converting the diagnosis


values from M and B to binary values 1 and 0, and removing any outliers.

 Visualize the data by using various plots, such as bar charts, histograms, box plots
etc.

 Calculate the correlation between the features and identify the highly correlated
features with the target variable.

 Train and evaluate different machine learning models, such as KNN, logistic
regression, random forest, Adaboost using the preprocessed data.

 Use an ensemble method of these methods and use a voting classifier.

 Train a neural network model using TensorFlow to classify the breast cells as
malignant or benign.
3.3. PROPOSED SYSTEM MODEL
????????????????????????????????????????

4. PROPOSED SYSTEM ANALYSIS AND DESIGN

4.1. INTRODUCTION

The Proposed System has the following modules:

 Data Extraction

This is the process of obtaining data from the source which will be further
used for creating structured data. Data will be extracted from the database
which is Kaggle in our case .In this step, we analyze and refine the data for
conversion to information for further use.

 Exploratory Data Analysis (EDA)

Data set is explored in detail by enquiring the dataset manually and


applying statistical operations. The purpose of data pre-processing is to
provide a refined input to the classifiers to achieve best possible output.
Missing values, categorical features, variable scale and high dimensionality
can all affect the performance of the classifier. In data exploration
correlation is distinguished between the features for each data set.

EDA is an essential step in the analysis of data because it allows in finding


patterns, connections, and anomalies in the data, which can then be used to
guide further analyses or decisions.

 Feature Selection

We use methods like heatmap if a variable has high correlation with the
target.

 Principal Component Analysis (PCA)

PCA is a statistical technique that seeks to minimize the dimensionality


(number of features) of a collection while still keeping the most data to
identify patterns.
It projects the data onto these directions called principal components, after
identifying the directions in which there is maximum variance in the data.

 Model Creation And Prediction

Different Models are selected from a large pool of machine learning


models. For our project we have chosen KNN, logistic regression, random
forest, Adaboost, XGBoost due to their high-performance measures. We
train and evaluate these models using the preprocessed data.

Ensemble learning method is used, which combines the models trained


which improves the predictive performance by mixing predictions from
them.

4.2. REQUIREMENT ANALYSIS

4.2.1. FUNCTIONAL REQUIREMENTS

4.2.1.1. PRODUCT PERSPECTIVE

The automated breast cancer recognition system based on machine


learning is an advanced software tool created to help healthcare
workers identify breast cancer accurately and effectively utilizing
medical imaging data. It is meant to be incorporated into current
clinical procedures, enhancing the knowledge of pathologists and
radiologists.

On a technical level, the product is based on reliable machine learning


models that have been trained on large, complex and annotated
datasets. The key patterns and traits suggestive of breast cancer are
captured and interpreted by these models using advanced image
processing and feature extraction techniques. In order to guarantee
accuracy and generalizability, the models go through rigorous training
and validation processes.

4.2.1.2. PRODUCT FEATURES

The automated breast cancer recognition product based on machine


learning using datasets offers a range of features to support accurate
and efficient breast cancer detection. It provides features for
organizing, annotating, and importing datasets to ensure effective data
management. The product uses data preprocessing methods such
as noise reduction, image resizing, pixel value normalization, and class
imbalance correction. Advanced feature extraction methods are also
included to capture key breast tissue properties in medical datasets.
Model training uses machine learning methods, such as deep learning

structures like Logistic Regression, Random Forest Classifier, Ada

Boost, XGBoost Classifier and Neural Network. Various criteria are


used to evaluate and validate the trained models, and model selection
and optimization approaches are used to determine which models
perform the best overall.

ROC, F1 score, recall, and other evaluation metrics are used in the
product's processes for assessing and verifying the trained models.
This guarantees the models' functionality and generalizability, which
helps to reliably and accurately identify breast cancer.

4.2.1.3. USER CHARACTERISTICS

Healthcare Professionals:
Healthcare professionals with expertise in the detection and treatment
of breast cancer are the main users of automated breast cancer
recognition systems based on machine learning. This covers the
clinical judgments made by radiologists, pathologists, oncologists, and
other medical professionals while interpreting medical pictures.

Technical Proficiency:
To interact with the automated breast cancer recognition system
successfully, users must have a specific level of technical competence.
They have to be experienced operating computer systems, navigating
software user interfaces, and comprehending the fundamental ideas
behind machine learning and medical imaging.

Domain Expertise:
Users should have in-depth knowledge of breast cancer, its pathology,
and the specifics of breast imaging procedures. For evaluating the
system's predictions, validating the results, and making decisions
based on the automated analysis, this domain expertise is essential.
Ability to Validate and Interpret Results:
Rather than replacing healthcare experts, automated breast cancer
recognition systems are created to support them. Users should be able
to verify and understand the system's findings in light of the patient's
clinical background, symptoms, and other diagnostic data. Critical
thinking, clinical reasoning, and fusing the automated analysis with
their own knowledge are required for this.

Users should have a thorough awareness of the ethical issues related to


employing automated methods for detecting breast cancer. They
should follow laws governing patient privacy and confidentiality, use
technology impartially and fairly, and engage with patients in a way
that fosters openness and confidence.

By taking into account these essential user characteristics, automated


breast cancer recognition systems can be developed and deployed to
specifically suit the requirements of medical professionals, promote
teamwork, and enhance the overall diagnostic procedure.

4.2.1.4. ASSUMPTION & DEPENDENCIES

Several crucial presumptions must be true in order for machine


learning-based automated breast cancer identification to work. In the
beginning, it is assumed that a sufficient and representative dataset is
available, one that correctly reflects the variety of breast cancer cases.
To guarantee that the models are capable of generalizing adequately to
varied circumstances, this dataset includes a variety of breast tissue
types and cancer symptoms. For the purpose of extracting significant
characteristics and building precise machine learning models, the
quality and dependability of the medical datasets are essential.

It is further considered that the characteristics taken from the medical


dataset are crucial and indicative of the existence or absence of breast
cancer. It is anticipated that the chosen feature extraction algorithms
would successfully capture the distinctive qualities of malignant
tumors and identify them from healthy breast tissue. Finally, it is
believed that the annotation and labeling of the medical pictures in the
dataset are consistent and properly represent the truth in supervised
learning settings. The machine learning models are trained using exact
labels thanks to the precise and trustworthy annotations provided by
expert radiologists or pathologists.

Developing automated breast cancer identification systems based on


machine learning requires close cooperation with healthcare experts,
such as radiologists, oncologists, and medical researchers. Their
knowledge and input are crucial for confirming the system's
effectiveness and assuring compliance with clinical procedures. Large
datasets, sophisticated computations, and machine learning model
optimization all require access to sufficient computational resources,
such as high-performance computing infrastructure and potent GPUs.
The credibility, efficiency, and practical application of automated
breast cancer identification systems are boosted by these resources.

4.2.1.5. DOMAIN REQUIREMENTS

When addressing various breast tissue types, tumor sizes, and cancer
stage variations, the system should demonstrate robustness to variety.
When processing vast amounts of medical picture data, it should
operate effectively and scale. Additionally, the system should be
interpretable and explicable so that medical practitioners may
comprehend the variables influencing its forecasts. In order to ensure
the secure processing of patient data, privacy and security are
essential. Last but not least, user-friendliness and efficient utilization
within clinical settings require seamless interaction with the clinical
process.

4.2.1.6. USER REQUIREMENTS

High accuracy in detecting and classifying breast cancer cases,


usability through an intuitive interface integrated into clinical
workflows, interpretability to understand the system's decisions,
alignment with accepted clinical practices, scalability to handle large
datasets, and adherence to privacy and security standards are among
the user requirements for an automated breast cancer recognition
system based on machine learning. By fulfilling these criteria, the
system will be able to assist medical professionals in the diagnosis and
treatment of breast cancer, ultimately leading to improved patient care
and outcomes.

4.2.2. NON FUNCTIONAL REQUIREMENTS

4.2.2.1. PRODUCT REQUIREMENTS

4.2.2.1.1. EFFICIENCY (IN TERMS OF TIME AND SPACE)

- O(n), where n is the total number of rows in the dataset, is


the reading time for dataset.

- Depending on the complexity of the procedures carried


out, data preprocessing and visualization take either O(n)
or O(n2) time.

- Training and evaluating machine learning models requires


training and evaluating several models, hence the time
required is O(n).

4.2.2.1.2. RELIABILITY

The method is very reliable since it constantly provides


correct predictions while reducing false positives and false
negatives. It is strong and adaptable to changes in medical
imaging quality, breast tissue kinds, and cancer symptoms.
The system's capacity to function consistently across multiple
datasets and contexts, indicating generalization capabilities.

4.2.2.1.3. PORTABILITY
To be easily installed and used across diverse healthcare
settings and platforms, the system must be portable.
Healthcare professionals can access and use the system
anywhere, regardless of the specific technical infrastructure
that may be in place. Computers, laptops, and mobile devices
with a range of operating systems and hardware should all
function with the system. In order to ensure seamless
integration and effective adoption, the system should also be
able to interact with currently used healthcare systems and
practices. By putting mobility first, the automated breast
cancer recognition system becomes more usable and
adaptable, enabling healthcare providers to use it in a range of
clinical settings and improve patient care.

4.2.2.1.4. USABILITY

The system is intended to be user-friendly and intuitive,


allowing healthcare practitioners to quickly browse and
engage with its features. A simple and well-designed user
interface makes it easier to input medical photos, retrieve
forecasts, and access important details. Furthermore, the
technology seamlessly integrates into existing clinical
operations, assuring little disturbance and maximum
productivity.

4.2.2.2. ORGANIZATIONAL REQUIREMENTS


4.2.2.2.1. IMPLEMENTATION REQUIREMENTS (IN TERMS OF
DEPLOYMENT)

- Hardware Infrastructure: Ensure that the automated breast


cancer detection system can be deployed with the support
of an appropriate hardware infrastructure. This could
comprise networking hardware, servers, storage devices,
GPUs or TPUs (Tensor Processing Units) for accelerated
computations.
- Software Frameworks and Libraries: We’ll need to ensure
that the following libraries must be present in the user
system such as TensorFlow, or scikit-learn .Make sure
that the frameworks are compatible with the hardware
infrastructure and can make use of frameworks that offer
deployment optimizations.

4.2.2.2.2. ENGINEERING STANDARD REQUIREMENTS

Implement best practices for machine learning model


construction, such as appropriate model selection, feature
engineering, and hyperparameter tuning. Model construction
and Validation. Utilize benchmark datasets and relevant
evaluation measures to validate the models. The model
development process should be documented, along with the
algorithms employed, version control, and performance
testing.

Preparing Training Data: Prepare the training data in


accordance with legal and moral standards. In order to
safeguard privacy, make sure that patient data is properly
anonymized and deidentified. Use data preprocessing
methods to improve the quality and diversity of the training
dataset, such as normalization, augmentation, or noise
reduction.

Performance measures and Validation Protocols: Select the


best performance measures to evaluate the automated breast
cancer detection system's accuracy, sensitivity, specificity,
precision, and recall. To enable fair evaluation and
comparison of various algorithms and models, establish
validation processes.

4.2.2.3. OPERATIONAL REQUIREMENTS

 ECONOMIC
The Automated Breast Cancer Detection System would
incur cost based on the hardware required.

 ENVIRONMENTAL
It is implemented on Computer Systems that do not
require higher specs therefore poses no environmental
threat.

 SOCIAL
It will create a sense of reliability towards doctors and
medical practitioners as it helps them yield better results
and treatment.

 POLITICAL
The system does not conform to any political bias.

 ETHICAL
Our system makes sure that data privacy laws are strictly
followed, and that patient data confidentiality is upheld.
Obtain the patient’s informed consent before collecting
and using their medical information for breast cancer
detection.

 HEALTH AND SAFETY

We take responsibility for the accuracy and reliability of


the automated system but the diagnosis done on the
predicted result is solely based on the doctor that is being
consulted by the patient.

 SUSTAINABILITY
This system does not require large amount of power and
can run on very less energy which means the system can
sustain for a long period of time.

 LEGALITY
The dataset used by our automated system is freely
available to everyone so it is legally sourced.

4.2.3. SYSTEM REQUIREMENTS

4.2.3.1. H/W REQUIREMENTS

 Processor: Minimum i3 Dual Core


 Ethernet connection (LAN) OR Wi-Fi
 Hard Drive: Recommended 100 GB or more
 Memory (RAM): Minimum 8 GB

4.2.3.2. S/W REQUIREMENTS

 Python
 Anaconda
 Jupyter Notebook
 TensorFlow

5. RESULTS AND DISCUSSION

We aimed to develop A PROJECT that uses various ML models to predict breast cancer
based on various clinical factors. Initially the project started with training multiple machine
learning models such as logistic regression , XGboost, Random Forest and other ensemble
learning techniques. After analyzing these pretrained models we explored neural networks
as referred from the paper hinting higher level of accuracy.

The dataset used in this project contains various parameters such as


concave points_mean, fractal_dimension_mean, radius_worst, fractal_dimension_mean. We
started with a very detailed exploratory data analysis to properly visualize our feature
attributes and their dependency among them using various charts and visualizing techniques.
After detailed exploratory data analysis our dataset was divided into train and test with
factor of 70 and 30. The performance of each of these models was evaluated using metrics
like accuracy , precision, ROC, F1 score,recall.

The results of this project showed that all the pretrained models performed very well in
predicting breast cancer with higher level of accuracy. Also , the custom neural network that
we developed for this specific project performed slightly better than all the pretrained
models with increase in accuracy. The Models that we explored showed us that logistic
regression and XGBoost showed higher level of accuracy and F1score as compared to other
models having an accuracy of 98.2% with F1-score of 97% which outperformed other
models that we used.
After this, we trained and tested our custom neural network model and tuned its hyper
parameters and running through a number of iterations(epochs=256) so that the model can
learn and increase it’s accuracy.Afer running through iterations and hyper parameters we
received an accuracy of 98.9% ~ 99% showing us that neural network outperformed all
other pretrained models with F1-socre being 98% higher than any other models that were
trained. This shows the use of Neural Network can result in higher level of accuracy and
show more promising result than other models that we explored in this project.

The evaluation metrics corresponding to each of the model that we trained and tested are
mentioned below.

Analysing the Confusion Matrix of some Models.

Fig: Random Forest Classifier


Based on the confusion matrix generated, we can see the drop of false positives from 1.75%
to 0% and false negative from 2.34% to 1.17%. Since we have less false negative and false
positive ratio, we can see the accuracy of true positive and true negative increases slightly
which shows that neural network outperformed other models and shows promising result
which can further be explored for this specific problem if the no. of feature attributes and
dependencies are increased so that it can be more medically accurate and with more no. of
features early detection of breast cancer can help the medical professionals to take
precautions and early measures to minimize the post detection of cancer and can help the
health institutes and provide proper medication and care in future.
References

[1] Noreen Fatima , Li Liu , Sha Hong and Haroon Ahmed (2020) Prediction of Breast
Cancer, Comparative Review of Machine Learning Techniques, and Their Analysis
[2] Dejun Zhang , Lu Zou , Xionghui Zhou and Fazhi He (2018) Integrating Feature Selection
and Feature Extraction Methods With Deep Learning to Predict
[3] Guoqing Bao , Ran Xu , Xiuying Wang , Jianxiong Ji , Linlin Wang , Wenjie Li , Qing
Zhang , Bin Huang , Anjing Chen , Di Zhang , Beihua Kong , Qifeng Yang ,
Cunzhong Yuan , Xinyu Wang , Jian Wang and Xingang Li (2021)
Identification of lncRNA Signature Associated With Pan-Cancer Prognosis
[4] Chen Peng , Yang Zheng and De-Shuang Huang (2020) Capsule Network Based
Modeling of Multi-omics Data for Discovery of Breast Cancer-Related Gen

[5] Tariq Mahmood , Jianqiang Li , Yan Pei , Faheem Akhtar , Azhar Imran and Khalil Ur
Rehman (2020)A Brief Survey on Breast Cancer Diagnostic With Deep Learning Schemes
Using Multi-Image
[6] Meriem Sebai , Tianjiang Wang and Saad Ali Al-Fadhli (2020) PartMitosis: A Partially
Supervised Deep Learning Framework for Mitosis Detection in Breast Cancer
Histopathology Images
[7] Beibit Abdikenov , Zangir Iklassov , Askhat Sharipov , Shahid Hussain and Prashant K.
Jamwal (2019) Analytics of Heterogeneous Breast Cancer Data Using Neuroevolution
[8] Kai Zheng , Lei Wang and Zhu-Hong You (2019) CGMDA: An Approach to Predict and
Validate MicroRNA-Disease Associations by Utilizing Chaos Game Representation and
LightGBM
[9] Md Tauhidul Islam , Ennio Tasciotti and Raffaella Righetti (2019) Non-Invasive Imaging
of Normalized Solid Stress in Cancers in Vivo
[10] Muhammad Nauman , Nadeem Akhtar , Omar H. Alhazmi , Mustafa Hameed , Habib
Ullah and Nadia Khan (2021) Improving the Correctness of Medical Diagnostics Based on
Machine Learning With Coloured Petri Nets
[11] Ziyu Ning , Shuang Yu , Yanqiao Zhao , Yang Wang , Qi Fan , Tianming Song and
Xiaoyang Yu (2020) Survival Marker miRNA-Mediated Subpathways of Breast Invasive
Carcinoma Derived From Activity Profile
[12] Hasan Nasir Khan , Ahmad Raza Shahid , Basit Raza , Amir Hanif Dar and Hani
Alquhayz (2019) Multi-View Feature Fusion Based Four Views Model for Mammogram
Classification Using Convolutional Neural Network
[13] Md Tauhidul Islam , Songyuan Tang , Ennio Tasciotti and Raffaella Righetti (2021)
Non-Invasive Assessment of the Spatial and Temporal Distributions of Interstitial Fluid
Pressure,Fluid Velocity and Fluid Flow in Cancers In Vivo
[14] K. Sabeena Beevi , Madhu S. Nair and G. R. Bindu (2017) A Multi-Classifier System
for Automatic Mitosis Detection in Breast Histopathology Images UsingDeep Belief
Networks
[15] Ali Bou Nassif, Manar Abu Talib, Qassim Nasir, Yaman Afadar, Omar Elgendy (2020)
Breast cancer detection using artificial intelligence techniques: A systematic literature
review

You might also like