0% found this document useful (0 votes)
36 views

Breast Cancer Detection

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Breast Cancer Detection

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

“Evaluating Machine Learning Algorithms for Breast

Cancer Detection in Developing Countries”

1|Page
Table of Contents

1.Introduction…………………………………………………………………………..2-3
2.Proposed Methodology………………………………………………………….3-4
3.Dataset Description………………………………………………………………..4-7
3.1 Sample of Dataset…………………………………………………………………………5-6

3.2 Description Table………………………………………………………………………….6-7

3.3 Data Table ……………………………………………………………………………………….7

4.Preprocessing………………………………………………………………………..7-8
4.1 Label Encoding Using Diagnosis Column……………………………………………8

4.2 Converting int to float of diagnosis column……………………………………….8

4.3 Converting int to float of ID column………………………………………………….8

4.4 Missing Value Checking…………………………………………………………………….8

5.Implementation……………………………………………………………………..9-12
5.1 Data Visualization………………………………..…………………………………………9-10

5.2 Machine Learning Algorithm…………………………………………………………11-12

6.Result……………………………………………………………………………………13-14

7.Conclusion…………………………………………………………………………………15

2|Page
Title: “Evaluating Machine Learning Algorithms for
Breast Cancer Detection in Developing Countries”
1.Introduction
Every day, cancer impacts people all over the world in a variety of ways. Breast
cancer is the most irritating type of cancer after skin cancer. Human embryonic
tissues are made up of tiny cells. Uncontrolled cell proliferation in the breast can
occasionally result in lumps known as tumors. During breast cancer, these tumor
cells create lumps, which are referred to as tumors. During breast cancer, these
tumor cells begin to proliferate abnormally and develop into cancer. Both men
and women can be treated for breast cancer. Women, on the other hand, are
more likely to get this sickness. Breast cancer is becoming more common by the
day. Breast cancer rates often increase with age, shorter periods, delayed first
childbirth, shorter nursing duration, family history, prior breast cancer or tumor,
abnormally big breasts, hormone treatment, prior breast radiation, obesity, and
high alcohol intake. We can reduce the number of breast cancer deaths by
applying early detection. We can make predictions based on specific signs and
behaviors. Here are a few examples of symptoms:
It feels different because of the breast lump or thickening tissues. The size,
shape, and look of the breasts have changed noticeably. On the breast skin,
noticeable changes such as lumpiness may be detected. The epidermis of the
breast shows signs of redness or pitting. If the patient shows any of these
symptoms or clues, they should see a doctor straight away. Statistics reveal that
women are diagnosed with breast cancer 110 times out of every 100, even if
they have no symptoms. As a result, cancer spreads, increasing the likelihood of
death. This requires regular breast cancer screening.
According to recent research, the survival percentage for women with breast
cancer is 91% five years following diagnosis.
After ten years, the rate is 86%.
After 15 years, the rate is 80%.

3|Page
Breast cancer is classified into stages and grades. The stages of breast cancer
define how far the cancer has gone and how quickly it has developed in the
human body. If cancer is found at an early stage, it is easily treatable. However,
when cancer spreads, the danger of death skyrockets. We can identify cancer
more accurately using machine learning and its algorithm. In this research, we
employed certain detection techniques. Support Vector Machine (SVM),
Decision Tree Algorithm (DT), Random Forest Algorithm (RF), and K Nearest
Neighbors Algorithm (KNN) are a few examples. The SVM method outperforms
the other seven algorithms in terms of accuracy.

2.Proposed Methodology
We got data from online (kaggle.com) for this paper and discovered that 357 of
the 570 patients are benign and 212 are cancer. Various influencing factors and
features are discovered after collecting data for input variables. The block
diagram of the proposed work is:

Figure: Block Diagram of the Proposed Methodology

To carry out the idea, a dataset is necessary. A total of 569 data points were
collected for pre-processing. Almost 32 columns have been added.
In this dataset, "Diagnosis" is the goal attribute.

4|Page
The required machine learning algorithm for Classification is shown below:
• RF
• DT
• SVM
• KNN

3.Dataset Description
Our dataset was acquired via the kaggle.com website. There are 32 columns and
569 rows in this data set. The diagnostic column is the goal property, and the 31
columns that follow are feature attributes. As we can see from data
visualization, the target class of the data set comprises two stages of breast
cancer: the first is benign, and the second is aggressive.
After gathering the dataset, each column is converted to a numeric format, and
the diagnosis is classified by target class. Following that, a final object is handled,
as well as the conversion of integer values to floats and any missing data.
Machine learning algorithms were employed in this area. Following
implementation, certain outcomes emerge.
Based on the structure of the data, it looks like a dataset related to breast cancer
diagnosis. The columns include various features such as radius, texture,
perimeter, area, smoothness, compactness, concavity, concave points,
symmetry, and fractal dimension at different moments (mean, standard error,
and worst).
The dataset seems to have an "id" column, a "diagnosis" column (with values
'M' for malignant and 'B' for benign), and several numerical columns
representing different features extracted from breast cancer biopsies.

5|Page
3.1 Sample of Dataset
Here is the sample of our breast cancer detection csv dataset:

6|Page
3.2 Description Table
In the implementation phase dia, rm, tm, pm, am, sm, cm, cnm, cpm, sym, fdm,
rs, ts, ps, as, ss, cms, cs, cps, sys, fds, rw, tw, p_w, aw, sw, cmw, cnm, cpw, use
dataset as a Feature Attribute. One column (diagnosis) is the target attribute.
Train data accounts for 70% of the total, whereas test data accounts for 30%,
and train values (x_train, y_train) are input. Enter the train value as (xtrain1,
ytrain1) when employing a method that requires feature scaling, and the device
will deliver the projected output.

Figure : Total Output of Train Test sample.

3.3 Data Table


In this dataset, "Diagnosis" is the goal attribute.

7|Page
4.Preprocessing
Preprocessing in the context of datasets refers to the tasks and techniques used
to clean, transform, and prepare the data before it is used for analysis or
machine learning.
The methods selected for preprocessing are-

• Label encoding of diagnosis column


• Converting int to float of diagnosis column
• Converting int to float of ID column
• Missing value checking
• Scaling(Z-score normalization)

Label encoding of diagnosis column

Label encoding is performed to convert categorical labels or text data into


numerical representations. the label encoding transformation to the 'diagnosis'
column in the DataFrame 'df'. The fit_transform of the LabelEncoder is used,
which both fits the encoder to the unique values in the 'diagnosis' column and
transforms the labels into numerical values. The encoded values are then
assigned back to the 'diagnosis' column in the DataFrame.

8|Page
Converting int to float of diagnosis column

We use this method to change the data type of the values in the selected
column. In this case, it is specifying that the values in the 'diagnosis' column
should be converted to the float data type. The purpose of this conversion might
be to ensure that the 'diagnosis' column, which likely contains categorical values
(e.g., 'M' for malignant and 'B' for benign in the context of breast cancer
diagnosis), is represented as numerical values in the form of floating-point
numbers.

Converting int to float of ID column

This is a method used to change the data type of the values in the selected column.
In this case, it is specifying that the values in the 'id' column should be converted
to the float data type. The purpose of this conversion might be to ensure that the
'id' column, which likely contains numerical identifiers, is represented as floating-
point numbers.

9|Page
Missing value checking

Check for missing values and we don’t find any missing values in our datasets.

Scaling

Scaling is necessary in machine learning to ensure that all features contribute


equally to the model training process. Feature scaling is a method used to
normalize the range of independent variables or features of data. Scaling is what
our algorithm does to keep the variables in balance. We determine the algorithm's
accuracy both with and without scaling. It depends on how the method is used.
For instance, utilize scaling in SMV to obtain the highest accuracy for it. Without
scaling, the accuracy is not as outstanding but the results of DT and random forest
are good.

10 | P a g e
5.Implementation
We obtained our data from the kaggle.com platform. This data collection
contains 32 columns and 569 total data. The diagnostic attribute is the target
attribute in these 32 columns, whereas the feature attribute is present in the
remaining 31 columns. Based on data visualization, we know that the data set's
target class comprises primarily of two stages of breast cancer: benign and
malignant.

After gathering the dataset, each column is converted to a numeric format, and
the diagnosis is classified by target class. Following that, a final object is handled,
as well as the conversion of integer values to floats and any missing data.
Machine learning algorithms were employed in this area. Following
implementation, certain outcomes emerge.

5.1 Data Visualization

We also use some of data visualization method. Some of the visualization


method that we use are-

Pair plot- We use pairplot visualization method to explore relationships


between multiple variables in a dataset.We use Seaborn library to visualize
pairwise relationships between variables in a DataFrame. We specifically
targets the DataFrame df, selecting columns from index 1 to 6 (inclusive) and
using the 'diagnosis' column as the hue variable.

11 | P a g e
Figure : Data Visualization using pairplot

Correlation-Here we use Correlation statistical measure to quantifies the


strength and direction of the relationship between two variables. The resulting
correlation matrix will be a square matrix with dimensions (12 x 12), where each
element represents the correlation coefficient between the corresponding pair
of columns. This matrix provides valuable insights into the relationships
between the selected features in the data frame.

Figure : Correlation

Heatmap- For our project we used Heatmaps effectively visualize the


distribution and intensity of data values, making it easier to spot patterns and
trends that might not be apparent from other visualizations. The color-coding
scheme helps identify areas of high or low concentrations, allowing for quick
visual analysis and interpretation.

12 | P a g e
Figure : Heatmap

5.2 Machine learning algorithms


Our goal for these experiments is to find out the best models for breast cancer
detection. Models will be created using the following nine ML algorithms.
Among them some best results come out after implement.

Figure: Visualization of SVM algorithm

13 | P a g e
Figure: Visualization of KNN algorithm

Figure: Visualization of RandomForest algorithm

Figure: Visualization of Decision Tree algorithm

14 | P a g e
6.Result

This research aims to perform a comparison among ML methods for breast


cancer detection and diagnosis. The five most popular supervised ML techniques
named support vector machine (SVM), decision tree (DT), logistic regression
(LR), random forest (RF), K-nearest neighbor (KNN) technique were used for
classification. SVM provides 97% accuracy, Random Forest provides 96%
accuracy, Decision Tree provides 92% accuracy, and KNN provides 94% accuracy
gradually. Based on our findings, SVM has the highest accuracy of 95%. It was
discovered in this research that it provides 95% accuracy (best accuracy) in
breast cancer patients.
Algorithms Accuracy with train data Accuracy with test data
SVM 98% 97%
Random Forest 100% 96%
KNN 95% 94%
DT 100% 92%

7.Conclusion:

The study's findings suggest the use of a few machine learning algorithms to
gauge and predict the early identification of breast cancer. An online dataset
called breast cancer detection was gathered and utilized in this system's
organization and execution, sourced from kaggle.com. On the basis of this
dataset, a suggested model was created, refined, and put into use. Included is
the fact that, when it comes to test data, SVM performs the best among all
machine learning algorithms, with 97% accuracy and train data accuracy of 98%.
However, Random Forest, KNN, and DT also offer higher accuracy: 96%, 94%,
and 92% for test data, and 100%, 95%, and 100% for train data. Although there
are just 569 data in our dataset, the model can achieve significantly higher
accuracy with many more.

15 | P a g e

You might also like