Breast Cancer Detection
Breast Cancer Detection
1|Page
Table of Contents
1.Introduction…………………………………………………………………………..2-3
2.Proposed Methodology………………………………………………………….3-4
3.Dataset Description………………………………………………………………..4-7
3.1 Sample of Dataset…………………………………………………………………………5-6
4.Preprocessing………………………………………………………………………..7-8
4.1 Label Encoding Using Diagnosis Column……………………………………………8
5.Implementation……………………………………………………………………..9-12
5.1 Data Visualization………………………………..…………………………………………9-10
6.Result……………………………………………………………………………………13-14
7.Conclusion…………………………………………………………………………………15
2|Page
Title: “Evaluating Machine Learning Algorithms for
Breast Cancer Detection in Developing Countries”
1.Introduction
Every day, cancer impacts people all over the world in a variety of ways. Breast
cancer is the most irritating type of cancer after skin cancer. Human embryonic
tissues are made up of tiny cells. Uncontrolled cell proliferation in the breast can
occasionally result in lumps known as tumors. During breast cancer, these tumor
cells create lumps, which are referred to as tumors. During breast cancer, these
tumor cells begin to proliferate abnormally and develop into cancer. Both men
and women can be treated for breast cancer. Women, on the other hand, are
more likely to get this sickness. Breast cancer is becoming more common by the
day. Breast cancer rates often increase with age, shorter periods, delayed first
childbirth, shorter nursing duration, family history, prior breast cancer or tumor,
abnormally big breasts, hormone treatment, prior breast radiation, obesity, and
high alcohol intake. We can reduce the number of breast cancer deaths by
applying early detection. We can make predictions based on specific signs and
behaviors. Here are a few examples of symptoms:
It feels different because of the breast lump or thickening tissues. The size,
shape, and look of the breasts have changed noticeably. On the breast skin,
noticeable changes such as lumpiness may be detected. The epidermis of the
breast shows signs of redness or pitting. If the patient shows any of these
symptoms or clues, they should see a doctor straight away. Statistics reveal that
women are diagnosed with breast cancer 110 times out of every 100, even if
they have no symptoms. As a result, cancer spreads, increasing the likelihood of
death. This requires regular breast cancer screening.
According to recent research, the survival percentage for women with breast
cancer is 91% five years following diagnosis.
After ten years, the rate is 86%.
After 15 years, the rate is 80%.
3|Page
Breast cancer is classified into stages and grades. The stages of breast cancer
define how far the cancer has gone and how quickly it has developed in the
human body. If cancer is found at an early stage, it is easily treatable. However,
when cancer spreads, the danger of death skyrockets. We can identify cancer
more accurately using machine learning and its algorithm. In this research, we
employed certain detection techniques. Support Vector Machine (SVM),
Decision Tree Algorithm (DT), Random Forest Algorithm (RF), and K Nearest
Neighbors Algorithm (KNN) are a few examples. The SVM method outperforms
the other seven algorithms in terms of accuracy.
2.Proposed Methodology
We got data from online (kaggle.com) for this paper and discovered that 357 of
the 570 patients are benign and 212 are cancer. Various influencing factors and
features are discovered after collecting data for input variables. The block
diagram of the proposed work is:
To carry out the idea, a dataset is necessary. A total of 569 data points were
collected for pre-processing. Almost 32 columns have been added.
In this dataset, "Diagnosis" is the goal attribute.
4|Page
The required machine learning algorithm for Classification is shown below:
• RF
• DT
• SVM
• KNN
3.Dataset Description
Our dataset was acquired via the kaggle.com website. There are 32 columns and
569 rows in this data set. The diagnostic column is the goal property, and the 31
columns that follow are feature attributes. As we can see from data
visualization, the target class of the data set comprises two stages of breast
cancer: the first is benign, and the second is aggressive.
After gathering the dataset, each column is converted to a numeric format, and
the diagnosis is classified by target class. Following that, a final object is handled,
as well as the conversion of integer values to floats and any missing data.
Machine learning algorithms were employed in this area. Following
implementation, certain outcomes emerge.
Based on the structure of the data, it looks like a dataset related to breast cancer
diagnosis. The columns include various features such as radius, texture,
perimeter, area, smoothness, compactness, concavity, concave points,
symmetry, and fractal dimension at different moments (mean, standard error,
and worst).
The dataset seems to have an "id" column, a "diagnosis" column (with values
'M' for malignant and 'B' for benign), and several numerical columns
representing different features extracted from breast cancer biopsies.
5|Page
3.1 Sample of Dataset
Here is the sample of our breast cancer detection csv dataset:
6|Page
3.2 Description Table
In the implementation phase dia, rm, tm, pm, am, sm, cm, cnm, cpm, sym, fdm,
rs, ts, ps, as, ss, cms, cs, cps, sys, fds, rw, tw, p_w, aw, sw, cmw, cnm, cpw, use
dataset as a Feature Attribute. One column (diagnosis) is the target attribute.
Train data accounts for 70% of the total, whereas test data accounts for 30%,
and train values (x_train, y_train) are input. Enter the train value as (xtrain1,
ytrain1) when employing a method that requires feature scaling, and the device
will deliver the projected output.
7|Page
4.Preprocessing
Preprocessing in the context of datasets refers to the tasks and techniques used
to clean, transform, and prepare the data before it is used for analysis or
machine learning.
The methods selected for preprocessing are-
8|Page
Converting int to float of diagnosis column
We use this method to change the data type of the values in the selected
column. In this case, it is specifying that the values in the 'diagnosis' column
should be converted to the float data type. The purpose of this conversion might
be to ensure that the 'diagnosis' column, which likely contains categorical values
(e.g., 'M' for malignant and 'B' for benign in the context of breast cancer
diagnosis), is represented as numerical values in the form of floating-point
numbers.
This is a method used to change the data type of the values in the selected column.
In this case, it is specifying that the values in the 'id' column should be converted
to the float data type. The purpose of this conversion might be to ensure that the
'id' column, which likely contains numerical identifiers, is represented as floating-
point numbers.
9|Page
Missing value checking
Check for missing values and we don’t find any missing values in our datasets.
Scaling
10 | P a g e
5.Implementation
We obtained our data from the kaggle.com platform. This data collection
contains 32 columns and 569 total data. The diagnostic attribute is the target
attribute in these 32 columns, whereas the feature attribute is present in the
remaining 31 columns. Based on data visualization, we know that the data set's
target class comprises primarily of two stages of breast cancer: benign and
malignant.
After gathering the dataset, each column is converted to a numeric format, and
the diagnosis is classified by target class. Following that, a final object is handled,
as well as the conversion of integer values to floats and any missing data.
Machine learning algorithms were employed in this area. Following
implementation, certain outcomes emerge.
11 | P a g e
Figure : Data Visualization using pairplot
Figure : Correlation
12 | P a g e
Figure : Heatmap
13 | P a g e
Figure: Visualization of KNN algorithm
14 | P a g e
6.Result
7.Conclusion:
The study's findings suggest the use of a few machine learning algorithms to
gauge and predict the early identification of breast cancer. An online dataset
called breast cancer detection was gathered and utilized in this system's
organization and execution, sourced from kaggle.com. On the basis of this
dataset, a suggested model was created, refined, and put into use. Included is
the fact that, when it comes to test data, SVM performs the best among all
machine learning algorithms, with 97% accuracy and train data accuracy of 98%.
However, Random Forest, KNN, and DT also offer higher accuracy: 96%, 94%,
and 92% for test data, and 100%, 95%, and 100% for train data. Although there
are just 569 data in our dataset, the model can achieve significantly higher
accuracy with many more.
15 | P a g e