0% found this document useful (0 votes)

97 views

Breast Cancer Classification Using Python

The document describes building a machine learning model to classify breast cancer tumors as benign or malignant based on their characteristics. It discusses obtaining breast cancer data, exploring the data through visualizations like violin plots and correlation maps, and selecting relevant features for the classification model. The goal is to accurately identify malignant tumors through an analysis of tumor shape, geometry and other specifications.

Uploaded by

Davidson De Oliveira Lima

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

97 views

Breast Cancer Classification Using Python

Uploaded by

Davidson De Oliveira Lima

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium

Breast Cancer Classification Using

Python
A guide to EDA and classification

Mugdha Paithankar
Nov 8, 2020 · 13 min read

Photo by Peter Boccia on Unsplash

https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 1/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium

Breast cancer (BC) is one of the most common cancers among women in
the world today.

Currently, the average risk of a woman in the United States developing

breast cancer sometime in her life is about 13%, which means there is a 1
in 8 chance she will develop breast cancer!

An early diagnosis of BC can greatly improve the prognosis and chance of

survival for patients. Thus an accurate identification of malignant tumors is
of paramount importance.

In this article I will also go over all the steps needed to make a Data Science
project complete in itself, and with the use of machine learning algorithms,
ultimately build a model which accurately classifies tumors as Benign or
Malignant based on the tumor shape and its geometry.

Step 1: Get the data!

I got the dataset from Kaggle. It contains 596 rows and 32 columns of
tumor shape and specifications. The tumor is classified as benign or
malignant based on its geometry and shape. Features are computed from a
digitized image of a fine needle aspirate (FNA) of a breast mass, which is
type of biopsy procedure. They describe characteristics of the cell nuclei
present in the image.

The features of the dataset include:

1. tumor radius (mean of distances from center to points on the

perimeter)

2. texture (standard deviation of gray-scale values)

3. perimeter

https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 2/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium

4. area

5. smoothness (local variation in radius lengths)

6. compactness (perimeter² / area — 1.0)

7. concavity (severity of concave portions of the contour)

8. concave points (number of concave portions of the contour)

9. symmetry

10. fractal dimension

The mean, standard error and “worst” or largest (mean of the three largest
values) of these features were computed for each image, resulting in 30
features.

Step 2: Exploratory Data Analysis (EDA)

#make a dataframe

df = pd.read_csv(‘data.csv’)

#examine the shape of the data

df.shape()

#get the column names

df.columns

The dataset has 569 rows and 33 columns. There are two extra columns
“id” and “Unnamed: 32”. We drop Unnamed: 32 which has all Nan
values.

#Drop the column with all missing values (na, NAN, NaN)

#NOTE: This drops the column Unnamed: 32 column

https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 3/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium

df = df.dropna(axis=1)

#Get a count of the number of 'M' & 'B' cells

df['diagnosis'].value_counts()

#Visualize this count

sns.countplot(df['diagnosis'],label="Count")

212 Malignant and 357 Benign tumors

There are now 30 features we can visualize. I decided to plot 10 features

at a time. This led to 3 plots containing 10 features each. The means of all
the features were plotted together, so were the standard errors and worst
dimensions.

Violin plots are like density plots and unlike bar graphs with means and
error bars, violin plots contain all data points which make them an excellent
tool to visualize samples of small sizes.

I made violin plots and commented, based on their distribution whether

that feature will be good for classification. To make violin plots for this
dataset, first separate the data labels ‘M’ or ‘B’ (into y) and features (into
X). Then visualize 10 features at a time.

https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 4/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium

# y includes diagnosis column with M or B values

y = df.diagnosis

# drop the column 'id' as it is does not convey any

useful info

# drop diagnosis since we are separating labels and

features

list = [‘id’,’diagnosis’]

# X includes our features

X = df.drop(list,axis = 1)

# get the first ten features

data_dia = y

data = X

data_std = (data — data.mean()) / (data.std()) #

standardization

# get the first 10 features

data = pd.concat([y,data_std.iloc[:,0:10]],axis=1)

data = pd.melt(data,id_vars=”diagnosis”,

var_name=”features”,

value_name=’value’)

# make a violin plot

plt.figure(figsize=(10,10))

sns.violinplot(x=”features”, y=”value”,
hue=”diagnosis”, data=data,split=True, inner=”quart”)

plt.xticks(rotation=90)

https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 5/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium

Violin plot displaying all the mean features

The median of texture_mean for Malignant and Benign looks separated,

so it might be a good feature for classification. For
fractal_dimension_mean, the medians of the Malignant and Benign
groups are very close to each other.

https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 6/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium

Violin plot displaying all the standard error features

The medians for almost all Malignant or Benign don’t vary much for the
standard error features above, except for concave points_se and
concavity_se. smoothness_se or symmetry_se have a very similar
distribution which could make classification using this feature difficult. The
shape of the violin plot for area_se looks warped and the distribution of
data points for benign and malignant very different!

https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 7/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium

Violin plot displaying all the worst dimension features

area_worst look well separated, so it might be easier to use this feature for
classification! Variance seems highest for fractal_dimension_worst.
concavity_worst and concave_points_worst seem to have a similar data
distribution.

In order to check the correlation between the features, I plotted a correlation

matrix. It is effective in summarizing a large amount of data where the goal
is to see patterns.

#correlation map

f,ax = plt.subplots(figsize=(18, 18))

matrix = np.triu(X.corr())

sns.heatmap(X.corr(), annot=True, linewidths=.5, fmt=

‘.1f’,ax=ax, mask=matrix)

https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 8/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium

Correlation heatmap of all the features

The means, std errors and worst dimension lengths of compactness,

concavity and concave points of tumors are highly correlated amongst
each other (correlation > 0.8). The mean, std errors and worst dimensions
of radius, perimeter and area of tumors have a correlation of 1!

https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 9/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium

texture_mean and texture_worst have a correlation of 0.9. area_worst and

area_mean have a correlation of 1.

By now we have a rough idea that many of the features are highly
correlated amongst each other. But what about correlation between the
benign and malignant groups for each feature? In order to understand if
there is a difference between the data distribution for malignant and
benign groups, I visualized some features via box plots and performed a t
test to detect statistical significance.

Box plots succinctly compare multiple distributions and are a great way to
visualize the IQR.

# create boxplots for texture mean vs diagnosis of

tumor

plot = sns.boxplot(x=’diagnosis’, y=’texture_mean’,

data=df, showfliers=False)

plot.set_title(“Graph of texture mean vs diagnosis of

tumor”)

Comparing mean features for M and B groups

Texture means, for malignant and benign tumors vary by about 3 units.

https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 10/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium

The distribution looks similar for both the groups. Malignant tumors tend
to have a higher texture mean compared to benign.

Fractal dimension means are almost the same for malignant and benign
tumors. The IQR is wider for malignant tumors.

Comparing se features for M and B groups

Malignant groups have a distinctly wider range of values for area se. The
distribution range is very narrow for benign groups. This might be a good
feature for classification.

Standard error (se) of concave points has a higher mean and IQR for
malignant tumors. The distribution looks somewhat similar for both tumor
types.

https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 11/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium

Comparing worst dimension features for M and B groups

Malignant groups have a wider range of values for radius worst compared
to benign groups. The IQR is wider for the same. Malignant tumors have a
higher radius worst compared to benign groups.

Similar to area_se, area_worst has a very different data distribution for

malignant and benign tumors. Malignant tumors tend to have a higher
value of mean and wider IQR range. Because of noticeable differences
between B and M tumors, this could be a good feature for classification.

Box plots indicated a difference in means for most of the features

visualized above. But are these differences statistically significant? One
way to check for this is by a t test.

t test tells us he t test tells you how significant the differences between groups
are; In other words it lets you know if those differences (measured in means)
could have happened by chance.

# make a new dataframe with only the desired feature

for t test

new = pd.DataFrame(data=df[[‘area_worst’,
‘diagnosis’]])

new = new.set_index(‘diagnosis’)

stats.ttest_ind(new_d.loc[‘M’], new_d.loc['B'])

https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 12/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium

t test results for some features from the dataset

Except for fractal dimension mean, the p value and t statistic is

statistically significant for all the features in the table above. For fractal
dimension mean the null hypothesis stands true, meaning there is no
difference in means for the fractal dimension mean of M and B tumors.

From the correlation matrix we saw earlier, it was clear that there are quite
a few features with very high correlations. So I dropped one of the
features, from each of the feature pairs which had a correlation greater
than 0.95. ‘perimeter_mean’, ‘area_mean’, ‘perimeter_se’, ‘area_se’,
‘radius_worst’, ‘perimeter_worst’, ‘area_worst’ were amongst the features
that were dropped.

# Create correlation matrix

corr_matrix = X.corr().abs()

# Select upper triangle of correlation matrix

upper =
corr_matrix.where(np.triu(np.ones(corr_matrix.shape),
k=1).astype(np.bool))

https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 13/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium

# Find index of feature columns with correlation

greater than 0.95

to_drop = [column for column in upper.columns if

any(upper[column] > 0.95)]

# Drop features

X = X.drop(X[to_drop], axis=1)

X.columns

Step 3: Machine Learning

We want to build a model which classifies tumors as benign or malignant. I
used sklearn’s Logistic Regression, Support Vector Classifier, Decision Tree
and Random Forest for this purpose.

But first, transform the categorical variable column (diagnosis) to a

numeric type. I used sklearn’s LabelEncoder for this purpose. The M and
B variables were changed to 1 and 0 by the label encoder.

Transform categorical variables

#Encoding categorical data values

from sklearn.preprocessing import LabelEncoder

labelencoder_y = LabelEncoder()

y= labelencoder_y.fit_transform(y)

print(labelencoder_y.fit_transform(y))

Train Test Split the data

40% of the data was reserved for testing purposes. The dataset was
stratified in order to preserve the proportion of target as in the original
dataset, in the train and test datasets as well.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,

https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 14/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium

y, test_size = 0.40, stratify=y, random_state = 17)

Scale the features

sklearn’s Robust Scaler was used to scale the features of the dataset. The
centering and scaling statistics of this scaler are based on percentiles and
are therefore not influenced by a few number of very large marginal
outliers.

#Feature Scaling

from sklearn.preprocessing import RobustScaler

sc = RobustScaler()

X_train = sc.fit_transform(X_train)

X_test = sc.transform(X_test)

Train the data

# Define a function which trains models

def models(X_train,y_train):

#Using Logistic Regression

from sklearn.linear_model import LogisticRegression

log = LogisticRegression(random_state = 0)

log.fit(X_train, y_train)

#Using SVC linear

from sklearn.svm import SVC

svc_lin = SVC(kernel = 'linear', random_state = 0)

svc_lin.fit(X_train, y_train)

#Using SVC rbf

from sklearn.svm import SVC

svc_rbf = SVC(kernel = 'rbf', random_state = 0)

svc_rbf.fit(X_train, y_train)

#Using DecisionTreeClassifier

from sklearn.tree import DecisionTreeClassifier

https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 15/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium

tree = DecisionTreeClassifier(criterion =
'entropy', random_state = 0)

tree.fit(X_train, y_train)

#Using Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators = 10,

criterion = 'entropy', random_state = 0)

forest.fit(X_train, y_train)

#print model accuracy on the training data.

print('[0]Logistic Regression Training Accuracy:',

log.score(X_train, y_train))

print('[1]Support Vector Machine (Linear

Classifier) Training Accuracy:', svc_lin.score(X_train,
y_train))

print('[2]Support Vector Machine (RBF Classifier)

Training Accuracy:', svc_rbf.score(X_train, y_train))

print('[3]Decision Tree Classifier Training

Accuracy:', tree.score(X_train, y_train))

print('[4]Random Forest Classifier Training

Accuracy:', forest.score(X_train, y_train))

return log, svc_lin, svc_rbf, tree, forest

#get the training results

model = models(X_train,y_train)

[0]Logistic Regression Training Accuracy:

0.9794721407624634

[1]Support Vector Machine (Linear Classifier) Training

Accuracy: 0.9794721407624634

[2]Support Vector Machine (RBF Classifier) Training

Accuracy: 0.9824046920821115

[3]Decision Tree Classifier Training Accuracy: 1.0

[4]Random Forest Classifier Training Accuracy:

0.9912023460410557

Confusion matrix

from sklearn.metrics import confusion_matrix

for i in range(len(model)):

https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 16/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium

cm = confusion_matrix(y_test,
model[i].predict(X_test))

TN = cm[0][0]

TP = cm[1][1]

FN = cm[1][0]

FP = cm[0][1]

print(cm)

print(‘Model[{}] Testing Accuracy = “{}”’.format(i,

(TP + TN) / (TP + TN + FN + FP)))

print()# Print a new line

[[142 1]

[ 2 83]]

Model[0] Testing Accuracy = "0.9868421052631579"

[[141 2]

[ 4 81]]

Model[1] Testing Accuracy = "0.9736842105263158"

[[141 2]

[ 3 82]]

Model[2] Testing Accuracy = "0.9780701754385965"

[[129 14]

[ 5 80]]

Model[3] Testing Accuracy = "0.9166666666666666"

[[139 4]

[ 6 79]]

Model[4] Testing Accuracy = "0.956140350877193"

Classification Report

from sklearn.metrics import classification_report

from sklearn.metrics import accuracy_score

for i in range(len(model)):

print(‘Model ‘,i)

#Check precision, recall

recall, f1-score

print(classification_report(y_test,

https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 17/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium

model[i].predict(X_test)))

#Another way to get the models accuracy on the test

data

print(accuracy_score(y_test,
model[i].predict(X_test)))

print()#Print a new line

Model 0

precision recall f1-score support

0 0.99 0.99 0.99 143

1 0.99 0.98 0.98 85

accuracy 0.99 228

macro avg 0.99 0.98 0.99 228

weighted avg 0.99 0.99 0.99 228

0.9868421052631579

Model 1

precision recall f1-score support

0 0.97 0.99 0.98 143

1 0.98 0.95 0.96 85

accuracy 0.97 228

macro avg 0.97 0.97 0.97 228

weighted avg 0.97 0.97 0.97 228

0.9736842105263158

Model 2

precision recall f1-score support

0 0.98 0.99 0.98 143

1 0.98 0.96 0.97 85

accuracy 0.98 228

macro avg 0.98 0.98 0.98 228

weighted avg 0.98 0.98 0.98 228

0.9780701754385965

Model 3

precision recall f1-score support

0 0.96 0.90 0.93 143

https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 18/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium

1 0.85 0.94 0.89 85

accuracy 0.92 228

macro avg 0.91 0.92 0.91 228

weighted avg 0.92 0.92 0.92 228

0.9166666666666666

Model 4

precision recall f1-score support

0 0.96 0.97 0.97 143

1 0.95 0.93 0.94 85

accuracy 0.96 228

macro avg 0.96 0.95 0.95 228

weighted avg 0.96 0.96 0.96 228

0.956140350877193

Hyper parameter tuning

Hyperparameters are crucial as they control the overall behavior of a

machine learning model.

In the context of cancer classification, my goal was to minimize the

misclassifications for the positive class (ie when the tumor is malignant
‘M’). But misclassifications include False Positives (FP) and False Negatives
(FN). I was focused more on reducing the FN because tumors which are
malignant should never be classified as benign even if this means the
model might classify a few benign tumors as malignant! Therefore I used
the sklearn’s fbeta_score as the scoring function with GridSearchCV. A
beta > 1 makes fbeta_score favor recall over precision.

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import GridSearchCV

https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 19/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium

#make the scoring function with a beta = 2

from sklearn.metrics import fbeta_score, make_scorer

ftwo_scorer = make_scorer(fbeta_score, beta=2)

# Create logistic regression

logistic = LogisticRegression()

# Create regularization penalty space

penalty = [‘l1’, ‘l2’]

# Create regularization hyperparameter space

C = np.arange(0, 1, 0.001)

# Create hyperparameter options

hyperparameters = dict(C=C, penalty=penalty)

# Create grid search using 5-fold cross validation

clf = GridSearchCV(logistic, hyperparameters, cv=5,

scoring=ftwo_scorer, verbose=0)

# Fit grid search

best_model = clf.fit(X_train, y_train)

# View best hyperparameters

print('Best Penalty:',
best_model.best_estimator_.get_params()['penalty'])

print('Best C:',
best_model.best_estimator_.get_params()['C'])

Best Penalty: l2

Best C: 0.591

predictions = best_model.predict(X_test)

print("Accuracy score %f" % accuracy_score(y_test,

predictions))

print(classification_report(y_test, predictions))

print(confusion_matrix(y_test, predictions))

Accuracy score 0.986742

precision recall f1-score support

0 0.99 0.99 0.99 143

1 0.99 0.98 0.98 85

accuracy 0.99 228

macro avg 0.99 0.98 0.99 228

https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 20/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium

weighted avg 0.99 0.99 0.99 228

[[142 1]

[ 2 83]]

After grid searching the accuracy improved a little but the FNs are still 2.

Grid searching was done on SVC and Random Forest models too but the
recall was best for logistic regression which is why I am discussing logistic
regression in this post.

Custom Threshold to increase recall

The default threshold for interpreting probabilities to class labels is 0.5, and
tuning this hyperparameter is called threshold moving.

y_scores = best_model.predict_proba(X_test)[:, 1]

from sklearn.metrics import precision_recall_curve

recall

p, r, thresholds = precision_recall_curve(y_test,
recall
y_scores)

def adjusted_classes(y_scores, t):

#This function adjusts class predictions based on the

prediction threshold (t).Works only for binary
classification problems.

return [1 if y >= t else 0 for y in y_scores]

def precision_recall_threshold(p,
recall r, thresholds,
t=0.5):

#plots the precision recall curve and shows the

current value for each by identifying the
classifier's threshold (t).

# generate new class predictions based on the

adjusted classes

function above and view the resulting confusion

https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 21/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium

matrix.

y_pred_adj = adjusted_classes(y_scores, t)

print(pd.DataFrame(confusion_matrix(y_test,
y_pred_adj),

columns=['pred_neg',
'pred_pos'],

index=['neg', 'pos']))

print(classification_report(y_test, y_pred_adj))

precision_recall_threshold(p,
recall r, thresholds, 0.42)

pred_neg pred_pos

neg 141 2

pos 1 84

precision recall f1-score support

0 0.99 0.99 0.99 143

1 0.98 0.99 0.98 85

accuracy 0.99 228

macro avg 0.98 0.99 0.99 228

weighted avg 0.99 0.99 0.99 228

Finally the FNs reduced to 1, after manually setting a decision threshold of

0.42!

Graph of recall and precision VS threshold

def plot_precision_recall_vs_threshold(precisions,
recall
recalls, thresholds):

recall

plt.figure(figsize=(8, 8))

plt.title(“Recall
Recall Scores as a function of the decision
threshold”)

plt.plot(thresholds, precisions[:-1], “b — “,
label=”Precision”)

plt.plot(thresholds, recalls[:-1],
recall “g-”,
label=”Recall”)
Recall

plt.axvline(x=.42, color=’black’)

plt.text(.39,.50,’Optimal Threshold for best

https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 22/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium

Recall’,rotation=90)

Recall
plt.ylabel(“Recall
Recall Score”)

plt.xlabel(“Decision Threshold”)

plt.legend(loc=’best’)

# use the same p, r, thresholds that were previously

calculated

plot_precision_recall_vs_threshold(p,
recall r, thresholds)

Graph of recall and precision scores VS thresholds

The line for optimal decision threshold indicates the point of maximum
recall which could be achieved without compromising a lot on precision.
After that point the precision starts to drop more.

https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 23/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium

from sklearn import metrics

from sklearn.metrics import roc_curve

# Compute predicted probabilities: y_pred_prob

y_pred_prob = best_model.predict_proba(X_test)[:,1]

# Generate ROC curve values: fpr, tpr, thresholds

fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

print(metrics.auc(fpr, tpr))

# Plot ROC curve

plt.plot([0, 1], [0, 1], ‘k — ‘)

plt.plot(fpr, tpr)

plt.xlabel(‘False Positive Rate’)

plt.ylabel(‘True Positive Rate’)

plt.title(‘ROC Curve for Logistic Regression’)

plt.show()

AUC score is 0.9979432332373509

ROC Curve for Logistic Regression model

The AUC score for this model is 0.9979.

AUC score tells us how good our model is at distinguishing between classes,
in this case, predicting benign tumors as benign and malignant tumors as
malignant.

https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 24/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium

The ROC curve is plotted with TPR against the FPR where TPR is on y-axis
and FPR is on the x-axis. ROC curve looks almost ideal.

When the TPR and FPR don’t overlap at all, it means model has an ideal
measure of separability ie it is able to correctly classify positives as positives
and negatives as negatives.

To conclude this post, I have discussed a few EDA, statistical analysis and
machine learning techniques as applied to breast cancer classification
dataset. Complete code of this project can be found on Github.

The breast cancer classification dataset is good to get started with making a
complete Data Science project before you move on to more advanced
datasets and techniques.

Hope you guys found this post helpful and learnt something new too!
Follow Mugdha Paithankar for more stories. Please clap this article if you
like it!

Sign up for Top 10 Stories

By The Startup

Get smarter at building your thing. Subscribe to receive The Startup's top 10 most
read stories — delivered straight into your inbox, once a week. Take a look.

Emails will be sent to [email protected].

Get this newsletter Not you?

https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 25/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium

Exploratory Data Analysis Machine Learning Classification Algorithms Recall

Hyperparameter Tuning

About Help Legal

Get the Medium app

https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 26/26

Generalised Linear Models and Bayesian Statistics
No ratings yet
Generalised Linear Models and Bayesian Statistics
35 pages
Building A Simple Machine Learning Model On Breast Cancer Data
No ratings yet
Building A Simple Machine Learning Model On Breast Cancer Data
12 pages
Breast Cancer Classification
No ratings yet
Breast Cancer Classification
18 pages
IJERT Developing A Web Based System For
No ratings yet
IJERT Developing A Web Based System For
5 pages
Breast Cancer Detection Using SVM Classifier With Grid Search Technique
No ratings yet
Breast Cancer Detection Using SVM Classifier With Grid Search Technique
6 pages
Breast Cancer Detection and Prediction: Created by
No ratings yet
Breast Cancer Detection and Prediction: Created by
20 pages
Breast Cancer Detection Algo Comparison
No ratings yet
Breast Cancer Detection Algo Comparison
15 pages
Breast Cancer Detection
No ratings yet
Breast Cancer Detection
15 pages
Breast Cancer Research Paper Dip
No ratings yet
Breast Cancer Research Paper Dip
12 pages
Expert System With Application
No ratings yet
Expert System With Application
12 pages
1-s2.0-S1959031820301858-main
No ratings yet
1-s2.0-S1959031820301858-main
13 pages
2023 Ieeee
No ratings yet
2023 Ieeee
6 pages
Predicting Breast Cancer Using Logistic Regression - by Mo Kaiser - The Startup - Medium
No ratings yet
Predicting Breast Cancer Using Logistic Regression - by Mo Kaiser - The Startup - Medium
15 pages
Breast Cancer Detection With Machine Learning
No ratings yet
Breast Cancer Detection With Machine Learning
7 pages
BreakHis Data Paper
No ratings yet
BreakHis Data Paper
16 pages
Effective Feature Selection Using Multi-Objective Improved Ant Colony Optimization For Breast Cancer Classification
No ratings yet
Effective Feature Selection Using Multi-Objective Improved Ant Colony Optimization For Breast Cancer Classification
10 pages
Grdjev06i010003 PDF
No ratings yet
Grdjev06i010003 PDF
4 pages
Tài liệu không có tiêu đề-2
No ratings yet
Tài liệu không có tiêu đề-2
19 pages
IRJMETS51200105224
No ratings yet
IRJMETS51200105224
5 pages
02 Journal Ovarium
No ratings yet
02 Journal Ovarium
9 pages
Breast Cancer Prediction Using Machine Learning
No ratings yet
Breast Cancer Prediction Using Machine Learning
11 pages
Project Final
No ratings yet
Project Final
15 pages
Brain Tumor Detection
0% (1)
Brain Tumor Detection
16 pages
BREAST CANCER VIJAY & ARAVIND PROJECT 2024-06-28 RECREATE
No ratings yet
BREAST CANCER VIJAY & ARAVIND PROJECT 2024-06-28 RECREATE
14 pages
AReviewandComputationalAnalysisofBreastCancerUsing(Autosaved)
No ratings yet
AReviewandComputationalAnalysisofBreastCancerUsing(Autosaved)
8 pages
Final Breast Cancer
100% (1)
Final Breast Cancer
23 pages
Biomedicines 11 01536 v2 Done
No ratings yet
Biomedicines 11 01536 v2 Done
12 pages
(IJCST-V11I3P3) :DR M Narendra, A Nandini, T Kamal Raj, V Sai Sowmya, CH Brahma Reddy
No ratings yet
(IJCST-V11I3P3) :DR M Narendra, A Nandini, T Kamal Raj, V Sai Sowmya, CH Brahma Reddy
3 pages
Breast Cancer Fully Completed Paper (2) - Abinav Batch
No ratings yet
Breast Cancer Fully Completed Paper (2) - Abinav Batch
16 pages
A Novel Breast Tumor Classification Algorithm Using Neutrosophic Score Features
No ratings yet
A Novel Breast Tumor Classification Algorithm Using Neutrosophic Score Features
11 pages
Breast Invasive Ductal Carcinoma Diagnosis Using Machine Learning Models and Gabor Filter Method of Histology Images
No ratings yet
Breast Invasive Ductal Carcinoma Diagnosis Using Machine Learning Models and Gabor Filter Method of Histology Images
10 pages
Breast Cancer
No ratings yet
Breast Cancer
12 pages
Machine Learning based Intelligent System for Breast Cancer Prediction (MLISBCP)
No ratings yet
Machine Learning based Intelligent System for Breast Cancer Prediction (MLISBCP)
13 pages
Polynomial Regression
No ratings yet
Polynomial Regression
5 pages
Brain Tumor Segmentation Thesis
100% (3)
Brain Tumor Segmentation Thesis
7 pages
Cancer Detection Using Biclustering
No ratings yet
Cancer Detection Using Biclustering
5 pages
(IJCST-V7I4P8) : Nitasha
No ratings yet
(IJCST-V7I4P8) : Nitasha
4 pages
Kavita Dip Report
No ratings yet
Kavita Dip Report
12 pages
Vikas Venkat Sigatapu - ECE
No ratings yet
Vikas Venkat Sigatapu - ECE
8 pages
An Adaptive K-Means Clustering Algorithm For Breast Image Segmentation
No ratings yet
An Adaptive K-Means Clustering Algorithm For Breast Image Segmentation
4 pages
IEEE PROJECT PHASE 2 (1)
No ratings yet
IEEE PROJECT PHASE 2 (1)
8 pages
Thesis On Mammogram Classification
100% (3)
Thesis On Mammogram Classification
4 pages
PMNet A Probability Map Based Scaled Network - 2021 - Computerized Medical Imag
No ratings yet
PMNet A Probability Map Based Scaled Network - 2021 - Computerized Medical Imag
7 pages
Breast Cancer Prediction System Breast Cancer Prediction System
No ratings yet
Breast Cancer Prediction System Breast Cancer Prediction System
6 pages
Etasr 5115
No ratings yet
Etasr 5115
7 pages
Breast Cancer
No ratings yet
Breast Cancer
3 pages
Linear Discriminant Analysis and Support Vector Machines For Classifying Breast Cancer
No ratings yet
Linear Discriminant Analysis and Support Vector Machines For Classifying Breast Cancer
4 pages
Project Title and Abstract
No ratings yet
Project Title and Abstract
17 pages
Breast Cancer Classification and Prediction Using Machine Learning IJERTV9IS020280
No ratings yet
Breast Cancer Classification and Prediction Using Machine Learning IJERTV9IS020280
5 pages
Ankita Patra
No ratings yet
Ankita Patra
17 pages
An Enhancement of Mammogram Images For Breast Cancer Classification Using Artificial Neural Networks
No ratings yet
An Enhancement of Mammogram Images For Breast Cancer Classification Using Artificial Neural Networks
14 pages
Logistic Regression For Malignancy Prediction in Cancer - by Luca Zammataro - Towards Data Science
No ratings yet
Logistic Regression For Malignancy Prediction in Cancer - by Luca Zammataro - Towards Data Science
32 pages
IDS Project Group 11
No ratings yet
IDS Project Group 11
35 pages
Machine Learning For Breast Cancer Diagnosis A Proof of Concept
No ratings yet
Machine Learning For Breast Cancer Diagnosis A Proof of Concept
27 pages
A Computer-Aided Detection of The Architectural Distortion in Digital Mammograms Using The Fractal Dimension Measurements of BEMD PDF
No ratings yet
A Computer-Aided Detection of The Architectural Distortion in Digital Mammograms Using The Fractal Dimension Measurements of BEMD PDF
17 pages
Comparison of Decision Tree Methods For Breast Cancer Diagnosis
No ratings yet
Comparison of Decision Tree Methods For Breast Cancer Diagnosis
7 pages
Deep Learning Applied For Histological Diagnosis of Breast Cancer
No ratings yet
Deep Learning Applied For Histological Diagnosis of Breast Cancer
17 pages
Mla - 2 (Cia - 1) - 20221013
No ratings yet
Mla - 2 (Cia - 1) - 20221013
14 pages
Jurnal Q4
No ratings yet
Jurnal Q4
5 pages
01 Journal
No ratings yet
01 Journal
9 pages
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
From Everand
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
Fouad Sabry
No ratings yet
CV-En Davidson de Oliveira Lima 2024 v2
No ratings yet
CV-En Davidson de Oliveira Lima 2024 v2
4 pages
Stromeyer - Eidetic Memory
No ratings yet
Stromeyer - Eidetic Memory
4 pages
Flash and Decanter Aspen Plus
No ratings yet
Flash and Decanter Aspen Plus
11 pages
Combinations Stranded II
No ratings yet
Combinations Stranded II
2 pages
Forecasting in Operation Management
No ratings yet
Forecasting in Operation Management
54 pages
Phillips Perron
100% (1)
Phillips Perron
4 pages
Scikit Learn
No ratings yet
Scikit Learn
1 page
Lesson 5 Measures of Dispersion (Rhea)
No ratings yet
Lesson 5 Measures of Dispersion (Rhea)
25 pages
Quaid I Azam University Qau Thesis Template
No ratings yet
Quaid I Azam University Qau Thesis Template
20 pages
Mindmap Business Intelligence (BI)
No ratings yet
Mindmap Business Intelligence (BI)
2 pages
010 - STAT - SAMPLING Distribution of The Sample Mean
100% (1)
010 - STAT - SAMPLING Distribution of The Sample Mean
9 pages
L3 Numerical Summary Measures
No ratings yet
L3 Numerical Summary Measures
44 pages
Statistical Data Analysis - 2 - Step by Step Guide To SPSS & MINITAB - Nodrm
No ratings yet
Statistical Data Analysis - 2 - Step by Step Guide To SPSS & MINITAB - Nodrm
83 pages
Structural Safety: Irmela Zentner
No ratings yet
Structural Safety: Irmela Zentner
8 pages
Question Set of Statistics P7 Deb Sir
No ratings yet
Question Set of Statistics P7 Deb Sir
3 pages
Sampling and Estimation
No ratings yet
Sampling and Estimation
12 pages
Correlation-Exam Questions With Answers
100% (3)
Correlation-Exam Questions With Answers
7 pages
R Notes For Data Analysis and Statistical Inference
No ratings yet
R Notes For Data Analysis and Statistical Inference
10 pages
Multiple Regression Further Issues
No ratings yet
Multiple Regression Further Issues
40 pages
Chapter 3: Survival Distributions and Life Tables: Flea:)
No ratings yet
Chapter 3: Survival Distributions and Life Tables: Flea:)
1 page
Data Mining Models and Evaluation Techniques
No ratings yet
Data Mining Models and Evaluation Techniques
59 pages
Introduction To Econometrics - Stock & Watson - CH 13 Slides
No ratings yet
Introduction To Econometrics - Stock & Watson - CH 13 Slides
38 pages
ML Q
No ratings yet
ML Q
40 pages
Journal of International Financial Markets, Institutions & Money
No ratings yet
Journal of International Financial Markets, Institutions & Money
27 pages
Operation Mangement Chapter 3
100% (1)
Operation Mangement Chapter 3
105 pages
P Syllabus PDF
No ratings yet
P Syllabus PDF
5 pages
Chapter 5) Spreadsheet Engineering: 1) Design
No ratings yet
Chapter 5) Spreadsheet Engineering: 1) Design
4 pages
LGC Proficiency Testing Catalogue 201
No ratings yet
LGC Proficiency Testing Catalogue 201
1 page
DOE (Fractional Factorial Design-Revised)
No ratings yet
DOE (Fractional Factorial Design-Revised)
80 pages
STEP SPSS ANALYSIS COHEN KAPPA and ICC
No ratings yet
STEP SPSS ANALYSIS COHEN KAPPA and ICC
5 pages
Chapter 4 Two-Sample Tests
No ratings yet
Chapter 4 Two-Sample Tests
48 pages
Moderation Reporting-Results
No ratings yet
Moderation Reporting-Results
5 pages
TATA Steels Sales Forecast
No ratings yet
TATA Steels Sales Forecast
30 pages