Breast Cancer Classification Using Python
Breast Cancer Classification Using Python
Mugdha Paithankar
Nov 8, 2020 · 13 min read
https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 1/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
Breast cancer (BC) is one of the most common cancers among women in
the world today.
In this article I will also go over all the steps needed to make a Data Science
project complete in itself, and with the use of machine learning algorithms,
ultimately build a model which accurately classifies tumors as Benign or
Malignant based on the tumor shape and its geometry.
3. perimeter
https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 2/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
4. area
9. symmetry
The mean, standard error and “worst” or largest (mean of the three largest
values) of these features were computed for each image, resulting in 30
features.
#make a dataframe
df = pd.read_csv(‘data.csv’)
df.shape()
df.columns
The dataset has 569 rows and 33 columns. There are two extra columns
“id” and “Unnamed: 32”. We drop Unnamed: 32 which has all Nan
values.
#Drop the column with all missing values (na, NAN, NaN)
https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 3/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
df = df.dropna(axis=1)
df['diagnosis'].value_counts()
sns.countplot(df['diagnosis'],label="Count")
Violin plots are like density plots and unlike bar graphs with means and
error bars, violin plots contain all data points which make them an excellent
tool to visualize samples of small sizes.
https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 4/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
y = df.diagnosis
list = [‘id’,’diagnosis’]
X = df.drop(list,axis = 1)
data_dia = y
data = X
data = pd.concat([y,data_std.iloc[:,0:10]],axis=1)
data = pd.melt(data,id_vars=”diagnosis”,
var_name=”features”,
value_name=’value’)
plt.figure(figsize=(10,10))
sns.violinplot(x=”features”, y=”value”,
hue=”diagnosis”, data=data,split=True, inner=”quart”)
plt.xticks(rotation=90)
https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 5/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 6/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
The medians for almost all Malignant or Benign don’t vary much for the
standard error features above, except for concave points_se and
concavity_se. smoothness_se or symmetry_se have a very similar
distribution which could make classification using this feature difficult. The
shape of the violin plot for area_se looks warped and the distribution of
data points for benign and malignant very different!
https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 7/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
area_worst look well separated, so it might be easier to use this feature for
classification! Variance seems highest for fractal_dimension_worst.
concavity_worst and concave_points_worst seem to have a similar data
distribution.
#correlation map
matrix = np.triu(X.corr())
https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 8/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 9/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
By now we have a rough idea that many of the features are highly
correlated amongst each other. But what about correlation between the
benign and malignant groups for each feature? In order to understand if
there is a difference between the data distribution for malignant and
benign groups, I visualized some features via box plots and performed a t
test to detect statistical significance.
Box plots succinctly compare multiple distributions and are a great way to
visualize the IQR.
Texture means, for malignant and benign tumors vary by about 3 units.
https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 10/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
The distribution looks similar for both the groups. Malignant tumors tend
to have a higher texture mean compared to benign.
Fractal dimension means are almost the same for malignant and benign
tumors. The IQR is wider for malignant tumors.
Malignant groups have a distinctly wider range of values for area se. The
distribution range is very narrow for benign groups. This might be a good
feature for classification.
Standard error (se) of concave points has a higher mean and IQR for
malignant tumors. The distribution looks somewhat similar for both tumor
types.
https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 11/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
Malignant groups have a wider range of values for radius worst compared
to benign groups. The IQR is wider for the same. Malignant tumors have a
higher radius worst compared to benign groups.
t test tells us he t test tells you how significant the differences between groups
are; In other words it lets you know if those differences (measured in means)
could have happened by chance.
new = pd.DataFrame(data=df[[‘area_worst’,
‘diagnosis’]])
new = new.set_index(‘diagnosis’)
stats.ttest_ind(new_d.loc[‘M’], new_d.loc['B'])
https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 12/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
From the correlation matrix we saw earlier, it was clear that there are quite
a few features with very high correlations. So I dropped one of the
features, from each of the feature pairs which had a correlation greater
than 0.95. ‘perimeter_mean’, ‘area_mean’, ‘perimeter_se’, ‘area_se’,
‘radius_worst’, ‘perimeter_worst’, ‘area_worst’ were amongst the features
that were dropped.
corr_matrix = X.corr().abs()
upper =
corr_matrix.where(np.triu(np.ones(corr_matrix.shape),
k=1).astype(np.bool))
https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 13/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
# Drop features
X = X.drop(X[to_drop], axis=1)
X.columns
labelencoder_y = LabelEncoder()
y= labelencoder_y.fit_transform(y)
print(labelencoder_y.fit_transform(y))
40% of the data was reserved for testing purposes. The dataset was
stratified in order to preserve the proportion of target as in the original
dataset, in the train and test datasets as well.
https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 14/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
sklearn’s Robust Scaler was used to scale the features of the dataset. The
centering and scaling statistics of this scaler are based on percentiles and
are therefore not influenced by a few number of very large marginal
outliers.
#Feature Scaling
sc = RobustScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
def models(X_train,y_train):
log = LogisticRegression(random_state = 0)
log.fit(X_train, y_train)
svc_lin.fit(X_train, y_train)
svc_rbf.fit(X_train, y_train)
#Using DecisionTreeClassifier
https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 15/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
tree = DecisionTreeClassifier(criterion =
'entropy', random_state = 0)
tree.fit(X_train, y_train)
forest.fit(X_train, y_train)
model = models(X_train,y_train)
Confusion matrix
for i in range(len(model)):
https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 16/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
cm = confusion_matrix(y_test,
model[i].predict(X_test))
TN = cm[0][0]
TP = cm[1][1]
FN = cm[1][0]
FP = cm[0][1]
print(cm)
[[142 1]
[ 2 83]]
[[141 2]
[ 4 81]]
[[141 2]
[ 3 82]]
[[129 14]
[ 5 80]]
[[139 4]
[ 6 79]]
Classification Report
for i in range(len(model)):
print(‘Model ‘,i)
print(classification_report(y_test,
https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 17/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
model[i].predict(X_test)))
print(accuracy_score(y_test,
model[i].predict(X_test)))
Model 0
0.9868421052631579
Model 1
0.9736842105263158
Model 2
0.9780701754385965
Model 3
https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 18/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
0.9166666666666666
Model 4
0.956140350877193
https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 19/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
logistic = LogisticRegression()
C = np.arange(0, 1, 0.001)
print('Best Penalty:',
best_model.best_estimator_.get_params()['penalty'])
print('Best C:',
best_model.best_estimator_.get_params()['C'])
Best Penalty: l2
Best C: 0.591
predictions = best_model.predict(X_test)
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))
https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 20/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
[[142 1]
[ 2 83]]
After grid searching the accuracy improved a little but the FNs are still 2.
Grid searching was done on SVC and Random Forest models too but the
recall was best for logistic regression which is why I am discussing logistic
regression in this post.
The default threshold for interpreting probabilities to class labels is 0.5, and
tuning this hyperparameter is called threshold moving.
y_scores = best_model.predict_proba(X_test)[:, 1]
p, r, thresholds = precision_recall_curve(y_test,
recall
y_scores)
def precision_recall_threshold(p,
recall r, thresholds,
t=0.5):
https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 21/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
matrix.
y_pred_adj = adjusted_classes(y_scores, t)
print(pd.DataFrame(confusion_matrix(y_test,
y_pred_adj),
columns=['pred_neg',
'pred_pos'],
index=['neg', 'pos']))
print(classification_report(y_test, y_pred_adj))
precision_recall_threshold(p,
recall r, thresholds, 0.42)
pred_neg pred_pos
neg 141 2
pos 1 84
def plot_precision_recall_vs_threshold(precisions,
recall
recalls, thresholds):
recall
plt.figure(figsize=(8, 8))
plt.title(“Recall
Recall Scores as a function of the decision
threshold”)
plt.plot(thresholds, precisions[:-1], “b — “,
label=”Precision”)
plt.plot(thresholds, recalls[:-1],
recall “g-”,
label=”Recall”)
Recall
plt.axvline(x=.42, color=’black’)
https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 22/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
Recall’,rotation=90)
Recall
plt.ylabel(“Recall
Recall Score”)
plt.xlabel(“Decision Threshold”)
plt.legend(loc=’best’)
plot_precision_recall_vs_threshold(p,
recall r, thresholds)
The line for optimal decision threshold indicates the point of maximum
recall which could be achieved without compromising a lot on precision.
After that point the precision starts to drop more.
https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 23/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
y_pred_prob = best_model.predict_proba(X_test)[:,1]
print(metrics.auc(fpr, tpr))
plt.plot(fpr, tpr)
plt.show()
AUC score tells us how good our model is at distinguishing between classes,
in this case, predicting benign tumors as benign and malignant tumors as
malignant.
https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 24/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
The ROC curve is plotted with TPR against the FPR where TPR is on y-axis
and FPR is on the x-axis. ROC curve looks almost ideal.
When the TPR and FPR don’t overlap at all, it means model has an ideal
measure of separability ie it is able to correctly classify positives as positives
and negatives as negatives.
To conclude this post, I have discussed a few EDA, statistical analysis and
machine learning techniques as applied to breast cancer classification
dataset. Complete code of this project can be found on Github.
The breast cancer classification dataset is good to get started with making a
complete Data Science project before you move on to more advanced
datasets and techniques.
Hope you guys found this post helpful and learnt something new too!
Follow Mugdha Paithankar for more stories. Please clap this article if you
like it!
Get smarter at building your thing. Subscribe to receive The Startup's top 10 most
read stories — delivered straight into your inbox, once a week. Take a look.
https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 25/26
18/06/2021 Breast Cancer Classification Using Python | by Mugdha Paithankar | The Startup | Medium
Hyperparameter Tuning
https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d 26/26