100% found this document useful (3 votes)
423 views49 pages

DATA MINING PROJECT PAVITHRAA GOVINDARAJAN 24 OCT 2021 Jupyter Notebook PDF

This Jupyter Notebook document loads banking customer data, cleans and explores the data, and visualizes relationships between variables. Key steps include: 1. Importing necessary libraries and reading in a CSV file containing banking customer data on 7 variables for 210 customers. 2. Performing initial data cleaning and exploration, including checking for null values, duplicate rows, and descriptive statistics. 3. Visualizing relationships between variables using pairplots, heatmaps of correlations, and univariate plots of each variable.

Uploaded by

GURUPADA PATI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (3 votes)
423 views49 pages

DATA MINING PROJECT PAVITHRAA GOVINDARAJAN 24 OCT 2021 Jupyter Notebook PDF

This Jupyter Notebook document loads banking customer data, cleans and explores the data, and visualizes relationships between variables. Key steps include: 1. Importing necessary libraries and reading in a CSV file containing banking customer data on 7 variables for 210 customers. 2. Performing initial data cleaning and exploration, including checking for null values, duplicate rows, and descriptive statistics. 3. Visualizing relationships between variables using pairplots, heatmaps of correlations, and univariate plots of each variable.

Uploaded by

GURUPADA PATI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [1]: import pandas as pd


import numpy as np
import seaborn as sns
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
%matplotlib inline

In [2]: df = pd.read_csv('bank_marketing_part1_Data-1.csv')

In [3]: df.head()

Out[3]:
spending advance_payments probability_of_full_payment current_balance credit_limit min_paym

0 19.94 16.92 0.8752 6.675 3.763

1 15.99 14.89 0.9064 5.363 3.582

2 18.95 16.42 0.8829 6.248 3.755

3 10.83 12.96 0.8099 5.278 2.641

4 17.99 15.86 0.8992 5.890 3.694

In [4]: df.shape

Out[4]: (210, 7)

In [5]: df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 210 entries, 0 to 209

Data columns (total 7 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 spending 210 non-null float64

1 advance_payments 210 non-null float64

2 probability_of_full_payment 210 non-null float64

3 current_balance 210 non-null float64

4 credit_limit 210 non-null float64

5 min_payment_amt 210 non-null float64

6 max_spent_in_single_shopping 210 non-null float64

dtypes: float64(7)

memory usage: 11.6 KB

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 1/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [6]: df.describe().T

Out[6]:
count mean std min 25% 50% 75%

spending 210.0 14.847524 2.909699 10.5900 12.27000 14.35500 17.305000

advance_payments 210.0 14.559286 1.305959 12.4100 13.45000 14.32000 15.715000

probability_of_full_payment 210.0 0.870999 0.023629 0.8081 0.85690 0.87345 0.887775

current_balance 210.0 5.628533 0.443063 4.8990 5.26225 5.52350 5.979750

credit_limit 210.0 3.258605 0.377714 2.6300 2.94400 3.23700 3.561750

min_payment_amt 210.0 3.700201 1.503557 0.7651 2.56150 3.59900 4.768750

max_spent_in_single_shopping 210.0 5.408071 0.491480 4.5190 5.04500 5.22300 5.877000

In [7]: df.isna().sum()

Out[7]: spending 0

advance_payments 0

probability_of_full_payment 0

current_balance 0

credit_limit 0

min_payment_amt 0

max_spent_in_single_shopping 0

dtype: int64

In [8]: df.duplicated().sum()

Out[8]: 0

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 2/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [9]: sns.pairplot(df)

Out[9]: <seaborn.axisgrid.PairGrid at 0x185f5605df0>

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 3/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [10]: plt.figure(figsize=(7,6))
sns.heatmap(df.corr(),annot=True,fmt=".2f");

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 4/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [11]: plt.figure(figsize =(15,14))


sns.boxplot(data = df.describe())

Out[11]: <matplotlib.axes._subplots.AxesSubplot at 0x185f7379c70>

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 5/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [12]: fig, ax =plt.subplots(1,2)


fig.set_size_inches(15,6)
sns.distplot(df['spending'], kde=True, ax=ax[0])
sns.boxplot(df['spending'], ax=ax[1])
fig.show()

<ipython-input-12-2740fba9f425>:5: UserWarning: Matplotlib is currently using m


odule://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot s
how the figure.

fig.show()

In [13]: fig, ax =plt.subplots(1,2)


fig.set_size_inches(15,6)
sns.distplot(df['advance_payments'], kde=True, ax=ax[0])
sns.boxplot(df['advance_payments'], ax=ax[1])
fig.show()

<ipython-input-13-b8f8b3158a39>:5: UserWarning: Matplotlib is currently using m


odule://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot s
how the figure.

fig.show()

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 6/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [14]: fig, ax =plt.subplots(1,2)


fig.set_size_inches(15,6)
sns.distplot(df['probability_of_full_payment'], kde=True, ax=ax[0])
sns.boxplot(df['probability_of_full_payment'], ax=ax[1])
fig.show()

<ipython-input-14-31fbb0792d83>:5: UserWarning: Matplotlib is currently using m


odule://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot s
how the figure.

fig.show()

In [15]: fig, ax =plt.subplots(1,2)


fig.set_size_inches(15,6)
sns.distplot(df['current_balance'], kde=True, ax=ax[0])
sns.boxplot(df['current_balance'], ax=ax[1])
fig.show()

<ipython-input-15-8ec4f2d9c5f0>:5: UserWarning: Matplotlib is currently using m


odule://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot s
how the figure.

fig.show()

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 7/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [16]: fig, ax =plt.subplots(1,2)


fig.set_size_inches(15,6)
sns.distplot(df['credit_limit'], kde=True, ax=ax[0])
sns.boxplot(df['credit_limit'], ax=ax[1])
fig.show()

<ipython-input-16-7702c5671e1c>:5: UserWarning: Matplotlib is currently using m


odule://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot s
how the figure.

fig.show()

In [17]: fig, ax =plt.subplots(1,2)


fig.set_size_inches(15,6)
sns.distplot(df['min_payment_amt'], kde=True, ax=ax[0])
sns.boxplot(df['min_payment_amt'], ax=ax[1])
fig.show()

<ipython-input-17-9ebe2fb10f1c>:5: UserWarning: Matplotlib is currently using m


odule://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot s
how the figure.

fig.show()

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 8/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [18]: fig, ax =plt.subplots(1,2)


fig.set_size_inches(15,6)
sns.distplot(df['max_spent_in_single_shopping'], kde=True, ax=ax[0])
sns.boxplot(df['max_spent_in_single_shopping'], ax=ax[1])
fig.show()

<ipython-input-18-4fce516e7e0b>:5: UserWarning: Matplotlib is currently using m


odule://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot s
how the figure.

fig.show()

In [19]: from sklearn.preprocessing import StandardScaler


X = StandardScaler()
scaled_df = X.fit_transform(df)
scaled_df

Out[19]: array([[ 1.75435461, 1.81196782, 0.17822987, ..., 1.33857863,

-0.29880602, 2.3289982 ],

[ 0.39358228, 0.25383997, 1.501773 , ..., 0.85823561,

-0.24280501, -0.53858174],

[ 1.41330028, 1.42819249, 0.50487353, ..., 1.317348 ,

-0.22147129, 1.50910692],

...,

[-0.2816364 , -0.30647202, 0.36488339, ..., -0.15287318,

-1.3221578 , -0.83023461],

[ 0.43836719, 0.33827054, 1.23027698, ..., 0.60081421,

-0.95348449, 0.07123789],

[ 0.24889256, 0.45340314, -0.77624835, ..., -0.07325831,

-0.70681338, 0.96047321]])

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 9/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [20]: scaled_df = pd.DataFrame(scaled_df, index=df.index, columns=df.columns)


scaled_df.head()

Out[20]:
spending advance_payments probability_of_full_payment current_balance credit_limit min_pay

0 1.754355 1.811968 0.178230 2.367533 1.338579

1 0.393582 0.253840 1.501773 -0.600744 0.858236

2 1.413300 1.428192 0.504874 1.401485 1.317348

3 -1.384034 -1.227533 -2.591878 -0.793049 -1.639017

4 1.082581 0.998364 1.196340 0.591544 1.155464

In [21]: from scipy.cluster.hierarchy import dendrogram, linkage

In [22]: HClust = linkage(scaled_df, method = 'ward')

In [23]: dend = dendrogram(HClust)

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 10/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [24]: dend = dendrogram(HClust,


truncate_mode='lastp',
p = 15,
)

In [25]: from scipy.cluster.hierarchy import fcluster

In [26]: H_clusters = fcluster(HClust, 3, criterion='maxclust')


H_clusters

Out[26]: array([1, 3, 1, 2, 1, 2, 2, 3, 1, 2, 1, 3, 2, 1, 3, 2, 3, 2, 3, 2, 2, 2,

1, 2, 3, 1, 3, 2, 2, 2, 3, 2, 2, 3, 2, 2, 2, 2, 2, 1, 1, 3, 1, 1,

2, 2, 3, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 3, 2, 2, 3, 3, 1,

1, 3, 1, 2, 3, 2, 1, 1, 2, 1, 3, 2, 1, 3, 3, 3, 3, 1, 2, 3, 3, 1,

1, 2, 3, 1, 3, 2, 2, 1, 1, 1, 2, 1, 2, 1, 3, 1, 3, 1, 1, 2, 2, 1,

3, 3, 1, 2, 2, 1, 3, 3, 2, 1, 3, 2, 2, 2, 3, 3, 1, 2, 3, 3, 2, 3,

3, 1, 2, 1, 1, 2, 1, 3, 3, 3, 2, 2, 3, 2, 1, 2, 3, 2, 3, 2, 3, 3,

3, 3, 3, 2, 3, 1, 1, 2, 1, 1, 1, 2, 1, 3, 3, 3, 3, 2, 3, 1, 1, 1,

3, 3, 1, 2, 3, 3, 3, 3, 1, 1, 3, 3, 3, 2, 3, 3, 2, 1, 3, 1, 1, 2,

1, 2, 3, 1, 3, 2, 1, 3, 1, 3, 1, 3], dtype=int32)

In [27]: df['H_clusters'] = H_clusters

In [28]: df.head()

Out[28]:
spending advance_payments probability_of_full_payment current_balance credit_limit min_paym

0 19.94 16.92 0.8752 6.675 3.763

1 15.99 14.89 0.9064 5.363 3.582

2 18.95 16.42 0.8829 6.248 3.755

3 10.83 12.96 0.8099 5.278 2.641

4 17.99 15.86 0.8992 5.890 3.694

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 11/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [29]: df.H_clusters.value_counts().sort_index()

Out[29]: 1 70

2 67

3 73

Name: H_clusters, dtype: int64

In [30]: agg_data = df.groupby('H_clusters').mean()


agg_data['Freq'] = df.H_clusters.value_counts().sort_index()
agg_data

Out[30]:
spending advance_payments probability_of_full_payment current_balance credit_limit

H_clusters

1 18.371429 16.145429 0.884400 6.158171 3.684629

2 11.872388 13.257015 0.848072 5.238940 2.848537

3 14.199041 14.233562 0.879190 5.478233 3.226452

In [31]: wardlink_1 = linkage(scaled_df, method = 'average')

In [32]: dend_1 = dendrogram(wardlink_1)

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 12/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [33]: dend_1 = dendrogram(wardlink_1,


truncate_mode='lastp',
p = 25,
)

In [34]: Avg_clusters = fcluster(wardlink_1, 3, criterion='maxclust')


Avg_clusters

Out[34]: array([1, 3, 1, 2, 1, 3, 2, 2, 1, 2, 1, 1, 2, 1, 3, 3, 3, 2, 2, 2, 2, 2,

1, 2, 3, 1, 3, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 1, 1, 3, 1, 1,

2, 2, 3, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 3, 2, 2, 1, 3, 1,

1, 3, 1, 2, 3, 2, 1, 1, 2, 1, 3, 2, 1, 3, 3, 3, 3, 1, 2, 1, 1, 1,

1, 3, 3, 1, 3, 2, 2, 1, 1, 1, 2, 1, 3, 1, 3, 1, 3, 1, 1, 2, 3, 1,

1, 3, 1, 2, 2, 1, 3, 3, 2, 1, 3, 2, 2, 2, 3, 3, 1, 2, 3, 3, 2, 3,

3, 1, 2, 1, 1, 2, 1, 3, 3, 3, 2, 2, 2, 2, 1, 2, 3, 2, 3, 2, 3, 1,

3, 3, 2, 2, 3, 1, 1, 2, 1, 1, 1, 2, 1, 3, 3, 2, 3, 2, 3, 1, 1, 1,

3, 2, 3, 2, 3, 2, 3, 3, 1, 1, 3, 1, 3, 2, 3, 3, 2, 1, 3, 1, 1, 2,

1, 2, 3, 3, 3, 2, 1, 3, 1, 3, 3, 1], dtype=int32)

In [35]: df['Avg_clusters'] = Avg_clusters

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 13/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [36]: df.head()

Out[36]:
spending advance_payments probability_of_full_payment current_balance credit_limit min_paym

0 19.94 16.92 0.8752 6.675 3.763

1 15.99 14.89 0.9064 5.363 3.582

2 18.95 16.42 0.8829 6.248 3.755

3 10.83 12.96 0.8099 5.278 2.641

4 17.99 15.86 0.8992 5.890 3.694

In [37]: df.Avg_clusters.value_counts().sort_index()

Out[37]: 1 75

2 70

3 65

Name: Avg_clusters, dtype: int64

In [38]: agg_data_1 = df.groupby('Avg_clusters').mean()


agg_data_1['Freq'] = df.Avg_clusters.value_counts().sort_index()
agg_data_1

Out[38]:
spending advance_payments probability_of_full_payment current_balance credit_lim

Avg_clusters

1 18.129200 16.058000 0.881595 6.135747 3.64812

2 11.916857 13.291000 0.846766 5.258300 2.84600

3 14.217077 14.195846 0.884869 5.442000 3.25350

In [ ]: ​

In [39]: from sklearn.cluster import KMeans

In [40]: K_Means = KMeans(n_clusters=3, random_state=1)

In [41]: K_Means.fit(scaled_df)

Out[41]: KMeans(n_clusters=3, random_state=1)

In [42]: labels_3 = K_Means.labels_

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 14/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [43]: K_Means.labels_

Out[43]: array([2, 0, 2, 1, 2, 1, 1, 0, 2, 1, 2, 0, 1, 2, 0, 1, 0, 1, 1, 1, 1, 1,

2, 1, 0, 2, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 2, 2, 0, 2, 2,

1, 1, 0, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 1, 1, 2, 0, 1, 1, 0, 0, 2,

2, 0, 2, 1, 0, 1, 2, 2, 1, 2, 0, 1, 2, 0, 0, 0, 0, 2, 1, 0, 2, 0,

2, 1, 0, 2, 0, 1, 1, 2, 2, 2, 1, 2, 0, 2, 0, 2, 0, 2, 2, 1, 1, 2,

0, 0, 2, 1, 1, 2, 0, 0, 1, 2, 0, 1, 1, 1, 0, 0, 2, 1, 0, 0, 1, 0,

0, 2, 1, 2, 2, 1, 2, 0, 0, 0, 1, 1, 0, 1, 2, 1, 0, 1, 0, 1, 0, 0,

1, 0, 0, 1, 0, 2, 2, 1, 2, 2, 2, 1, 0, 0, 0, 1, 0, 1, 0, 2, 2, 2,

0, 1, 0, 1, 0, 0, 0, 0, 2, 2, 1, 0, 0, 1, 1, 0, 1, 2, 0, 2, 2, 1,

2, 1, 0, 2, 0, 1, 2, 0, 2, 0, 0, 0])

In [44]: wss = []

In [45]: for i in range(1,11):


KM = KMeans(n_clusters=i,random_state=1)
KM.fit(scaled_df)
wss.append(KM.inertia_)

In [46]: wss

Out[46]: [1469.9999999999995,

659.1717544870411,

430.65897315130064,

371.38509060801107,

327.2127816566134,

289.315995389595,

262.98186570162267,

241.8189465608603,

223.91254221002728,

206.3961218478669]

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 15/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [47]: plt.plot(range(1,11),wss)
plt.grid()
plt.xlabel('No. of clusters')
plt.ylabel('WCSS')
plt.show()

In [48]: from sklearn.metrics import silhouette_samples, silhouette_score

In [49]: silhouette_score(scaled_df,labels_3,random_state=1)

Out[49]: 0.40072705527512986

In [50]: K_Means2 = KMeans(n_clusters=2, random_state=1)


K_Means2.fit(scaled_df)
labels_2 = K_Means2.labels_
silhouette_score(scaled_df,labels_2,random_state=1)

Out[50]: 0.46577247686580914

In [51]: K_Means4 = KMeans(n_clusters=4, random_state=1)


K_Means4.fit(scaled_df)
labels_4 = K_Means4.labels_
silhouette_score(scaled_df,labels_4,random_state=1)

Out[51]: 0.3276547677266192

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 16/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [52]: df["Clus_kmeans"] = labels_3


df.head(5)

Out[52]:
spending advance_payments probability_of_full_payment current_balance credit_limit min_paym

0 19.94 16.92 0.8752 6.675 3.763

1 15.99 14.89 0.9064 5.363 3.582

2 18.95 16.42 0.8829 6.248 3.755

3 10.83 12.96 0.8099 5.278 2.641

4 17.99 15.86 0.8992 5.890 3.694

In [53]: sil_width = silhouette_samples(scaled_df,labels_3)

In [54]: df["sil_width"] = sil_width


df.head(5)

Out[54]:
spending advance_payments probability_of_full_payment current_balance credit_limit min_paym

0 19.94 16.92 0.8752 6.675 3.763

1 15.99 14.89 0.9064 5.363 3.582

2 18.95 16.42 0.8829 6.248 3.755

3 10.83 12.96 0.8099 5.278 2.641

4 17.99 15.86 0.8992 5.890 3.694

In [55]: df.Clus_kmeans.value_counts().sort_index()

Out[55]: 0 71

1 72

2 67

Name: Clus_kmeans, dtype: int64

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 17/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [56]: agg_data_2 = df.groupby('Clus_kmeans').mean()


agg_data_2['Freq'] = df.Clus_kmeans.value_counts().sort_index()
agg_data_2

Out[56]:
spending advance_payments probability_of_full_payment current_balance credit_lim

Clus_kmeans

0 14.437887 14.337746 0.881597 5.514577 3.25922

1 11.856944 13.247778 0.848253 5.231750 2.84954

2 18.495373 16.203433 0.884210 6.175687 3.69753

In [57]: from sklearn import tree


from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import roc_auc_score,roc_curve,classification_report,confusi
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

In [58]: df1 = pd.read_csv('insurance_part2_data-2.csv')

In [59]: df1.head()

Out[59]:
Product
Age Agency_Code Type Claimed Commision Channel Duration Sales Destin
Name

Customised
0 48 C2B Airlines No 0.70 Online 7 2.51
Plan

Travel Customised
1 36 EPX No 0.00 Online 34 20.00
Agency Plan

Travel Customised
2 39 CWT No 5.94 Online 3 9.90 Am
Agency Plan

Travel Cancellation
3 36 EPX No 0.00 Online 4 26.00
Agency Plan

4 33 JZI Airlines No 6.30 Online 53 18.00 Bronze Plan

In [60]: df1.shape

Out[60]: (3000, 10)

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 18/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [61]: df1.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 3000 entries, 0 to 2999

Data columns (total 10 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Age 3000 non-null int64

1 Agency_Code 3000 non-null object

2 Type 3000 non-null object

3 Claimed 3000 non-null object

4 Commision 3000 non-null float64

5 Channel 3000 non-null object

6 Duration 3000 non-null int64

7 Sales 3000 non-null float64

8 Product Name 3000 non-null object

9 Destination 3000 non-null object

dtypes: float64(2), int64(2), object(6)

memory usage: 234.5+ KB

In [62]: df1.isna().sum()

Out[62]: Age 0

Agency_Code 0

Type 0

Claimed 0

Commision 0

Channel 0

Duration 0

Sales 0

Product Name 0

Destination 0

dtype: int64

In [63]: df1.describe().T

Out[63]:
count mean std min 25% 50% 75% max

Age 3000.0 38.091000 10.463518 8.0 32.0 36.00 42.000 84.00

Commision 3000.0 14.529203 25.481455 0.0 0.0 4.63 17.235 210.21

Duration 3000.0 70.001333 134.053313 -1.0 11.0 26.50 63.000 4580.00

Sales 3000.0 60.249913 70.733954 0.0 20.0 33.00 69.000 539.00

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 19/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [64]: df1.describe(include='all').T

Out[64]:
count unique top freq mean std min 25% 50% 75%

Age 3000.0 NaN NaN NaN 38.091 10.463518 8.0 32.0 36.0 42.0

Agency_Code 3000 4 EPX 1365 NaN NaN NaN NaN NaN NaN

Travel
Type 3000 2 1837 NaN NaN NaN NaN NaN NaN
Agency

Claimed 3000 2 No 2076 NaN NaN NaN NaN NaN NaN

Commision 3000.0 NaN NaN NaN 14.529203 25.481455 0.0 0.0 4.63 17.235

Channel 3000 2 Online 2954 NaN NaN NaN NaN NaN NaN

Duration 3000.0 NaN NaN NaN 70.001333 134.053313 -1.0 11.0 26.5 63.0

Sales 3000.0 NaN NaN NaN 60.249913 70.733954 0.0 20.0 33.0 69.0

Product Customised
3000 5 1136 NaN NaN NaN NaN NaN NaN
Name Plan

Destination 3000 3 ASIA 2465 NaN NaN NaN NaN NaN NaN

In [65]: df1.duplicated().sum()

Out[65]: 139

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 20/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [66]: for column in df1[['Agency_Code','Type','Claimed','Channel','Product Name','Desti


print(column.upper(),': ',df1[column].nunique())
print(df1[column].value_counts().sort_values())
print('\n')

AGENCY_CODE : 4

JZI 239

CWT 472

C2B 924

EPX 1365

Name: Agency_Code, dtype: int64

TYPE : 2

Airlines 1163

Travel Agency 1837

Name: Type, dtype: int64

CLAIMED : 2

Yes 924

No 2076

Name: Claimed, dtype: int64

CHANNEL : 2

Offline 46

Online 2954

Name: Channel, dtype: int64

PRODUCT NAME : 5

Gold Plan 109

Silver Plan 427

Bronze Plan 650

Cancellation Plan 678

Customised Plan 1136

Name: Product Name, dtype: int64

DESTINATION : 3

EUROPE 215

Americas 320

ASIA 2465

Name: Destination, dtype: int64

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 21/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [67]: plt.figure(figsize=(10,5))
df1[['Age','Commision','Duration','Sales']].boxplot()

Out[67]: <matplotlib.axes._subplots.AxesSubplot at 0x185fb0d3dc0>

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 22/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [68]: fig, ax =plt.subplots(1,2)


fig.set_size_inches(15,6)
sns.distplot(df1['Age'], kde=True, ax=ax[0])
sns.boxplot(df1['Age'], ax=ax[1])
fig.show()

<ipython-input-68-ddac30fb6ae8>:5: UserWarning: Matplotlib is currently using m


odule://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot s
how the figure.

fig.show()

In [69]: fig, ax =plt.subplots(1,2)


fig.set_size_inches(15,6)
sns.distplot(df1['Commision'], kde=True, ax=ax[0])
sns.boxplot(df1['Commision'], ax=ax[1])
fig.show()

<ipython-input-69-7453aabcc793>:5: UserWarning: Matplotlib is currently using m


odule://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot s
how the figure.

fig.show()

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 23/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [70]: fig, ax =plt.subplots(1,2)


fig.set_size_inches(15,6)
sns.distplot(df1['Duration'], kde=True, ax=ax[0])
sns.boxplot(df1['Duration'], ax=ax[1])
fig.show()

<ipython-input-70-770071dae301>:5: UserWarning: Matplotlib is currently using m


odule://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot s
how the figure.

fig.show()

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 24/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [71]: fig, ax =plt.subplots(1,2)


fig.set_size_inches(15,6)
sns.distplot(df1['Sales'], kde=True, ax=ax[0])
sns.boxplot(df1['Sales'], ax=ax[1])
fig.show()

<ipython-input-71-1f8fcbbf51d0>:5: UserWarning: Matplotlib is currently using m


odule://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot s
how the figure.

fig.show()

In [72]: fig, ax =plt.subplots(1,2)


fig.set_size_inches(15,6)
sns.countplot(x ='Agency_Code', data = df1, ax=ax[0])
sns.boxplot(data = df1, x='Agency_Code',y='Sales', hue='Claimed', ax=ax[1])
fig.show()

<ipython-input-72-8ad90b9b652e>:5: UserWarning: Matplotlib is currently using m


odule://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot s
how the figure.

fig.show()

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 25/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [73]: fig, ax =plt.subplots(1,2)


fig.set_size_inches(15,6)
sns.countplot(x ='Type', data = df1, ax=ax[0])
sns.boxplot(data = df1, x='Type',y='Sales', hue='Claimed', ax=ax[1])
fig.show()

<ipython-input-73-bb2422c5042f>:5: UserWarning: Matplotlib is currently using m


odule://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot s
how the figure.

fig.show()

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 26/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [74]: fig, ax =plt.subplots(1,2)


fig.set_size_inches(15,6)
sns.countplot(data = df1, x = 'Channel', ax=ax[0])
sns.boxplot(data = df1, x='Channel',y='Sales', hue='Claimed', ax=ax[1])
fig.show()

<ipython-input-74-87bb0736f8b2>:5: UserWarning: Matplotlib is currently using m


odule://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot s
how the figure.

fig.show()

In [75]: fig, ax =plt.subplots(1,2)


fig.set_size_inches(15,6)
sns.countplot(data = df1, x = 'Product Name', ax=ax[0])
sns.boxplot(data = df1, x='Product Name',y='Sales', hue='Claimed', ax=ax[1])
fig.show()

<ipython-input-75-a5e28aba9e90>:5: UserWarning: Matplotlib is currently using m


odule://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot s
how the figure.

fig.show()

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 27/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [76]: fig, ax =plt.subplots(1,2)


fig.set_size_inches(15,6)
sns.countplot(data = df1, x = 'Destination', ax=ax[0])
sns.boxplot(data = df1, x='Destination',y='Sales', hue='Claimed', ax=ax[1])
fig.show()

<ipython-input-76-4e9ed6a34e01>:5: UserWarning: Matplotlib is currently using m


odule://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot s
how the figure.

fig.show()

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 28/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [77]: sns.pairplot(df1)

Out[77]: <seaborn.axisgrid.PairGrid at 0x185f95045b0>

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 29/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [78]: plt.figure(figsize=(7,6))
sns.heatmap(df1.corr(),annot=True,fmt=".2f");

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 30/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [79]: for feature in df1.columns:


if df1[feature].dtype == 'object':
print('\n')
print('feature:',feature)
print(pd.Categorical(df1[feature].unique()))
print(pd.Categorical(df1[feature].unique()).codes)
df1[feature] = pd.Categorical(df1[feature]).codes

feature: Agency_Code

['C2B', 'EPX', 'CWT', 'JZI']

Categories (4, object): ['C2B', 'CWT', 'EPX', 'JZI']

[0 2 1 3]

feature: Type

['Airlines', 'Travel Agency']

Categories (2, object): ['Airlines', 'Travel Agency']

[0 1]

feature: Claimed

['No', 'Yes']

Categories (2, object): ['No', 'Yes']

[0 1]

feature: Channel

['Online', 'Offline']

Categories (2, object): ['Offline', 'Online']

[1 0]

feature: Product Name

['Customised Plan', 'Cancellation Plan', 'Bronze Plan', 'Silver Plan', 'Gold Pl


an']

Categories (5, object): ['Bronze Plan', 'Cancellation Plan', 'Customised Plan',


'Gold Plan', 'Silver Plan']

[2 1 0 4 3]

feature: Destination

['ASIA', 'Americas', 'EUROPE']

Categories (3, object): ['ASIA', 'Americas', 'EUROPE']

[0 1 2]

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 31/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [80]: df1.head()

Out[80]:
Product
Age Agency_Code Type Claimed Commision Channel Duration Sales Destination
Name

0 48 0 0 0 0.70 1 7 2.51 2 0

1 36 2 1 0 0.00 1 34 20.00 2 0

2 39 1 1 0 5.94 1 3 9.90 2 1

3 36 2 1 0 0.00 1 4 26.00 1 0

4 33 3 0 0 6.30 1 53 18.00 0 0

In [81]: df1.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 3000 entries, 0 to 2999

Data columns (total 10 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Age 3000 non-null int64

1 Agency_Code 3000 non-null int8

2 Type 3000 non-null int8

3 Claimed 3000 non-null int8

4 Commision 3000 non-null float64

5 Channel 3000 non-null int8

6 Duration 3000 non-null int64

7 Sales 3000 non-null float64

8 Product Name 3000 non-null int8

9 Destination 3000 non-null int8

dtypes: float64(2), int64(2), int8(6)

memory usage: 111.5 KB

In [82]: df1.Claimed.value_counts()

Out[82]: 0 2076

1 924

Name: Claimed, dtype: int64

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 32/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [83]: X = df1.drop("Claimed", axis=1)


y = df1.pop("Claimed")
X.head()

Out[83]:
Age Agency_Code Type Commision Channel Duration Sales Product Name Destination

0 48 0 0 0.70 1 7 2.51 2 0

1 36 2 1 0.00 1 34 20.00 2 0

2 39 1 1 5.94 1 3 9.90 2 1

3 36 2 1 0.00 1 4 26.00 1 0

4 33 3 0 6.30 1 53 18.00 0 0

In [84]: y.head()

Out[84]: 0 0

1 0

2 0

3 0

4 0

Name: Claimed, dtype: int8

In [85]: from sklearn.model_selection import train_test_split



X_train, X_test, train_labels, test_labels = train_test_split(X, y, test_size=.30

In [86]: print('X_train',X_train.shape)
print('X_test',X_test.shape)
print('train_labels',train_labels.shape)
print('test_labels',test_labels.shape)

X_train (2100, 9)

X_test (900, 9)

train_labels (2100,)

test_labels (900,)

In [87]: from sklearn.preprocessing import StandardScaler


std_scale = StandardScaler()
X_train = std_scale.fit_transform(X_train)
X_test = std_scale.transform(X_test)

In [88]: dt_model = DecisionTreeClassifier(criterion = 'gini')

In [89]: dt_model.fit(X_train, train_labels)

Out[89]: DecisionTreeClassifier()

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 33/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [90]: import numpy as np


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples,silhouette_score
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
# Import stats from scipy
from scipy import stats

In [91]: param_grid_dtcl = {
'criterion': ['gini'],
'max_depth': [3, 5, 7, 10,12],
'min_samples_leaf': [20,30,40,50,60],
'min_samples_split': [150,300,450],
}

dtcl = DecisionTreeClassifier(random_state=1)

grid_search_dtcl = GridSearchCV(estimator = dtcl, param_grid = param_grid_dtcl, c

In [92]: grid_search_dtcl.fit(X_train, train_labels)


print(grid_search_dtcl.best_params_)
best_grid_dtcl = grid_search_dtcl.best_estimator_
best_grid_dtcl

{'criterion': 'gini', 'max_depth': 7, 'min_samples_leaf': 20, 'min_samples_spli


t': 150}

Out[92]: DecisionTreeClassifier(max_depth=7, min_samples_leaf=20, min_samples_split=150,

random_state=1)

In [93]: param_grid_dtcl = {
'criterion': ['gini'],
'max_depth': [3.5,4.0,4.5, 5.0,5.5],
'min_samples_leaf': [40, 42, 44,46,48,50,52,54],
'min_samples_split': [250, 270, 280, 290, 300,310],
}

dtcl = DecisionTreeClassifier(random_state=1)

grid_search_dtcl = GridSearchCV(estimator = dtcl, param_grid = param_grid_dtcl, c

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 34/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [94]: grid_search_dtcl.fit(X_train, train_labels)


print(grid_search_dtcl.best_params_)
best_grid_dtcl = grid_search_dtcl.best_estimator_
best_grid_dtcl

{'criterion': 'gini', 'max_depth': 4.0, 'min_samples_leaf': 46, 'min_samples_sp


lit': 280}

Out[94]: DecisionTreeClassifier(max_depth=4.0, min_samples_leaf=46,

min_samples_split=280, random_state=1)

In [95]: ytrain_predict_dtcl = best_grid_dtcl.predict(X_train)


ytest_predict_dtcl = best_grid_dtcl.predict(X_test)

In [96]: ytest_predict_dtcl
ytest_predict_prob_dtcl=best_grid_dtcl.predict_proba(X_test)
ytest_predict_prob_dtcl
pd.DataFrame(ytest_predict_prob_dtcl).head()

Out[96]:
0 1

0 0.887805 0.112195

1 0.432432 0.567568

2 0.432432 0.567568

3 0.208163 0.791837

4 0.937143 0.062857

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 35/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [97]: # predict probabilities


probs_cart = best_grid_dtcl.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs_cart = probs_cart[:, 1]
# calculate AUC
cart_train_auc = roc_auc_score(train_labels, probs_cart)
print('AUC: %.3f' % cart_train_auc)
# calculate roc curve
cart_train_fpr, cart_train_tpr, cart_train_thresholds = roc_curve(train_labels, p
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(cart_train_fpr, cart_train_tpr)

AUC: 0.825

Out[97]: [<matplotlib.lines.Line2D at 0x185f93f8430>]

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 36/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [98]: # predict probabilities


probs_cart = best_grid_dtcl.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs_cart = probs_cart[:, 1]
# calculate AUC
cart_test_auc = roc_auc_score(test_labels, probs_cart)
print('AUC: %.3f' % cart_test_auc)
# calculate roc curve
cart_test_fpr, cart_test_tpr, cart_testthresholds = roc_curve(test_labels, probs_
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(cart_test_fpr, cart_test_tpr)

AUC: 0.792

Out[98]: [<matplotlib.lines.Line2D at 0x185fb295670>]

In [99]: confusion_matrix(train_labels, ytrain_predict_dtcl)

Out[99]: array([[1289, 182],

[ 259, 370]], dtype=int64)

In [102]: param_grid_rfcl = {
'max_depth': [6],#20,30,40
'max_features': [4],## 7,8,9
'min_samples_leaf': [8],## 50,100
'min_samples_split': [45], ## 60,70
'n_estimators': [100] ## 100,200
}

rfcl = RandomForestClassifier(random_state=1)

grid_search_rfcl = GridSearchCV(estimator = rfcl, param_grid = param_grid_rfcl, c

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 37/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [103]: grid_search_rfcl.fit(X_train, train_labels)


print(grid_search_rfcl.best_params_)
best_grid_rfcl = grid_search_rfcl.best_estimator_
best_grid_rfcl

{'max_depth': 6, 'max_features': 4, 'min_samples_leaf': 8, 'min_samples_split':


45, 'n_estimators': 100}

Out[103]: RandomForestClassifier(max_depth=6, max_features=4, min_samples_leaf=8,

min_samples_split=45, random_state=1)

In [104]: rf_train_acc=best_grid_rfcl.score(X_train,train_labels)
rf_train_acc

Out[104]: 0.81

In [105]: cart_train_acc=best_grid_dtcl.score(X_train,train_labels)
cart_train_acc

Out[105]: 0.79

In [106]: print(classification_report(train_labels, ytrain_predict_dtcl))

precision recall f1-score support

0 0.83 0.88 0.85 1471

1 0.67 0.59 0.63 629

accuracy 0.79 2100

macro avg 0.75 0.73 0.74 2100

weighted avg 0.78 0.79 0.79 2100

In [107]: param_grid_rfcl = {
'max_depth': [6],#20,30,40
'max_features': [4],## 7,8,9
'min_samples_leaf': [8],## 50,100
'min_samples_split': [45], ## 60,70
'n_estimators': [100] ## 100,200
}

rfcl = RandomForestClassifier(random_state=1)

grid_search_rfcl = GridSearchCV(estimator = rfcl, param_grid = param_grid_rfcl, c

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 38/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [108]: grid_search_rfcl.fit(X_train, train_labels)


print(grid_search_rfcl.best_params_)
best_grid_rfcl = grid_search_rfcl.best_estimator_
best_grid_rfcl

{'max_depth': 6, 'max_features': 4, 'min_samples_leaf': 8, 'min_samples_split':


45, 'n_estimators': 100}

Out[108]: RandomForestClassifier(max_depth=6, max_features=4, min_samples_leaf=8,

min_samples_split=45, random_state=1)

In [109]: grid_search_rfcl.best_params_

Out[109]: {'max_depth': 6,

'max_features': 4,

'min_samples_leaf': 8,

'min_samples_split': 45,

'n_estimators': 100}

In [110]: ytrain_predict_rfcl = best_grid_rfcl.predict(X_train)


ytest_predict_rfcl = best_grid_rfcl.predict(X_test)

In [111]: ytest_predict_rfcl
ytest_predict_prob_rfcl=best_grid_rfcl.predict_proba(X_test)
ytest_predict_prob_rfcl
pd.DataFrame(ytest_predict_prob_rfcl).head()

Out[111]:
0 1

0 0.732980 0.267020

1 0.493807 0.506193

2 0.448772 0.551228

3 0.258665 0.741335

4 0.926248 0.073752

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 39/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [112]: rf_train_fpr, rf_train_tpr,_=roc_curve(train_labels,best_grid_rfcl.predict_proba(


plt.plot(rf_train_fpr,rf_train_tpr,color='green')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
rf_train_auc=roc_auc_score(train_labels,best_grid_rfcl.predict_proba(X_train)[:,1
print('Area under Curve is', rf_train_auc)

Area under Curve is 0.8563623806955674

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 40/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [113]: rf_test_fpr, rf_test_tpr,_=roc_curve(test_labels,best_grid_rfcl.predict_proba(X_t


plt.plot(rf_test_fpr,rf_test_tpr,color='green')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
rf_test_auc=roc_auc_score(test_labels,best_grid_rfcl.predict_proba(X_test)[:,1])
print('Area under Curve is', rf_test_auc)

Area under Curve is 0.8177083625157584

In [114]: rf_metrics=classification_report(train_labels, ytrain_predict_rfcl,output_dict=Tr


df=pd.DataFrame(rf_metrics).transpose()
rf_train_precision=round(df.loc["1"][0],2)
rf_train_recall=round(df.loc["1"][1],2)
rf_train_f1=round(df.loc["1"][2],2)
print ('rf_train_precision ',rf_train_precision)
print ('rf_train_recall ',rf_train_recall)
print ('rf_train_f1 ',rf_train_f1)

rf_train_precision 0.72

rf_train_recall 0.6

rf_train_f1 0.65

In [115]: confusion_matrix(train_labels,ytrain_predict_rfcl)

Out[115]: array([[1326, 145],

[ 254, 375]], dtype=int64)

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 41/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [116]: print(classification_report(train_labels,ytrain_predict_rfcl))

precision recall f1-score support

0 0.84 0.90 0.87 1471

1 0.72 0.60 0.65 629

accuracy 0.81 2100

macro avg 0.78 0.75 0.76 2100

weighted avg 0.80 0.81 0.80 2100

In [117]: rf_metrics=classification_report(train_labels, ytrain_predict_rfcl,output_dict=Tr


df=pd.DataFrame(rf_metrics).transpose()
rf_train_precision=round(df.loc["1"][0],2)
rf_train_recall=round(df.loc["1"][1],2)
rf_train_f1=round(df.loc["1"][2],2)
print ('rf_train_precision ',rf_train_precision)
print ('rf_train_recall ',rf_train_recall)
print ('rf_train_f1 ',rf_train_f1)

rf_train_precision 0.72

rf_train_recall 0.6

rf_train_f1 0.65

In [118]: confusion_matrix(test_labels,ytest_predict_rfcl)

Out[118]: array([[551, 54],

[150, 145]], dtype=int64)

In [119]: rf_test_acc=best_grid_rfcl.score(X_test,test_labels)
rf_test_acc

Out[119]: 0.7733333333333333

In [120]: print(classification_report(test_labels,ytest_predict_rfcl))

precision recall f1-score support

0 0.79 0.91 0.84 605

1 0.73 0.49 0.59 295

accuracy 0.77 900

macro avg 0.76 0.70 0.72 900

weighted avg 0.77 0.77 0.76 900

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 42/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [121]: rf_metrics=classification_report(test_labels, ytest_predict_rfcl,output_dict=True


df=pd.DataFrame(rf_metrics).transpose()
rf_test_precision=round(df.loc["1"][0],2)
rf_test_recall=round(df.loc["1"][1],2)
rf_test_f1=round(df.loc["1"][2],2)
print ('rf_test_precision ',rf_test_precision)
print ('rf_test_recall ',rf_test_recall)
print ('rf_test_f1 ',rf_test_f1)

rf_test_precision 0.73

rf_test_recall 0.49

rf_test_f1 0.59

In [122]: param_grid_nncl = {
'hidden_layer_sizes': [50,100,200], # 50, 200
'max_iter': [2500,3000,4000], #5000,2500
'solver': ['adam'], #sgd
'tol': [0.01],
}

nncl = MLPClassifier(random_state=1)

grid_search_nncl = GridSearchCV(estimator = nncl, param_grid = param_grid_nncl, c

In [123]: grid_search_nncl.fit(X_train, train_labels)


grid_search_nncl.best_params_
best_grid_nncl = grid_search_nncl.best_estimator_
best_grid_nncl

Out[123]: MLPClassifier(hidden_layer_sizes=200, max_iter=2500, random_state=1, tol=0.01)

In [124]: grid_search_nncl.best_params_

Out[124]: {'hidden_layer_sizes': 200, 'max_iter': 2500, 'solver': 'adam', 'tol': 0.01}

In [125]: ytrain_predict_nncl = best_grid_nncl.predict(X_train)


ytest_predict_nncl = best_grid_nncl.predict(X_test)

In [126]: ytest_predict_nncl
ytest_predict_prob_nncl=best_grid_nncl.predict_proba(X_test)
ytest_predict_prob_nncl
pd.DataFrame(ytest_predict_prob_nncl).head()

Out[126]:
0 1

0 0.828364 0.171636

1 0.627123 0.372877

2 0.526596 0.473404

3 0.327278 0.672722

4 0.924043 0.075957

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 43/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [127]: confusion_matrix(train_labels, ytrain_predict_dtcl)

Out[127]: array([[1289, 182],

[ 259, 370]], dtype=int64)

In [128]: cart_train_acc=best_grid_dtcl.score(X_train,train_labels)
cart_train_acc

Out[128]: 0.79

In [129]: print(classification_report(train_labels, ytrain_predict_dtcl))

precision recall f1-score support

0 0.83 0.88 0.85 1471

1 0.67 0.59 0.63 629

accuracy 0.79 2100

macro avg 0.75 0.73 0.74 2100

weighted avg 0.78 0.79 0.79 2100

In [130]: nn_metrics=classification_report(train_labels, ytrain_predict_nncl,output_dict=Tr


df=pd.DataFrame(nn_metrics).transpose()
nn_train_precision=round(df.loc["1"][0],2)
nn_train_recall=round(df.loc["1"][1],2)
nn_train_f1=round(df.loc["1"][2],2)
print ('nn_train_precision ',nn_train_precision)
print ('nn_train_recall ',nn_train_recall)
print ('nn_train_f1 ',nn_train_f1)

nn_train_precision 0.67

nn_train_recall 0.51

nn_train_f1 0.57

In [131]: nn_metrics=classification_report(test_labels, ytest_predict_nncl,output_dict=True


df=pd.DataFrame(nn_metrics).transpose()
nn_test_precision=round(df.loc["1"][0],2)
nn_test_recall=round(df.loc["1"][1],2)
nn_test_f1=round(df.loc["1"][2],2)
print ('nn_test_precision ',nn_test_precision)
print ('nn_test_recall ',nn_test_recall)
print ('nn_test_f1 ',nn_test_f1)

nn_test_precision 0.72

nn_test_recall 0.43

nn_test_f1 0.54

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 44/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [132]: cart_metrics=classification_report(train_labels, ytrain_predict_dtcl,output_dict=


df=pd.DataFrame(cart_metrics).transpose()
cart_train_f1=round(df.loc["1"][2],2)
cart_train_recall=round(df.loc["1"][1],2)
cart_train_precision=round(df.loc["1"][0],2)
print ('cart_train_precision ',cart_train_precision)
print ('cart_train_recall ',cart_train_recall)
print ('cart_train_f1 ',cart_train_f1)

cart_train_precision 0.67

cart_train_recall 0.59

cart_train_f1 0.63

In [133]: nn_train_acc=best_grid_nncl.score(X_train,train_labels)
nn_train_acc

Out[133]: 0.7757142857142857

In [134]: confusion_matrix(test_labels, ytest_predict_dtcl)

Out[134]: array([[546, 59],

[150, 145]], dtype=int64)

In [135]: cart_test_acc=best_grid_dtcl.score(X_test,test_labels)
cart_test_acc

Out[135]: 0.7677777777777778

In [136]: nn_test_acc=best_grid_nncl.score(X_test,test_labels)
nn_test_acc

Out[136]: 0.76

In [137]: print(classification_report(test_labels, ytest_predict_dtcl))

precision recall f1-score support

0 0.78 0.90 0.84 605

1 0.71 0.49 0.58 295

accuracy 0.77 900

macro avg 0.75 0.70 0.71 900

weighted avg 0.76 0.77 0.75 900

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 45/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [138]: cart_metrics=classification_report(test_labels, ytest_predict_dtcl,output_dict=Tr


df=pd.DataFrame(cart_metrics).transpose()
cart_test_precision=round(df.loc["1"][0],2)
cart_test_recall=round(df.loc["1"][1],2)
cart_test_f1=round(df.loc["1"][2],2)
print ('cart_test_precision ',cart_test_precision)
print ('cart_test_recall ',cart_test_recall)
print ('cart_test_f1 ',cart_test_f1)

cart_test_precision 0.71

cart_test_recall 0.49

cart_test_f1 0.58

In [139]: nn_train_fpr, nn_train_tpr,_=roc_curve(train_labels,best_grid_nncl.predict_proba(


plt.plot(nn_train_fpr,nn_train_tpr,color='black')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
nn_train_auc=roc_auc_score(train_labels,best_grid_nncl.predict_proba(X_train)[:,1
print('Area under Curve is', nn_train_auc)

Area under Curve is 0.8182460262477858

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 46/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [140]: nn_test_fpr, nn_test_tpr,_=roc_curve(test_labels,best_grid_nncl.predict_proba(X_t


plt.plot(nn_test_fpr,nn_test_tpr,color='black')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
nn_test_auc=roc_auc_score(test_labels,best_grid_nncl.predict_proba(X_test)[:,1])
print('Area under Curve is', nn_test_auc)

Area under Curve is 0.8038464770976328

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 47/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [141]: index=['Accuracy', 'AUC', 'Recall','Precision','F1 Score']


data = pd.DataFrame({'CART Train':[cart_train_acc,cart_train_auc,cart_train_recal
'CART Test':[cart_test_acc,cart_test_auc,cart_test_recall,cart_test_preci
'Random Forest Train':[rf_train_acc,rf_train_auc,rf_train_recall,rf_train_
'Random Forest Test':[rf_test_acc,rf_test_auc,rf_test_recall,rf_test_prec
'Neural Network Train':[nn_train_acc,nn_train_auc,nn_train_recall,nn_train
'Neural Network Test':[nn_test_acc,nn_test_auc,nn_test_recall,nn_test_pre
round(data,2)

Out[141]:
CART CART Random Forest Random Forest Neural Network Neural Network
Train Test Train Test Train Test

Accuracy 0.79 0.77 0.81 0.77 0.78 0.76

AUC 0.83 0.79 0.86 0.82 0.82 0.80

Recall 0.59 0.49 0.60 0.49 0.51 0.43

Precision 0.67 0.71 0.72 0.73 0.67 0.72

F1 Score 0.63 0.58 0.65 0.59 0.57 0.54

In [142]: plt.plot([0, 1], [0, 1], linestyle='--')


plt.plot(cart_train_fpr, cart_train_tpr,color='red',label="CART")
plt.plot(rf_train_fpr,rf_train_tpr,color='green',label="RF")
plt.plot(nn_train_fpr,nn_train_tpr,color='black',label="NN")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc='lower right')

Out[142]: <matplotlib.legend.Legend at 0x185fc569ca0>

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 48/49
10/24/21, 7:17 PM DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021 - Jupyter Notebook

In [143]: plt.plot([0, 1], [0, 1], linestyle='--')


plt.plot(cart_test_fpr, cart_test_tpr,color='red',label="CART")
plt.plot(rf_test_fpr,rf_test_tpr,color='green',label="RF")
plt.plot(nn_test_fpr,nn_test_tpr,color='black',label="NN")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc='lower right')

Out[143]: <matplotlib.legend.Legend at 0x185fc3619a0>

localhost:8888/notebooks/Downloads/DATA_MINING_PROJECT___PAVITHRAA_GOVINDARAJAN_24_OCT_2021.ipynb 49/49

You might also like