BUSINESS REPORT-
Problem 2: CART-RF-ANN
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it.
import pandas as pd
from PIL import Image
import numpy as np
#from scipy.cluster.hierarchy import dendrogram, linkage,fcluster
import scipy.linalg as la
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
#from sklearn.cluster import KMeans
#from sklearn.metrics import silhouette_samples, silhouette_score
#from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
import matplotlib.pyplot as plt
import seaborn as sns
2.2 Data Split: Split the data into test and train, build classification model CART, Random
Forest, Artificial Neural Network.
# Decision tree in Python can take only numerical / categorical colums. It
cannot take string / obeject types.
# The following code loops through each column and checks if the column type
is object then converts those columns
# into categorical with each distinct value becoming a category or code
# capture the target column ("default") into separate vectors for training set and test set.
# splitting data into training and test set for independent attributes
X_train, X_test, train_labels, test_labels = train_test_split(X, y, test_size=.30,
random_state=1)
print (pd.mydata(dt_model.feature_importances_, columns = ["Imp"], index =
X_train.columns))
Age 0.175142
Agency_Code 0.195045
Type 0.003095
Commision 0.082596
Channel 0.007262
Duration 0.266131
Sales 0.211101
Product Name 0.039937
Destination 0.019691
Random Forest
rfcl = RandomForestClassifier(n_estimators = 501,random_state=1)
rfcl = rfcl.fit(X_train, train_labels) precision recall f1-score support
0 0.80 0.91 0.86 1471
1 0.70 0.48 0.57 629
accuracy 0.78 2100
macro avg 0.75 0.70 0.71 2100
weighted avg 0.77 0.78 0.77 2100
Data is suitable for precision because first cluster is 80% and second is 70%, where as micro avg
values is 75% and weighted avg values id 77% and data distribution between 70 and 30% ratio.
ANN precision recall f1-score support
0 0.81 0.91 0.86 1471
1 0.70 0.51 0.59 629
accuracy 0.79 2100
macro avg 0.76 0.71 0.73 2100
weighted avg 0.78 0.79 0.78 2100
Data is suitable for precision because first cluster is 81% and second is 70%, where as micro avg
values is 76% and weighted avg values id 78% and data distribution between 70%and 30% ratio.
Means data set of 3000 customers have been divide in 1471 and 629 customers.
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets
using Accuracy (1.5 pts), Confusion Matrix (2 pts), Plot ROC curve and get ROC_AUC
score for each model (2 pts), Write inferences on each model (2 pts).
#Decison Tree
# AUC and ROC for the train data
reg_dt_model = DecisionTreeClassifier(criterion = 'gini', max_depth =
7,min_samples_leaf=10,min_samples_split=30)
reg_dt_model.fit(X_train, train_labels)
insu_tree_regularized = open('C:\Users\Anu\Downloads\insu_tree_regularized.dot,'w')
dot_data = tree.export_graphviz(reg_dt_model, out_file= insu_tree_regularized , feature_names
= list(X_train), class_names = list(train_char_label))
insu_tree_regularized.close()
print (pd.DataFrame(dt_model.feature_importances_, columns = ["Imp"], index =
X_train.columns))
ytrain_predict = reg_dt_model.predict(X_train)
ytest_predict = reg_dt_model.predict(X_test)
Need to setup data set in #Random Forest
AUC and ROC for the training data and AUC &ROC data set in Test data which will help us to
understand data calculate AUC and calculate ROC Curve as well.
#decision Tree
# AUC and ROC for the training data
# predict probabilities to get ROC Curve model for Training and Test data set separately. The data
set for Random Forest Decision Tree and ANN is very close to precision means 82.4% is close to
precision in random forest.method.
AUC: 0.824
AUC: 0.864
AUC: 0.817
AUC: 0.793
AUC: 0.793
AUC: 0.798
# AUC and ROC for the test data
precision recall
AUC: 0.793
AUC: 0.793
AUC: 0.798
# AUC and ROC for the test data
precision recall f1-score support
0 0.84 0.89 0.86 1471
1 0.70 0.61 0.65 629
accuracy 0.81 2100
macro avg 0.77 0.75 0.76 2100
weighted avg 0.80 0.81 0.80 2100
precision recall f1-score support
0 0.78 0.89 0.83 605
1 0.68 0.50 0.58 295
accuracy 0.76 900
macro avg 0.73 0.69 0.70 900
weighted avg
2.4 Final Model - Compare all models on the basis of the performance metrics in a
structured tabular manner (3 pts). Describe on which model is best/optimized (2 pts ).
Train Data
precision recall f1-score support
0 0.84 0.89 0.86 1471
1 0.70 0.61 0.65 629
accuracy 0.81 2100
macro avg 0.77 0.75 0.76 2100
weighted avg 0.80 0.81 0.80 2100
Test Data
precision recall f1-score support
0 0.78 0.89 0.83 605
1 0.68 0.50 0.58 295
accuracy 0.76 900
macro avg 0.73 0.69 0.70 900
weighted avg 0.75 0.76 0.75 900
Both the Model is good and can be used for Data Mining. But ANN is more Optimize to solve business
problems
2.5 Based on your analysis and working on the business problem, detail out appropriate 5
insights and recommendations to help the management solve the business objective.
We will be working on a wholesale customer segmentation problem. The data is hosted on
the UCI Machine Learning repository. The aim of this problem is to segment the clients to
provide Travel Distributions based on their traveling on diverse product categories,
destinations, , etc.
Our aim is to make clusters from this data that can segment similar clients together. We will, of
course, use ANN for this problem.
But before applying ANN or Random Forest Method, we have to normalize the data so that
the scale of each variable is the same. Why is this important? Well, if the scale of the
variables is not the same, the model might become biased towards the variables with a higher
magnitude like Claimed Commission.
First normalize the data and bring all the variables to the same scale
1. For applications in classification problems, Random Forest algorithm will avoid the overfitting
problem
2. For both classification and regression task, the same random forest algorithm can be used