Decision Trees
Decision Trees
Pros
• Decision trees are easy to interpret and visualize.
• It can easily capture Non-linear patterns.
• It requires fewer data preprocessing from the user, for example, there
is no need to normalize columns.
• It can be used for feature engineering such as predicting missing
values, suitable for variable selection.
• The decision tree has no assumptions about distribution because of
the non-parametric nature of the algorithm.
Cons
• Sensitive to noisy data.
• It can overfit noisy data.
• The small variation(or variance) in data can result in the different
decision tree.
• This can be reduced by bagging and boosting algorithms.
• Decision trees are biased with imbalance dataset, so
• balance out the dataset before creating the decision tree.
• # Load libraries
• import pandas as pd
• # Import Decision Tree Classifier
• from sklearn.tree import DecisionTreeClassifier
• # Import train_test_split function
• from sklearn.model_selection import train_test_split
• #Import scikit-learn metrics module for accuracy calculation
• from sklearn import metrics
Loading Data
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
# load dataset
pima = pd.read_csv("pima-indians-diabetes.csv", header=None, names=col_names)
pima.head()
• Accuracy: 0.6753246753246753
Optimizing Decision Tree Performance
• # Create Decision Tree classifer object
• clf = DecisionTreeClassifier(criterion="entropy", max_depth=3)
• # Train Decision Tree Classifer
• clf = clf.fit(X_train,y_train)
• Accuracy: 0.7705627705627706
Visualizing Decision Trees
• from sklearn.externals.six import StringIO
• from IPython.display import Image
• from sklearn.tree import export_graphviz
• import pydotplus
• dot_data = StringIO()
• export_graphviz(clf, out_file=dot_data,
• filled=True, rounded=True,
• special_characters=True, feature_names = feature_cols,class_names=['0','1'])
• graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
• graph.write_png('diabetes.png')
• Image(graph.create_png())