0% found this document useful (0 votes)
159 views28 pages

Decision Trees

Decision trees are a type of predictive model that can be used for both classification and regression problems. They have several advantages including being easy to interpret, requiring little data preprocessing, and making no assumptions about data distributions. However, they also have some disadvantages such as being sensitive to noise in data and potentially overfitting. Various techniques like bagging and boosting can help reduce overfitting. Decision trees also tend to perform poorly on imbalanced datasets, so balancing classes is important.

Uploaded by

M. Talha Nadeem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
159 views28 pages

Decision Trees

Decision trees are a type of predictive model that can be used for both classification and regression problems. They have several advantages including being easy to interpret, requiring little data preprocessing, and making no assumptions about data distributions. However, they also have some disadvantages such as being sensitive to noise in data and potentially overfitting. Various techniques like bagging and boosting can help reduce overfitting. Decision trees also tend to perform poorly on imbalanced datasets, so balancing classes is important.

Uploaded by

M. Talha Nadeem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Decision Trees

Pros
• Decision trees are easy to interpret and visualize.
• It can easily capture Non-linear patterns.
• It requires fewer data preprocessing from the user, for example, there
is no need to normalize columns.
• It can be used for feature engineering such as predicting missing
values, suitable for variable selection.
• The decision tree has no assumptions about distribution because of
the non-parametric nature of the algorithm.
Cons
• Sensitive to noisy data.
• It can overfit noisy data.
• The small variation(or variance) in data can result in the different
decision tree.
• This can be reduced by bagging and boosting algorithms.
• Decision trees are biased with imbalance dataset, so
• balance out the dataset before creating the decision tree.
• # Load libraries
• import pandas as pd
• # Import Decision Tree Classifier
• from sklearn.tree import DecisionTreeClassifier
• # Import train_test_split function
• from sklearn.model_selection import train_test_split
• #Import scikit-learn metrics module for accuracy calculation
• from sklearn import metrics
Loading Data

col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
# load dataset
pima = pd.read_csv("pima-indians-diabetes.csv", header=None, names=col_names)
pima.head()

glucose bp skin insulin bmi pedigree age label


pregnant
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
Feature Selection & Splitting
• #split dataset in features and target variable
• feature_cols = ['pregnant', 'insulin', 'bmi', 'age', 'glucose', 'bp',
'pedigree']
• X = pima[feature_cols] # Features
• y = pima.label # Target variable

• # Split dataset into training set and test set


• X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=1) # 70% training and 30% test
Building Decision Tree Model
• # Create Decision Tree classifer object
• clf = DecisionTreeClassifier()

• # Train Decision Tree Classifer


• clf = clf.fit(X_train,y_train)

• #Predict the response for test dataset


• y_pred = clf.predict(X_test)
Evaluating Model
• # Model Accuracy, how often is the classifier correct?
• print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

• Accuracy: 0.6753246753246753
Optimizing Decision Tree Performance
• # Create Decision Tree classifer object
• clf = DecisionTreeClassifier(criterion="entropy", max_depth=3)
• # Train Decision Tree Classifer
• clf = clf.fit(X_train,y_train)

• #Predict the response for test dataset


• y_pred = clf.predict(X_test)
• # Model Accuracy, how often is the classifier correct?
• print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

• Accuracy: 0.7705627705627706
Visualizing Decision Trees
• from sklearn.externals.six import StringIO
• from IPython.display import Image
• from sklearn.tree import export_graphviz
• import pydotplus
• dot_data = StringIO()
• export_graphviz(clf, out_file=dot_data,
• filled=True, rounded=True,
• special_characters=True, feature_names = feature_cols,class_names=['0','1'])
• graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
• graph.write_png('diabetes.png')
• Image(graph.create_png())

You might also like