Mrs. S. L .
JOTHI LAKSHMI,
ASSISTANT PROFESSOR /CSE,
AMRITA COLLEGE OF ENGINEERING AND
TECHNOLOGY, NAGERCOIL.
Agenda
Why supervised learning
What is classification
How classification
What is decision trees
Advantages of decision trees
Important Terminology related to Decision Trees
How do Decision Trees work?
Attribute Selection Measure
How does ID3 decide which attribute is the best
Implementation of decision tree
Supervised Learning
Algorithms learn from labeled data.
The algorithm determines which label
should be given to new data
There are two types of Supervised
learning problems.
1.CLASSIFICATION
2.REGRESSION
Supervised Learning model
Steps in Supervised Learning
Determine the type of training examples
-what kind of data
Gather a training set
Determine the input feature representation of the
learned function
Determine the structure of the learned function and
corresponding learning algorithm
Complete the design
Evaluate the accuracy of the learned function
Classification
Categorize data into a given number of classes.
Identify the category/class to which a new data will fall
under.
classification algorithm, is a function that weighs the input
features so that the output separates one class into positive
values and the other into negative values
Terminologies encountered in classification
Classifier
Classification model
Feature
Binary Classification
Multi-class classification
Multi-label classification
The following are the steps involved in building a
classification model:
Initialize the classifier to be used.
Train the classifier
Predict the target
Evaluate the model
Classification process
Classification is a two-step process
Learning step
Prediction step
Types of Classification Algorithm
Logistic Regression
Naïve Bayes
Stochastic Gradient Descent
K-Nearest Neighbors
Decision Tree
Random Forest
Support Vector Machine
Decision Tree
Decision Tree algorithm belongs to the family of
supervised learning algorithms.
Learning simple decision rules inferred from prior
data(training data).
Start from the root of the tree.
Advantages of decision trees
Simple to understand and to interpret.
Trees can be visualized.
Requires little data preparation.
The cost of using the tree (i.e., predicting data) is logarithmic
in the number of data points used to train the tree.
Able to handle both numerical and categorical data.
Able to handle multi-output problems.
Uses a white box model.
Possible to validate a model using statistical tests.
Performs well even if its assumptions are somewhat violated
by the true model from which the data were generated
Types of decision trees
Types of decision trees are based on the type of target
variable
Categorical Variable Decision tree:
categorical target variable.
Continuous Variable Decision tree:
continuous target variable
Important Terminology related to Decision Trees
Root Node
Splitting
Decision Node
Leaf / Terminal Node
Pruning
Branch / Sub-Tree
Parent and Child Node
Assumptions while creating Decision Tree
Whole training set is considered as the root.
Feature values are preferred to be categorical
Records are distributed recursively
Order to placing attributes
Attributes selection measures
Identify which attributes do we need to consider as the
root node and each level.
Entropy
Information gain
Gini index
Gain Ratio
Reduction in Variance
Chi-Square
Entropy
Entropy is a measure of the impurity or randomness
in the information being processed. The higher the
entropy, the harder it is to draw any conclusions from
that information.
Contd…
Mathematically Entropy for 1 attribute is represented as:
Where S → Current state, and Pi → Probability of an event i of
state S or Percentage of class i in a node of state S.
Mathematically Entropy for multiple attributes is represented
as:
where T→ Current state and X → Selected attribute
Information Gain
Information gain or IG
Statistical property that measures how well a given
attribute separates the training examples according to
their target classification.
Information Gain(T,X)=Entropy(T)-Entropy(T,X)
Gini Index
Cost function used to evaluate splits in the dataset.
It is calculated by subtracting the sum of the squared
probabilities of each class from one.
Gini Index works with the categorical target variable
“Success” or “Failure”. It performs only Binary splits.
Higher the value of Gini index higher the
homogeneity.
Gain ratio & Reduction in variance
Modification of information gain is gain ratio and
reduces its bias
Gain ratio takes number and size of branches when
choosing an attribute
Reduction in variance is an algorithm used for
continuous target variables (regression problems).
Chi-Square(Chi-squared Automatic Interaction Detector)
It finds out the statistical significance between the
differences between sub-nodes and parent node.
Mathematically, Chi-squared is represented as:
How do Decision Trees work?
Algorithms for classification
ID3
C4.5
CART
ID3(Iterative Dichotomiser 3)
The ID3 algorithm builds decision trees top-down greedy
search approach
Steps in ID3 algorithm:
Calculate entropy for dataset.
For each attribute/feature
Calculate entropy for all its categorical values.
Calculate information gain for the feature.
Find the feature with maximum information gain.
Repeat it until we get the desired tree.
Example
Decide whether the weather is amenable to
playing GOLF. Over the course of 2 weeks, data
is collected to help ID3 build a decision tree .
The weather attributes are outlook, temperature,
humidity, and wind speed.
They can have the following values:
outlook = { sunny, overcast, rain }
temperature = {hot, mild, cool }
humidity = { high, normal }
wind = {weak, strong }
Play Golf
Consider the table below. It represent factors that affect whether John
would go out to play golf or not. Using the data in the table, build a
decision tree to model that can be used to predict if John would play
golf or not.
Algorithm for Building Decision Trees – The
ID3 Algorithm
Begin
Load learning sets and create decision tree root node(rootNode), add learning
set S into root not as its subset
For rootNode, compute Entropy(rootNode.subset) first
If Entropy(rootNode.subset) == 0 (subset is homogenious)
return a leaf node
If Entropy(rootNode.subset)!= 0 (subset is not homogenious)
compute Information Gain for each attribute left (not been used for
spliting)
Find attibute A with Maximum(Gain(S, A))
Create child nodes for this root node and add to rootNode in the decision
tree
For each child of the rootNode
Apply ID3(S, A, V)
Continue until a node with Entropy of 0 or a leaf node is reached
End
Step by Step Procedure
Step 1: Determine the Root of the Tree
Step 2: Calculate Entropy for The Classes
Step 3: Calculate Entropy After Split for Each Attribute
Step 4: Calculate Information Gain for each split
Step 5: Perform the Split
Step 6: Perform Further Splits
Step 7: Complete the Decision Tree
Determine the Root of the Tree
What is a good Attribute?
A good attribute prefers attributes that split the data so
that each successor node is as pure as possible
Attributes that have a high degree of „order“
Entropy
Information gain
Calculate Entropy for Other Attributes
After Split
E(PlayGolf, Outloook)
E(PlayGolf, Temperature)
E(PlayGolf, Humidity)
E(PlayGolf,Windy)
To build a decision tree, we need to calculate two types
of entropy using frequency tables as follows
1.Entropy using the frequency table of one attribute
Contd..
Entropy using the frequency table of two attributes:
E(PlayGolf, Temperature) Calculation
E(PlayGolf, Humidity) Calculation
E(PlayGolf, Windy) Calculation
Calculating Information Gain for Each Split
The information gain is calculated using the formula:
Gain(S,T) = Entropy(S) – Entropy(S,T)
1.Gain(PlayGolf, Outlook) = Entropy(PlayGolf) – Entropy(PlayGolf,
Outlook)
= 0.94 – 0.693 = 0.247
2.Gain(PlayGolf, Temperature) = Entropy(PlayGolf ) – Entropy(PlayGolf,
Temparature)
= 0.94 – 0.911 = 0.029
3.Gain(PlayGolf, Humidity) = Entropy(PlayGolf ) – Entropy(PlayGolf,
Humidity)
= 0.94 – 0.788 = 0.152
4.Gain(PlayGolf, Windy) = Entropy(PlayGolf) – Entropy(PlayGolf, Windy)
= 0.94 – 0.892 = 0.048
Perform the First Split
Initial Split using Outlook
Perform Further Splits
Complete the Decision Tree
Demo
Scikit-learn is a free software machine
learning library for the Python programming
language. , and is designed to interoperate with the
Python numerical and scientific
libraries NumPy and SciPy.
Graphviz(Graphviz graph-drawing software
Pydotplus (Decision Tree Graph)
Matplotlib is a plotting library for the Python
programming language and its numerical
mathematics extension NumPy
Install packages
!apt install -y graphviz
!pip install graphviz
## import dependencies
from sklearn import tree #For our Decision Tree
import pandas as pd # For our DataFrame
import pydotplus
from IPython.display import Image
import matplotlib.pyplot as plt
import numpy as np
from graphviz import Digraph
from math import log
Create the dataset
#Create the dataset
#create empty data frame
golf_df = pd.DataFrame()
#add outlook
golf_df['Outlook'] = ['rainy', 'rainy', 'overcast', 'sunny', '
sunny', 'sunny',
'overcast', 'rainy', 'rainy', 'sunny', 'rainy', 'o
vercast',
'overcast', 'sunny']
Dataset
Calculate information entropy
ROOT NODE
NEXT BRANCH
NEXT BRANCH
NEXT BRANCH
FINAL DECISION TREE
Decision tree implementation using Python Sklearn
DecisionTreeClassifier is a class capable of performing multi-class classification on a
dataset.
DecisionTreeClassifier parameters
ccp_alpha
class_weight
criterion
max_depth
max_features
max_leaf_nodes
min_impurity_decrease
min_impurity_split
min_samples_leaf
min_samples_split
min_weight_fraction_leaf
presort
random_state
splitter='best‘
Iris dataset
Iris dataset
Decision tree for Iris dataset
Accuracy
True Positive:our predicted positive and it’s true
True Negative:You predicted negative and it’s true.
False Positive:You predicted positive and it’s false
False Negative:You predicted negative and it’s false.
Recall:Out of all the positive classes, how much we
predicted correctly. It should be high as possible.
TP/TP+FN
Precision:Out of all the positive classes we have predicted
correctly, how many are actually positive.
TP/TP+FP
Accuracy->F-
measure=2*Recall*Precision/Recall+Precision
DISADVANTAGES
Over fitting.
Decision trees can be unstable
The problem of learning an optimal decision tree is
known to be NP-complete
Hard to learn because decision trees do not express
them easily, such as XOR, parity or multiplexer
problems.
Decision tree learners create biased trees if some
classes dominate.
Over fitting
Two ways to remove over fitting
Pruning Decision Trees.
Random Forest
APPLICATIONS OF DECISION TREES
Business Management
Customer Relationship Management
Fraudulent Statement Detection
Engineering
Energy Consumption
Thank you