0% found this document useful (0 votes)
13 views

Lecture 21 (DS) - Decision Tree

decision tree

Uploaded by

anayabutt658
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Lecture 21 (DS) - Decision Tree

decision tree

Uploaded by

anayabutt658
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Data Science

Lecture # 22
Decision Tree
• Lets look at Decision Tree model, a popular method
used for classification
• By the end of this lecture, you should be able to:
• Explain how a decision tree is used for classification
• Describe the process of constructing a decision tree for
classification
• Interpret how a decision tree comes up with a classification
decision
2

Note: All Images are taken from edx.org


Decision Tree Overview

Note: All Images are taken from edx.org


Decision Tree Overview
• The idea behind decision tree is to split the data into
subsets where each subset belongs to only one class
• This is accomplished by dividing the input space into
pure regions
• i.e. regions with samples from only one class
• With real data completely pure subsets may not be
possible, so we divide into subsets that are as pure
as possible
• Decision tree makes classification decision based on
decision boundaries 4

Note: All Images are taken from edx.org


Classification Using Decision Tree
• The root and internal nodes
have test conditions
• Each leaf node has a class
label associated with it
• Decision is made by
traversing the decision tree
• At each node test condition
answer determines which
branch to traverse
• When a leaf node is reached,
the category at the leaf node
determines the decision 5

Note: All Images are taken from edx.org


Classification Using Decision Tree
• Depth of a node is the
number of edges from root
to that node
• The depth of root node is
zero
• Depth of tree is the
number of edges in the
longest path
• Size of tree is the number
of nodes in the tree
6

Note: All Images are taken from edx.org


Example Decision Tree

• This decision tree is used to classify an animal as a 7

mammal or not a mammal


Note: All Images are taken from edx.org
Constructing Decision Tree
• Constructing a decision tree consists of following
steps:
• Start with all samples at a node
• i.e. starting with all samples at root node
• Adding additional nodes when data is split into subsets

• Partition samples based on input to create purest subsets


• i.e. each subset contains as many samples as possible belonging
to just one class

• Repeat to partition data into successively purer subsets


• Do this process until stopping criteria are satisfied

• An algorithm for constructing a decision tree model is 8

called induction algorithm


Note: All Images are taken from edx.org
Greedy Approach
• At each split,
the induction
algorithm only
considers the
best way to split
the particular
portion of the
data
• This is referred
to as greedy
approach
9

Note: All Images are taken from edx.org


How to Determine Best Split?
• Again the goal is
to partition the
data into subsets
as pure as possible
• In this example,
right partition is
more
homogeneous
subsets, since
these contain
more samples
belonging to a
single class 10

Note: All Images are taken from edx.org


Impurity Measure
• Therefore, we need to
measure the purity of a
split
• Impurity measure of a
node specifies how
mixed the resulting
subsets are
• We want the split that
minimizes the impurity
measure
• Other impurity measures
are entropy and
misclassification rate 11

Note: All Images are taken from edx.org


What Variable to Split On?
• The other factor in
determining the best
way to partition a node
is which variable to split
on
• Decision tree will test all
variables to determine
the best way to split the
nodes, using a purity
measure such as Gini
index to compare the
various possibilities 12

Note: All Images are taken from edx.org


When to Stop Splitting a Node?
• Recall that tree induction algorithm repeatedly splits
nodes to get more and more homogeneous datasets
• So when does this process stop building subsets?

• All (or x% of) samples have same class label


• Number of samples in node reaches a minimum value
• Change in impurity measure is smaller than threshold
• Max tree depth is reached
• Others… (but we’ll not discuss here) 13

Note: All Images are taken from edx.org


Tree Induction Example: Split 1

• Let’s say we want to


classify loan
applicants as being
likely to repay a loan,
or not likely to repay a
loan, based on their
income and amount of
debt they have

14

Note: All Images are taken from edx.org


• Building
Tree aInduction Example:
decision tree for Split
this classification 1
problem
could proceed as follows
• Consider the input space of this problem, as shown
in left figure
• One way to split this dataset into a more
homogeneous subset is to consider the decision
boundary where income is t1.
• To the right of this decision boundary are mostly red
samples
• The subsets are not completely homogeneous, but
that is the best way to split the original dataset 15

based on variable income


Note: All Images are taken from edx.org
Tree Induction Example: Split 2
• Income > t1
represented at root
node
• This is the condition
used to split the original
dataset
• Samples > t1 are
placed in right subset
and < t1 are placed in
left subset
• Because right subset is
almost perfect, it is now
16
labeled as RED
Note: All Images are taken from edx.org
Tree
• RED Induction
means Example:
loan applicant Split
loan applicants 2
likely to
repay the loan
• The second step, then, is to determine how to split
the region outlined in red
• The best way to split this data is specified by the
second decision boundary, with debts equals t2
• This is represented in the decision tree on the right
with the addition of the node with condition debt >
t2
• This region contains all blue samples meaning that
the loan applicant is not likely to repay the loan 17

Note: All Images are taken from edx.org


Decision Boundaries
• The final decision tree
implements the decision
boundaries shown as
dashed lines in left
diagram
• The label for each region
is determined by the
label of the majority of
the samples
• These labels are
reflected in the leaf
nodes of the decision
tree shown on the right 18

Note: All Images are taken from edx.org


Decision Boundaries
• Notice that decision boundaries are parallel to axes
referred as rectilinear
• The boundaries are rectilinear because each split
considers only a single variable
• Some algorithms can consider more than one
variables
• However each split has to consider all combinations
of combined variables
• Such induction algorithms are more computationally
intensive 19

Note: All Images are taken from edx.org


Decision Tree for Classification
• There are few important things to note about the
decision tree classifier
• Resulting tree is often simple and easy to understand
• Induction is computationally inexpensive, so training a
decision tree for classification can be relatively fast
• Greedy approach does not guarantee best solution
• Rectilinear decision boundaries which means it may not be
able to solve complicated classification problems that
require complex decision boundaries
• Discuss Week 7 notebooks 20

Note: All Images are taken from edx.org

You might also like