0% found this document useful (0 votes)
52 views

LINFO2262: Decision Trees + Random Forests: Pierre Dupont

This document discusses decision trees and random forests machine learning algorithms. It begins with an overview of decision tree representation using a PlayTennis example. It then describes the ID3 algorithm for learning decision trees, including how it selects attributes using information gain to build the tree in a top-down manner to best separate the data. The document outlines how later algorithms like C4.5 and CART improved upon ID3, including handling continuous values and overfitting issues, before concluding with an introduction to random forests.

Uploaded by

Quentin Lambotte
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

LINFO2262: Decision Trees + Random Forests: Pierre Dupont

This document discusses decision trees and random forests machine learning algorithms. It begins with an overview of decision tree representation using a PlayTennis example. It then describes the ID3 algorithm for learning decision trees, including how it selects attributes using information gain to build the tree in a top-down manner to best separate the data. The document outlines how later algorithms like C4.5 and CART improved upon ID3, including handling continuous values and overfitting issues, before concluding with an introduction to random forests.

Uploaded by

Quentin Lambotte
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

LINFO2262: Decision Trees + Random Forests

Pierre Dupont

ICTEAM Institute
Université catholique de Louvain – Belgium

P. Dupont (UCL Machine Learning Group) LINFO2262 1.


Outline

1 Decision Tree Representation


2 ID3 learning algorithm
ID3 Attribute Selection
ID3 Inductive Bias
3 From ID3 to C4.5
Avoiding Overfitting the Data
Incorporating Continuous-Valued Attributes
Alternative Measures for Selecting Attributes
Handling Missing Values
4 The CART algorithm
5 Random Forests

P. Dupont (UCL Machine Learning Group) LINFO2262 2.


Decision Tree Representation

Outline

1 Decision Tree Representation

2 ID3 learning algorithm

3 From ID3 to C4.5

4 The CART algorithm

5 Random Forests

P. Dupont (UCL Machine Learning Group) LINFO2262 3.


Decision Tree Representation

The PlayTennis problem


Each example is represented by discrete attribute-value pairs
Training examples include a yes/no class label
Outlook Temperature Humidity Wind PlayTennis
Sunny Hot High Weak No
Sunny Hot High Strong No
Overcast Hot High Weak Yes
Rain Mild High Weak Yes
Rain Cool Normal Weak Yes
Rain Cool Normal Strong No
Overcast Cool Normal Strong Yes
Sunny Mild High Weak No
Sunny Cool Normal Weak Yes
Rain Mild Normal Weak Yes
Sunny Mild Normal Strong Yes
Overcast Mild High Strong Yes
Overcast Hot Normal Weak Yes
Rain Mild High Strong No

Questions
Which are the general rules to classify correctly these examples?
How to predict the class of new examples?
P. Dupont (UCL Machine Learning Group) LINFO2262 4.
Decision Tree Representation

Decision Tree for PlayTennis

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes
Illustration from Machine Learning, T. Mitchell, McGraw Hill, 1997.

P. Dupont (UCL Machine Learning Group) LINFO2262 5.


Decision Tree Representation

Decision tree representation


Outlook

IF (Outlook = Sunny ) ∧ (Humidity = High)


Sunny Overcast Rain THEN PlayTennis = No
Humidity Yes Wind
IF (Outlook = Sunny ) ∧ (Humidity = Normal)
THEN PlayTennis = Yes
High Normal Strong Weak ...
No Yes No Yes

Each internal node tests an attribute


Each branch corresponds to an attribute value
Each leaf node assigns a classification
Each path from root to leaf represents a conjunction of attribute
values
The whole tree represents a mutually exclusive logical rules
Classification of a previously unseen example: follow the path
corresponding to observed attribute values
P. Dupont (UCL Machine Learning Group) LINFO2262 6.
ID3 learning algorithm

Outline

1 Decision Tree Representation

2 ID3 learning algorithm


ID3 Attribute Selection
ID3 Inductive Bias

3 From ID3 to C4.5

4 The CART algorithm

5 Random Forests

P. Dupont (UCL Machine Learning Group) LINFO2262 7.


ID3 learning algorithm

ID3: TopDown Induction of Decision Trees


Algorithm ID3
Input: A labeled training dataset
S = {(x 1 , y1 ), . . . , (x n , yn )} // yi is the class of x i
Input: The root node of the current (sub-)tree // Initially, T is a single node tree
Output: A classification tree T
if S contains several distinct class labels then // Current node is not pure
Assign best attribute xj to node // Attribute selection at current node
foreach value v of xj do
Add new nodev as a child of node in T
Attach Svj = {x ∈ S|xj = v } to nodev // Split S according to xj values
ID3(Svj , nodev )

else
Assign unique class label to node
return T
Notes: attributes are also called features or input variables
multi-class case: there can be more than 2 distinct class labels
Ex: yi = low, medium, or high value

P. Dupont (UCL Machine Learning Group) LINFO2262 8.


ID3 learning algorithm

Training Examples

Day Outlook Temperature Humidity Wind PlayTennis


D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

P. Dupont (UCL Machine Learning Group) LINFO2262 9.


ID3 learning algorithm

Which attribute is best?

P. Dupont (UCL Machine Learning Group) LINFO2262 10.


ID3 learning algorithm

Entropy
1.0

Entropy(S)
0.5

0.0 0.5 1.0


p
+

S is a sample of training examples


p⊕ is the proportion of positive examples in S
p is the proportion of negative examples in S
Entropy(S) ≡ −p⊕ log2 p⊕ − p log2 p

A measure of the impurity (or disorder) in a sample


(= at a specific tree node)

P. Dupont (UCL Machine Learning Group) LINFO2262 11.


ID3 learning algorithm ID3 Attribute Selection

Information Gain
Gain(S, xj ) =
expected reduction in entropy due to sorting on attribute xj
j
P |Sv | j
Gain(S, xj ) ≡ Entropy(S) − v ∈Values(xj ) |S| Entropy(Sv )

where Svj = {x ∈ S|xj = v }

Gain(S, xj ) is the information provided about the target function value


given the value of xj

P. Dupont (UCL Machine Learning Group) LINFO2262 12.


ID3 learning algorithm ID3 Attribute Selection

Selecting the Next Attribute


S: [9+,5-] S: [9+,5-]
E =0.940 E =0.940

Humidity Wind

High Normal Weak Strong

[3+,4-] [6+,1-] [6+,2-] [3+,3-]


E =0.985 E =0.592 E =0.811 E =1.00

Gain (S, Humidity) Gain (S, Wind )


= .940 - (7/14).985 - (7/14).592 = .940 - (8/14).811 - (6/14)1.0
= .151 = .048

Similar computations:
Gain(S, Outlook) = 0.246 and Gain(S, Temperature) = 0.029
⇒ Select Outlook
Illustration from Machine Learning, T. Mitchell, McGraw Hill, 1997.

P. Dupont (UCL Machine Learning Group) LINFO2262 13.


ID3 learning algorithm ID3 Attribute Selection

Growing the Tree


{D1, D2, ..., D14}
[9+,5−]

Outlook

Sunny Overcast Rain

{D1,D2,D8,D9,D11} {D3,D7,D12,D13} {D4,D5,D6,D10,D14}


[2+,3−] [4+,0−] [3+,2−]

? Yes ?

Which attribute should be tested here?

Ssunny = {D1,D2,D8,D9,D11}

Gain (Ssunny , Humidity) = .970 − (3/5) 0.0 − (2/5) 0.0 = .970

Gain (Ssunny , Temperature) = .970 − (2/5) 0.0 − (2/5) 1.0 − (1/5) 0.0 = .570

Gain (Ssunny , Wind) = .970 − (2/5) 1.0 − (3/5) .918 = .019

Illustration from Machine Learning, T. Mitchell, McGraw Hill, 1997.

P. Dupont (UCL Machine Learning Group) LINFO2262 14.


ID3 learning algorithm ID3 Attribute Selection

Final Tree

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

Illustration from Machine Learning, T. Mitchell, McGraw Hill, 1997.

P. Dupont (UCL Machine Learning Group) LINFO2262 15.


ID3 learning algorithm ID3 Attribute Selection

Hypothesis Space Search by ID3

+ – +

...
x2
x1
+ – + + + – + –

...

x2 x2

+ – + – + – + –
x3 x4

+

... ...

Illustration from Machine Learning, T. Mitchell, McGraw Hill, 1997.

P. Dupont (UCL Machine Learning Group) LINFO2262 16.


ID3 learning algorithm ID3 Attribute Selection

Hypothesis Space Search by ID3

Hypothesis space is complete!


I Any discrete-valued function of available attributes can be
represented
I Target function surely in there
Space actually searched is partial
I Hill-climbing search, no backtracking
I Risk of local minima
I Efficient
Outputs a single hypothesis
I We don’t know how many alternatives are consistent with the
training data
Statistically-based search choices
I Robust to noisy data

P. Dupont (UCL Machine Learning Group) LINFO2262 17.


ID3 learning algorithm ID3 Attribute Selection

Additional Properties of ID3

Is ID3 guaranteed to perfectly classify all training examples?


⇒ Yes, provided the training sample is consistent
What happens if the training sample is inconsistent?
⇒ ID3 returns a tree discriminating at best between the examples
⇒ some leaves of the final tree have examples from various
classes
Can a leaf contain no training examples?
⇒ Yes, if there is no example having its specific conjunction of
attribute values
⇒ In this case, the leaf can predict:
I either an UNKNOWN class
I or the most likely class of the parent node

P. Dupont (UCL Machine Learning Group) LINFO2262 18.


ID3 learning algorithm ID3 Inductive Bias

The need for an Inductive Bias


An inductive bias is the set of restrictions we put on the hypothesis
space: the class of possible models we consider
For instance, if the data is represented by vectors in Rd we can restrict
our attention to Gaussian models or Linear discriminants
Theorem
Generalization, which is the ability to classify correctly new examples,
is impossible without inductive bias
Proof sketch:
If the model class is too rich, it will always be possible to construct 2
models that perfectly agree on the training but predict exactly the
opposite for any new data

Here we are interested in data represented by sets of discrete attribute


values. What is the inductive bias of ID3?
P. Dupont (UCL Machine Learning Group) LINFO2262 19.
ID3 learning algorithm ID3 Inductive Bias

Inductive Bias in ID3

The hypothesis space H is complete: any discrete-valued function of


available attributes can be represented
⇒ is ID3 unbiased?
Not really. . .
ID3 selects the first acceptable tree in a simple-to-complex search
I preference for short trees, and for those with high information gain
attributes near the root
H is complete but the search is incomplete!
Bias is a preference for some hypotheses, rather than a restriction
of hypothesis space H
Ockham’s razor: prefer the shortest (i.e. the simplest) hypothesis
that fits the data

In short, ID3 inductive bias follows from its search strategy

P. Dupont (UCL Machine Learning Group) LINFO2262 20.


From ID3 to C4.5

Outline

1 Decision Tree Representation

2 ID3 learning algorithm

3 From ID3 to C4.5


Avoiding Overfitting the Data
Incorporating Continuous-Valued Attributes
Alternative Measures for Selecting Attributes
Handling Missing Values

4 The CART algorithm

5 Random Forests

P. Dupont (UCL Machine Learning Group) LINFO2262 21.


From ID3 to C4.5 Avoiding Overfitting the Data

Overfitting in Decision Trees


Consider adding noisy training example #15:

hSunny , Hot, Normal, Strong, PlayTennisi = No

What effect on earlier tree?


Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

Illustration from Machine Learning, T. Mitchell, McGraw Hill, 1997.

P. Dupont (UCL Machine Learning Group) LINFO2262 22.


From ID3 to C4.5 Avoiding Overfitting the Data

Predicting class label from a partial tree


{D1, D2, ..., D14}
[9+,5−]

Outlook

Sunny Overcast Rain

{D1,D2,D8,D9,D11} {D3,D7,D12,D13} {D4,D5,D6,D10,D14}


[2+,3−] [4+,0−] [3+,2−]

?
No Yes ?
Yes

A partial tree does not predict the training labels with 100% accuracy
Which attribute should be tested here?
Illustration adapted from Machine Learning, T. Mitchell, McGraw Hill, 1997.
Ssunny = {D1,D2,D8,D9,D11}

Gain (Ssunny , Humidity) = .970 − (3/5) 0.0 − (2/5) 0.0 = .970


P. Dupont (UCL Machine Learning Group) LINFO2262 23.
From ID3 to C4.5 Avoiding Overfitting the Data

Overfitting

Consider error of hypothesis h over


training data: errortrain (h)
entire distribution D of data: errorD (h)

Hypothesis h ∈ H overfits training data if there is another hypothesis


h0 ∈ H such that
errortrain (h) < errortrain (h0 )
and
errorD (h) > errorD (h0 )

P. Dupont (UCL Machine Learning Group) LINFO2262 24.


From ID3 to C4.5 Avoiding Overfitting the Data

Overfitting in Decision Tree Learning

0.9

0.85

0.8

0.75
Accuracy

0.7

0.65

0.6 On training data


On validation data
0.55

0.5
0 10 20 30 40 50 60 70 80 90 100
Size of tree (number of nodes)

Illustration from Machine Learning, T. Mitchell, McGraw Hill, 1997.

P. Dupont (UCL Machine Learning Group) LINFO2262 25.


From ID3 to C4.5 Avoiding Overfitting the Data

Limit overfitting

pre-pruning: stop growing tree when current node has too few
examples or is nearly pure
I local decision might not be globally optimal

post-pruning grow full tree then prune part of it


1 Split data into training and validation set

2 Do until further pruning is harmful:


1 Evaluate impact on validation set of pruning each possible node
(including all nodes below it)
2 Greedily remove the one that most improves validation set accuracy

P. Dupont (UCL Machine Learning Group) LINFO2262 26.


From ID3 to C4.5 Avoiding Overfitting the Data

Effect of Post-Pruning

0.9

0.85

0.8

0.75
Accuracy

0.7

0.65

0.6 On training data


On validation data
0.55 On validation data (during pruning)

0.5
0 10 20 30 40 50 60 70 80 90 100
Size of tree (number of nodes)

Illustration from Machine Learning, T. Mitchell, McGraw Hill, 1997.

P. Dupont (UCL Machine Learning Group) LINFO2262 27.


From ID3 to C4.5 Incorporating Continuous-Valued Attributes

Incorporating Continuous Valued Attributes

Create a discrete attribute to test a continuous value


Temperature = 22.5
(Temperature > θ) = true, false where θ is some threshold
What threshold value should be defined for θ?
Temperature: 5 9 17 24 29 35
PlayTennis: No No Yes Yes Yes No

Ask an expert (e.g. medical domain) or


Select candidate thresholds midway between instances with
distinct classifications (e.g. 13, 32)
Compute information gain associated with each threshold
Choose threshold maximizing information gain

P. Dupont (UCL Machine Learning Group) LINFO2262 28.


From ID3 to C4.5 Alternative Measures for Selecting Attributes

Attributes with Many Values


Problem
If an attribute has many values then Gain will select it
Imagine using Date = Jun 3 1996 as attribute
⇒ Date will have the highest Gain because Date alone perfectly
predicts the classification on the training sample

Solution: use GainRatio instead of Gain


j j
Gain(S,xj ) P |Sv | |Sv |
GainRatio(S, xj ) ≡ SplitInfo(S,xj ) with SplitInfo(S, xj ) ≡ − v |S| log2 |S|
where Svj = {x ∈ S|xj = v } is the subset of S for which xj has value v

SplitInfo(S, xj ) is the entropy of S with respect to the values of xj


Entropy (S) is the entropy of S with respect to the class label (⊕ or )
SplitInfo(S, xj ) discourages selection of attributes with many values
(high entropy)
P. Dupont (UCL Machine Learning Group) LINFO2262 29.
From ID3 to C4.5 Handling Missing Values

Unknown Attribute Values

What if some examples miss some values of xj


(e.g. missing Humidity) ?
[6+,4−]

Humidity

High Normal

[1+,3−] [5+,1−]

If Node tests xj then assign most common value of xj among


other examples sorted to Node ⇒ Normal
Assign most common value of xj among other examples with
same target value (here ) ⇒ High

P. Dupont (UCL Machine Learning Group) LINFO2262 30.


From ID3 to C4.5 Handling Missing Values

[6+,4−]

Humidity

High Normal

[1+,3−] [5+,1−]

Fraction the examples


I assign probability pv to each possible value v of xj
I assign fraction pv of example to each descendant in tree
[6+, 5−]

Humidity

High Normal
0.4 0.6

[1+, 3.4−] [5+, 1.6−]

I classify new examples by summing the weights of the fragments at


the leaf nodes

P. Dupont (UCL Machine Learning Group) LINFO2262 31.


The CART algorithm

Outline

1 Decision Tree Representation

2 ID3 learning algorithm

3 From ID3 to C4.5

4 The CART algorithm

5 Random Forests

P. Dupont (UCL Machine Learning Group) LINFO2262 32.


The CART algorithm

The CART Algorithm [Breiman et al. 84]


Algorithm CART
Input: A labeled training dataset
S = {(x 1 , y1 ), . . . , (x n , yn )} // yi is the class of x i
Input: The root node of the current (sub-)tree // Initially, T is a single node tree
Input: n0 // A number of examples below which splitting is stopped
Input: imp0 // An impurity level below which splitting is stopped
Output: A classification tree T
if |S| ≥ n0 AND Gini(S) ≥ imp0 then // Pre-pruning
// Attribute selection at current node
Assign binarized attribute xj maximizing Drop(S, xj ) to node
foreach value v of xj do // Only 2 possible values
Add new nodev as a child of node in T
Attach Svj = {x ∈ S|xj = v } to nodev // Split S according to xj values
CART(Svj , nodev )

else
Assign majority class label to node
return T

P. Dupont (UCL Machine Learning Group) LINFO2262 33.


The CART algorithm

Gini impurity function


A node is pure if all examples assigned to it have the same class label
Gini measures the impurity of a sample S having C different classes:
C
X
Gini(S) = 1 − pS (y )2
y=1

where pS (y ) denotes the proportion of examples labeled y in S


2 classes:
2 2
Gini(S) = 1 − (p⊕ + p ) = 2p⊕ p = 2p⊕ (1 − p⊕ )
Splitting sample S with respect to attribute xj results in a drop of
impurity: j
Drop(S, xj ) = Gini(S) − v ∈Values(xj ) |S v|
P
|S| Gini(Sv )

where Svj = {x ∈ S|xj = v }


P. Dupont (UCL Machine Learning Group) LINFO2262 34.
The CART algorithm

Gini versus Entropy

When entropy is used, the impurity of S is interpreted as the


uncertainty in S
When entropy is used, the drop of impurity is equal to the
information gain
In all cases, the attribute that maximizes the drop of impurity is
selected
1
Gini
Entropy
0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.2 0.4 0.6 0.8 1

P. Dupont (UCL Machine Learning Group) LINFO2262 35.


The CART algorithm

Attribute Binarization
CART constructs binary decision trees

Continuous attributes
I Compare to a threshold x < θ? ⇒ {TRUE, FALSE}
I Choose θ maximizing drop of impurity
Discrete attributes having more than 2 values
I Compare one value with respect to all others, and iterate recursively
I Pick the value such that splitting maximizes the drop of impurity

x x
v v v v OR v OR v
1 4 2 1 3 4
v v
2 3
⇒ x
v v OR v
4 1 3

P. Dupont (UCL Machine Learning Group) LINFO2262 36.


The CART algorithm

CART post-pruning

for each Node compute

∆train (Node)
Diff (Node) =
nNode − 1
I nNode is the number of nodes of the subtree treeNode rooted in Node
I ∆train (Node) = errortrain (Node)−error
|SNode |
train (treeNode )

errortrain (Node) denotes the number of misclassifications in the


training sample SNode when the tree is pruned in Node (i.e. Node
becomes a leaf)
errortrain (treeNode ) denotes the number of misclassifications in the
training sample SNode in the subtree treeNode (before pruning)
prune node for which Diff (Node) is minimal
prune iteratively from the full tree to a single node tree and select
the tree for which the accuracy over an independent validation set
is maximal

P. Dupont (UCL Machine Learning Group) LINFO2262 37.


Random Forests

Outline

1 Decision Tree Representation

2 ID3 learning algorithm

3 From ID3 to C4.5

4 The CART algorithm

5 Random Forests

P. Dupont (UCL Machine Learning Group) LINFO2262 38.


Random Forests

Bagging [Breiman 1996]

Algorithm B AGGING
Input: A labeled training dataset
S = {(x 1 , y1 ), . . . , (x n , yn )} // yi is the class of x i
Input: A total number B of bagging rounds
Input: A learning algorithm of Algo returning a binary classifier X → {−1, +1}
Output: A combined classifier
for b ← 1 to B do
Sb = Resample (S) // Randomly samples S with replacement
fb (x) = Algo(Sb ) // Build a classifier on Sb using learning algorithm Algo
P 
B
return sign b=1 fb (x)

The final result is a combined classifier by voting with uniform weights


(extension to multi-class is straightforward)
Bagging reduces the overfitting to the dataset S and generally
outperforms significantly the learning algorithm Algo applied once on S

P. Dupont (UCL Machine Learning Group) LINFO2262 39.


Random Forests

Random Forests [Breiman, 2001]


Bagging + decision trees + (partially) random attribute selection
The general approach is like bagging: build models on
successive resampling (with replacement) of the training data and
make a majority vote to form the combined classifier
Decision trees are built with no post-pruning but splitting of a
node is stopped when it contains too few examples (typically less
than n0 = 5 examples)
While growing the tree, the attribute selected at each node
maximizes the drop of impurity (Gini index) among a random
selection of υ attributes out of the d possible attributes
I when υ = 1 only one attribute is randomly selected and there is no
need to compute the impurity
I when υ = d standard decision trees are built as the best among all
attributes is selected l√ m
I Typical values are υ = dlog2 de or υ = d

P. Dupont (UCL Machine Learning Group) LINFO2262 40.


Conclusion

Decision tree learning is a powerful method

Pros
Simple-to-complex search bias is an efficient incomplete search of a complete
hypothesis space
Robustness to noisy data
Multiclass case built in
Can deal with binary, n-ary, continuous-valued, missing attributes
Rules can be derived from trees (learned knowledge can be made explicit)

Cons
Overfitting is an issue
Splitting at each node reduces the statistical significance of subsequent tests
Random Forests address both issues but with a higher computational cost and a
reduced interpretability

P. Dupont (UCL Machine Learning Group) LINFO2262 41.


Conclusion

Further Reading I

Breiman, L. (1996).
Bagging Predictors,
Machine Learning, Vol 24, No. 2, pp. 123-140.
Breiman, L. (2001).
Random Forests,
Machine Learning, Vol 45, No. 1, pp. 4-32.
Quinlan, J. (1986).
Induction of decision trees.
Machine Learning, 1(1):81–106.
Quinlan, J. (1993).
C4.5: Programs for Machine Learning.
Morgan Kaufmann.

P. Dupont (UCL Machine Learning Group) LINFO2262 42.


Conclusion

Further Reading II

Breiman, L., Friedman, J., Olshen, R., and Stone, P. (1984).


Classification and regression trees.
Wadsworth International Group.
Duda, R., Hart, P., and Stork, D. (2001).
Pattern Classification, chapter 8, pages 394–413.
Wiley-Interscience, 2nd edition.
Mitchell, T. (1997).
Machine Learning, chapter 3.
McGraw Hill.

P. Dupont (UCL Machine Learning Group) LINFO2262 43.

You might also like