LINFO2262: Decision Trees + Random Forests: Pierre Dupont
LINFO2262: Decision Trees + Random Forests: Pierre Dupont
Pierre Dupont
ICTEAM Institute
Université catholique de Louvain – Belgium
Outline
5 Random Forests
Questions
Which are the general rules to classify correctly these examples?
How to predict the class of new examples?
P. Dupont (UCL Machine Learning Group) LINFO2262 4.
Decision Tree Representation
Outlook
No Yes No Yes
Illustration from Machine Learning, T. Mitchell, McGraw Hill, 1997.
Outline
5 Random Forests
else
Assign unique class label to node
return T
Notes: attributes are also called features or input variables
multi-class case: there can be more than 2 distinct class labels
Ex: yi = low, medium, or high value
Training Examples
Entropy
1.0
Entropy(S)
0.5
Information Gain
Gain(S, xj ) =
expected reduction in entropy due to sorting on attribute xj
j
P |Sv | j
Gain(S, xj ) ≡ Entropy(S) − v ∈Values(xj ) |S| Entropy(Sv )
Humidity Wind
Similar computations:
Gain(S, Outlook) = 0.246 and Gain(S, Temperature) = 0.029
⇒ Select Outlook
Illustration from Machine Learning, T. Mitchell, McGraw Hill, 1997.
Outlook
? Yes ?
Ssunny = {D1,D2,D8,D9,D11}
Gain (Ssunny , Temperature) = .970 − (2/5) 0.0 − (2/5) 1.0 − (1/5) 0.0 = .570
Final Tree
Outlook
No Yes No Yes
+ – +
...
x2
x1
+ – + + + – + –
...
x2 x2
+ – + – + – + –
x3 x4
–
+
... ...
Outline
5 Random Forests
No Yes No Yes
Outlook
?
No Yes ?
Yes
A partial tree does not predict the training labels with 100% accuracy
Which attribute should be tested here?
Illustration adapted from Machine Learning, T. Mitchell, McGraw Hill, 1997.
Ssunny = {D1,D2,D8,D9,D11}
Overfitting
0.9
0.85
0.8
0.75
Accuracy
0.7
0.65
0.5
0 10 20 30 40 50 60 70 80 90 100
Size of tree (number of nodes)
Limit overfitting
pre-pruning: stop growing tree when current node has too few
examples or is nearly pure
I local decision might not be globally optimal
Effect of Post-Pruning
0.9
0.85
0.8
0.75
Accuracy
0.7
0.65
0.5
0 10 20 30 40 50 60 70 80 90 100
Size of tree (number of nodes)
Humidity
High Normal
[1+,3−] [5+,1−]
[6+,4−]
Humidity
High Normal
[1+,3−] [5+,1−]
Humidity
High Normal
0.4 0.6
Outline
5 Random Forests
else
Assign majority class label to node
return T
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
Attribute Binarization
CART constructs binary decision trees
Continuous attributes
I Compare to a threshold x < θ? ⇒ {TRUE, FALSE}
I Choose θ maximizing drop of impurity
Discrete attributes having more than 2 values
I Compare one value with respect to all others, and iterate recursively
I Pick the value such that splitting maximizes the drop of impurity
x x
v v v v OR v OR v
1 4 2 1 3 4
v v
2 3
⇒ x
v v OR v
4 1 3
CART post-pruning
∆train (Node)
Diff (Node) =
nNode − 1
I nNode is the number of nodes of the subtree treeNode rooted in Node
I ∆train (Node) = errortrain (Node)−error
|SNode |
train (treeNode )
Outline
5 Random Forests
Algorithm B AGGING
Input: A labeled training dataset
S = {(x 1 , y1 ), . . . , (x n , yn )} // yi is the class of x i
Input: A total number B of bagging rounds
Input: A learning algorithm of Algo returning a binary classifier X → {−1, +1}
Output: A combined classifier
for b ← 1 to B do
Sb = Resample (S) // Randomly samples S with replacement
fb (x) = Algo(Sb ) // Build a classifier on Sb using learning algorithm Algo
P
B
return sign b=1 fb (x)
Pros
Simple-to-complex search bias is an efficient incomplete search of a complete
hypothesis space
Robustness to noisy data
Multiclass case built in
Can deal with binary, n-ary, continuous-valued, missing attributes
Rules can be derived from trees (learned knowledge can be made explicit)
Cons
Overfitting is an issue
Splitting at each node reduces the statistical significance of subsequent tests
Random Forests address both issues but with a higher computational cost and a
reduced interpretability
Further Reading I
Breiman, L. (1996).
Bagging Predictors,
Machine Learning, Vol 24, No. 2, pp. 123-140.
Breiman, L. (2001).
Random Forests,
Machine Learning, Vol 45, No. 1, pp. 4-32.
Quinlan, J. (1986).
Induction of decision trees.
Machine Learning, 1(1):81–106.
Quinlan, J. (1993).
C4.5: Programs for Machine Learning.
Morgan Kaufmann.
Further Reading II