Decision Trees and Boosting: Helge Voss (MPI-K, Heidelberg) TMVA Workshop
Decision Trees and Boosting: Helge Voss (MPI-K, Heidelberg) TMVA Workshop
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 2
Boosted Decision Trees
Decision Tree: Sequential application of cuts splits
the data into nodes, where the final nodes (leafs)
classify an event as signal or background
used since a long time in general “data-mining”
applications, less known in (High Energy)
Physics
similar to “simple Cuts”: each leaf node is a
set of cuts. many boxes in phase space
attributed either to signal or backgr.
independent of monotonous variable
transformations, immune against outliers
weak variables are ignored (and don’t
(much) deteriorate performance)
Disadvantage very sensitive to statistical
fluctuations in training data
Fragments data too quickly; also: multiple splits per node = series of binary node splits
time consuming
other methods more adapted for such correlations
we’ll see later that for “boosted” DTs weak (dull) classifiers are often better, anyway
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 4
Separation Gain
p (1-p) : p=purity
Cross Entropy:
-(plnp + (1-p)ln(1-p))
cross entropy
difference in the various indices are small, Gini index
most commonly used: Gini-index misidentification
purity
cumulative-
distributions
There are cases where the simple “misclassificaton” does not have any optimium at all!
other S=400,B=400 (S=300,B=100) (S=100,B=300) or (S=200,B=0) (S=200,B=400)
example:
equal in terms of misclassification error, but GiniIndex/Entropy favour the latter
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 6
Decision Tree Pruning
One can continue node splitting until all leaf nodes
are basically pure (using the training sample)
obviously: that’s overtraining
Two possibilities:
stop growing earlier
generally not a good idea, even useless
splits might open up subsequent useful splits
grow tree to the end and “cut back”, nodes
that seem statistically dominated:
pruning
Decision tree
Decision tree before pruning after pruning
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 8
Boosting
classifier
Training Sample C(0)(x)
re-weight
Weighted classifier
Sample C(1)(x)
re-weight
Weighted classifier
Sample C(2)(x)
NClassifier
re-weight
Weighted classifier
y(x) i
w iC(i) (x)
Sample C(3)(x)
re-weight
Weighted classifier
Sample C(m)(x)
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 9
Adaptive Boosting (AdaBoost)
NClassifier
1 ferr
(i)
(i)
y(x) log (i) C (x)
classifier
i ferr
Weighted
Sample C(m)(x)
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 10
Boosted Decision Trees
Result of ONE Decision Tree for test event is either “Signal” or “Background”
the tree gives a fixed signal eff. and background rejection
y(B) 0
y(S) 1
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 11
AdaBoost in Pictures
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 12
Boosted Decision Trees – Control Plots
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 13
Boosted Decision Trees – Control Plots
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 14
AdaBoost: A simple demonstration
b) a)
Two reasonable cuts: a) Var0 > 0.5 εsignal=66% εbkg ≈ 0% misclassified events in total 16.5%
or
b) Var0 < -0.5 εsignal=33% εbkg ≈ 0% misclassified events in total 33%
the training of a single decision tree stump will find “cut a)”
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 15
AdaBoost: A simple demonstration
The first “tree”, choosing cut a) will give an error fraction: err = 0.165
before building the next “tree”: weight wrong classified training events by ( 1-err/err) ) ≈ 5
the next “tree” sees essentially the following data sample:
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 16
AdaBoost: A simple demonstration
Only 1 tree “stump” Only 2 tree “stumps” with AdaBoost
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 17
“A Statistical View of Boosting” (Friedman 1998 et.al)
• AdaBoost: “exponential loss function” = exp( -y0y(α,x)) where y0=-1 (bkg), y0=1 (signal)
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 18
Gradient Boost
Binomial log-likelihood loss ln(1 + exp( -2y0y(α,x)) more well behaved loss function,
(the corresponding “GradientBoost” is implmented in TMVA)
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 19
Bagging and Randomised Trees
These combined classifiers work surprisingly well, are very stable and
almost perfect “out of the box” classifiers
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 20
AdaBoost vs Bagging and Randomised Forests
Sometimes people present “boosting” as nothing else then just “smearing” in order to make
the Decision Trees more stable w.r.t statistical fluctuations in the training.
clever “boosting” however can do more, than for example: for previous example of “three
bumps”
- Random Forests
- Bagging
Surprisingly: Often using smaller trees (weaker classifiers) in AdaBoost and other clever boosting
algorithms (i.e. gradient boost) seems to give overall significantly better performance !
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 21
Boosting at Work
Boosting seems to work best on “weak” classifiers (i.e. small, dum trees)
Tuning (tree building) parameter settings are important
For good out of the box performance: Large numbers of very small trees
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 22
Generalised Classifier Boosting
Principle (just as in BDT): multiple training cycles, each time wrongly
classified events get a higher event weight
classifier
Training Sample C(0)(x)
re-weight
Weighted classifier
Sample C(1)(x)
re-weight
NClassifier
1 ferr
(i)
(i)
Weighted classifier
y(x) log (i) C (x)
Sample C(2)(x)
i ferr
re-weight
Boosting might be interesting especially for simple (weak) Methods like Cuts, Linear
Discriminants, simple (small, few nodes) MLPs
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 23
AdaBoost On a linear Classifier (e.g. Fisher)
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 24
AdaBoost On a linear Classifier (e.g. Fisher)
Ups… there’s still a problem in TMVA’s generalized boosting. This example doesn’t work yet !
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 25
Boosting a Fisher Discriminant in TMVA…
100 Boosts of a “Fisher Discriminant”
as Multivariate Tree split (yes.. it is in TMVA
although I argued against it earlier. I hoped to
cope better with linear correlations that way…)
generalised boosting of Fisher classifier
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 26
Learning with Rule Ensembles
Following RuleFit approach by Friedman-Popescu Friedman-Popescu, Tech Rep,
Stat. Dpt, Stanford U., 2003
MR nR
y RF x a0 am rm xˆ bk xˆk
m 1 k 1
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 27
Regression Trees
Use this to model ANY non analytic function of which you have “training data”
i.e.
energy in your calorimeter as function of show parameters
training data from testbeam
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 28
Regression Trees
Leaf Nodes:
One output value
ZOOM
Regression Trees seem to need DESPITE BOOSTING larger trees
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 29
Summary
Helge Voss TMVA-Workshop, CERN, 21. January 2011 ― Decision Trees and Boosting 30