0% found this document useful (0 votes)
72 views

Comparison Among Methods of Ensemble Learning: July 2013

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views

Comparison Among Methods of Ensemble Learning: July 2013

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/261025217

Comparison among Methods of Ensemble Learning

Conference Paper · July 2013


DOI: 10.1109/ISBAST.2013.50

CITATIONS READS
14 2,538

2 authors, including:

Shaohua Wan
Zhongnan University of Economics and Law
131 PUBLICATIONS   2,405 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Cyber Physical System View project

internet of things View project

All content following this page was uploaded by Shaohua Wan on 20 March 2018.

The user has requested enhancement of the downloaded file.


2013 International Symposium on Biometrics and Security Technologies

Comparison among Methods of Ensemble Learning

Shaohua Wan Hua Yang*1,2


1
School of Information and Safety Engineering School of Mathematics and Computer Science,
Zhongnan University of Economics and Law Guizhou Normal University, Guiyang, China
2
Wuhan, China College of Chinese Language and Literature, Wuhan
E-mail: [email protected] University, Wuhan, China

Abstract—Ensemble learning refers to a collection of methods Bagging, which stands for bootstrap aggregating, is one
that learn a target function by training a number of individual of the earliest, most intuitive and perhaps the simplest
learners and combining their predictions. We explore four ensemble based algorithms, with a surprisingly good
popular methods (bagging, boosting, stacking and random performance [3]. Diversity of classifiers in bagging is
forest) of combining their outputs, for classification and obtained by using bootstrapped replicas of the training data.
training time and regression problems. Following this, That is, different training data subsets are randomly drawn –
experimental evaluations are performed on UCI datasets. with replacement – from the entire training dataset. Each
training data subset is used to train a different classifier of
Keywords- Bagging, Boosting, Stacking, Random Forest˗
the same type. Individual classifiers are then combined by
taking a simple majority vote of their decisions. For any
I. INTRODUCTION given instance, the class chosen by most number of
Ensemble methods or classifier combination methods classifiers is the ensemble decision. Since the training
aggregate the predictions of multiple classifiers into a single datasets may overlap substantially, additional measures can
learning model. Several classifier models (called “weak” or be used to increase diversity, such as using a subset of the
“base” learners) are trained and their results are usually training data for training each classifier, or using relatively
combined through a voting or averaging process. The idea weak classifiers (such as decision stumps). The pseudo code
behind ensemble methods can be compared to situations in of Bagging is provided in algorithm 1.
real life. When critical decisions have to be taken, often In the bagging algorithm, each member of the ensemble
opinions of several experts are taken into account rather than is constructed from a different training dataset, and the
relying on a single judgment. Ensembles have shown to be predictions combined either by uniform averaging or voting
more accurate in many cases than the individual classifiers, over class labels. Each dataset is generated by sampling from
but it is not always meaningful to combine models. Ideal the total N data example, choosing N items uniformly at
ensembles consist of classifiers with high accuracy which random with replacement. Each sample is known as a
differ as much as possible [6-9]. If each classifier makes bootstrap; the name bagging is an acronym derived from
different mistakes, the total error will be reduced, if the Bootstrap AGGregatING. Since a bootstrap samples N items
classifiers are identical; a combination is useless since the uniformly at random with replacement, the probability of any
results remain unchanged. Combining multiple models individual data item not being selected is p = (1-1/N)N.
depends on the level of disagreement between classifiers and Therefore with large N, a single bootstrap is expected to
only helps when these models are significantly different from contain approximately 63.2% of the original set, while
each other. Here, the literature in general is reviewed, with, 36.8% of the original are not selected.
where possible, an emphasis on both theory and practical Like many ensemble methods, Bagging works best with
advice, then the taxonomy from Jain, Duin and Mao [4] is unstable models, which is those produce differing
provided, and finally four ensemble methods are focused on: generalization behavior with small changes to the training
bagging, boosting, stacked generalization and the random data. These are also known as high variance models,
forest. examples of which are Decision Trees and Neural Networks.
Bagging therefore tends not to work well with very simple
II. COMMONLY USED ENSEMBLE LEARNING ALGORITHMS models. In effect, bagging samples randomly from the space
of possible models to make up the ensemble-with very
A. bagging simple models the sampling produces almost identical (low
Bagging is a method for generating diverse ensemble for diversity) predictions.
model combination. For unstable classifier like some Despite its apparent capability for variance reduction,
decision tree and the performance of the classifiers are situations have been demonstrated where bagging can
similar, draw n samples from training data with replacement converge without affecting variance. Several other
and update the classifier, run it for some iteration, finally explanations have been proposed for bagging’s success,
combine all learned classifiers. The random forest has very including links to Bayesian Model Averaging. In summary,
similar structure, which uses bootstrap sampling. it seems that several years from its introduction, despite its
apparent simplicity, bagging is still not fully understood.
*Corresponding author: E-mail: [email protected]

978-0-7695-5010-7/13 $26.00 © 2013 IEEE 292


286
DOI 10.1109/ISBAST.2013.50
Algorithm 1 Bagging problems. This limitation is removed with the AdaBoost
algorithm.
Input: Training set S= {(x1,y1),(x2,y2),(xn,yn)}
for t=1 to T do
Build a dataset St, by sampling N items, Algorithm 2 Boosting
randomly with replacement from S. Input: training set (x1,y1),(x2,y2),(xm,ym)
Train a model ht using St, and add it to the yi in {-1,+1} correct label of instance xi in X
ensemble for t =1 to T
end for construct distribution Dt on {1,…m}
For a new testing point (x˃,y˃) find weak classifier (“rule of thumb”)
If model outputs are continuous, combine them by Ht: Xė{-1,+1}
averaging with small error on ε t on Dt:
If model outputs are class labels, combine them by ε t = PrDt [ht(xi)<>yi]
voting Output final classifier Hfinal

C. Stacking
B. Boosting
In Wolpert's stacked generalization (or stacking)[1], an
Boosting is a magic-like method, which is also a method ensemble of classifiers is first trained using bootstrapped
for generating diverse ensemble [5, 10]. Boosting combines samples of the training data, creating Tier 1 classifiers,
weak learners to obtain a strong learner. The most widely whose outputs are then used to train a Tier 2 classifier (meta-
used boosting algorithm is Adaboost, which stands for classifier). The underlying idea is to learn whether training
Adaptive boost. The Adaboost at each step produces a weak data have been properly learned. For example, if a particular
learner and updates the weights of training data (improve the classifier incorrectly learned a certain region of the feature
weights of wrong classified data), finally combine these space, and hence consistently misclassifies instances coming
weak learners linearly to form a strong learner. FYI, the from that region, then the Tier 2 classifier may be able to
weak learner is which could classify better than random in learn this behavior, and along with the learned behaviors of
any weights of the training data. other classifiers, it can correct such improper training. Cross
Similar to bagging, boosting also creates an ensemble of validation type selection is typically used for training the
classifiers by resampling the data, which are then combined Tier 1 classifiers: the entire training dataset is divided into T
by majority voting. However, in boosting, resampling is blocks, and each Tier - 1 classifier is first trained on (a
strategically geared to provide the most informative training different set of) T-1 blocks of the training data. Each
data for each consecutive classifier. In essence, each iteration classifier is then evaluated on the Tth (pseudo-test) block,
of boosting creates three weak classifiers: the first classifier not seen during training. The outputs of these classifiers on
C1 is trained with a random subset of the available training their pseudo-training blocks, along with the actual correct
data. The training data subset for the second classifier C2 is labels for those blocks constitute the training dataset for the
chosen as the most informative subset, given C1. Tier 2 classifier.
Specifically, C2 is trained on a training data only half of Unlike bagging and boosting, stacking may be (and
which is correctly classified by C1, and the other half is normally is) used to combine models of different types. The
misclassified. The third classifier C3 is trained with instances procedure is as follows:
on which C1 and C2 disagree. The three classifiers are 1. Split the training set into two disjoint sets.
combined through a three-way majority vote. The pseudo 2. Train several base learners on the first part.
code and implementation detail of boosting is shown in 3. Test the base learners on the second part.
algorithm 2. 4. Using the predictions from 3) as the inputs, and the
Schapire [11] showed that the error of this algorithm has correct responses as the outputs, train a higher level learner.
an upper bound: if the algorithm A used to create the Notes that steps 1) to 3) are the same as cross-validation,
classifiers C1,C2,C3 has an error of ϵ (as computed on S), but instead of using a winner-takes-all approach, the base
then the error of the ensemble is bounded above by learners are combined, possibly non-linearly.
f(ϵ)=3ϵ2−2ϵ3 . Note that f(ϵ)İϵ for ϵ<1/2 . That is, as long as
the original algorithm A can do at least better than random D. Random Forest
guessing, then the boosting ensemble that combines three A random forest [2] is a special modification of bagging
classifiers generated by A on the above described three that mixes the bagging approach with a random sub-
distributions of S, will always outperform A . Also, the sampling method. While bagging works with any algorithm
ensemble error is a training error bound. Hence, a stronger as weak learner, random forests are ensembles of unpruned
classifier is generated from three weaker classifiers. A strong classification or regression trees. The commonly used
classifier in the strict PAC learning sense can then be created growing algorithm for the single decision trees used within
by recursive applications of boosting. A particular limitation the random forest algorithm is CART. Just like bagging,
of boosting is that it applies only to binary classification random forests also sample attributes (without replacement)

287
293
for each tree. The trees are grown to maximal depth (no Stacking is a similar to boosting: we also apply several
pruning) and each tree performs an independent models to our original data. The difference here is, however,
classification/regression. Then each tree assigns a vector of that we don't have just an empirical formula for our weight
attributes or features to a class and the forest chooses the function, rather we introduce a meta-level and use another
class having most votes over all trees (using a majority vote model/approach to estimate the input together with outputs
or averaging). of every model to estimate the weights or, in other words, to
Each tree is grown as follows: if the number of cases in the determine what models perform well and what badly given
training set is N, sample N cases at random with these input data.
Random forests are a combination of tree predictors. It is
replacement (i.e., the size of the sample is equal to the size
unexcelled in accuracy among current algorithms. It runs
of the training set- but some instances of the training set
efficiently on large data bases. It can handle thousands of
may be missing in the sample while some other instances input variables without variable deletion. Compared with
may appear multiply in the sample). This sample is the Boosting, it is more robust, faster to train (no reweighting,
training set for growing the tree. If there are M input each split is on a small subset of data and feature), and easier
variables, a number m<<M is specified such that at each to extend to online version.
node m variables are selected at random out of the M and As we see, these all are different approaches to combine
the best split on these m attributes is used to split the node. several models into a better one, and there is no single
The value of m is held constant during the forest growing. winner here: everything depends upon our domain and what
Each tree is grown to the largest extent possible, there is no we're going to do. We can still treat stacking as a sort of
pruning. The standard algorithm is shown in the pseudo more advances boosting, however, the difficulty of finding a
code in algorithm 3. good approach for our meta-level makes it difficult to apply
this approach in practice. Random forest (also true for many
machine learning algorithms) is an example of a tool that is
Algorithm 3 Random Forest
useful in doing analyses of scientific data. But the cleverest
Input: a training set algorithms are no substitute for human intelligence and
For I =1 to k do: knowledge of the data in the problem. Take the output of
Build subset Si by sampling with replacement from S random forests not as absolute truth, but as smart computer
Learn tree Ti from Si generated guesses that may be helpful in leading to a deeper
At each node: understanding of the problem.
Choose best split from random subset of F features
Each tree grows to the largest extend, and no pruning ACKNOWLEDGMENT
Make predictions according to majority vote of the set This paper is supported by Natural Science Foundation
of k trees. Project (61070243), Guizhou High-level Talent Research
Project (TZJF-2010-048), Guizhou Normal University PhD
Start-up Research Project(11904-05032110011), Governor
III. CONCLUSIONS Special Fund Grant of Guizhou Province for Prominent
We use the UCI database as experimental data, 31 Science and Technology Talents (identification serial
standard data sets and the experimental results of detailed number "唄ⴱуਸᆇ(2012)155 ਧ"), and the Fundamental
description and the corresponding instance classification Research Funds for the Central Universities.
accuracy are shown in Table 1 and Figure 1. The
experimental analysis is as below.
These are different approaches to improve the REFERENCES
performance of model (so-called meta-algorithms):Bagging [1] D.H. Wolpert Stacked Generalization. Neural Networks 5(2)., pages
(stands for Bootstrap Aggregation) is the way decrease the 241-259,1992.
variance of our prediction by generating additional data for [2] Breiman, L., Random forests. Machine Learning 45(1) (2001) 5-32.
training from our original dataset using combinations with [3] Breiman, L., Bagging predictors, Machine Learning 24(2) (1996)
repetitions to produce multi-sets of the same cardinality/size 123-140.
as our original data. By increasing the size of our training set [4] JAIN, Anil K., Robert P.W.DUIN, and Jianchang Mao, Statistical
Pattern Recognition, A Review. IEEE Transactions on Pattern
we can't improve the model predictive force, but just Analysis and Machine Intelligence, 22(1), 4-37.
decrease the variance, narrowly tuning the prediction to [5] B. Efron (1979), Bootstrap methods: another look at the jackknife,
expected outcome. The Annals of Statistics, vol. 7, no. 1, pp. 1-26.
Boosting is an approach to calculate the output using [6] R. Polikar, Bootstrap inspired techniques in computational
several different models and then average the result using a intelligence: ensemble of classifiers, incremental learning, data fusion
weighted average approach. By combining the advantages and missing features, IEEE Signal Processing Magazine, v. 24, no. 4,
and pitfalls of these approaches by varying our weighting pp. 59-72, 2007.
formula we can come up with a good predictive force for a [7] M. Muhlbaier, A. Topalis, R. Polikar, NC: Combining Ensemble of
wider range of input data, using different narrowly tuned Classifiers with Dynamically Weighted Consult-and-Vote for
Efficient Incremental Learning of New Classes, IEEE Transactions
models. on Neural Networks, In press, 2008.

288
294
[8] Bauer, E., Kohavi, R., An empirical comparison of voting [10] Shen, C. and Li, H. (2010), On the dual formulation of boosting
classification algorithms: Bagging, Boosting, and variants. Machine algorithms, IEEE Transactions on Pattern Analysis and Machine
Learning 36(1-2) (1999) 105-139. Intelligence 32(12), 2216–2231.
[9] Bernard Zenko, L.jupco Todorovski, and Saso Dzeroski. A [11] R. E. Schapire, The Strength of Weak Learnability, Machine
Comparison of Stacking with Meta Decision Trees to Bagging, Learning, vol. 5, no. 2, pp. 197-227, 1990.Article in a conference
Boosting, and Stacking with other Methods. In IEEE International proceedings.
Conference on Data Mining, p.669. IEEE Computer Society, 2001.

Table 1. Experimental Comparisons of Different Algorithms Using UCI Dataset


Dataset Example Boosting(%) Bagging(%) Stacking(%) Randomforest(%)
anneal 898 83.63 98.22 76.17 99.33
autos 205 44.88 69.76 32.68 83.41
audiology 226 46.46 76.55 25.22 76.99
balance-scale 625 72.32 82.88 45.76 80.48
breast-cancer 286 70.28 67.83 70.28 69.23
breast-w 699 94.85 95.57 65.52 96.14
colic 368 81.25 85.33 64.04 86.14
credit-rating 690 84.64 85.07 55.05 85.07
german_credit 1000 69.5 74.4 70 72.5
pima_diabetes 768 74.35 74.61 65.1 73.83
Glass 214 44.86 69.63 35.51 72.9
heart-c 303 82.18 82.18 54.46 81.52
heart-h 294 77.89 78.57 63.95 77.89
heart-statlog 270 80 79.26 55.56 78.15
hepatitis 155 82.58 84.52 79.35 82.58
hypothyroid 3772 93.21 99.55 92.29 99.1
ionosphere 351 90.88 90.88 64.1 92.88
iris 150 95.33 94 33.33 95.33
kr-vs-kp 3196 93.84 99.12 52.22 98.81
labor 57 87.72 85.96 64.91 87.72
lymph 148 74.32 78.38 54.73 81.08
mushroom 8124 96.2 100 51.8 100
primary-tumor 339 28.91 45.13 24.78 42.48
segment 2310 28.57 96.97 14.29 97.66
sick 3772 97.19 98.49 93.88 98.38
sonar 208 71.63 77.4 53.37 80.77
soybean 683 27.96 86.82 13.18 91.65
vehicle 846 39.95 72.7 25.65 77.07
vote 435 95.4 95.86 61.38 95.86
vowel 990 17.37 85.76 9.09 96.06
zoo 101 60.4 42.57 40.59 89.11

289
295
Figure 1. Classification Accuracy Comparison of Different Algorithms.

290
296

View publication stats

You might also like