Ensemble-Learning-Bagging: Case Study with Python
Introduction: Ensemble learning is a method in machine learning where
multiple models, often referred to as "base learners," are combined to
improve overall performance. Bagging, or Bootstrap Aggregating, is a
specific technique under ensemble learning that reduces variance and
enhances accuracy by training multiple models on different subsets of the
training data. Each subset is created through random sampling with
replacement, known as bootstrap sampling. The core idea is to aggregate the
predictions of multiple models to form a stronger overall prediction. In
regression, the predictions are averaged, whereas in classification, a majority
vote is used to determine the outcome.
Objective: In this case study, we applied the Bagging method using decision
trees as the base learners. Our goal was to evaluate how Bagging impacts the
accuracy of a model compared to a single decision tree classifier. The Breast
Cancer Wisconsin dataset was used for this purpose.
Bagging Visualization :
Methodology:
1. Bootstrap Sampling: Bagging generates multiple datasets by
randomly sampling with replacement from the original training data.
These bootstrap samples are then used to train separate decision tree
models. Since each tree is trained on a different subset of the data, the
model is less likely to overfit.
2. Training Base Learners: Decision trees were selected as the base
model because they are prone to overfitting, especially when trained on
smaller datasets. This makes them ideal candidates for Bagging, which
aims to reduce overfitting through averaging.
3. Aggregation: After training the models on different subsets, the
predictions are aggregated. For classification, the majority voting
method is used to combine the predictions of the different decision
trees. This ensures that any variance in individual tree predictions is
smoothed out, leading to a more generalizable model.
4. Evaluation and Comparison: The performance of the Bagging
classifier was compared with that of a single decision tree model. By
evaluating the models on the test data, we found that the Bagging
classifier significantly improved the prediction accuracy, demonstrating
the power of ensemble learning in reducing the variance and overfitting
that single decision trees tend to suffer from.
Results: The Bagging classifier, with 50 decision trees, achieved a higher
accuracy compared to a single decision tree. This improvement in accuracy
highlighted Bagging's effectiveness in stabilizing decision trees and
preventing overfitting. While the single decision tree exhibited a lower
accuracy due to overfitting, the Bagging model showed enhanced
generalization to unseen data.
Conclusion: Bagging is a highly effective technique for improving the
performance of decision trees. By training multiple trees on different samples
of the data and aggregating their predictions, Bagging significantly reduces
the variance and improves the overall accuracy of the model. In our case
study, the Bagging classifier outperformed a single decision tree, confirming
that ensemble methods are valuable in scenarios where base models are prone
to overfitting.