Ensemble Learning with SVM and Decision Trees

Last Updated : 10 Mar, 2024

Ensemble learning is a machine learning technique that combines multiple individual models to improve predictive performance. Two popular algorithms used in ensemble learning are Support Vector Machines (SVMs) and Decision Trees.

What is Ensemble Learning?

By merging many models (also referred to as "base learners" or "weak learners"), ensemble learning is a machine learning approach that creates a stronger model that is referred to as an "ensemble model." The concept of ensemble learning is based on the premise that an ensemble model may frequently outperform any individual model in the ensemble by aggregating the predictions of numerous models.

What are decision trees?

A decision tree is a tree-like structure where:

Each internal node represents a "test" on an attribute (e.g., whether a feature is greater than a certain threshold).
Each branch represents the outcome of the test.
Each leaf node represents a class label (in classification) or a continuous value (in regression).

What are support vector machines?

Support Vector Machines (SVMs) are supervised learning models used for classification and regression tasks. In classification, SVMs find the hyperplane that best separates different classes in the feature space. This hyperplane is chosen to maximize the margin, which is the distance between the hyperplane and the nearest data point from each class, also known as support vectors.

How to combine Support Vector Machines (SVM) and Decision Trees?

Here are some common approaches to how to combine Support Vector Machines (SVM) and Decision Trees :

Bagging (Bootstrap Aggregating): This involves training multiple SVMs or Decision Trees on different subsets of the training data and then combining their predictions. This can reduce overfitting and improve generalization.
Boosting: Algorithms like AdaBoost can be used to combine multiple SVMs or Decision Trees sequentially, with each subsequent model focusing on the mistakes of the previous ones. This can improve the overall performance of the combined model.
Random Forests: This ensemble method combines multiple Decision Trees trained on random subsets of the features, and optionally, the samples. It can be effective for both classification and regression tasks.
Cascade SVM: This approach involves using a Decision Tree to pre-select samples that are then fed into separate SVM classifiers. This can be useful when the dataset is large and the SVM training is computationally expensive.
SVM as feature selector for Decision Trees: Use the SVM to select the most relevant features from the dataset, and then train a Decision Tree on the selected features. This can help improve the interpretability of the Decision Tree and reduce the impact of irrelevant features.
Stacking: Train multiple SVMs and Decision Trees separately on the dataset and then use another model (e.g., a linear regression or another Decision Tree) to combine their predictions. This can often lead to better performance than any individual model.

Implementation of Support Vector Machines (SVM) using Decision Trees

In this implementation, we have set up to use a Voting Classifier with a Support Vector Machine (SVM) and a Decision Tree (DT) as base estimators for the breast cancer dataset.

Importing Necessary Libraries

Python3

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

Loading and splitting the dataset

Python3

# Load the breast cancer dataset
breast_cancer = load_breast_cancer()
X_bc, y_bc = breast_cancer.data, breast_cancer.target

# Split the dataset into training and test sets
X_train_bc, X_test_bc, y_train_bc, y_test_bc = train_test_split(X_bc, y_bc, test_size=0.2, random_state=42)

Creating Base Estimators

SVC (Support Vector Classifier): The probability=True parameter allows the model to predict probabilities for each class, which is necessary for soft voting in the VotingClassifier.
DecisionTreeClassifier: This classifier creates a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. Each internal node represents a "test" on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label. In the context of the VotingClassifier, the decision tree serves as another base estimator for voting.

Python3

# Create the base estimators
svm_bc = SVC(probability=True)
dt_bc = DecisionTreeClassifier()