Tree-Based Models for Classification in Python

Last Updated : 04 Jul, 2024

Tree-based models are a cornerstone of machine learning, offering powerful and interpretable methods for both classification and regression tasks. This article will cover the most prominent tree-based models used for classification, including Decision Tree Classifier, Random Forest Classifier, Gradient Boosting Classifier, XGBoost Classifier, LightGBM Classifier, CatBoost Classifier, Extra Trees Classifier, HistGradientBoostingClassifier, and AdaBoost Classifier.

We'll delve into how each model works and provide Python code examples for implementation.

Table of Content

Tree-Based Models: The Core Idea
Classification Tree-Based Algorithms in Machine Learning
Implementing Tree-Based Models in Python
1. Decision Tree Classifier
2. Random Forest Classifier
3. Gradient Boosting Classifier
4. XGBoost Classifier
5. LightGBM Classifier
6. CatBoost Classifier
7. Extra Trees Classifier
8. HistGradientBoostingClassifier
9. AdaBoost Classifier

Tree-Based Models: The Core Idea

Tree-based models are a family of machine learning algorithms that use a tree-like structure to make decisions. The tree starts with a single node (the root) and branches out into multiple nodes, where each node represents a decision based on a feature. The final nodes (leaves) represent the predicted class labels.

Key Advantages:

Interpretability: Tree-based models are relatively easy to understand and visualize, making them a great choice when you need to explain the reasoning behind predictions.
Handles Various Data Types: They can handle both categorical and numerical features without extensive preprocessing.
Nonlinear Relationships: Unlike linear models, tree-based models can capture complex nonlinear relationships between features and the target variable.
Feature Importance: They provide insights into which features are most important for prediction.

Classification Tree-Based Algorithms in Machine Learning

Tree-based models are powerful and flexible machine learning algorithms used for classification tasks, known for their interpretability and high performance. Here are some of the most popular tree-based classification algorithms:

Decision Tree Classifier: Creates a tree model that splits data into branches to predict outcomes.
Random Forest Classifier: Combines multiple decision trees to improve accuracy and prevent overfitting.
Gradient Boosting Classifier: Builds trees sequentially to correct errors of previous trees for better performance.
XGBoost Classifier: An optimized and efficient implementation of gradient boosting with additional regularization.
LightGBM Classifier: Uses gradient-based one-side sampling and exclusive feature bundling to speed up the training process.
CatBoost Classifier: Handles categorical features automatically and reduces overfitting with minimal parameter tuning.
Extra Trees Classifier: Uses random splits and averaging of multiple trees to improve robustness and accuracy.
HistGradientBoostingClassifier: A variant of gradient boosting that uses histograms to speed up training on large datasets.
AdaBoost Classifier: Boosts the performance of weak classifiers by focusing on misclassified instances iteratively.

Implementing Tree-Based Models in Python

Step 1: Install Required Libraries

Python

pip install scikit-learn xgboost lightgbm catboost matplotlib

Step 2: Import Libraries and Dataset

Import necessary libraries for data handling, model training, and evaluation.
Load the Iris dataset, which includes features and target labels.
Split the dataset into training and testing sets with a 70-30 split.

Python

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import (
    RandomForestClassifier,
    GradientBoostingClassifier,
    ExtraTreesClassifier,
    HistGradientBoostingClassifier,
    AdaBoostClassifier
)
import xgboost as xgb
import lightgbm as lgb
import catboost as cb
import matplotlib.pyplot as plt

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Step 3: Build and Train Models

1. Decision Tree Classifier

A Decision Tree Classifier splits the data into subsets based on the value of input features, creating a tree-like model of decisions. Each internal node represents a "test" on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label.

Building a Decision Tree Classifier in Python

To build a decision tree in Python, we can use the DecisionTreeClassifier class from the Scikit-learn library.

Create a DecisionTreeClassifier instance.
Train the model using fit on the training data.
Make predictions on the test data.
Calculate and print the accuracy.
Visualize the decision tree using plot_tree.

Python

clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Decision Tree Accuracy: {accuracy * 100:.2f}%")
plt.figure(figsize=(12, 8))
plot_tree(clf, filled=True, feature_names=iris.feature_names, class_names=iris.target_names)
plt.title("Decision Tree Visualization")
plt.show()

Output:

Decision Tree Accuracy: 100.00%

Screenshot-2024-07-04-075521-(2) — Decision Tree

2. Random Forest Classifier

Random Forest Classifier is an ensemble of decision trees, typically trained with the "bagging" method. It builds multiple decision trees and merges them together to get a more accurate and stable prediction.

Building a Random Forest Classifier in Python

To build a random forest, we can use the RandomForestClassifier class from Scikit-learn:

Create a RandomForestClassifier instance with 100 trees.
Train the model using fit on the training data.
Make predictions on the test data.
Calculate and print the accuracy.
Visualize one of the trees from the Random Forest.

Python

rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Accuracy: {accuracy_rf * 100:.2f}%")
plt.figure(figsize=(12, 8))
plot_tree(rf_clf.estimators_[0], filled=True, feature_names=iris.feature_names, class_names=iris.target_names)
plt.title("Random Forest Tree Visualization")
plt.show()

Output:

Random Forest Accuracy: 100.00%

Screenshot-2024-07-04-080012-(1) — Random Forest Tree

3. Gradient Boosting Classifier

Gradient Boosting builds trees sequentially, each one correcting the errors of the previous ones. It uses gradient descent to minimize a loss function.

Building a Gradient Boosting Classifier

Create a GradientBoostingClassifier instance with 100 estimators, a learning rate of 1.0, and max depth of 1.
Train the model using fit on the training data.
Make predictions on the test data.
Calculate and print the accuracy.
Visualize one of the boosting iterations.

Python

gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=42)
gb_clf.fit(X_train, y_train)
y_pred_gb = gb_clf.predict(X_test)
accuracy_gb = accuracy_score(y_test, y_pred_gb)
print(f"Gradient Boosting Accuracy: {accuracy_gb * 100:.2f}%")
plt.figure(figsize=(12, 8))
plot_tree(gb_clf.estimators_[0][0], filled=True, feature_names=iris.feature_names, class_names=iris.target_names)
plt.title("Gradient Boosting Tree Visualization")
plt.show()

Output:

Gradient Boosting Accuracy: 95.56%

Screenshot-2024-07-04-080506 — Gradient Boosting

4. XGBoost Classifier

XGBoost (Extreme Gradient Boosting) is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable. It uses a more regularized model formalization to control overfitting.

Python Implementation for XGBoost Classifier

Create an XGBClassifier instance.
Train the model using fit on the training data.
Make predictions on the test data.
Calculate and print the accuracy.

Python

xgb_clf = xgb.XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')
xgb_clf.fit(X_train, y_train)
y_pred_xgb = xgb_clf.predict(X_test)
accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
print(f"XGBoost Accuracy: {accuracy_xgb * 100:.2f}%")

Output:

XGBoost Accuracy: 100.00%

5. LightGBM Classifier

LightGBM (Light Gradient Boosting Machine) is a gradient boosting framework that uses tree-based learning algorithms. It is designed to be distributed and efficient with the following advantages: faster training speed, lower memory usage, better accuracy, and support for parallel and GPU learning.

Python Implementation for LightGBM Classifier

Create a LGBMClassifier instance.
Train the model using fit on the training data.
Make predictions on the test data.
Calculate and print the accuracy.

Python

lgb_clf = lgb.LGBMClassifier()
lgb_clf.fit(X_train, y_train)
y_pred_lgb = lgb_clf.predict(X_test)
accuracy_lgb = accuracy_score(y_test, y_pred_lgb)
print(f"LightGBM Accuracy: {accuracy_lgb * 100:.2f}%")

Output:

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000102 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 105, number of used features: 4
[LightGBM] [Info] Start training from score -1.219973
[LightGBM] [Info] Start training from score -1.043042
[LightGBM] [Info] Start training from score -1.043042
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
LightGBM Accuracy: 100.00%

6. CatBoost Classifier

CatBoost (Categorical Boosting) is a gradient boosting algorithm that handles categorical features automatically. It is known for its high performance and ease of use.

CatBoost Classifier Implementation in Python

Create a CatBoostClassifier instance with no verbosity.
Train the model using fit on the training data.
Make predictions on the test data.
Calculate and print the accuracy.

Python

cb_clf = cb.CatBoostClassifier(verbose=0)
cb_clf.fit(X_train, y_train)
y_pred_cb = cb_clf.predict(X_test)
accuracy_cb = accuracy_score(y_test, y_pred_cb)
print(f"CatBoost Accuracy: {accuracy_cb * 100:.2f}%")

Output:

CatBoost Accuracy: 100.00%

7. Extra Trees Classifier

Extra Trees (Extremely Randomized Trees) is an ensemble learning method that aggregates the results of multiple de-correlated decision trees collected in a "forest". It differs from Random Forests in the way splits are computed.

Python Implementation for Extra Trees Classifier

Create an ExtraTreesClassifier instance with 100 trees.
Train the model using fit on the training data.
Make predictions on the test data.
Calculate and print the accuracy.

Python

et_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
et_clf.fit(X_train, y_train)
y_pred_et = et_clf.predict(X_test)
accuracy_et = accuracy_score(y_test, y_pred_et)
print(f"Extra Trees Accuracy: {accuracy_et * 100:.2f}%")

Output:

Extra Trees Accuracy: 100.00%

8. HistGradientBoostingClassifier

HistGradientBoostingClassifier is a faster variant of Gradient Boosting that uses histograms to bucket continuous feature values into discrete bins, which speeds up the training process.

Implementing HistGradientBoostingClassifier in Python

Create a HistGradientBoostingClassifier instance.
Train the model using fit on the training data.
Make predictions on the test data.
Calculate and print the accuracy.

Python

hgb_clf = HistGradientBoostingClassifier()
hgb_clf.fit(X_train, y_train)
y_pred_hgb = hgb_clf.predict(X_test)
accuracy_hgb = accuracy_score(y_test, y_pred_hgb)
print(f"HistGradientBoosting Accuracy: {accuracy_hgb * 100:.2f}%")

Output:

HistGradientBoosting Accuracy: 100.00%

9. AdaBoost Classifier

AdaBoost (Adaptive Boosting) works by combining multiple weak classifiers to create a strong classifier. It adjusts the weights of incorrectly classified instances so that subsequent classifiers focus more on difficult cases.

Implementing AdaBoost Classifier in Python

Create an AdaBoostClassifier instance with 100 estimators.
Train the model using fit on the training data.
Make predictions on the test data.
Calculate and print the accuracy.

Python

ada_clf = AdaBoostClassifier(n_estimators=100, random_state=42)
ada_clf.fit(X_train, y_train)
y_pred_ada = ada_clf.predict(X_test)
accuracy_ada = accuracy_score(y_test, y_pred_ada)
print(f"AdaBoost Accuracy: {accuracy_ada * 100:.2f}%")

Output:

AdaBoost Accuracy: 100.00%

Conclusion

Tree-based models such as Decision Trees, Random Forests, Gradient Boosting, XGBoost, LightGBM, CatBoost, Extra Trees, HistGradientBoosting, and AdaBoost provide powerful and intuitive methods for classification tasks. They handle both numerical and categorical data effectively and can be easily implemented and visualized in Python, allowing for improved understanding and accuracy in various machine learning applications.

Machine Learning Algorithms Cheat Sheet

mrmishraoofc

Improve

Article Tags :

Practice Tags :

Machine Learning

Tree-Based Models for Classification in Python

Tree-Based Models: The Core Idea

Classification Tree-Based Algorithms in Machine Learning

Implementing Tree-Based Models in Python

1. Decision Tree Classifier

Building a Decision Tree Classifier in Python

2. Random Forest Classifier

Building a Random Forest Classifier in Python

3. Gradient Boosting Classifier

Building a Gradient Boosting Classifier

4. XGBoost Classifier

Python Implementation for XGBoost Classifier

5. LightGBM Classifier

Python Implementation for LightGBM Classifier

6. CatBoost Classifier

CatBoost Classifier Implementation in Python

7. Extra Trees Classifier

Python Implementation for Extra Trees Classifier

8. HistGradientBoostingClassifier

Implementing HistGradientBoostingClassifier in Python

9. AdaBoost Classifier

Implementing AdaBoost Classifier in Python

Conclusion

Similar Reads

Thank You!

What kind of Experience do you want to share?