Open In App

Tree-Based Models for Classification in Python

Last Updated : 04 Jul, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

Tree-based models are a cornerstone of machine learning, offering powerful and interpretable methods for both classification and regression tasks. This article will cover the most prominent tree-based models used for classification, including Decision Tree Classifier, Random Forest Classifier, Gradient Boosting Classifier, XGBoost Classifier, LightGBM Classifier, CatBoost Classifier, Extra Trees Classifier, HistGradientBoostingClassifier, and AdaBoost Classifier.

Tree-Based
Tree-Based Models for Classification

We'll delve into how each model works and provide Python code examples for implementation.

Tree-Based Models: The Core Idea

Tree-based models are a family of machine learning algorithms that use a tree-like structure to make decisions. The tree starts with a single node (the root) and branches out into multiple nodes, where each node represents a decision based on a feature. The final nodes (leaves) represent the predicted class labels.

Key Advantages:

  • Interpretability: Tree-based models are relatively easy to understand and visualize, making them a great choice when you need to explain the reasoning behind predictions.
  • Handles Various Data Types: They can handle both categorical and numerical features without extensive preprocessing.
  • Nonlinear Relationships: Unlike linear models, tree-based models can capture complex nonlinear relationships between features and the target variable.
  • Feature Importance: They provide insights into which features are most important for prediction.

Classification Tree-Based Algorithms in Machine Learning

Tree-based models are powerful and flexible machine learning algorithms used for classification tasks, known for their interpretability and high performance. Here are some of the most popular tree-based classification algorithms:

  1. Decision Tree Classifier: Creates a tree model that splits data into branches to predict outcomes.
  2. Random Forest Classifier: Combines multiple decision trees to improve accuracy and prevent overfitting.
  3. Gradient Boosting Classifier: Builds trees sequentially to correct errors of previous trees for better performance.
  4. XGBoost Classifier: An optimized and efficient implementation of gradient boosting with additional regularization.
  5. LightGBM Classifier: Uses gradient-based one-side sampling and exclusive feature bundling to speed up the training process.
  6. CatBoost Classifier: Handles categorical features automatically and reduces overfitting with minimal parameter tuning.
  7. Extra Trees Classifier: Uses random splits and averaging of multiple trees to improve robustness and accuracy.
  8. HistGradientBoostingClassifier: A variant of gradient boosting that uses histograms to speed up training on large datasets.
  9. AdaBoost Classifier: Boosts the performance of weak classifiers by focusing on misclassified instances iteratively.

Implementing Tree-Based Models in Python

Step 1: Install Required Libraries

Python
pip install scikit-learn xgboost lightgbm catboost matplotlib

Step 2: Import Libraries and Dataset

  • Import necessary libraries for data handling, model training, and evaluation.
  • Load the Iris dataset, which includes features and target labels.
  • Split the dataset into training and testing sets with a 70-30 split.
Python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import (
    RandomForestClassifier,
    GradientBoostingClassifier,
    ExtraTreesClassifier,
    HistGradientBoostingClassifier,
    AdaBoostClassifier
)
import xgboost as xgb
import lightgbm as lgb
import catboost as cb
import matplotlib.pyplot as plt

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Step 3: Build and Train Models

1. Decision Tree Classifier

A Decision Tree Classifier splits the data into subsets based on the value of input features, creating a tree-like model of decisions. Each internal node represents a "test" on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label.

Building a Decision Tree Classifier in Python

To build a decision tree in Python, we can use the DecisionTreeClassifier class from the Scikit-learn library. 

  • Create a DecisionTreeClassifier instance.
  • Train the model using fit on the training data.
  • Make predictions on the test data.
  • Calculate and print the accuracy.
  • Visualize the decision tree using plot_tree.
Python
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Decision Tree Accuracy: {accuracy * 100:.2f}%")
plt.figure(figsize=(12, 8))
plot_tree(clf, filled=True, feature_names=iris.feature_names, class_names=iris.target_names)
plt.title("Decision Tree Visualization")
plt.show()

Output:

Decision Tree Accuracy: 100.00%
Screenshot-2024-07-04-075521-(2)
Decision Tree

2. Random Forest Classifier

Random Forest Classifier is an ensemble of decision trees, typically trained with the "bagging" method. It builds multiple decision trees and merges them together to get a more accurate and stable prediction.

Building a Random Forest Classifier in Python

To build a random forest, we can use the RandomForestClassifier class from Scikit-learn:

  • Create a RandomForestClassifier instance with 100 trees.
  • Train the model using fit on the training data.
  • Make predictions on the test data.
  • Calculate and print the accuracy.
  • Visualize one of the trees from the Random Forest.
Python
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Accuracy: {accuracy_rf * 100:.2f}%")
plt.figure(figsize=(12, 8))
plot_tree(rf_clf.estimators_[0], filled=True, feature_names=iris.feature_names, class_names=iris.target_names)
plt.title("Random Forest Tree Visualization")
plt.show()

Output:

Random Forest Accuracy: 100.00%
Screenshot-2024-07-04-080012-(1)
Random Forest Tree

3. Gradient Boosting Classifier

Gradient Boosting builds trees sequentially, each one correcting the errors of the previous ones. It uses gradient descent to minimize a loss function.

Building a Gradient Boosting Classifier

  • Create a GradientBoostingClassifier instance with 100 estimators, a learning rate of 1.0, and max depth of 1.
  • Train the model using fit on the training data.
  • Make predictions on the test data.
  • Calculate and print the accuracy.
  • Visualize one of the boosting iterations.
Python
gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=42)
gb_clf.fit(X_train, y_train)
y_pred_gb = gb_clf.predict(X_test)
accuracy_gb = accuracy_score(y_test, y_pred_gb)
print(f"Gradient Boosting Accuracy: {accuracy_gb * 100:.2f}%")
plt.figure(figsize=(12, 8))
plot_tree(gb_clf.estimators_[0][0], filled=True, feature_names=iris.feature_names, class_names=iris.target_names)
plt.title("Gradient Boosting Tree Visualization")
plt.show()

Output:

Gradient Boosting Accuracy: 95.56%
Screenshot-2024-07-04-080506
Gradient Boosting

4. XGBoost Classifier

XGBoost (Extreme Gradient Boosting) is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable. It uses a more regularized model formalization to control overfitting.

Python Implementation for XGBoost Classifier

  • Create an XGBClassifier instance.
  • Train the model using fit on the training data.
  • Make predictions on the test data.
  • Calculate and print the accuracy.
Python
xgb_clf = xgb.XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')
xgb_clf.fit(X_train, y_train)
y_pred_xgb = xgb_clf.predict(X_test)
accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
print(f"XGBoost Accuracy: {accuracy_xgb * 100:.2f}%")

Output:

XGBoost Accuracy: 100.00%

5. LightGBM Classifier

LightGBM (Light Gradient Boosting Machine) is a gradient boosting framework that uses tree-based learning algorithms. It is designed to be distributed and efficient with the following advantages: faster training speed, lower memory usage, better accuracy, and support for parallel and GPU learning.

Python Implementation for LightGBM Classifier

  • Create a LGBMClassifier instance.
  • Train the model using fit on the training data.
  • Make predictions on the test data.
  • Calculate and print the accuracy.
Python
lgb_clf = lgb.LGBMClassifier()
lgb_clf.fit(X_train, y_train)
y_pred_lgb = lgb_clf.predict(X_test)
accuracy_lgb = accuracy_score(y_test, y_pred_lgb)
print(f"LightGBM Accuracy: {accuracy_lgb * 100:.2f}%")

Output:

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000102 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 105, number of used features: 4
[LightGBM] [Info] Start training from score -1.219973
[LightGBM] [Info] Start training from score -1.043042
[LightGBM] [Info] Start training from score -1.043042
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
LightGBM Accuracy: 100.00%

6. CatBoost Classifier

CatBoost (Categorical Boosting) is a gradient boosting algorithm that handles categorical features automatically. It is known for its high performance and ease of use.

CatBoost Classifier Implementation in Python

  • Create a CatBoostClassifier instance with no verbosity.
  • Train the model using fit on the training data.
  • Make predictions on the test data.
  • Calculate and print the accuracy.
Python
cb_clf = cb.CatBoostClassifier(verbose=0)
cb_clf.fit(X_train, y_train)
y_pred_cb = cb_clf.predict(X_test)
accuracy_cb = accuracy_score(y_test, y_pred_cb)
print(f"CatBoost Accuracy: {accuracy_cb * 100:.2f}%")

Output:

CatBoost Accuracy: 100.00%

7. Extra Trees Classifier

Extra Trees (Extremely Randomized Trees) is an ensemble learning method that aggregates the results of multiple de-correlated decision trees collected in a "forest". It differs from Random Forests in the way splits are computed.

Python Implementation for Extra Trees Classifier

  • Create an ExtraTreesClassifier instance with 100 trees.
  • Train the model using fit on the training data.
  • Make predictions on the test data.
  • Calculate and print the accuracy.
Python
et_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
et_clf.fit(X_train, y_train)
y_pred_et = et_clf.predict(X_test)
accuracy_et = accuracy_score(y_test, y_pred_et)
print(f"Extra Trees Accuracy: {accuracy_et * 100:.2f}%")

Output:

Extra Trees Accuracy: 100.00%

8. HistGradientBoostingClassifier

HistGradientBoostingClassifier is a faster variant of Gradient Boosting that uses histograms to bucket continuous feature values into discrete bins, which speeds up the training process.

Implementing HistGradientBoostingClassifier in Python

  • Create a HistGradientBoostingClassifier instance.
  • Train the model using fit on the training data.
  • Make predictions on the test data.
  • Calculate and print the accuracy.
Python
hgb_clf = HistGradientBoostingClassifier()
hgb_clf.fit(X_train, y_train)
y_pred_hgb = hgb_clf.predict(X_test)
accuracy_hgb = accuracy_score(y_test, y_pred_hgb)
print(f"HistGradientBoosting Accuracy: {accuracy_hgb * 100:.2f}%")

Output:

HistGradientBoosting Accuracy: 100.00%

9. AdaBoost Classifier

AdaBoost (Adaptive Boosting) works by combining multiple weak classifiers to create a strong classifier. It adjusts the weights of incorrectly classified instances so that subsequent classifiers focus more on difficult cases.

Implementing AdaBoost Classifier in Python

  • Create an AdaBoostClassifier instance with 100 estimators.
  • Train the model using fit on the training data.
  • Make predictions on the test data.
  • Calculate and print the accuracy.
Python
ada_clf = AdaBoostClassifier(n_estimators=100, random_state=42)
ada_clf.fit(X_train, y_train)
y_pred_ada = ada_clf.predict(X_test)
accuracy_ada = accuracy_score(y_test, y_pred_ada)
print(f"AdaBoost Accuracy: {accuracy_ada * 100:.2f}%")

Output:

AdaBoost Accuracy: 100.00%

Conclusion

Tree-based models such as Decision Trees, Random Forests, Gradient Boosting, XGBoost, LightGBM, CatBoost, Extra Trees, HistGradientBoosting, and AdaBoost provide powerful and intuitive methods for classification tasks. They handle both numerical and categorical data effectively and can be easily implemented and visualized in Python, allowing for improved understanding and accuracy in various machine learning applications.


Next Article

Similar Reads