Tree-Based Models for Classification in Python
Last Updated :
04 Jul, 2024
Tree-based models are a cornerstone of machine learning, offering powerful and interpretable methods for both classification and regression tasks. This article will cover the most prominent tree-based models used for classification, including Decision Tree Classifier, Random Forest Classifier, Gradient Boosting Classifier, XGBoost Classifier, LightGBM Classifier, CatBoost Classifier, Extra Trees Classifier, HistGradientBoostingClassifier, and AdaBoost Classifier.
Tree-Based Models for Classification We'll delve into how each model works and provide Python code examples for implementation.
Tree-Based Models: The Core Idea
Tree-based models are a family of machine learning algorithms that use a tree-like structure to make decisions. The tree starts with a single node (the root) and branches out into multiple nodes, where each node represents a decision based on a feature. The final nodes (leaves) represent the predicted class labels.
Key Advantages:
- Interpretability: Tree-based models are relatively easy to understand and visualize, making them a great choice when you need to explain the reasoning behind predictions.
- Handles Various Data Types: They can handle both categorical and numerical features without extensive preprocessing.
- Nonlinear Relationships: Unlike linear models, tree-based models can capture complex nonlinear relationships between features and the target variable.
- Feature Importance: They provide insights into which features are most important for prediction.
Classification Tree-Based Algorithms in Machine Learning
Tree-based models are powerful and flexible machine learning algorithms used for classification tasks, known for their interpretability and high performance. Here are some of the most popular tree-based classification algorithms:
- Decision Tree Classifier: Creates a tree model that splits data into branches to predict outcomes.
- Random Forest Classifier: Combines multiple decision trees to improve accuracy and prevent overfitting.
- Gradient Boosting Classifier: Builds trees sequentially to correct errors of previous trees for better performance.
- XGBoost Classifier: An optimized and efficient implementation of gradient boosting with additional regularization.
- LightGBM Classifier: Uses gradient-based one-side sampling and exclusive feature bundling to speed up the training process.
- CatBoost Classifier: Handles categorical features automatically and reduces overfitting with minimal parameter tuning.
- Extra Trees Classifier: Uses random splits and averaging of multiple trees to improve robustness and accuracy.
- HistGradientBoostingClassifier: A variant of gradient boosting that uses histograms to speed up training on large datasets.
- AdaBoost Classifier: Boosts the performance of weak classifiers by focusing on misclassified instances iteratively.
Implementing Tree-Based Models in Python
Step 1: Install Required Libraries
Python
pip install scikit-learn xgboost lightgbm catboost matplotlib
Step 2: Import Libraries and Dataset
- Import necessary libraries for data handling, model training, and evaluation.
- Load the Iris dataset, which includes features and target labels.
- Split the dataset into training and testing sets with a 70-30 split.
Python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import (
RandomForestClassifier,
GradientBoostingClassifier,
ExtraTreesClassifier,
HistGradientBoostingClassifier,
AdaBoostClassifier
)
import xgboost as xgb
import lightgbm as lgb
import catboost as cb
import matplotlib.pyplot as plt
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Step 3: Build and Train Models
1. Decision Tree Classifier
A Decision Tree Classifier splits the data into subsets based on the value of input features, creating a tree-like model of decisions. Each internal node represents a "test" on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label.
Building a Decision Tree Classifier in Python
To build a decision tree in Python, we can use the DecisionTreeClassifier
 class from the Scikit-learn library.Â
- Create a DecisionTreeClassifier instance.
- Train the model using fit on the training data.
- Make predictions on the test data.
- Calculate and print the accuracy.
- Visualize the decision tree using plot_tree.
Python
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Decision Tree Accuracy: {accuracy * 100:.2f}%")
plt.figure(figsize=(12, 8))
plot_tree(clf, filled=True, feature_names=iris.feature_names, class_names=iris.target_names)
plt.title("Decision Tree Visualization")
plt.show()
Output:
Decision Tree Accuracy: 100.00%
Decision Tree2. Random Forest Classifier
Random Forest Classifier is an ensemble of decision trees, typically trained with the "bagging" method. It builds multiple decision trees and merges them together to get a more accurate and stable prediction.
Building a Random Forest Classifier in Python
To build a random forest, we can use the RandomForestClassifier class from Scikit-learn:
- Create a RandomForestClassifier instance with 100 trees.
- Train the model using fit on the training data.
- Make predictions on the test data.
- Calculate and print the accuracy.
- Visualize one of the trees from the Random Forest.
Python
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Accuracy: {accuracy_rf * 100:.2f}%")
plt.figure(figsize=(12, 8))
plot_tree(rf_clf.estimators_[0], filled=True, feature_names=iris.feature_names, class_names=iris.target_names)
plt.title("Random Forest Tree Visualization")
plt.show()
Output:
Random Forest Accuracy: 100.00%
Random Forest Tree3. Gradient Boosting Classifier
Gradient Boosting builds trees sequentially, each one correcting the errors of the previous ones. It uses gradient descent to minimize a loss function.
Building a Gradient Boosting Classifier
- Create a GradientBoostingClassifier instance with 100 estimators, a learning rate of 1.0, and max depth of 1.
- Train the model using fit on the training data.
- Make predictions on the test data.
- Calculate and print the accuracy.
- Visualize one of the boosting iterations.
Python
gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=42)
gb_clf.fit(X_train, y_train)
y_pred_gb = gb_clf.predict(X_test)
accuracy_gb = accuracy_score(y_test, y_pred_gb)
print(f"Gradient Boosting Accuracy: {accuracy_gb * 100:.2f}%")
plt.figure(figsize=(12, 8))
plot_tree(gb_clf.estimators_[0][0], filled=True, feature_names=iris.feature_names, class_names=iris.target_names)
plt.title("Gradient Boosting Tree Visualization")
plt.show()
Output:
Gradient Boosting Accuracy: 95.56%
Gradient Boosting4. XGBoost Classifier
XGBoost (Extreme Gradient Boosting) is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable. It uses a more regularized model formalization to control overfitting.
Python Implementation for XGBoost Classifier
- Create an XGBClassifier instance.
- Train the model using fit on the training data.
- Make predictions on the test data.
- Calculate and print the accuracy.
Python
xgb_clf = xgb.XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')
xgb_clf.fit(X_train, y_train)
y_pred_xgb = xgb_clf.predict(X_test)
accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
print(f"XGBoost Accuracy: {accuracy_xgb * 100:.2f}%")
Output:
XGBoost Accuracy: 100.00%
5. LightGBM Classifier
LightGBM (Light Gradient Boosting Machine) is a gradient boosting framework that uses tree-based learning algorithms. It is designed to be distributed and efficient with the following advantages: faster training speed, lower memory usage, better accuracy, and support for parallel and GPU learning.
Python Implementation for LightGBM Classifier
- Create a LGBMClassifier instance.
- Train the model using fit on the training data.
- Make predictions on the test data.
- Calculate and print the accuracy.
Python
lgb_clf = lgb.LGBMClassifier()
lgb_clf.fit(X_train, y_train)
y_pred_lgb = lgb_clf.predict(X_test)
accuracy_lgb = accuracy_score(y_test, y_pred_lgb)
print(f"LightGBM Accuracy: {accuracy_lgb * 100:.2f}%")
Output:
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000102 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 105, number of used features: 4
[LightGBM] [Info] Start training from score -1.219973
[LightGBM] [Info] Start training from score -1.043042
[LightGBM] [Info] Start training from score -1.043042
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
LightGBM Accuracy: 100.00%
6. CatBoost Classifier
CatBoost (Categorical Boosting) is a gradient boosting algorithm that handles categorical features automatically. It is known for its high performance and ease of use.
CatBoost Classifier Implementation in Python
- Create a CatBoostClassifier instance with no verbosity.
- Train the model using fit on the training data.
- Make predictions on the test data.
- Calculate and print the accuracy.
Python
cb_clf = cb.CatBoostClassifier(verbose=0)
cb_clf.fit(X_train, y_train)
y_pred_cb = cb_clf.predict(X_test)
accuracy_cb = accuracy_score(y_test, y_pred_cb)
print(f"CatBoost Accuracy: {accuracy_cb * 100:.2f}%")
Output:
CatBoost Accuracy: 100.00%
Extra Trees (Extremely Randomized Trees) is an ensemble learning method that aggregates the results of multiple de-correlated decision trees collected in a "forest". It differs from Random Forests in the way splits are computed.
- Create an ExtraTreesClassifier instance with 100 trees.
- Train the model using fit on the training data.
- Make predictions on the test data.
- Calculate and print the accuracy.
Python
et_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
et_clf.fit(X_train, y_train)
y_pred_et = et_clf.predict(X_test)
accuracy_et = accuracy_score(y_test, y_pred_et)
print(f"Extra Trees Accuracy: {accuracy_et * 100:.2f}%")
Output:
Extra Trees Accuracy: 100.00%
8. HistGradientBoostingClassifier
HistGradientBoostingClassifier is a faster variant of Gradient Boosting that uses histograms to bucket continuous feature values into discrete bins, which speeds up the training process.
Implementing HistGradientBoostingClassifier in Python
- Create a HistGradientBoostingClassifier instance.
- Train the model using fit on the training data.
- Make predictions on the test data.
- Calculate and print the accuracy.
Python
hgb_clf = HistGradientBoostingClassifier()
hgb_clf.fit(X_train, y_train)
y_pred_hgb = hgb_clf.predict(X_test)
accuracy_hgb = accuracy_score(y_test, y_pred_hgb)
print(f"HistGradientBoosting Accuracy: {accuracy_hgb * 100:.2f}%")
Output:
HistGradientBoosting Accuracy: 100.00%
9. AdaBoost Classifier
AdaBoost (Adaptive Boosting) works by combining multiple weak classifiers to create a strong classifier. It adjusts the weights of incorrectly classified instances so that subsequent classifiers focus more on difficult cases.
Implementing AdaBoost Classifier in Python
- Create an AdaBoostClassifier instance with 100 estimators.
- Train the model using fit on the training data.
- Make predictions on the test data.
- Calculate and print the accuracy.
Python
ada_clf = AdaBoostClassifier(n_estimators=100, random_state=42)
ada_clf.fit(X_train, y_train)
y_pred_ada = ada_clf.predict(X_test)
accuracy_ada = accuracy_score(y_test, y_pred_ada)
print(f"AdaBoost Accuracy: {accuracy_ada * 100:.2f}%")
Output:
AdaBoost Accuracy: 100.00%
Conclusion
Tree-based models such as Decision Trees, Random Forests, Gradient Boosting, XGBoost, LightGBM, CatBoost, Extra Trees, HistGradientBoosting, and AdaBoost provide powerful and intuitive methods for classification tasks. They handle both numerical and categorical data effectively and can be easily implemented and visualized in Python, allowing for improved understanding and accuracy in various machine learning applications.
Similar Reads
How to build classification trees in R?
In this article, we will discuss What is a Classification Tree and how we create a Classification Tree in the R Programming Language. What is a Classification Tree?Classification trees are powerful tools for predictive modeling in machine learning, particularly for categorical outcomes. In R, the rp
3 min read
Compute Classification Report and Confusion Matrix in Python
Classification Report and Confusion Matrix are used to check machine learning model's performance during model development. These help us understand the accuracy of predictions and tells areas of improvement. In this article, we will learn how to compute these metrics in Python using a simple exampl
3 min read
Text Classification using Decision Trees in Python
Text classification is the process of classifying the text documents into predefined categories. In this article, we are going to explore how we can leverage decision trees to classify the textual data. Text Classification and Decision Trees Text classification involves assigning predefined categori
5 min read
ROC Curves for Multiclass Classification in R
Receiver Operating Characteristic (ROC) curves are a powerful tool for evaluating the performance of classification models. While ROC curves are straightforward for binary classification, extending them to multiclass classification presents additional challenges. In this article, we'll explore how t
3 min read
Implementing CART (Classification And Regression Tree) in Python
Classification and Regression Trees (CART) are a type of decision tree algorithm used in machine learning and statistics for predictive modeling. CART is versatile, used for both classification (predicting categorical outcomes) and regression (predicting continuous outcomes) tasks. Here we check the
6 min read
CART (Classification And Regression Tree) in Machine Learning
CART( Classification And Regression Trees) is a variation of the decision tree algorithm. It can handle both classification and regression tasks. Scikit-Learn uses the Classification And Regression Tree (CART) algorithm to train Decision Trees (also called âgrowingâ trees). CART was first produced b
11 min read
Prediction Using Classification and Regression Trees in MATLAB
A Classification and Regression Tree(CART) is a Machine learning algorithm to predict the labels of some raw data using the already trained classification and regression trees. Initially one needs enough labelled data to create a CART and then, it can be used to predict the labels of new unlabeled r
4 min read
Zero-Shot Text Classification using HuggingFace Model
Zero-shot text classification is a groundbreaking technique that allows for categorizing text into predefined labels without any prior training on those specific labels. This method is particularly useful when labeled data is scarce or unavailable. Leveraging the HuggingFace Transformers library, we
4 min read
How to Create simulated data for classification in Python?
In this article, we are going to see how to create simulated data for classification in Python. We will use the sklearn library that provides various generators for simulating classification data. Single Label Classification Here we are going to see single-label classification, for this we will use
2 min read
An introduction to MultiLabel classification
One of the most used capabilities of supervised machine learning techniques is for classifying content, employed in many contexts like telling if a given restaurant review is positive or negative or inferring if there is a cat or a dog on an image. This task may be divided into three domains, binary
7 min read