How to fit categorical data types for random forest classification?
Last Updated :
08 Apr, 2024
Categorical variables are an essential component of many datasets, representing qualitative characteristics rather than numerical values. While random forest classification is a powerful machine-learning technique, it typically requires numerical input data. Therefore, encoding categorical variables into a suitable format is a crucial step in preparing data for random forest classification. In this article, we'll explore different encoding methods and their applications in fitting categorical data types for random forest classification.
Types of Encoding for Random Forest Classification
- Ordinal Encoder: Ordinal encoding is particularly useful when categorical variables have an inherent order or rank. In this method, each category is assigned a unique integer value based on its position in the ordered sequence.
- One-hot Encoding: One-hot encoding is a popular technique for handling categorical variables, especially when the categories are not inherently ordered. In this method, each category is represented by a binary indicator variable (0 or 1).
- Target Encoding: Target encoding, also known as mean encoding, replaces each category with the mean of the target variable for that category. This method is particularly useful when dealing with high-cardinality categorical variables, where one-hot encoding would result in a large number of binary columns. By encoding categories based on their relationship with the target variable, target encoding captures valuable information about the predictive power of each category. However, it's essential to be cautious when using target encoding to avoid overfitting, especially with small or imbalanced datasets.
How to fit categorical data types for random forest classification in Python?
Handling categorical data in machine learning involves converting discrete category values into numerical representations suitable for models like random forests. Techniques include Label Encoding, One-Hot Encoding, and Target Encoding, each with unique advantages and considerations based on the nature of the categorical variable and the model requirements. The choice of encoding method impacts model performance and should be selected carefully based on the data characteristics and modeling goals.
Implementation of fitting categorical data types for random forest classification
Loading the dataset
- We will load the dataset using pandas .
- It specifies header=None, indicating that the CSV file doesn't contain a header row.
- Lastly, it displays the first few rows of the DataFrame using the head() method, providing a quick preview of the data.
Python3
data = 'car-evaluation-data-set/car_evaluation.csv'
df = pd.read_csv(data, header=None)
df.head()
Output:
0 1 2 3 4 5 6
0 vhigh vhigh 2 2 small low unacc
1 vhigh vhigh 2 2 small med unacc
2 vhigh vhigh 2 2 small high unacc
3 vhigh vhigh 2 2 med low unacc
4 vhigh vhigh 2 2 med med unacc
Renaming Columns
- col_names is a list containing the desired column names.
- df.columns = col_names assigns the column names from the col_names list to the DataFrame df.
Python3
col_names = ['Cost', 'Maintenance', 'Doors', 'Persons', 'Luggage boot', 'Safety', 'Class']
df.columns = col_names
df.columns = col_names
df.head()
Output:
Cost Maintenance Doors Persons Luggage boot Safety Class
0 vhigh vhigh 2 2 small low unacc
1 vhigh vhigh 2 2 small med unacc
2 vhigh vhigh 2 2 small high unacc
3 vhigh vhigh 2 2 med low unacc
4 vhigh vhigh 2 2 med med unacc
Declaring Feature and Target vector
- The code snippet creates a feature matrix X by dropping the column labeled 'Class' from the DataFrame df, using the drop() method along the columns axis (axis=1).
- It creates a target vector y by selecting only the column labeled 'Class' from the DataFrame df.
Python3
X = df.drop(['Class'], axis=1)
y = df['Class']
Splitting data into Train and Test set
- This code snippet splits the data into training and testing sets for both features (X) and target (y) variables using the train_test_split function from scikit-learn.
- It splits the features (X) and target (y) data into training (X_train, y_train) and testing (X_test, y_test) sets, with 80% of the data allocated for training and 20% for testing.
- The random_state=42 parameter ensures reproducibility of the split, meaning that the same random split will be obtained each time the code is run.
Python3
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
Ordinal Encoding
- We import the category_encoders library, typically used for encoding categorical variables.
- It initializes an OrdinalEncoder object, specifying the columns to encode as the features excluding the last column ('Class') using col_names[:-1].
- Copies of the training and testing feature sets (X_train and X_test) are created to preserve the original data.
- The fit_transform method of the OrdinalEncoder object is applied to the training feature set (X_train_oe), fitting the encoder and transforming the training data into ordinal encoded format.
- Similarly, the transform method is applied to the testing feature set (X_test_oe) to transform it into the same ordinal encoded format.
- The head() method is used to display the first few rows of the transformed training feature set (X_train_oe).
Python3
# import category encoders
import category_encoders as ce
ordinal_encoder = ce.OrdinalEncoder(cols=col_names[:-1])
X_train_oe = X_train.copy()
X_test_oe = X_test.copy()
X_train_oe = ordinal_encoder.fit_transform(X_train_oe)
X_test_oe = ordinal_encoder.transform(X_test_oe)
X_train_oe.head()
Output:
Cost Maintenance Doors Persons Luggage boot Safety
107 1 1 1 1 1 1
901 2 1 2 2 2 2
1709 3 2 1 3 1 1
706 4 3 3 3 3 2
678 4 3 2 3 3 3
Random Forest Classification on Ordinal encoded data
- The code snippet imports the RandomForestClassifier class from the sklearn.ensemble module, allowing the implementation of a random forest classifier.
- It also imports the accuracy_score function from the sklearn.metrics module, which will be used to evaluate the classifier's performance.
- An instance of the RandomForestClassifier class is initialized with the random_state=42 parameter, ensuring reproducibility of results by fixing the random number generator seed to 42.
- The code fits the RandomForestClassifier (rf_classifier) to the training data (X_train_oe, y_train) using the fit() method.
- Predictions are made on the ordinal encoded testing feature set (X_test_oe) using the predict() method, resulting in predicted target values (y_pred_oe).
- The accuracy of the model is calculated by comparing the predicted target values (y_pred_oe) with the actual target values from the testing set (y_test) using the accuracy_score() function.
Python3
# import Random Forest classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Initialize and fit RandomForestClassifier
rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.fit(X_train_oe, y_train)
y_pred_oe = rf_classifier.predict(X_test_oe)
# Calculate accuracy
accuracy_oe = accuracy_score(y_test, y_pred_oe)
print("Ordinal Encoder Accuracy: ", accuracy_oe)
Output:
Ordinal Encoder Accuracy: 0.9566473988439307
One-Hot Encoding
- The code initializes a OneHotEncoder object from the category_encoders library, specifying the columns to encode as the features excluding the last column ('Class') using col_names[:-1].
- Copies of the training and testing feature sets (X_train and X_test) are created to preserve the original data.
- The fit_transform method of the OneHotEncoder object is applied to the training feature set (X_train_oh), fitting the encoder and transforming the training data into one-hot encoded format.
- Similarly, the transform method is applied to the testing feature set (X_test_oh) to transform it into the same one-hot encoded format.
- The head() method is used to display the first few rows of the transformed training feature set (X_train_oh).
Python3
one_hot = ce.OneHotEncoder(cols=col_names[:-1])
X_train_oh = X_train.copy()
X_test_oh = X_test.copy()
X_train_oh = one_hot.fit_transform(X_train_oh)
X_test_oh = one_hot.transform(X_test_oh)
X_train_oh.head()
Output:
Cost_1 Cost_2 Cost_3 Cost_4 Maintenance_1 Maintenance_2 Maintenance_3 Maintenance_4 Doors_1 Doors_2 ... Doors_4 Persons_1 Persons_2 Persons_3 Luggage boot_1 Luggage boot_2 Luggage boot_3 Safety_1 Safety_2 Safety_3
107 1 0 0 0 1 0 0 0 1 0 ... 0 1 0 0 1 0 0 1 0 0
901 0 1 0 0 1 0 0 0 0 1 ... 0 0 1 0 0 1 0 0 1 0
1709 0 0 1 0 0 1 0 0 1 0 ... 0 0 0 1 1 0 0 1 0 0
706 0 0 0 1 0 0 1 0 0 0 ... 0 0 0 1 0 0 1 0 1 0
678 0 0 0 1 0 0 1 0 0 1 ... 0 0 0 1 0 0 1 0 0 1
5 rows × 21 columns
Random Forest Classification on One-hot encoded data
- The code fits the RandomForestClassifier (rf_classifier) to the training data (X_train_oh, y_train) using the fit() method.
- Predictions are made on the one-hot encoded testing feature set (X_test_oh) using the predict() method, resulting in predicted target values (y_pred_oh).
- The accuracy of the model is calculated by comparing the predicted target values (y_pred_oh) with the actual target values from the testing set (y_test) using the accuracy_score() function.
Python3
rf_classifier.fit(X_train_oh, y_train)
y_pred_oh = rf_classifier.predict(X_test_oh)
# Calculate accuracy
accuracy_oh = accuracy_score(y_test, y_pred_oh)
print("One-Hot Encoder Accuracy: ", accuracy_oh)
Output:
One-Hot Encoder Accuracy: 0.9595375722543352
Target Encoding
- The code initializes a TargetEncoder object from the category_encoders library, specifying the columns to encode as the features excluding the last column ('Class') using col_names[:-1].
- An OrdinalEncoder object (oe) is also initialized to encode the target variable ('Class') to ordinal values.
- The target variable is encoded using oe to create y_train_oe and y_test_oe.
- Copies of the training and testing feature sets (X_train and X_test) are created to preserve the original data.
- The fit_transform method of the TargetEncoder object is applied to the training feature set (X_train_te), fitting the encoder and transforming the training data into target encoded format using the encoded target variable (y_train_oe).
- Similarly, the transform method is applied to the testing feature set (X_test_te) to transform it into the same target encoded format using the encoded target variable (y_test_oe).
- The head() method is used to display the first few rows of the transformed training feature set (X_train_te).
Python3
target_encoder = ce.TargetEncoder(cols=col_names[:-1])
oe = ce.OrdinalEncoder(cols=["Class"])
y_train_oe = oe.fit_transform(y_train)
y_test_oe = oe.transform(y_test)
X_train_te = X_train.copy()
X_test_te = X_test.copy()
X_train_te = target_encoder.fit_transform(X_train_te, y_train_oe)
X_test_te = target_encoder.transform(X_test_te, y_test_oe)
X_train_te.head()
Output:
Cost Maintenance Doors Persons Luggage boot Safety
107 1.168639 1.159292 1.466667 1.596950 1.522777 1.738462
901 1.521127 1.159292 1.397661 1.627907 1.295896 1.513100
1709 1.684814 1.688623 1.466667 1.000000 1.522777 1.738462
706 1.264706 1.517045 1.450867 1.000000 1.421397 1.513100
678 1.264706 1.517045 1.397661 1.000000 1.421397 1.000000
Random Forest Classification on Target encoded data
- The code fits the RandomForestClassifier (rf_classifier) to the training data (X_train_te, y_train) using the fit() method.
- Predictions are made on the target encoded testing feature set (X_test_te) using the predict() method, resulting in predicted target values (y_pred_te).
- The accuracy of the model is calculated by comparing the predicted target values (y_pred_te) with the actual target values from the testing set (y_test) using the accuracy_score() function.
Python3
rf_classifier.fit(X_train_te, y_train)
y_pred_te = rf_classifier.predict(X_test_te)
# Calculate accuracy
accuracy_te = accuracy_score(y_test, y_pred_te)
print("Target Encoder Accuracy: ", accuracy_te)
Output:
Target Encoder Accuracy: 0.9739884393063584
In conclusion, the choice of encoding technique for categorical variables in random forest classification significantly influences model performance. Ordinal Encoding preserves ordinal relationships, One-Hot Encoding handles unordered categories effectively, and Target Encoding captures predictive information. Understanding these techniques empowers data scientists to preprocess categorical data effectively, enhancing model accuracy and interpretability.
Similar Reads
Calculate ROC AUC for Classification Algorithm Such as Random Forest
Receiver Operating Characteristic (ROC) curves and Area Under the Curve (AUC) are popular evaluation metrics for classification algorithms, In this article, we will discuss how to calculate the ROC AUC for a Random Forest classifier. Â ROC AUC is a metric that quantifies the ability of a binary class
8 min read
How to Calculate Class Weights for Random Forests in R
In machine learning, handling imbalanced datasets is crucial for building robust models. One effective technique is to use class weights, especially in Random Forests, which can help the model to focus more on the minority classes. This guide will walk you through understanding and using class weigh
3 min read
Random Forest for Image Classification Using OpenCV
Random Forest is a machine learning algorithm that uses multiple decision trees to achieve precise results in classification and regression tasks. It resembles the process of choosing the best path amidst multiple options. OpenCV, an open-source library for computer vision and machine learning tasks
8 min read
Random Forest Approach for Classification in R Programming
Random forest approach is supervised nonlinear classification and regression algorithm. Classification is a process of classifying a group of datasets in categories or classes. As random forest approach can use classification or regression techniques depending upon the user and target or categories
4 min read
Bagging and Random Forest for Imbalanced Classification
Ensemble learning techniques like bagging and random forests have gained prominence for their effectiveness in handling imbalanced classification problems. In this article, we will delve into these techniques and explore their applications in mitigating the impact of class imbalance. Classification
8 min read
Categorical Cross-Entropy in Multi-Class Classification
Categorical Cross-Entropy (CCE), also known as softmax loss or log loss, is one of the most commonly used loss functions in machine learning, particularly for classification problems. It measures the difference between the predicted probability distribution and the actual (true) distribution of clas
6 min read
Binary Classification or unknown class in Random Forest in R
Random Forest is a powerful and versatile machine-learning algorithm capable of performing both classification and regression tasks. It operates by constructing a multitude of decision trees during training time and outputting the mode of the classes (for classification) or mean prediction (for regr
5 min read
Classification on a large and noisy dataset with R
In this article, we will discuss What is noisy data and perform Classification on a large and noisy dataset with R Programming Language. What is noisy data?Noise in data refers to random or irrelevant information that interferes with the analysis or interpretation of the data. It can include errors,
8 min read
How to build classification trees in R?
In this article, we will discuss What is a Classification Tree and how we create a Classification Tree in the R Programming Language. What is a Classification Tree?Classification trees are powerful tools for predictive modeling in machine learning, particularly for categorical outcomes. In R, the rp
3 min read
Tree-Based Models for Classification in Python
Tree-based models are a cornerstone of machine learning, offering powerful and interpretable methods for both classification and regression tasks. This article will cover the most prominent tree-based models used for classification, including Decision Tree Classifier, Random Forest Classifier, Gradi
8 min read