0% found this document useful (0 votes)
4 views

Comparitive Analysis of Gradient Boosting and Transformer Based Models for Binary Classification in Tabular Data

This study compares the performance of Gradient Boosting (XGBoost) and Transformer-based models for binary classification using tabular data, specifically focusing on customer churn prediction. The results indicate that while both models perform similarly, the Transformer model outperforms XGBoost in Recall by 8%, making it more suitable for applications like fraud detection and medical diagnostics. The findings highlight the importance of selecting the appropriate algorithm based on the specific data challenges and performance requirements.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Comparitive Analysis of Gradient Boosting and Transformer Based Models for Binary Classification in Tabular Data

This study compares the performance of Gradient Boosting (XGBoost) and Transformer-based models for binary classification using tabular data, specifically focusing on customer churn prediction. The results indicate that while both models perform similarly, the Transformer model outperforms XGBoost in Recall by 8%, making it more suitable for applications like fraud detection and medical diagnostics. The findings highlight the importance of selecting the appropriate algorithm based on the specific data challenges and performance requirements.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Volume 10, Issue 3, March– 2025 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25mar416

Comparitive Analysis of Gradient Boosting and


Transformer Based Models for Binary
Classification in Tabular Data
A Customer Churn Case Study

Jebaraj Vasudevan1
1
Visa Inc., Atlanta, GA, USA

Publication Date: 2025/03/20

Abstract: This study compares the classification performance of the Gradient Boosting (XGBoost), and Transformer based
model with multi-head self-attention for Tabular Data. While the methods exhibit broadly similar performance, the
Transformer model particularly excels in Recall by about 8% showing that it would be better suited to applications such as
Fraud Detection in Payment processing and Medical Diagnostics.

Keywords: Transformer, Gradient Boosting, XGBoost, Tabular Data.

How to Cite: Jebaraj Vasudevan (2025). Comparitive Analysis of Gradient Boosting and Transformer Based Models for Binary
Classification in Tabular Data. International Journal of Innovative Science and Research Technology,
10(3), 466-470. https://doi.org/10.38124/ijisrt/25mar416

I. INTRODUCTION task of evaluating their performance on a Binary Churn


prediction problem using the Telco Customer Churn data [3].
Tabular data is ubiquitous in industry because it is The algorithms exhibit a similar level of performance on
inherently structured, easily interpretable, and compatible with multiple classification metrics while the Tab Transformer
a wide range of analytical and reporting tools. Its organization outperforms the XG Boost on Recall by +8%.
in rows and columns simplifies the process of data storage,
retrieval, and manipulation, which is why relational databases, Comparing XG Boost and Tab Transformer reveals
spreadsheets, and data warehouses predominantly use this distinct methodologies that cater to different aspects of tabular
format. data modeling. XG Boost, a gradient boosting framework, is
lauded for its efficiency in handling structured data. It builds
Industries such as finance, healthcare, retail,
ensembles of decision trees using gradient statistics and
telecommunications, and manufacturing heavily rely on tabular
data. In finance, for instance, transaction records, market data, regularization, resulting in robust models that mitigate
and risk assessments are typically stored in structured tables, overfitting and offer clear interpretability. This algorithm has
facilitating quantitative analyses and regulatory reporting. In been refined over years and is widely adopted in industry and
healthcare, patient records, laboratory results, and treatment research due to its computational speed and ease of
histories are maintained in tabular formats to support clinical deployment. In contrast, Tab Transformer harnesses the power
decision-making and research. Retail and e-commerce sectors of transformer architectures originally designed for natural
use tabular data for inventory management, sales tracking, and language processing. By applying self-attention mechanisms,
customer behavior analysis, while telecommunications Tab Transformer captures complex, non-linear interactions
companies employ it for billing, service usage, and churn among features, providing a deep representation of data
prediction. relationships. While XG Boost excels in scenarios where model
transparency and speed are paramount, Tab Transformer
The prevalence of tabular data across these sectors demonstrates potential in situations with intricate feature
highlights its role in enabling robust, data-driven decision- dependencies that require nuanced contextual understanding.
making and operational efficiency. Its simplicity and versatility The choice between these methods depends on the problem
make it a cornerstone of analytical workflows in both domain, computational resources, and the need for model
traditional and modern digital enterprises. interpretability versus expressive power. Both approaches offer
complementary strengths; combining them might even enhance
This case study shows a comparative analysis of XG performance in hybrid systems. Ultimately, their continued
Boost [1] and Tab Transformer [2], two of the most popular development reflects the dynamic evolution of machine
supervised learning algorithms for Tabular data. We chose the learning techniques for structured data analysis. This

IJISRT25MAR416 www.ijisrt.com 466


Volume 10, Issue 3, March– 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25mar416
comparative review highlights the importance of aligning  Data
algorithm selection with specific data challenges Telco Customer Churn Data from Kaggle contains real-
world data collected from a telecommunications company,
II. SCOPE capturing various aspects of customer behavior and account
characteristics. The dataset includes demographic details,
 XGBoost account information, service subscriptions, billing data, and
XG Boost is a highly efficient, scalable gradient boosting usage metrics. The primary target variable is a binary indicator
algorithm that has revolutionized machine learning practices representing whether a customer has discontinued their service
across various domains. It constructs an ensemble of decision ("Churn"), making it a popular benchmark for binary
classification tasks focused on customer attrition.
trees in a sequential manner, optimizing each new tree based
on the residual errors of previous iterations. By employing both
The dataset’s structure—with a mix of categorical
first-order and second-order gradient statistics, XG Boost features (e.g., gender, contract type, payment method) and
effectively minimizes loss functions while integrating numerical features (e.g., tenure, monthly charges, total
regularization techniques to prevent overfitting. This algorithm charges)—requires robust preprocessing and feature
is well-known for its speed and performance, especially on engineering. Researchers and practitioners have leveraged this
large and complex datasets. Its implementation supports dataset to test various data transformation and modeling
parallel processing and distributed computing, enabling the approaches, as its inherent challenges, such as handling missing
analysis of massive datasets with ease. Additionally, XG Boost values and imbalanced classes, reflect real business scenarios.
provides robust handling of missing values and sparse data
through innovative approaches such as weighted quantile Due to its practical significance, the Telco Customer
sketch. The framework is highly customizable, accommodating Churn dataset is frequently used in both academic studies and
various objective functions, including regression, industrial applications. It helps organizations develop
classification, and ranking. As a result, it has become a favored predictive models aimed at understanding and mitigating
choice in data science competitions and industry applications. churn, ultimately supporting customer retention strategies
With a strong emphasis on interpretability and computational through data-driven insights.
efficiency, XG Boost has significantly contributed to the
advancement of predictive analytics and remains a critical tool III. IMPLEMENTATION
for researchers and practitioners aiming to extract meaningful
insights from data. Furthermore, its design enables seamless The Tab Transformer [2] is organized into three principal
integration with various programming languages and data components: a dedicated column embedding layer, a
processing libraries, making it a versatile solution for research succession of N Transformer layers, and a concluding
and industry applications. multilayer perception. Each Transformer layer, as described by
[4] integrates a multi-head self-attention mechanism that
 TabTransformer dynamically models inter-feature dependencies, followed by a
Tab Transformer is an innovative neural architecture position-wise feed-forward network that refines the learned
designed specifically for tabular data analysis by leveraging the representations. This configuration facilitates the extraction of
principles of transformer models. It extends the self-attention complex interactions within categorical data while seamlessly
mechanism, which is central to transformers, to capture integrating numerical inputs, ultimately enhancing predictive
intricate relationships among features in structured datasets. performance on tabular datasets.
Transformers, initially introduced for natural language
processing, utilize multi-head self-attention to assess the
significance of each input element, regardless of their order. In
Tab Transformer, categorical features are first transformed into
dense embeddings, which are then processed through a series
of transformer layers. These layers enable the model to learn
complex, non-linear interactions among variables, facilitating
superior feature representation. The self-attention mechanism
allows the model to dynamically weigh contributions from
different features, thus enhancing predictive accuracy and
robustness. Moreover, the architecture seamlessly integrates
with traditional deep learning frameworks, making it adaptable
to various data science tasks. By combining the strengths of
transformer architectures with specialized adaptations for
tabular data, Tab Transformer offers a novel approach to
overcome limitations of conventional methods. Its design
represents a convergence of ideas from natural language
processing and structured data modeling, offering promising
potential in fields requiring high interpretability and
performance. This approach not only enhances model
efficiency but also paves the way for future innovations in data
representation.

IJISRT25MAR416 www.ijisrt.com 467


Volume 10, Issue 3, March– 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25mar416

Fig 1 Tab Transformer Architecture [2]

 Forward Pass 𝐿(𝑥, 𝑦) ≡ 𝐶 (𝑔𝜓 (𝑓𝜃 (𝐸𝜑 (𝑥𝑐𝑎𝑡 )) , 𝑥𝑐𝑜𝑛𝑡 ) , 𝑦) (1)

 Embedding Categorical Inputs


 Muti-head Self Attention
In the forward method, each column of the categorical
In the formulation presented by [4], the Transformer
input 𝑥𝑐𝑎𝑡 is passed through its corresponding embedding
architecture is structured around a multi-head self-attention
layer. These embeddings are stacked along a new dimension
mechanism followed by a position-wise feed-forward network,
to form a tensor of E with shape (batch, num_cat, embed_dim).
with both sub-layers augmented by residual connections and
layer normalization. The self-attention mechanism operates via
 Embedding Categorical Inputs three learnable projection matrices—namely, Key, Query, and
The stacked embeddings E are passed through the
Value. Each input embedding is projected onto these matrices
transformer encoder. This layer applies multi-head self-
to produce its corresponding key, query, and value vectors.
attention (explained in detail below), allowing the model to
Formally, let 𝐾 ∈ ℝ𝑚 𝑥 𝑘 , 𝑄 ∈ ℝ𝑚 𝑥 𝑘 , 𝑉 ∈ ℝ𝑚 𝑥 𝑣 denote
learn complex interdependencies between different
the matrices containing the key, query, and value vectors for m
categorical features
input embeddings, where 𝑘 , 𝑣 represent the dimensions of the
key and value vectors, respectively. Each embedding then
 Concatenation and Prediction
computes attention over all embeddings via an attention head
The output E′ from the previous layer is flattened to a
defined by
vector and concatenated with the numerical features 𝑥𝑐𝑜𝑛𝑡 ∈
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝐾, 𝑄, 𝑉) = 𝐴 . 𝑉 (2)
ℝ𝑐 denotes all the c continuous features. The resulting vector
is processed by the MLP to yield the final prediction logits with the attention weights are given by,
For our classification task, let C be the cross-entropy for
𝑄𝐾 𝑇
and we want to minimize the following loss function L(x, y) to 𝐴 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 ( ) (3)
learn all the parameters in an end-to-end learning gradient √𝑘
descent. The Tab Transformer parameters include φ for column
embedding, θ for Transformer layers, and ψ for the top MLP Here, the matrix 𝐴 ∈ ℝ𝑚 𝑥 𝑚 quantifies the degree to
layer. which each embedding attends to every other embedding,
thereby producing contextually enriched representations.
Following the attention operation, the output—originally of

IJISRT25MAR416 www.ijisrt.com 468


Volume 10, Issue 3, March– 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25mar416
dimension v is re-projected to the embedding dimension d via Instead of using the continuous tenure variable directly,
a fully connected layer. This is then processed sequentially by we segmented tenure into categorical bins (e.g., 0–12 months,
two position-wise feed-forward layers, where the first layer 13–24 months, etc.). This transformation captures non-linear
expands the dimensionality to four times the original size and effects, as churn likelihood may change drastically at different
the second layer subsequently reduces it back to d. stages of a customer’s lifecycle.

IV. ANALYSIS  Interaction Features


We explored interaction terms such as the product of
 Feature Engineering Monthly Charges and Contract type, which can reveal
We used several new features to enhance model combined effects where, for example, high charges paired with
performance by providing additional context and capturing a month-to-month contract might be a stronger churn signal
non-linear relationships within the data. Below is an than either feature in isolation.
explanation of the key engineered features and their potential
impact: These engineered features enrich the dataset by providing
more nuanced signals for the learning algorithms. For XG
 Average Monthly Charge Boost, the additional numerical variables enhance tree-splitting
decisions, while for Tab Transformer, they offer extra context
𝑇𝑜𝑡𝑎𝑙𝐶ℎ𝑎𝑟𝑔𝑒𝑠 that complements the embedded representations of categorical
𝐴𝑣𝑔𝑀𝑜𝑛𝑡ℎ𝑙𝑦𝐶ℎ𝑎𝑟𝑔𝑒 = data. Overall, these features aim to improve the models’ ability
𝑇𝑒𝑛𝑢𝑟𝑒 to detect subtle patterns and relationships that contribute to
customer churn.
This feature normalizes the total spending by the length
of the customer’s relationship, highlighting customers who
 Methodology and Metrics
incur higher charges relative to their engagement duration. It
Both the models were trained using the same set of
may indicate dissatisfaction or financial stress, both of which
features and the training was stopped as soon as the loss of the
can correlate with churn.
unseen data did not improve (early stopping)
 Service Count The models are compared using several performance
By summing binary indicators for various service metrics in Error! Reference source not found. that provide a
features (e.g., OnlineSecurity, OnlineBackup, comprehensive view of their classification abilities. These
DeviceProtection, TechSupport, StreamingTV, and include:
StreamingMovies), we created a feature:
𝑛
 Accuracy: Measures the overall proportion of correct
predictions.
𝑆𝑒𝑟𝑣𝑖𝑐𝑒𝐶𝑜𝑢𝑛𝑡 = ∑ 1 (𝑆𝑒𝑟𝑣𝑖𝑐𝑒𝑖 = "𝑌𝑒𝑠")  Precision: Evaluates the correctness of positive predictions,
𝑖=1 indicating how many predicted positives are true positives.
 Recall (Sensitivity): Assesses the model's ability to identify
This aggregation provides a measure of customer all actual positive cases.
engagement with additional services, which can be a proxy for
 F1 Score: The harmonic mean of precision and recall,
loyalty. A higher count may imply a deeper investment in the
offering a balance between them.
ecosystem, potentially reducing churn risk.
 Area Under the ROC Curve (AUC): Captures the trade-off
between true positive and false positive rates across
different thresholds.
 Tenure Binning

Table 1 Metrics Comparing the Model Performance on Unseen Data


Accuracy Precision Recall F1 AUC
XGBoost 79.4% 64.3% 50.2% 56.5% 84.1%
TabTransformer 79.5% 63.1% 54.8% 58.6% 83.6%

As evident from the table shown above, the models systems, overlooking a fraudulent transaction could result in
have very similar overall performance similar to what [2] had substantial financial loss, making it preferable to flag more
also noticed in their results. But what we also see here is that transactions for review even if some are false alarms. In these
the Transformer model outperforms the Boosting method in circumstances, the Transformer based model can be preferred
Recalling the positive examples by about 8%. So, in over the Gradient Boosting XG Boost.
scenarios, when the cost of missing a true positive far
outweighs the inconvenience or cost of incorrectly flagging a
negative instance as positive. For instance, in medical
diagnostics—such as screening for cancer or infectious
diseases—failing to identify a diseased patient (a false
negative) can have severe or even fatal consequences,
whereas a false positive might lead to further testing that,
while potentially anxiety-inducing and costly, is
comparatively less harmful. Similarly, in fraud detection

IJISRT25MAR416 www.ijisrt.com 469


Volume 10, Issue 3, March– 2025 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/25mar416

Fig 2 Roc Curve

Additionally, ROC curves as shown in Fig 2 are plotted


to visually analyze the distribution of classification errors and
to assess model discrimination capabilities. These combined
metrics allow for a detailed scientific comparison between the
XGBoost and TabTransformer models, highlighting strengths
and potential trade-offs in different aspects of performance.

V. CONCLUSTION

In conclusion, this study provides a comprehensive


comparative analysis of Gradient Boosting (XGBoost) and
Transformer-based models for binary classification in tabular
data. Both models exhibit similar performance across various
metrics, with the Transformer model demonstrating a notable
advantage in recall. This suggests that the Transformer model
may be better suited for applications where the cost of false
negatives is high, such as fraud detection and medical
diagnostics. The findings underscore the importance of
aligning model selection with specific data challenges and
application requirements. Future research could explore hybrid
approaches that combine the strengths of both models to further
enhance performance. Overall, this study contributes valuable
insights into the evolving landscape of machine learning
techniques for structured data analysis.

REFERENCES

[1]. T. Chen and C. Guestrin, "XGBoost: A Scalable Tree


Boosting System," 2016.
[2]. X. Huang, A. Khetan, M. Cvitkovic and Z. Karnin,
"TabTransformer: Tabular Data Modeling Using
Contextual Embeddings," 2020.
[3]. "Kaggle," [Online]. Available:
https://www.kaggle.com/datasets/blastchar/telco-
customer-churn/data.
[4]. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L.
Jones and A. Gomez, "Attention is all you need," 2017.

IJISRT25MAR416 www.ijisrt.com 470

You might also like