1 s2.0 S2666827024000434 Main
1 s2.0 S2666827024000434 Main
Keywords: The study addresses customer churn, a major issue in service-oriented sectors like telecommunications, where it
Customer churn refers to the discontinuation of subscriptions. The research emphasizes the importance of recognizing customer
Explainable model satisfaction for retaining clients, focusing specifically on early churn prediction as a key strategy. Previous
Global explainable
approaches mainly used generalized classification techniques for churn prediction but often neglected the
Local explainable
aspect of interpretability, vital for decision-making. This study introduces explainer models to address this
Telecommunication
gap, providing both local and global explanations of churn predictions. Various classification models, including
the standout Gradient Boosting Machine (GBM), were used alongside visualization techniques like Shapley
Additive Explanations plots and scatter plots for enhanced interpretability. The GBM model demonstrated
superior performance with an 81% accuracy rate. A Wilcoxon signed rank test confirmed GBM’s effectiveness
over other models, with the 𝑝-value indicating significant performance differences. The study concludes that
GBM is notably better for churn prediction, and the employed visualization techniques effectively elucidate
key churn factors in the telecommunications sector.
1. Introduction & Rehman, 2013; Wei & Chiu, 2002) and can positively influence
the company’s reputation, reducing marketing costs for new customer
The service-oriented industries, such as telecommunications, face acquisition (Bolton & Bronkhorst, 1995; Reichheld & Sasser, 1990).
considerable challenges due to customer churn, where valuable cus- So, it is desirable to have thorough research on customer churn and
tomers are lost to competitors. As the world rapidly embraces digi- taking proactive measures in response by decision maker can provide
tization, the telecommunications sector serves as a crucial backbone. a competitive edge to stay ahead in this competition.
Notably, it represents a significant contributor to national income, The primary goal of the churn prediction is to support the cre-
particularly in developing countries, where it plays a substantial role ation of client retention plans in a market that is highly competitive.
in generating revenue (Liao & Lien, 2012). With its substantial business Churn models are made to predict which customers are likely to quit
volume, telecommunications is recognized as a key industry, evident in on their own will and to spot early signs of churn (Wei & Chiu,
ongoing technical advancements and a growing number of operators. 2002). For this, companies must leverage their databases as valuable
Consequently, fierce competition among service providers persists (Ger- assets to comprehend customer churn behavior (Coussement & Van
pott, Rams, & Schindler, 2001), leading to the introduction of new
den Poel, 2008). Fundamentally, these databases contain information
technologies, services and strategies aimed at attracting new customer
on customer service usage, billing details, and satisfaction levels. In
and retaining existing customers. The churn rate in this sector is
addition to predicting customers likely to switch, companies seek to
approximately 2.6% monthly (Hawley, 2003). Comparing the return on
understand churn causes, which aids in profiling prone customers
investment between acquiring a new customer and retaining an existing
and devising effective retention campaigns (Leung, Pazdor, & Souza,
one reveals that the latter is less expensive (Reinartz & Kumar, 2003;
2021). Effective churn modeling has two important components: (i)
Yang & Peterson, 2004) and generally easier than upselling (Ascarza,
Iyengar, & Schleicher, 2016). Therefore, customer retention is recog- predicting whether a specific customer will churn, (ii) discovering the
nized as the most profitable strategy (Qureshi, Rehman, Qamar, Kamal, reasons behind their churn, either at a local or global level. While
∗ Corresponding author.
E-mail addresses: [email protected] (S.S. Poudel), [email protected] (S. Pokharel), [email protected]
(M. Timilsina).
https://doi.org/10.1016/j.mlwa.2024.100567
Received 9 February 2024; Received in revised form 28 March 2024; Accepted 19 June 2024
Available online 24 June 2024
2666-8270/© 2024 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
S.S. Poudel et al. Machine Learning with Applications 17 (2024) 100567
much of the existing research predominantly focuses on the first aspect. et al., 2007). However, the information for land-line services providers
They are treating churn prediction as a binary classification task and are different than mobile services (Bin et al., 2007). Some of this
employing various machine learning techniques around it such as data is missing, less reliable or incomplete in land-line communication
feature extraction (Zhao, Gao, Dong, Dong, & Dong, 2017), feature service providers. For instances, customer ages and complaint data,
selection (Umayaparvathi & Iyakutti, 2017), treatment of imbalanced fault reports are unavailable and only the call details of a few months
datasets (Fujo et al., 2022), and utilizing classifiers like SVC (Cortes & are available. Due to business confidentiality and privacy, there are
Vapnik, 1995), Logistic Regression (Hosmer, Lemeshow, & Sturdivant, no public datasets for churn prediction (Huang, Kechadi, & Buckley,
2013), Random Forest (Breiman, 2001), XGBoost (Friedman, 2001), 2012).
and Neural networks (Goodfellow, Bengio, & Courville, 2016). How- Customer churn prediction models have demonstrated significant
ever, this alone may not suffice to fully grasp customer behavior and value beyond telecommunications, notably within industries like digital
these ignore the second important component. These model cannot marketing, e-commerce, and banking, where understanding and miti-
explain the reason behind the churning. gating churn is equally critical. In digital marketing, the application
This study aims to close the research gap in the field of churning pre- of churn models facilitates the optimization of customer engagement
diction by focusing not only on forecasting whether a certain customer and retention strategies. For instance, Ascarza (2018) in their work
would churn or not, but also on the reason why. For the reasoning delve into how digital marketing efforts can be tailored to retain cus-
we adapt SHapley Additive exPlanations (SHAP) to explain machine tomers showing signs of churn, offering insights into the effectiveness
learning predictions by identifying influential customers from the train- of targeted interventions. In the banking sector, Miguéis, Van den Poel,
ing set (Lundberg & Lee, 2017a). The specific research questions (RQs) Camanho, and e Cunha (2012) apply churn prediction to understand
investigated are: and predict customer churn concerning specific banking products and
services. These references collectively highlight the broad applicability
• What are the best available off-the-shelf machine learning of churn prediction models across various industries, emphasizing their
algorithms for predicting customer churn? potential to inform and refine customer retention strategies in diverse
1. Which classification algorithm performs best for churn business contexts.
prediction in terms of different evaluation metrics? Churn analysis and prediction task is also tackled from statistical
2. Is there a significant difference in the predictions made by modeling perspective. A very popular approach to model churn is time
these classifiers? to event prediction (Bhattacharya, 1998; Van den Poel & Lariviere,
2004). In the context of customer attrition, the time to failure links to
• How can we explain the factors responsible for customer the churn behavior. The potential churner behavior has also been con-
churn? sidered using structural equation modeling (Nguyen & LeBlanc, 1998;
Varki & Colgate, 2001). Such technique can be of great interest for
1. What are the most important predictors, and how do they
managerial decisions, as it evaluates the effect of suspected influential
influence prediction performance?
features on a specific customer decision, such as churn (Geiler, Affeldt,
2. Is there any interaction between the churn predictors?
& Nadif, 2022). The variance analysis was also widely used in market-
ing and business areas to uncover customer behavior (Maxham, 2001;
Contributions: Our contributions are summarized as follows: Mittal & Kamakura, 2001; Zeithaml, Berry, & Parasuraman, 1996).
With this, this research’s contributions are summarized as follows: Financial and retail services also rely on classical T-test and Chi square
statistics to forecast customer behavior and perceptions (Hitt & Frei,
• We rigorously compared state-of-the-art supervised machine learn-
2002; Mittal & Lassar, 1998). The churn prediction problem has one
ing
algorithms for churn prediction. important issue of class imbalance (Kong, Kowalczyk, Menzel, & Bäck,
2020) that might cause biased towards the negative samples which
• We performed statistical tests to find the most significant model
might hinder training the machine learning models (Zhu, Baesens, &
for churn prediction.
vanden Broucke, 2017). Typically, this problem occurs when the classes
• We provide explanations for each predictor corresponding to
in a given dataset are unequally distributed between the minority
customer churn, highlighting both positive and negative contri-
and majority classes that is low number of ‘‘churners’’ than ‘‘non
butions to churn prediction.
churners’’. Without considering this problem, effective learning process
To the best of our knowledge, our approach is the first to generate by classification algorithms will be a challenge, since the main goal is
global and/or local explanations for churn prediction. We conducted the detection of minority classes (Dwiyanti et al., 2016; Sun, Wong, &
rigorous experiments to evaluate tabular machine learning algorithms Kamel, 2009). The popular algorithms like k-nearest neighbors (k-NN)
using different evaluation metrics and to choose the most significant is also applied in the churn-like data however studies (Dubey & Pudi,
model. 2013; Tan, 2005) have shown several significant drawbacks. In the con-
The remainder of the paper is organized as follows: related work, text of class imbalance issues in churn prediction problem, Naive Bayes
problem definition, method description, experiments, and conclusion. classifier also appeared to be sensitive due to the strong bias in the prior
estimation (Bermejo, Gámez, & Puerta, 2011). However, Huang et al.
2. Related work (2012) demonstrated reasonable results using Naive Bayes method.
Earlier studies have provided for various customer churn models
Recently, data mining techniques have emerged to tackle the chal- they have analyzed the model based on customer behavior data and
lenging problems of customer churn in telecommunication service used different data mining techniques (Moayer & Gardner, 2012; Naz,
field (Au, Chan, & Yao, 2003; Hadden, Tiwari, Roy, & Ruta, 2007). As Shoaib, & Shahzad Sarfraz, 2018; Pushpa, 2012). In these studies,
one of the important measures to retain customers, churn prediction has all churn prediction models were analyzed and models with the best
been a concern in the telecommunication industry and research (Bin, results were presented. There are various approaches for that for ex-
Peiji, & Juan, 2007). Majority of the research focused on churn pre- ample Lazarov and Capota (Lazarov & Capota, 2007) showed that a
diction were dedicated in voice services available over mobile and model based on the customer’s lifetime value analysis is the best way
fixed-line networks. In most of the cases, the features used for churn to predict customer churn. Similarly Naz et al. (2018) and Bandara,
prediction in mobile telecommunication industry includes customer Perera, and Alahakoon (2013) analyzed model based on a dataset
demographics, contractual data, customer service logs, call details, they used and showed that a big dataset with more features causes
complaint data, bill and payment information (Bin et al., 2007; Hadden model training and evaluation difficult. Hence, this research suggested
2
S.S. Poudel et al. Machine Learning with Applications 17 (2024) 100567
focusing on feature selection to reduce the number of features. In to stay. Thus in this work, we exploit the power of XAI to uncover
terms of machine learning models the study showed that for true churn local and global explanation of churn prediction. In particular, these
rate and false churn rate, SVM should be used and in case of churn explanations will enable the understanding of machine learning reason-
probability, logistic regression should have been used. Similarly Ahmed ing for the domain expert for customer churn prediction. From global
and Linen (2017) proposed that using hybrid models are useful and explanation, one can learn about the most important pattern learned by
accurate for churn prediction. the machine learning model for churn prediction about training popu-
The user churn prediction is also studied from the network science lation. It helps to understand the interaction between the confounding
perspective. Recently the studies (Ahmad, Jafar, & Aljoumaa, 2019; predictors. From local explanation, it enables the reasoning that the
Huang et al., 2015; Mitrović & De Weerdt, 2020; Xu et al., 2021; Zhang, model applied to a particular case to answer the very specific questions
Zeng, Zhao, Jin, & Li, 2022) showed the effect of social influence on such as ‘‘Why customer Alex churned?’’ and ‘‘Why has Jane continued
user churn. The techniques to approach this problem is categorized to subscribe the plan’’.
from two perspective. The first one is to model the network structure as
a surrogate of social influence. For instance, Ahmad et al. (2019) used 3. Solution approach
social network analysis to extract network-based features for machine
learning model. Similarly, Yang, Shi, Jie, and Han (2018) extracted The overall solution of our approach is illustrated in Fig. 1. The
network features to cluster users in different communities and predict main aim of this study is to assess machine learning classifiers to predict
customer churn with a deep learning model. The second one is to model the customer churn and provide local and global explainability for
the sequential order of churn as a diffusion process and use propagation those predictions. In the next section, we explained our methodology
models such as inflection and stopping rule (Ji et al., 2021) and of our approach.
spreading propagation activation (Dasgupta et al., 2008) to simulate
the diffusion process and give predictions. However, the main caveat of 4. Methods
this method is that these approaches failed to capture the causal nature
The figure above depicts the methodology of the proposed model
of social influence. There is also a graph-based semi-supervised effort to
approach for the churn prediction. The step wise working of method-
predict the customer-churn in telecommunication (Benczúr, Csalogány,
ology is described as below:
Lukács, & Siklósi, 2007). Liu et al. (2018) propose a novel graph-
based inductive semi-supervised embedding model that jointly learns • Dataset: The input to the model is the Telecommunication dataset
the prediction function and the embedding function for user-game in any tabular format. The dataset used in the paper is from
interaction to predict the user churn from the games. Kaggle. The dataset consists of missing data which requires clean-
Recent studies begin to investigate how to use causal informa- ing. For this, the dataset is passed to data preparation and
tion to build better deep learning models (Bonner & Vasile, 2018; preprocessing steps.
Yoon, Jordon, & Van Der Schaar, 2018). It includes the applications • Selection Criteria: The Telecommunication dataset consist of
to eliminate the bias between the observed data and the application data of both churners and non-churners. Some of field might
scenarios and learning the causal effects to give more accurate churn consist of missing values as well. Such data should be handled
predictions (Johansson, Shalit, & Sontag, 2016). The studies by Umaya- before the data are fed into the model. Thus, in this steps missing
parvathi and Iyakutti (2017) demonstrated that deep learning models values are drop.
have similar performance to conventional classifiers such as support • Feature Engineering and feature selection: The raw datasets
vector machine and random forest. The transfer learning which is very need to be handled before fetching to the classifiers. The input
popular in image classification has also been employed in the customer datasets consists of duplicate columns and unique value columns
churn prediction (Ahmed et al., 2019). Similarly, Seymen, Dogan, as well. Such data does not provide any significance in the churn
and Hiziroglu (2020) proposed a novel deep learning model which is prediction and thus these columns are drop.
compared to logistic regression and artificial neural network models. • Encoding: The Telecommunication dataset consist of both nu-
In a similar note, Momin, Bohra, and Raut (2020) demonstrated that meric as well as categorical data. However, all of the machine
deep Learning enables multi-stage models to represent the data at level models do not work with categorical data. Thus, numeric
multiple abstraction levels which reduces the time and effort of feature conversion of data need to be done before application of ML
selection considerably as it automatically creates useful features for models. For handling of such categorical data one-hot encoding
accurate customer churn prediction. In spite of the popularity, the technique is implemented in the model. This led to the increment
deep learning models can still be considered as a black box because in the column of the dataset.
of the complicated architecture and there is a little visibility into its • Hyper parameter selection: the optimization of hyperparame-
decision rationale (Colbrook, Antun, & Hansen, 2022). Furthermore, it ters across diverse machine learning models deployed for predict-
is also ambitious to recognize problems in a machine learning model or ing customer churn within the telecommunications sector. These
otherwise find improvements for it if the model’s behavior cannot be models are characterized by a multitude of hyperparameters, each
understood (Adadi & Berrada, 2018). EXplainable Artificial Intelligence necessitating precise calibration to enhance model efficacy.
(XAI) (Emmert-Streib, Yli-Harja, & Dehmer, 2020) is a research area • Training Models: In our methodology, we have incorporated
that studies how to make models transparent and explainable. In terms a suite of state-of-the-art classification algorithms to ensure ro-
of black box models such as random forest and artificial neural network bust and accurate modeling. This includes the utilization of the
they require the application of XAI techniques to explain the model SVM (Cortes & Vapnik, 1995), known for its effectiveness in high-
recommendation (Leung et al., 2021). dimensional spaces, and LR (Hosmer et al., 2013), a staple for
From the above listed studies, we observed that customer churn has binary classification problems. Additionally, we have leveraged
investigated a wide range of algorithms from white box to black box the Random Forest Classifier (Breiman, 2001), which excels in
models. They have good abilities to differentiate between ‘‘churn’’ and handling large datasets with numerous features. The GBM (Fried-
‘‘no churn’’ customers. However, previous studies have not primarily man, 2001) has been selected for its prowess in predictive accu-
focus on explaining churn prediction model. Therefore, successfully racy by combining multiple weak prediction models into a strong
discriminating between these two categories is not only the aspect that one. Lastly, Neural Networks (Goodfellow et al., 2016) have
is utmost importance. For customer churn prediction, understanding been implemented for their unparalleled capacity to learn from
of the model and its outputs is important as well to target incentives complex data patterns through layers of interconnected nodes,
to customers who have a high risk of churning and inducing them making our approach comprehensive and powerful.
3
S.S. Poudel et al. Machine Learning with Applications 17 (2024) 100567
Fig. 1. An illustration of the data processing, model training, evaluation and explainer models on the customer churn data.
4
S.S. Poudel et al. Machine Learning with Applications 17 (2024) 100567
Table 2 Table 3
Model hyperparameters. Summary of the Dataset.
Model Hyperparameter tuning range Hyperparameter Description Dataset
SVC 0.001641949 – 464.0812108 C Number of samples 7043
Number of features 30
Logistic Regression 5.15E−05 - 4534347.358 C
% of positive samples (Churn) 26.54%
Random Forest 9 – 20 max-depth %of negative samples (Non-Churn) 73.46%
14–20 n-estimators Data source Kaggle
GBM 5–29 max-depth
5–10 min-samples-leaf
auto max-features
3–7 max-leaf-nodes model’s performance. The models are ranked by their Accuracy, with
Neural networks 5–9 hidden-layer-sizes GBM showing the highest Accuracy of 0.81 ± 0.02 and Neural Networks
relu activation the lowest at 0.74 ± 0.06. The ROC-score follows a similar trend, with
adam solver
GBM having the highest score. The PR-score is also highest for GBM,
AdaBoost 50 – 500 n-estimators suggesting its superior performance across various aspects of churn
0.01 – 1.0 learning-rate
prediction tasks in this evaluation. The data presented in the table
XGBoost 100 - 1000 n-estimators reveals that the GBM model exhibits superior performance compared
0.01 – 0.3 learning-rate
to other models.
3 – 10 max-depth
We have used the Wilcoxon signed-rank test is used to determine
if there is a significant difference in the predictive power of GBM
compared to each of the other models when applied to the same churn
total number of services a customer utilizes, including Phone_Service, prediction task. This allows for a fair assessment of whether GBM’s
Multiple_Lines, Internet_Service, among others. This feature reflects the predictive ability is statistically better or not, providing a rigorous
depth of product penetration and serves as an indicator of potential validation for model selection.
customer satisfaction. A higher service utilization often suggests that The test results showcased in the Table 5 demonstrate that GBM
customers find value in a wider range of services, potentially increasing significantly outperform several other supervised machine learning
their loyalty and decreasing their likelihood of churn. models in the context of churn prediction for this specific dataset.
AdaBoost, with a 𝑝-value of 0.05, indicates that its difference in perfor-
6. Results mance compared to GBM is on the threshold of statistical significance,
suggesting a competitive but slightly less effective model than GBM
6.1. Model hyperparameter tuning in this context. XGBoost’s 𝑝-value of 0.07, slightly above the conven-
tional threshold for statistical significance, suggests that while it may
Table 2 illustrates the optimization of hyperparameters across di- offer strong predictive capabilities, it does not statistically outperform
verse machine learning models deployed for predicting customer churn GBM to a significant degree in this dataset. Both Neural Networks
within the telecommunications sector. These models are characterized and Logistic Regression, with p-values well below the 0.05 thresh-
by a multitude of hyperparameters, each necessitating precise calibra- old, demonstrate a statistically significant difference in performance
tion to enhance model efficacy. Detailed in the table are the ranges compared to GBM, indicating GBM’s superior capabilities in churn
of hyperparameter tuning, alongside the specific hyperparameters se- prediction. The SVC’s performance, with a 𝑝-value marginally above
lected for each model, underscoring their pivotal role in refining model the threshold, and Random Forest, with a higher 𝑝-value, suggest a less
performance. significant difference compared to GBM, underscoring GBM’s robust-
ness and effectiveness as a churn prediction tool. This comprehensive
6.2. Experiments comparison underscores the importance of selecting the right model
based on the dataset’s specific characteristics and the predictive task at
Table 3 presents the specifications of the dataset employed in hand. While GBM shows strong performance, the nuanced differences
our studies. The use of detailed telecommunication data poses sub- between models highlight the potential benefits of model ensemble
stantial challenges, primarily due to rigorous privacy regulations and approaches or further hyperparameter tuning to optimize predictive
proprietary limitations, which significantly hinder external analytical accuracy.
endeavors and innovative developments. Kaggle1 enhances these con- To further understand the effectiveness of the GBM, we utilized a
straints by providing anonymized datasets, thereby ensuring adherence confusion matrix to examine its predictive accuracy and identify the
to privacy standards while simultaneously facilitating the extraction of areas where the model may be making errors.
valuable analytical insights. The platform’s dynamic community further Table 6 presents the confusion matrix for the GBM model, a key
promotes a culture of collaboration and knowledge exchange, catalyz- tool in our churn prediction analysis. The matrix indicates that the
ing the development of novel solutions for intricate sector-specific model is highly effective at identifying customers who will remain with
issues such as churn prediction. Consequently, our study leverages the service, as evidenced by the 466 true negatives. However, it also
this publicly accessible data to train our models and derive predictive points to a notable challenge in the form of 84 false negatives, which
insights from this dataset. represent customers who were predicted to stay but actually churned.
To assess the performance of the state-of-the-art classifier model, While the model successfully identified 103 actual churners (true pos-
we utilized a comprehensive set of evaluation metrics, including Ac- itives), it incorrectly flagged 51 loyal customers as likely to churn
curacy, Precision, Recall, F1-score, Receiver Operating Characteristic (false positives), suggesting a need for refinement. The GBM model’s
(ROC) curve, and Precision–Recall (PR) score. Table 4 summarizes the strong suit is its ability to recognize stable customers, a vital aspect
performance metrics of various machine learning models used for churn of preserving a customer base and avoiding the costs associated with
prediction tasks. Each evaluation metric is accompanied by a mean unwarranted retention incentives. Yet, its tendency to overlook some
value and a standard deviation ±, indicating the variability of the churners could lead to substantial customer loss if not addressed. Im-
proving the model’s sensitivity, to capture more true churn cases, and
its precision, to reduce the mistaken identification of loyal customers as
1
https://www.kaggle.com/ churners, emerges as a critical focus for advancing its utility in practical
5
S.S. Poudel et al. Machine Learning with Applications 17 (2024) 100567
Table 4
Results of the 10 Fold cross validation of supervised machine learning classification model for churn prediction. The figure behind ± is the
standard deviation.
Models Accuracy Precision Recall F1-score ROC-score PR-score
Neural networks 0.74 ± 0.06 0.58 ± 0.26 0.43 ± 0.31 0.41 ± 0.21 0.83 ± 0.02 0.64 ± 0.03
SVC 0.78 ± 0.01 0.68 ± 0.03 0.34 ± 0.02 0.45 ± 0.02 0.77 ± 0.02 0.57 ± 0.04
Logistic Regression 0.79 ± 0.02 0.64 ± 0.04 0.47 ± 0.06 0.54 ± 0.05 0.81 ± 0.03 0.61 ± 0.04
AdaBoost 0.79 ± 0.01 0.65 ± 0.02 0.50 ± 0.06 0.57 ± 0.03 0.82 ± 0.01 0.63 ± 0.02
XGBoost 0.80 ± 0.03 0.68 ± 0.01 0.55 ± 0.02 0.61 ± 0.03 0.85 ± 0.02 0.67 ± 0.02
Random Forest 0.80 ± 0.02 0.71 ± 0.04 0.43 ± 0.08 0.53 ± 0.07 0.84 ± 0.01 0.64 ± 0.03
GBM 0.81 ± 0.02 0.67 ± 0.04 0.55 ± 0.03 0.60 ± 0.02 0.86 ± 0.01 0.68 ± 0.03
Table 5 values, with red signifying higher values. The horizontal spread of the
Wilcoxon signed rank test.
dots reflects the magnitude of each feature’s SHAP value; points to
Models Statistics pvalue
the right of the central vertical line indicate a feature’s propensity to
Neural Networks 0.0 0.03125 increase the likelihood of churn, while points to the left suggest a de-
SVC 1.0 0.0625
Logistic Regression 0.0 0.03125
crease. Notably, features such as ‘Internet_Service_Fiber optic’ and ‘Pay-
AdaBoost 2.0 0.05 ment_Method_Electronic check’ predominantly contribute positively to
XGBoost 1.5 0.07 churn predictions, whereas features like ‘Online_Security_No’, ‘Depen-
Random Forest 3.0 0.15625
dents_Yes’, and ‘Tech_Support_No’ display a mixture of positive and
negative effects on the model’s predictions. In the next section, we have
Table 6 demonstrated the top two ranked features ‘Contract_Month-to-month’,
Confusion Matrix analysis for GBM. ‘Tenure_Months‘ by the GBM and its interaction with the other features
Prediction outcome in the data.
Non-churners Churners
Non-churners 466 51
Actual value 6.2.2. Interaction between the churn predictors
Churners 84 103
Fig. 3 visualizes the relationship between month-to-month contracts
and the provision of fiber optic internet service in the context of
customer churn. The red dots represent customers who have churned
business scenarios. These enhancements are imperative for tailoring (discontinued their service), and the blue dots represent those who
customer retention strategies more effectively and securing a healthier have not churned (continued their service). The 𝑥-axis differentiates
churn rate, thereby improving the business’s financial performance and customers based on their contract type, with a particular focus on
customer satisfaction. month-to-month contracts. The 𝑦-axis measures some standardized met-
Qualitative Benchmark with Other State-Of-The-Art Models: In ric related to churn, possibly a probability or a churn score. From the
Table 7, we introduce an innovative approach to customer churn pre-
plot, we can observe a higher density of red dots at the higher end
diction, leveraging Gradient Boosting Machines (GBM) to analyze the
of the month-to-month contract axis, indicating that customers with
Kaggle customer churn prediction dataset. Our methodology achieved
month-to-month contracts and fiber optic internet service are more
a ROC-Score of 0.86, positioning it competitively among state-of-the-
likely to churn. Conversely, there are more blue dots concentrated
art methods in churn prediction for the telecommunications industry.
towards the lower end of the axis, suggesting that customers without
Notably, Ebrah et al.’s use of SVM on both the IBM Watson dataset
and the cell2cell dataset resulted in ROC-Scores of 0.83 and 0.99, fiber optic service or with longer contract terms are less likely to
respectively, indicating a high benchmark for model performance in churn. This implies an interaction where the likelihood of churn is
varied contexts (Ebrah & Elnasir, 2019). Similarly, Shrestha et al. amplified for customers who have fiber optic service on a month-to-
demonstrated the efficacy of XGBoost in achieving a ROC-Score of month basis compared to those without such service or with more
0.98 with data from a Telecom service provider in Nepal (Shrestha & extended contracts.
Shakya, 2022), while Saha et al. utilized CNN and ANN models to reach Fig. 4 illustrates the relationship between customer tenure, mea-
a ROC-Score of 0.99 across datasets from both Southeast Asian and sured in months on the 𝑥-axis, and the amount they are charged
American telecom markets (Saha et al., 2023). These findings under- monthly, represented by the color intensity of the dots, with magenta
score the significant advancements in churn prediction methodologies, indicating higher charges and blue indicating lower charges. The 𝑦-axis
with SVM, XGBoost, CNN, and ANN models setting high standards for shows a standardized value metric, which might represent customer
accuracy and reliability. Our GBM-based approach contributes to this satisfaction or likelihood of churn. The pattern suggests that customers
evolving landscape by not only achieving a commendable ROC-Score with shorter tenure and higher monthly charges (magenta dots) expe-
but also by emphasizing the adaptability and effectiveness of GBM rience a more substantial negative impact on the standardized value
models in handling the complexities of customer churn prediction. This
metric, which could indicate lower satisfaction or higher churn risk. As
comparative analysis highlights our model’s potential in bridging the
tenure increases, the density of magenta dots diminishes, particularly
gap between traditional machine learning techniques and the demands
beyond the 20-month mark, suggesting that customers with higher
of modern-day churn prediction challenges.
monthly charges either improve in their standardized value metric or
6.2.1. Selection of most important predictors possibly churn out of the service, leaving behind those more satisfied
Fig. 2 presents a beeswarm plot generated using SHAP values, or less sensitive to the charge amount. The convergence of magenta
which delineates the influence of various features on the GBM model’s and blue dots as tenure increases indicates that the impact of monthly
churn predictions. The plot reveals that the ‘Contract_Month-to-month’, charges on the standardized metric decreases over time. Customers
‘Tenure_Months’, and ‘Monthly Charges’ features exert the most sub- with longer tenure, irrespective of their monthly charges, show similar
stantial impact on the model’s output, with the ‘Contract_Month-to- values of the standardized metric, which could imply that the initial
month’ feature, in particular, strongly pushing predictions towards sensitivity to pricing diminishes, or that the remaining customer base
churn. A gradation from blue to red denotes the range of feature has adapted to or accepted the monthly charges.
6
S.S. Poudel et al. Machine Learning with Applications 17 (2024) 100567
Table 7
Performance comparison of various models on telecom customer churn prediction, highlighting our GBM approach.
Reference Dataset Evaluation metric Model
Yabas, Cankaya, and Ince (2012) Orange Telecom ROC-Score (0.653) Random Forest
Ebrah and Elnasir (2019) IBM Watson dataset ROC-Score (0.83) SVM
Ebrah and Elnasir (2019) cell2cell ROC-Score (0.99) SVM
Shrestha and Shakya (2022) Telecom service providerof Nepal ROC-Score (0.98) XGBoost
Saha et al. (2023) Southeast Asian telecom industry, and American telecom market. ROC-Score (0.99) in bothdataset CNN and ANN
Our approach Kaggle customer churn prediction ROC-Score (0.86) GBM
7. Discussion
7
S.S. Poudel et al. Machine Learning with Applications 17 (2024) 100567
The combination of GBM and SHAP explanations thus provided CRediT authorship contribution statement
a powerful tool for telecom operators. Not only could they accu-
rately predict which customers were at risk of churning, but they Sumana Sharma Poudel: Conducted experiments, Analysed the
could also understand the underlying factors contributing to these results, Prepared the original draft. Suresh Pokharel: Revised the
predictions. This understanding facilitates the development of targeted original draft. Mohan Timilsina: Provided the guidance, Revised the
strategies to retain specific customer segments, enhancing the efficiency manuscript.
of marketing efforts and potentially improving customer satisfaction.
Incorporating these insights into business operations could lead to more Declaration of competing interest
nuanced customer segmentation and more effective churn prevention
initiatives. For instance, identifying at-risk customers based on their us- We wish to confirm that there are no known conflicts of interest
age patterns and service preferences enables the deployment of tailored associated with this publication and there has been no significant
communication strategies and personalized offers, thereby fostering financial support for this work that could have influenced its outcome.
customer engagement and loyalty.
Our work’s core contribution lies in enhancing the interpretabil- Data availability
ity of machine learning (ML) models for customer churn prediction,
particularly through the use of SHapley Additive exPlanations (SHAP) Data will be made available on request.
values. The creation of unique features before data classification indeed
presents a valuable avenue for research; however, it poses substantial Acknowledgments
challenges, including the need for deep domain expertise, limitations
posed by data availability and quality, the balance between model We would like to thank Data Science Institute, Insight Center for
complexity and interpretability, and the risk of overfitting. Our study Data Analytics, at University of Galway Ireland for providing us con-
focuses on leveraging existing, well-understood features and enrich- structive feedback and improvement of the manuscript.
ing the analysis with detailed interpretability to provide actionable
insights. This approach not only aids telecom providers in identifying References
and addressing churn risks but also maintains the model’s generaliz-
ability and robustness, carefully navigating the complexities inherent Adadi, A., & Berrada, M. (2018). Peeking inside the black-box: a survey on explainable
artificial intelligence (XAI). IEEE Access, 6, 52138–52160.
in feature engineering.
Ahmad, A. K., Jafar, A., & Aljoumaa, K. (2019). Customer churn prediction in telecom
using machine learning in big data platform. Journal of Big Data, 6(1), 1–24.
8. Conclusion Ahmed, U., Khan, A., Khan, S. H., Basit, A., Haq, I. U., & Lee, Y. S. (2019). Transfer
learning and meta classification based deep churn prediction system for telecom
In the telecom sector, accurately predicting which customers are industry. arXiv preprint arXiv:1901.06091.
Ahmed, A., & Linen, D. M. (2017). A review and analysis of churn prediction methods
likely to leave the service is crucial. The ability to identify at-risk for customer retention in telecom industries. In 2017 4th international conference on
customers early on allows companies to intervene with targeted re- advanced computing and communication systems (pp. 1–7). IEEE.
tention strategies. Machine learning models, particularly those that Ascarza, E. (2018). Retention futility: Targeting high-risk customers might be
handle tabular data, are key to making these predictions. These models ineffective. Journal of Marketing Research, 55(1), 80–98.
Ascarza, E., Iyengar, R., & Schleicher, M. (2016). The perils of proactive churn
analyze customer data and can effectively forecast who might churn.
prevention using plan recommendations: Evidence from a field experiment. Journal
This predictive power is essential for reducing churn rates, which is of Marketing Research, 53(1), 46–60.
a persistent problem for telecom providers. Our research found that Au, W.-H., Chan, K. C., & Yao, X. (2003). A novel evolutionary data mining al-
the GBM model was especially effective in this data. To confirm GBM’s gorithm with applications to churn prediction. IEEE Transactions on Evolutionary
performance, we compared it with other advanced models using the Computation, 7(6), 532–545.
Bandara, W., Perera, A., & Alahakoon, D. (2013). Churn prediction methodologies in the
Wilcoxon signed-rank test. The test results showed that GBM was
telecommunications sector: A survey. In 2013 international conference on advances
significantly better at predicting churn. The 𝑝-value from the test helped in ICT for emerging regions (pp. 172–176). IEEE.
us understand the strength of this evidence. A lower 𝑝-value indicates a Benczúr, A. A., Csalogány, K., Lukács, L., & Siklósi, D. (2007). Semi-supervised learning:
more definitive difference between the models, and in our case, GBM’s A comparative study for web spam and telephone user churn. In In graph labeling
workshop in conjunction with ECML/pKDD. Citeseer.
lower 𝑝-value confirmed its superior predictive ability. similarly, we
Bermejo, P., Gámez, J. A., & Puerta, J. M. (2011). Improving the performance of Naive
leveraged the SHAP (SHapley Additive exPlanations) values to gain Bayes multinomial in e-mail foldering by introducing distribution-based balance of
insights into the importance of different features in our predictive datasets. Expert Systems with Applications, 38(3), 2072–2080.
model. This information is invaluable for telecom companies looking Bhattacharya, C. (1998). When customers are members: Customer retention in paid
to pinpoint the factors that most influence customer churn. By utiliz- membership contexts. Journal of the Academy of Marketing Science, 26(1), 31–44.
Bin, L., Peiji, S., & Juan, L. (2007). Customer churn prediction based on the decision
ing SHAP values, we were able to identify which specific customer
tree in personal handyphone system service. In 2007 international conference on
attributes, such as call duration, plan type, or contract length, had the service systems and service management (pp. 1–5). IEEE.
most significant impact on the churn prediction. These insights helped Bolton, R. N., & Bronkhorst, T. M. (1995). The relationship between customer
telecom providers tailor their retention efforts towards addressing the complaints to the firm and subsequent exit behavior. ACR North American Advances.
Bonner, S., & Vasile, F. (2018). Causal embeddings for recommendation. In Proceedings
key factors driving customer attrition. SHAP values provided a trans-
of the 12th ACM conference on recommender systems (pp. 104–112).
parent and interpretable way to analyze the model’s decision-making Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.
process, making it a valuable tool for optimizing customer retention Colbrook, M. J., Antun, V., & Hansen, A. C. (2022). The difficulty of computing
strategies in the telecommunications sector. stable and accurate neural networks: On the barriers of deep learning and Smale’s
18th problem. Proceedings of the National Academy of Sciences, 119(12), Article
e2107151119.
Funding
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20,
273–297.
This work receive no funding. Coussement, K., & Van den Poel, D. (2008). Churn prediction in subscription
services: An application of support vector machines while comparing two
Ethical approval parameter-selection techniques. Expert Systems with Applications, 34(1), 313–327.
Dasgupta, K., Singh, R., Viswanathan, B., Chakraborty, D., Mukherjea, S., Nanavati, A.
A., et al. (2008). Social ties and their relevance to churn in mobile telecom
All data used in this work is freely available online. No other aspect networks. In Proceedings of the 11th international conference on extending database
of this work cause ethical issues. technology: advances in database technology (pp. 668–677).
8
S.S. Poudel et al. Machine Learning with Applications 17 (2024) 100567
Dubey, H., & Pudi, V. (2013). Class based weighted k-nearest neighbor over imbalance Mittal, B., & Lassar, W. M. (1998). Why do customers switch? The dynamics of
dataset. In Pacific-Asia conference on knowledge discovery and data mining (pp. satisfaction versus loyalty. Journal of Services Marketing, 12(3), 177–194.
305–316). Springer. Moayer, S., & Gardner, S. (2012). Integration of data mining within a strategic
Dwiyanti, E., Ardiyanti, A., et al. (2016). Handling imbalanced data in churn prediction knowledge management framework. International Journal of Advanced Computer
using rusboost and feature selection (case study: Pt. telekomunikasi Indonesia Science and Applications, 3(8).
regional 7). In International conference on soft computing and data mining (pp. Momin, S., Bohra, T., & Raut, P. (2020). Prediction of customer churn using machine
376–385). Springer. learning. In EAI international conference on big data innovation for sustainable cognitive
Ebrah, K., & Elnasir, S. (2019). Churn prediction using machine learning and recom- computing (pp. 203–212). Springer.
mendations plans for telecoms. Journal of Computer and Communications, 7(11), 3. Naz, N. A., Shoaib, U., & Shahzad Sarfraz, M. (2018). A review on customer
http://dx.doi.org/10.4236/jcc.2019.711003. churn prediction data mining modeling techniques. Indian Journal of Science and
Emmert-Streib, F., Yli-Harja, O., & Dehmer, M. (2020). Explainable artificial intelligence Technology, 11(27), 1–27.
and machine learning: A reality rooted perspective. Wiley Interdisciplinary Reviews: Nguyen, N., & LeBlanc, G. (1998). The mediating role of corporate image on customers’
Data Mining and Knowledge Discovery, 10(6), Article e1368. retention decisions: an investigation in financial services. International Journal of
Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Bank Marketing.
Annals of Statistics, 1189–1232. Pushpa, S. (2012). An efficient method of building the telecom social network for churn
Fujo, S. W., Subramanian, S., Khder, M. A., et al. (2022). Customer churn prediction in prediction. International Journal of Data Mining & Knowled Management Process, 2(3),
telecommunication industry using deep learning. Information Sciences Letters, 11(1), 31–39.
24. Qureshi, S. A., Rehman, A. S., Qamar, A. M., Kamal, A., & Rehman, A. (2013).
Geiler, L., Affeldt, S., & Nadif, M. (2022). A survey on machine learning methods for Telecommunication subscribers’ churn prediction model using machine learning.
churn prediction. International Journal of Data Science and Analytics, 1–26. In Eighth international conference on digital information management (pp. 131–136).
Gerpott, T. J., Rams, W., & Schindler, A. (2001). Customer retention, loyalty, IEEE.
and satisfaction in the German mobile cellular telecommunications market. Reichheld, F. F., & Sasser, W. E. (1990). Zero defeofions: Quoliiy comes to services.
Telecommunications Policy, 25(4), 249–269. Harvard Business Review, 68(5), 105–111.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press. Reinartz, W. J., & Kumar, V. (2003). The impact of customer relationship characteristics
Hadden, J., Tiwari, A., Roy, R., & Ruta, D. (2007). Computer assisted customer churn on profitable lifetime duration. Journal of Marketing, 67(1), 77–99.
management: State-of-the-art and future trends. Computers & Operations Research, Saha, L., et al. (2023). Deep churn prediction method for telecommunication industry.
34(10), 2902–2917. Sustainability, 15(5), 4543.
Hawley, D. (2003). International wireless churn management: research and recom- Seymen, O. F., Dogan, O., & Hiziroglu, A. (2020). Customer churn prediction using
mendations. Yankee Group report, (June), URL http://www.ams.com/cme/pdfs/ deep learning. In International conference on soft computing and pattern recognition
yankeechurnstudy.pdf. (Accessed January 2006). (pp. 520–529). Springer.
Hitt, L. M., & Frei, F. X. (2002). Do better customers utilize electronic distribution Shrestha, S. M., & Shakya, A. (2022). A customer churn prediction model using
channels? The case of PC banking. Management Science, 48(6), 732–748. XGBoost for the telecommunication industry in Nepal. Procedia Computer Science,
Hosmer, D. W., Jr., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression: 215, 652–661.
vol. 398, John Wiley & Sons. Sun, Y., Wong, A. K., & Kamel, M. S. (2009). Classification of imbalanced data: A
Huang, B., Kechadi, M. T., & Buckley, B. (2012). Customer churn prediction in review. International Journal of Pattern Recognition and Artificial Intelligence, 23(04),
telecommunications. Expert Systems with Applications, 39(1), 1414–1425. 687–719.
Huang, Y., Zhu, F., Yuan, M., Deng, K., Li, Y., Ni, B., et al. (2015). Telco churn Tan, S. (2005). Neighbor-weighted k-nearest neighbor for unbalanced text corpus. Expert
prediction with big data. In Proceedings of the 2015 ACM SIGMOD international Systems with Applications, 28(4), 667–671.
conference on management of data (pp. 607–618). Umayaparvathi, V., & Iyakutti, K. (2017). Automated feature selection and churn
Ji, H., Zhu, J., Wang, X., Shi, C., Wang, B., Tan, X., et al. (2021). Who you would like to prediction using deep learning models. International Research Journal of Engineering
share with? a study of share recommendation in social e-commerce. In Proceedings and Technology (IRJET), 4(3), 1846–1854.
of the AAAI conference on artificial intelligence, vol. 35, no. 1 (pp. 232–239). Van den Poel, D., & Lariviere, B. (2004). Customer attrition analysis for financial
Johansson, F., Shalit, U., & Sontag, D. (2016). Learning representations for counter- services using proportional hazard models. European Journal of Operational Research,
factual inference. In International conference on machine learning (pp. 3020–3029). 157(1), 196–217.
PMLR. Varki, S., & Colgate, M. (2001). The role of price perceptions in an integrated model
Kong, J., Kowalczyk, W., Menzel, S., & Bäck, T. (2020). Improving imbalanced of behavioral intentions. Journal of Service Research, 3(3), 232–240.
classification by anomaly detection. In International conference on parallel problem Wei, C.-P., & Chiu, I.-T. (2002). Turning telecommunications call details to churn
solving from nature (pp. 512–523). Springer. prediction: a data mining approach. Expert Systems with Applications, 23(2),
Lazarov, V., & Capota, M. (2007). Churn prediction. Business Analysis Course. TUM 103–112.
Computer Science, 33, 34. Xu, F., Zhang, G., Yuan, Y., Huang, H., Yang, D., Jin, D., et al. (2021). Understanding
Leung, C. K., Pazdor, A. G., & Souza, J. (2021). Explainable artificial intelligence for the invitation acceptance in agent-initiated social e-commerce. In Proceedings of the
data science on customer churn. In 2021 IEEE 8th international conference on data international AAAI conference on web and social media, vol. 15 (pp. 820–829).
science and advanced analytics (pp. 1–10). IEEE. Yabas, U., Cankaya, H. C., & Ince, T. (2012). Customer churn prediction for telecom
Liao, C.-H., & Lien, C.-Y. (2012). Measuring the technology gap of APEC integrated services. In 2012 IEEE 36th annual computer software and applications conference (pp.
telecommunications operators. Telecommunications Policy, 36(10–11), 989–996. 358–359). http://dx.doi.org/10.1109/COMPSAC.2012.54.
Liu, X., Xie, M., Wen, X., Chen, R., Ge, Y., Duffield, N., et al. (2018). A semi-supervised Yang, Z., & Peterson, R. T. (2004). Customer perceived value, satisfaction, and loyalty:
and inductive embedding model for churn prediction of large-scale mobile games. The role of switching costs. Psychology & Marketing, 21(10), 799–822.
In 2018 ieee international conference on data mining (pp. 277–286). IEEE. Yang, C., Shi, X., Jie, L., & Han, J. (2018). I know you’ll be back: Interpretable new user
Lundberg, S. M., & Lee, S.-I. (2017a). A unified approach to interpreting model clustering and churn prediction on a mobile social application. In Proceedings of
predictions. Advances in Neural Information Processing Systems, 30. the 24th ACM SIGKDD international conference on knowledge discovery & data mining
Lundberg, S. M., & Lee, S.-I. (2017b). A unified approach to interpreting model (pp. 914–922).
predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, Yoon, J., Jordon, J., & Van Der Schaar, M. (2018). GANITE: Estimation of individualized
S. Vishwanathan, & R. Garnett (Eds.), Advances in neural information processing treatment effects using generative adversarial nets. In International conference on
systems 30 (pp. 4765–4774). Curran Associates, Inc.. learning representations.
Maxham, J. G., III (2001). Service recovery’s influence on consumer satisfaction, Zeithaml, V. A., Berry, L. L., & Parasuraman, A. (1996). The behavioral consequences
positive word-of-mouth, and purchase intentions. Journal of Business Research, of service quality. Journal of Marketing, 60(2), 31–46.
54(1), 11–24. Zhang, G., Zeng, J., Zhao, Z., Jin, D., & Li, Y. (2022). A counterfactual modeling
Miguéis, V. L., Van den Poel, D., Camanho, A. S., & e Cunha, J. F. (2012). Modeling framework for churn prediction. In Proceedings of the fifteenth ACM international
partial customer churn: On the value of first product-category purchase sequences. conference on web search and data mining (pp. 1424–1432).
Expert Systems with Applications, 39(12), 11250–11256. Zhao, L., Gao, Q., Dong, X., Dong, A., & Dong, X. (2017). K-local maximum margin
Mitrović, S., & De Weerdt, J. (2020). Churn modeling with probabilistic meta paths- feature extraction algorithm for churn prediction in telecom. Cluster Computing, 20,
based representation learning. Information Processing & Management, 57(2), Article 1401–1409.
102052. Zhu, B., Baesens, B., & vanden Broucke, S. K. (2017). An empirical comparison
Mittal, V., & Kamakura, W. A. (2001). Satisfaction, repurchase intent, and repurchase of techniques for the class imbalance problem in churn prediction. Information
behavior: Investigating the moderating effect of customer characteristics. Journal Sciences, 408, 84–99.
of Marketing Research, 38(1), 131–142.