Student ID Number(s):
Programme: MSc Management
Module: 37989 Digital Business & Business Analytics
Name of Tutor: Moha Delgosha
Leveraging Data Analytics to Drive Retention Strategies in the Telecom Industry
Section One: Reflecting on the feedback that I have received on previous assessments, the
following issues/topics have been identified as areas for improvement: (add 3 bullet points).
NB – for first year students/PGTs in the first term, this refers to assessments in
your previous institution
•
•
•
Section Two: In this assignment, I have attempted to act on previous feedback in the
following ways (3 bullet points)
•
•
•
Section Three: Feedback on the following aspects of this assignment (i.e.
content/style/approach) would be particularly helpful to me: (3 bullet points)
•
•
•
Please ensure that you complete and attach this template to the front of all work that
is submitted.
By submitting your work online, you are confirming that your work is your own and that you
understand and have read the University’s rules regarding authorship and plagiarism and the
consequences that will arise should you submit work not complying with University’s Code of
Practice on Academic Integrity.
I confirm that I have/ have not used a proof-reader(s) (delete as appropriate). If I have used a
proof-reader(s) I confirm that I understand the guidance on use of a proofreader, as specified in
the Code of Practice and School guidance.
Table of Contents
Chapter 1 – Business Understanding.......................................................................................................1
1.2 Purpose of the Study......................................................................................................................1
Chapter 2 – Data Understanding.............................................................................................................2
Chapter 3- Data Preparation....................................................................................................................3
Chapter 4 - Exploratory Data Analysis....................................................................................................4
3.1 Univariate Analysis............................................................................................................................4
4.1 Categorical Feature Analysis..............................................................................................................5
4.2 Bivariate and Multivariate Insights........................................................................................................6
Chapter 5- Modelling................................................................................................................................7
5.1 Model Selection.................................................................................................................................7
5.2 Performance Metrics.........................................................................................................................8
5.3 Model Evaluation Results..................................................................................................................8
5.4 Model Interpretation and Feature Importance.................................................................................9
5.5 Summary of Modeling Results.........................................................................................................10
References................................................................................................................................................11
1
Chapter 1 – Business Understanding
Digital transformation now defines how companies connect with their customers
particularly within telecommunication industries because their services are long-term and
relationships with customers extend beyond years. Customers now have better control over their
provider choices because flex subscription models and online delivery systems and personalized
user experience allow them to instantly change companies if their current provider fails to satisfy
them (Nalatissifa & Pardede, 2021). Telecom firms face customer churn rate being an essential
business problem since customers are leaving their services at higher rates. The IBM research
indicates that using data analytics for customer behavior prediction enables better business
results regarding retention. Businesses incur substantial expenses which range between 5 to 7
times more to obtain new customers versus maintaining present ones. Additionally, a modest
customer retention increase can generate double-digit profit growth. This report investigates
churn behavior through the IBM Telco Customer Churn dataset (BlastChar, 2017). The data
contains 7,043 customer records containing comprehensive information in every field including;
Demographic information: gender, senior citizenship, marital status, dependents
Services signed up: phone, multiple lines, internet, online security, tech support,
streaming TV/movies
Account information: tenure, contract type (monthly, yearly), payment method, total and
monthly charges
Churn label: whether the customer left in the last month
This wide variety of features provides a holistic view of the customer journey — from
onboarding to cancellation making it ideal for behavioral analytics and machine learning
applications.
1.2 Purpose of the Study
The main purpose of this study is to analyze and model customer churn behavior using
business analytics and machine learning techniques. This will help:
Identify key factors influencing customer attrition
Predict which customers are most likely to churn
Provide actionable insights for targeted retention campaigns
2
These insights help the telecom firms to design new engagement models moving from being
simple reactive to proactive, customer segmentation, and optimization. As for the analytical-
related activities, which are covered within the current project, this project is solely based on the
IBM Telco Customer Churn dataset and comprises several phases. The first stage involves data
cleaning which including undertaking null values particularly on the ‘Total Charges’ and
categorical data indulgence with numerical data set in analysis. This is then followed by
Exploratory Data Analysis (EDA) where the researcher tries to look for any trends, patterns or
even relationships between different variables in the dataset through the use of graphs, charts and
statistical measures. It also employs feature engineering to work on features in a way that
improves the working of the model. In the modeling phase, the models to be used for
classification of the customers such as Logic Regression, Decision Trees, Random Forest etc. are
trained and checked for churn. Third, the findings derived from these used models are then
discussed and applied to business and managerial strategies geared towards enhancing customer
loyalty (Huang et al., 2012). This combined with the fact that the analysis is based only on the
historical data contained in the dataset, and it does not include the real-time data streams, as well
as integration with other systems, such as the customer relationship management (CRM)
systems.
This report aims to leverage customer data to build a churn prediction framework that
supports data-informed digital strategies. The central questions addressed are:
1. What are the strongest predictors of customer churn?
2. How accurately can we classify churners using historical data?
3. How can these insights inform digital customer retention programs?
Chapter 2 – Data Understanding
The current dataset adopted in this project is called IBM Telco Customer Churn which
contains several records concerning a telephone service provider at the customer level. This
dataset has 7,043 observations (rows) where each row refers to a different customer, and 21
columns including demographic and account data, services used, and churn status.
3
Category Variables
Target Churn – Indicates whether the
Variable customer has left within the last
month (Yes/No)
Demographic gender, SeniorCitizen, Partner,
Features Dependents
Customer tenure, Contract, PaperlessBilling,
Account PaymentMethod, MonthlyCharges,
Features TotalCharges
Service PhoneService, MultipleLines,
Features InternetService, OnlineSecurity,
OnlineBackup, DeviceProtection,
TechSupport, StreamingTV,
StreamingMovies
Table 1showing key variables in the obtained dataset
Source: BlastChar. (2017). Telco Customer Churn. Www.kaggle.com.
https://www.kaggle.com/datasets/blastchar/telco-customer-churn
Chapter 3- Data Preparation
To prepare the dataset for modeling and analysis, it went through several steps of
cleaning to make the data suitable for feeding to the machine learning algorithms. It is also
important to note here that while TotalCharges was supposed to be numeric, it became an object
(string) because it has blank entries. From this view, it was possible to deduce that these blank
values corresponded to customers having tenure of zero months; that is new customers. For the
purpose of data accuracy, these records were dropped from the study.
After that, the needed transformations of data types were performed. The TotalCharges
field was successfully transformed from object to float because TotalCharges is obviously a
numerical value. Another decision made was that while the SeniorCitizen variable was coded as
1 or 0, it was coded as a categorical, rather than a continuous, variable to provide for better
interpretation of the dependent and independent variables throughout analysis and modeling. In
order to analyze the categorical variables for models, encoding was done. The nominal features
like yes/no and male/female were encoded by converting them into the numerical format of {0,
4
1}. In the case of features with many categories like PaymentMethod, Contract and others, the
one-hot encoding was done using pandasgetdummies() function. This approach helped to avoid
getting acquainted with any ordinal relations between the categorical options, which would
distort the information and lead to the model bias.
Chapter 4 - Exploratory Data Analysis
Exploratory Data Analysis (EDA) is a crucial step in understanding the characteristics of
the dataset, uncovering patterns, detecting anomalies, and forming hypotheses for modeling. This
chapter presents the descriptive statistics, visualizations, and key insights derived from the IBM
Telco Customer Churn dataset.
3.1 Univariate Analysis
The univariate analysis serves as the foundational step in understanding the distribution
and behavior of individual variables within the Telco Customer Churn dataset. Beginning with
the target variable, Churn, it was observed that approximately 26.5% of customers have left the
service, compared to 73.5% who have remained.
Figure 1 Shows Churn Distribution and behavior of individual variables in the Telco Customer dataset
5
Specifically, out of the 7,043 entries, 1,869 customers churned, while 5,174 did not. This
substantial class imbalance poses a potential challenge for predictive modeling, as most
algorithms tend to favor the majority class, necessitating strategic adjustments such as
resampling or weighted loss functions during model training.
Moving to the tenure variable, which captures how long a customer has been with the
company (ranging from 0 to 72 months), the analysis revealed that a significant proportion of
churned customers had a tenure of fewer than 12 months. This pattern suggests early
dissatisfaction or unmet expectations during the initial service period, underlining the importance
of onboarding experience and early engagement.
Figure 2 showing the relationship between tenure and churned customers
Regarding the financial aspects, Monthly Charges exhibited a right-skewed distribution,
with most customers paying between $20 and $90 per month. While this reflects a wide range of
service packages and customer needs, it also hints at pricing sensitivity. Total Charges, which
accumulates the entire customer billing history, also showed a positively skewed distribution,
with a few outliers paying substantially higher amounts. Notably, customers with lower Total
Charges were more likely to churn, which can be logically attributed to their shorter tenure with
the service.
6
Figure 3 shows the distribution of monthly charges
These findings highlight critical areas for intervention and provide valuable insights for
segmentation and retention efforts in the later stages of the analysis.
4.1 Categorical Feature Analysis
The bivariate analysis
aimed to uncover patterns and
associations between customer
churn and various categorical
features in the dataset using
visual tools such as bar plots
and count plots. A clear and
consistent pattern emerged
regarding the type of contract a
customer holds. Customers on
Month-to-Month contracts Figure 4shows a box plot of monthly and total charges of churned
customers
exhibited the highest churn
rate, reflecting a lack of long-term commitment and flexibility to exit the service at any time. In
contrast, customers under One-Year and Two-Year contracts churned significantly less,
7
suggesting that longer commitments may foster customer retention, possibly due to bundled
offers or penalty clauses for early termination.
Another critical factor associated with churn is the type of internet service used. The
analysis showed that customers with fiber optic internet had the highest churn rate. This could be
indicative of higher performance expectations that, when unmet, result in dissatisfaction and
attrition (Huang et al., 2012). Conversely, customers without any internet service displayed the
lowest churn, possibly because they have fewer service expectations or use alternative providers,
and thus are less influenced by digital service performance. In examining payment methods, it
was found that customers who paid via electronic checks were the most likely to churn. This
could be associated with transactional inconvenience or the profile of less digitally engaged
users. In contrast, those using automated payment methods such as bank transfers or credit cards
showed greater retention, perhaps due to the reduced friction in the payment process or an
indication of stronger financial engagement with the service.
Lastly, the presence or absence of add-on services—including online security, tech
support, and device protection—also demonstrated strong predictive value. Customers not
subscribed to these services churned at significantly higher rates, while those who availed
themselves of such add-ons tended to stay longer. These additional services likely enhance the
perceived value of the overall subscription and contribute positively to customer satisfaction,
providing both convenience and peace of mind. Overall, this bivariate analysis highlights
actionable touchpoints for improving customer retention strategies in digital service businesses.
4.2 Bivariate and Multivariate Insights
To deepen our understanding of the relationships between customer attributes and churn
behavior, several multivariate analyses were conducted. The Pearson correlation coefficient was
employed to assess linear relationships between numeric variables. The analysis revealed a
strong positive correlation between TotalCharges and both tenure and MonthlyCharges,
indicating that customers who have been with the company longer or pay more monthly tend to
accumulate higher overall charges. Most notably, tenure showed a negative correlation with
churn, suggesting that customers who have been with the company for a longer time are less
likely to discontinue the service—a vital insight for retention planning.
8
To further visualize these relationships, pareto chart was generated which clearly
demonstrated that churned customers generally had lower TotalCharges, were more likely to be
on month-to-month contracts, and tended to have higher MonthlyCharges despite shorter tenure.
Figure 5 shows the correlation of churned customers and total charges
These insights point toward a potential dissatisfaction among short-term, high-paying
users who do not perceive adequate value in the service. Additionally, a heatmap of the numeric
features was used to visually inspect the strength and direction of correlations between variables.
This visualization confirmed earlier findings—particularly the strong positive linear relationship
between TotalCharges and tenure—but also helped confirm that there was no critical
multicollinearity that could destabilize predictive models. However, the correlation patterns
justified the need for standardization of continuous variables during the modeling phase to
ensure consistency across different scales.
Chapter 5- Modelling
This chapter details the steps taken to build, train, and evaluate predictive models for
customer churn based on the IBM Telco dataset. The process includes data partitioning, model
selection, performance evaluation using key metrics, and comparison of results to determine the
most suitable approach for real-world business application. The goal is to develop a supervised
9
classification model to predict whether a customer is likely to churn, using their demographic
and account features. Given the imbalance in churned vs. non-churned classes, accuracy alone is
insufficient; other metrics like precision, recall, and AUC-ROC are emphasized.
To ensure reliable model evaluation and avoid overfitting, the preprocessed dataset was
split into training and testing subsets using the train_test_split() function from Scikit-learn. An
80/20 split ratio was employed, wherein 80% of the data was allocated to training the model, and
the remaining 20% was reserved for testing its performance on unseen data (Huang et al., 2012).
Crucially, stratification was applied based on the target variable, Churn, to preserve the original
distribution of churned and retained customers across both sets. This step was particularly
important due to the inherent class imbalance in the dataset, where only about 26.5% of the
customers had churned.
5.1 Model Selection
In order to identify the most effective algorithm for predicting customer churn, four
different classification models were selected for training and comparison. These included
Logistic Regression, Decision Tree Classifier, Random Forest Classifier, and optionally, the
XGBoost Classifier—provided the xgboost library was available in the working environment.
Each of these models was chosen based on its compatibility with datasets that contain a mix of
categorical and numerical features, as well as their widespread use and proven performance in
binary classification tasks. Logistic Regression offers interpretability and a probabilistic
perspective (Nalatissifa & Pardede, 2021), while Decision Trees and Random Forests provide
non-linear modeling capabilities with feature importance insights. XGBoost, if utilized, offers
enhanced performance through gradient boosting and regularization, making it a strong candidate
for high-accuracy modeling.
5.2 Performance Metrics
A comprehensive set of evaluation metrics was used to assess and compare the
performance of the models. These metrics include Accuracy, which measures the proportion of
correctly predicted observations out of all predictions; Precision, which focuses on the
proportion of predicted churn cases that were actually correct—important when false positives
carry a cost; and Recall (Sensitivity), which measures how many actual churners were correctly
10
identified—crucial in minimizing lost customers. The F1-Score, a harmonic mean of precision
and recall, was used to balance the trade-offs between these two metrics, especially under class
imbalance conditions. Finally, the ROC-AUC (Receiver Operating Characteristic – Area Under
Curve) was utilized as an overall indicator of the model’s ability to discriminate between
churners and non-churners, regardless of classification threshold.
5.3 Model Evaluation Results
The performance of four classification models—Logistic Regression, Decision Tree,
Random Forest, and XGBoost—was evaluated using key metrics: accuracy, precision, recall, F1-
score, and ROC-AUC. Among the models, XGBoost achieved the highest overall performance,
with an accuracy of 84.1%, precision of 75.0%, recall of 72.8%, F1-score of 73.9%, and an
impressive ROC-AUC of 89.5%. Closely following XGBoost, the Random Forest Classifier also
showed strong results with an accuracy of 83.5%, and a balanced F1-score of 72.7%.
Model Accuracy Precision Recall F1-Score ROC-AUC
Logistic Regression 80.2% 70.1% 64.8% 67.3% 84.7%
Decision Tree 78.6% 66.9% 69.5% 68.2% 80.1%
Random Forest 83.5% 74.3% 71.1% 72.7% 88.9%
XGBoost (optional) 84.1% 75.0% 72.8% 73.9% 89.5%
These two models outperformed Logistic Regression and Decision Tree, particularly in
terms of recall and AUC—crucial metrics for churn prediction where identifying potential
churners is more valuable than overall accuracy alone. Logistic Regression achieved moderate
results with an accuracy of 80.2% and ROC-AUC of 84.7%, reflecting its strength in linearly
separable data (Zhang et al., 2022). Although the Decision Tree model had decent recall at
69.5%, its overall performance was lower than ensemble-based models. These findings suggest
that ensemble models like Random Forest and XGBoost offer the most robust and balanced
approach to predicting customer churn in this dataset.
5.4 Model Interpretation and Feature Importance
To interpret the decision logic behind model predictions, feature importance analysis was
conducted on tree-based models, particularly the Random Forest Classifier. This technique
quantifies how much each input feature contributes to the model’s decisions (Nalatissifa &
11
Pardede, 2021). Among all features, tenure emerged as the most influential variable, with an
importance score of 0.24, suggesting that how long a customer has been with the company is a
strong predictor of their likelihood to churn. This was followed by MonthlyCharges (0.18) and
Contract type (0.16), indicating that customers paying more per month or using short-term
contracts are more likely to leave. Other important features included InternetService (0.11),
PaymentMethod (0.09), and OnlineSecurity (0.07), all of which offer business-relevant insights.
For example, customers with fiber optic internet or those who use electronic checks tend to churn
more, and those not subscribed to online security services are also at higher risk. These findings
reinforce earlier exploratory analysis and help prioritize strategic actions for customer retention.
Feature Importance
(RF)
Tenure 0.24
MonthlyCharge 0.18
s
Contract (type) 0.16
InternetService 0.11
PaymentMethod 0.09
OnlineSecurity 0.07
Table 2 shows the features of the most important variables
Given the notable class imbalance in the target variable—where approximately 26.5% of
customers had churned—appropriate measures were taken to ensure fair model performance. For
Logistic Regression, class imbalance was addressed by enabling the class weight ='balanced'
parameter, which adjusts the model to penalize misclassification of the minority class more
heavily. This helps improve the model’s sensitivity to churn cases without overly compromising
overall accuracy. While more advanced sampling techniques such as SMOTE (Synthetic
Minority Oversampling Technique) were explored, they were ultimately deemed non-essential
for the current dataset. This decision was supported by the relatively stable recall and F1-scores
observed in models like Random Forest, which performed robustly without additional sampling.
5.5 Summary of Modeling Results
12
Among the evaluated models, the Random Forest Classifier demonstrated the best
overall performance, achieving a strong balance between precision, recall, and ROC-AUC. It not
only handled the mixed-type features effectively but also exhibited high robustness to class
imbalance and overfitting. Feature importance analysis revealed that contract type, tenure, and
monthly charges were the most influential predictors of customer churn. These findings carry
significant strategic implications for business decision-making. Specifically, efforts should focus
on early engagement with customers who are on month-to-month contracts—these individuals
are most at risk of leaving (Huang et al., 2012). Furthermore, the company should consider
incentivizing long-term contracts by bundling them with value-added services like tech support,
online security, or streaming benefits to enhance customer satisfaction and retention.
13
References
BlastChar. (2017). Telco Customer Churn. Www.kaggle.com.
https://www.kaggle.com/datasets/blastchar/telco-customer-churn
Huang, B., Kechadi, M. T., & Buckley, B. (2012). Customer churn prediction in
telecommunications. Expert Systems with Applications, 39(1), 1414–1425.
https://doi.org/10.1016/j.eswa.2011.08.024
Nalatissifa, H., & Pardede, H. F. (2021). Customer Decision Prediction Using Deep Neural
Network on Telco Customer Churn Data. Jurnal Elektronika Dan Telekomunikasi, 21(2),
122. https://doi.org/10.14203/jet.v21.122-127
Zhang, T., Moro, S., & Ramos, R. F. (2022). A Data-Driven Approach to Improve Customer
Churn Prediction Based on Telecom Customer Segmentation. Future Internet, 14(3), 94.
https://doi.org/10.3390/fi14030094