Employee Attrition Analysis of Data Driven Models
Employee Attrition Analysis of Data Driven Models
1, 2, 3, 4
Assistant professor, Noida Institute of Engineering & Technology, Greater Noida
Abstract
Companies constantly strive to retain their professional employees to minimize the expenses associated with recruiting and
training new staff members. Accurately anticipating whether a particular employee is likely to leave or remain with the
company can empower the organization to take proactive measures. Unlike physical systems, human resource challenges
cannot be encapsulated by precise scientific or analytical formulas. Consequently, machine learning techniques emerge as
the most effective tools for addressing this objective. In this paper, we present a comprehensive approach for predicting
employee attrition using machine learning, ensemble techniques, and deep learning, applied to the IBM Watson dataset.
We employed a diverse set of classifiers, including Logistic regression classifier, K-nearest neighbour (KNN), Decision
Tree, Naïve Bayes, Gradient boosting, AdaBoost, Random Forest, Stacking, XG Boost, “FNN (Feedforward Neural
Network)”, and “CNN (Convolutional Neural Network)” on the dataset. Our most successful model, which harnesses a
deep learning technique known as FNN, achieved superior predictive performance with highest Accuracy, recall and F1-
score of 97.5%, 83.93% and 91.26%.
Copyright © 2023 M. Nandal et al., licensed to EAI. This is an open access article distributed under the terms of the CC BY-NC-SA
4.0, which permits copying, redistributing, remixing, transformation, and building upon the material in any medium so long as the
original work is properly cited.
doi: 10.4108/eetiot.4762
*
Corresponding author. Email: [email protected]
inquiries related to employee attrition and retention have attributes impact the predictive variable known as 'Attrition'.
been approached through qualitative and anecdotal methods. It consists of a total of 1,470 instances and encompasses 35
Typically, HR personnel conduct exit interviews when an attributes, providing a comprehensive dataset for analysis.
employee tenders their resignation, aiming to uncover the
underlying reasons behind their departure. In the current age
marked by the fourth industrial revolution, powered by 2. Related Work
advanced technologies such as predictive analytics that
employ statistical modelling techniques and machine Employee attrition issues were studied by researchers
learning, predicting the likelihood of an employee departing from various viewpoints. Researchers harnessed machine
from an organization is now within reach. Organizations learning techniques to predict employee attrition by
utilize machine learning algorithms to forecast the analysing data pertaining to the employees. This
probability of employee attrition and proactively implement investigation involved the utilization of several machine
measures to prevent such occurrences [3]. learning methods, including Random Forests (RF), Support
Machine learning represents a facet of artificial intelligence Vector Machines (SVM), and K-Nearest Neighbours (KNN)
(AI) technology that equips systems with the capability to while exploring various parameter configurations [8], There
autonomously acquire knowledge and refine their are researchers in [9] chose to utilize Classification Trees
performance through experience, mirroring human-like and Random Forest for the purpose of predicting employee
intelligence without the need for explicit programming [4]. attrition. Their approach commenced with dataset pre-
Machine learning (ML) stands as one of the most rapidly processing, where they excluded less influential variables
advancing research fields, showcasing successful based on Pearson correlation analysis.
development and application across a diverse array of real- A study utilizing the [4] IBM HR Employee Attrition &
world domains. Due to the expenses associated with hiring Performance dataset revealed an inherent data imbalance
employees, providing training, and acquiring intellectual issue. During the data exploration phase, the researchers
property, it becomes paramount to ensure a minimal attrition employed correlation plots and histogram visualizations to
rate (employee turnover) within organizations [5]. Employee assess the relationships among continuous variables in the
attrition imposes significant financial burdens on a model. Following this analysis, the “Synthetic Minority
company, encompassing expenses such as business Oversampling Technique (SMOTE)” was utilized to rectify
disruption costs, recruitment and onboarding of new the imbalance within the Attrition class [10]. To tackle the
employees, and training of newcomers [6]. While recruiting challenge of predicting employee turnover, we introduced a
top talent is vital for organizations, it is equally crucial to novel approach: a weighted quadratic random forest
ensure their satisfaction and retention. Employees have their algorithm. The algorithm was utilized with a dataset of
unique criteria for selecting and committing to an employees gathered from a branch of a telecommunications
organization, and if their expectations are not met, they may company located in China [10]. The researchers presented a
choose to resign. This can result in employee attrition, often comprehensive three-stage framework for predicting
referred to as the phenomenon of employee churn [7]. attrition. In the first stage, they applied the "max-out"
Lately, leading companies such as IBM, HCL, TCS, and feature selection method to refine the data. Following this,
others have grappled with employee attrition challenges. By in the second stage, a logistic regression model was trained
gathering employee feedback regarding various aspects, for prediction. Finally, the third stage involved conducting
including the company's culture, work environment, confidence analysis to enhance the reliability of the
workload, job satisfaction, and more, organizations can prediction model. However, it's worth noting that the system
employ statistical methods to predict attrition status. faces challenges, including suboptimal accuracy and
Hence, attrition must be dealt with utmost importance and elevated complexity due to the preprocessing and
measures must be taken by organizations to prevent this [4]. postprocessing step [11]. Taylor et al. [7] Tree-based
Consequently, forecasting employee attrition and models, specifically light Gradient Boosted Trees and
pinpointing the key factors that contribute to attrition random forests were utilized to make predictions regarding
emerge as crucial objectives for organizations seeking to employee attrition. These models demonstrated robust
bolster their human resource strategies. This paper delves performance, with the light gradient boosted trees exhibiting
into the application of classification and clustering particularly strong results. The study utilized a custom
techniques for analysing attrition. It conducts a comparative dataset comprising 5550 samples for their analysis. Machine
assessment to evaluate the accuracy of different data mining learning serves a wide array of applications, encompassing
algorithms using Weka, a collection of machine learning tasks from prediction to the classification of various HR data
algorithms employed for data mining purposes. In this parameters and features [12] the study focuses on the early
study, we employed the IBM Human Resource Analytics prediction of employee turnover, considering variables like
Performance dataset and Employee Attrition which is a absenteeism, tardiness, and employee indifference as
publicly accessible dataset accessible through the Kaggle significant factors influencing employee performance
Dataset Repository. This dataset was generated by IBM data forecasting. Fallucchi et al [13] conducted research and used
scientists for research purposes and comprises four primary a variety of machine learning approaches to identify the
components: seniority, employee satisfaction, income, and circumstances that may cause an employee to leave the
demographic information. Inside the dataset, numerous organization. The best recall value was provided by the
Gaussian Nave Bayes classifier, which contributes to the two critical characteristics examined in our trials were job
classifier's capacity to detect positive occurrences. The study satisfaction and job involvement. Attrition affected
[14] provided a hybrid model for anticipating client attrition. approximately 28.29% of employees with low job
satisfaction or job involvement. Furthermore, approximately
31.25% of employees with an unfavourable work-life
balance left the organization, compared to 17.65% of
3. Dataset Description departing employees with a good work-life balance. Figure
1 shows the correlation between the target variable i.e.,
The study used a dataset of 1,470 instances, which employee attrition and other variables Heatmap analysis
comprised detailed information about all employees and 35 reveals a strong correlation between job satisfaction,
features, including the target class. When the gender overtime, job level, monthly income, job involvement and
variable was examined, it was determined that 60% of the age with respect to attrition in the dataset.
employees were male and 40% were female. Surprisingly,
Figure 1: Dataset
4. Data Analysis
Based on heatmap the highly correlated attributes with
target variable employee attrition are Overtime, job
satisfaction, job level, monthly income, age and job
involvement. The Figure 2 shows the bar plot of the
correlated variables with the target variable i.e. Employee
attrition.
According to the plots employee attrition is higher when
overtime is increased, employee attrition is lower when job
satisfaction, job level, age and job involvement is higher.
5. Proposed Methodology
The proposed methodology (figure 3) consists of five
phases: data collection, data preprocessing,
classification using Ensemble Machine learning,
machine learning and Deep learning algorithms with
hyper parameter tuning, performance metric evaluation
to assess the effectiveness of the various algorithms,
and finally, selecting the best model of employee
attrition based on a comparative study of performance
metrics. After using Principal component analysis
feature selection, the data is divided into two parts:
75% for testing and 25% for training.
5.5.4 Naive Bayes: It is a probabilistic machine 5.6.3 Random Forest: Random Forest constructs an
learning technique used for text classification and ensemble of decision trees from random subsets of the
classification. It is based on Bayes' theorem, which training dataset. This method is done iteratively with
calculates the likelihood of a specific event occurring different random subsets, with a majority consensus
based on past knowledge of conditions that may be among these trees determining the conclusion.
associated with the event. The chance that a given data
point (such as a document or an item) belongs to a 5.6.4 Stacking model: Stacking model also known as
specific class or category is calculated using Naive Bayes. stacked generalization, or stacking ensemble, is an
It is assumed that the features used for categorization are advanced machine learning technique used for improving
conditionally independent, which means that the presence predictive performance. It combines the predictions of
or absence of one trait has no bearing on the presence or multiple base models (often diverse in nature) by training
absence of another. This is a "naive" assumption that a meta-model, or "stacker," on top of them. Stacking can
simplifies calculations and allows the algorithm to be significantly improve predictive performance compared to
more tractable. individual base models because it leverages the strengths
of different models and combines them to produce a more
5.5.6 Decision Tree: It is a graphical representation robust and accurate prediction. It is a powerful technique
resembling a tree that helps in the decision-making in machine learning and is often used in competitions and
process. Each branch of the tree represents a potential real-world applications where achieving the best possible
decision, event, or response. Decision Trees can be predictive accuracy is crucial.
employed for both classification and regression tasks. In
classification, they are used to categorize data into 5.6.5 XG Boost: XG Boost stands for Extreme
discrete classes, whereas in regression, they predict Gradient Boosting, is a highly popular and powerful
numerical or continuous values. machine learning algorithm that is widely used for both
regression and classification tasks. It belongs to the
𝑡𝑡𝑝𝑝
ensemble learning family and is specifically designed to Precision = (3)
improve the accuracy and efficiency of decision tree- 𝑡𝑡𝑝𝑝++𝑓𝑓𝑛𝑛
based models.
5.7 Deep Learning Models F1-score: The F1 score is a well-defined metric that
represents the harmonic mean of precision and recall .
5.7.1 Feedforward Neural Network (FNN): A F1 score = 2 × (4)
Feedforward Neural Network (FNN), sometimes known
as a Multilayer Perceptron (MLP), is a deep learning Where: 𝑡𝑡𝑡𝑡 is correctly predicted, 𝑓𝑓𝑓𝑓 is incorrectly
artificial neural network. Its architecture is distinguished predicted instances, 𝑡𝑡𝑡𝑡 is negatively predicted instances
by many layers of neurons, including an input layer, one and 𝑓𝑓𝑓𝑓 is the negatively predicted instances.
or more hidden layers, and an output layer. A
Feedforward Neural Network (FNN) is a deep learning
model made up of layers of artificial neurons that are 6. Results
coupled. It is intended to process and transform incoming
data through a sequence of mathematical operations, This section provides an analysis of the results
creating an output in the end. obtained from different machines and deep learning
classification models. The aim of this study is to evaluate
5.7.2 Convolutional Neural Network (CNN): the classification effectiveness of both machine learning
“Convolutional Neural Network (CNN)” is a deep and deep learning algorithms when applied to the task of
learning model specifically tailored for tasks involving categorizing employee attrition. In this study, a wide array
visual data, characterized by its use of convolutional of learning algorithms was utilized and assessed using the
layers to automatically learn and extract features from employee attrition dataset. The ML algorithms
images or other grid-like data. A “Convolutional Neural encompassed traditional methods such as SVM, logistic
Network (CNN)” is a deep learning model specifically regression, KNN, decision tree, and naive Bayes.
tailored for tasks involving visual data, characterized by Additionally, ensemble machine learning algorithms,
its use of convolutional layers to automatically learn and including Gradient boosting, XG-Boost, AdaBoost,
extract features from images or other grid-like data. The random forest, and stacking, were employed.
following metrics are examined to determine a model's Furthermore, the study also incorporated deep learning
effectiveness. techniques, specifically “Convolutional Neural Networks
To evaluate the effectiveness of a model the following (CNN)” and feedforward neural network (FNN). To
metrics are examined: assess the performance of these models, multiple
evaluation metrics were employed namely recall, F1
Accuracy: It is a performance metric used to assess the score, precision, accuracy, area under the ROC and
model's overall effectiveness when all classes carry equal precision-recall curve. The evaluation of results includes
significance. It is calculated as the ratio of correctly the use of performance metrics such as recall
predicted instances to the total number of predictions (Sensitivity), F1-score, precision, accuracy, and AUC,
made. This metric provides a measure of how well the with the corresponding scores detailed in Table 3.
model performs across all classes.
6.1. Machine Learning Models
tp+tn
Accuracy = (1) Table 1 provides an extensive assessment of the machine
tp+tn + fp+fn learning models. Out of the various models evaluated, the
Naïvebayes model demonstrated superior performance,
Recall, also known as sensitivity or true positive rate, achieving an accuracy and F1-Score of 0.541 % and
measures the model's ability to correctly recognize 0.908% respectively. Furthermore, both the Naïve bayes
positive samples. It is derived by dividing the total and Logistic Regression models provide highest precision,
number of positive samples by the number of correctly reaching at 0.769 and 0.727 respectively.
categorized positive samples. A greater recall value
suggests that the model accurately identifies more positive
samples. Table 1: Performance analysis of Machine Learning
𝑡𝑡𝑝𝑝
Recall = (2) model for Employee Attrition
𝑡𝑡𝑝𝑝++𝑓𝑓𝑝𝑝
Precision is a performance statistic that assesses the
model's ability to categories positive samples properly. It
is derived by dividing the total number of positive Model Accuracy Precision F1- Recall
samples by the number of correctly categorized positive Score
samples. that the model predicted as positive, whether SVM 0.878 0.636 0.236 0.146
they were classified correctly or incorrectly. Precision KNN 0.870 0.500 0.200 0.125
gauges how effectively the model identifies true positives Naïve 0.908 0.769 0.541 0.417
among all positive predictions. bayes
Decision 0.861 0.440 0.301 0.229 selected positive instance compared to a randomly
Tree selected negative instance. The closer the curve
Logistic 0.897 0.727 0.457 0.333 approaches the top-left corner, the more effective the
Regression classifier is. The ROC curves for each Machine learning
classifier are shown in Figure 5. Ensemble Machine
6.2. Ensemble Machine Learning Models learning in Figure 6. and Deep learning classifier in
Figure 7.
In machine learning, logistic regression has the highest
Table 2 provides a depth analysis of Ensemble machine
ROC score of 0.827. In ensemble machine learning
learning models. Among these models, the Stacking
methods, gradient boosting does even better with a ROC
model achieved the highest recall, accuracy, and F1-Score
score of 0.84. However, in the field of deep learning, the
respectively. Additionally, Gradient boosting, AdaBoost
Feedforward Neural Network (FNN) outperforms them
and XG- Boost techniques exhibited good accuracy
all, with the highest ROC score of 0.92.
levels.
Conclusion:
This paper explored the impact of voluntary attrition on
organizations and underscored the significance of
Fig6. ROC Plot for Ensemble Models
predictive modeling in addressing this issue. It provided
an overview of various supervised learning classification
algorithms employed to tackle the problem of predicting
employee attrition, using the IBM HR dataset for
evaluation. Initially, five foundational models were
trained and assessed. Subsequently, five ensembles were
created by leveraging various combinations of these five
base models. Two deep learning models were tested. The
findings revealed that linear models outperformed others
in terms of accuracy, recall, and AUC. Furthermore, deep
learning models, particularly the FNN approach, exhibited
exceptional accuracy. In contrast, other machine learning
models displayed a wider range of accuracy, spanning
from 86% to 94%. These results emphasize the potential
of both deep learning and ensemble machine learning
techniques in achieving high classification accuracy. As a
result, the authors recommend employing the FNN
classifier for precise predictions of employee attrition
within an organization. This approach empowers HR to
take proactive measures in retaining employees identified
as being at risk of leaving.