Analysis and Prediction of Employee Turnover Characteristics Based On Machine Learning
Analysis and Prediction of Employee Turnover Characteristics Based On Machine Learning
$QDO\VLVDQG3UHGLFWLRQRI(PSOR\HH7XUQRYHU
&KDUDFWHULVWLFVEDVHGRQ0DFKLQH/HDUQLQJ
Heng Zhang1, Lexi Xu1, Xinzhou Cheng1, Kun Chao1, Xueqing Zhao2
1
Network Technology Research Institute, China United Network Communications Corporation, Beijing, P.R.China
2
College of Engineering, University of Nebraska-Lincoln, Nebraska, USA
1
{zhangheng23, xulx29, chengxz11, chaokun1}@chinaunicom.cn 2 [email protected]
Abstract—Employee turnover indicates the staff decides In this paper, we adopt the supervised learning technique.
to leave the company. Along with the fast development of In supervised learning, input data is as "training data", and
economic and industries, employee turnover phenomenon each set of training data has a clear identification or result.
becomes popular gradually in recent years. On one hand, the After the establishment of the prediction model, supervised
staff decides to leave the company due to various reasons. On learning establishes a learning process and compares the
the other hand, staff retention and job stability impact the prediction results with the actual results of "training data".
normal operation of the company. Companies need to grasp Through continuous adjustment of the prediction model,
the major factors of employee turnover, and then take the model prediction results can reach high accuracy.
relevant measures to deal with this problem. This paper
employs machine learning technique to sort out the The rest of this paper is organized as follows: Section II
characteristics of employee turnover. Furthermore, we adopt introduces the data mining techniques and algorithms used
GBDT algorithm and LR algorithm to fit the characteristic in this paper. Section III presents the process of building
model which influences employee turnover. Finally, this employee turnover model in details, including data
paper implements the employee turnover prediction in understanding, data preprocessing, data modeling, data
realistic companies, which provides an effective reference for optimization, data fusion. Section IV gives the conclusion.
companies to reduce the turnover rate of employees.
II. DATA MINING TECHNOLOGY
Keywords—wireless network; operation maintenance;
stability evaluation; factor; caculation; A. Overview of data mining
The key of data mining is to deeply
I. INTRODUCTION investigate/understand the data, including its type,
In recent years, there has been a massive increase in the characteristics, typical value, etc [19][20]. Through deeply
competition among companies in sustaining in the business. investigation on the data set or database, we can obtain the
The profits of the company can be improved by company valuable information. Data mining technology can extract
efficiency. Staff retention is more important than the knowledge, rules or high-level relationships from the
acquisition of new staff [1]-[3]. database, including classification rules, clustering rules,
association rules, and prediction rules and so on [21]-[23].
Employee turnover reflects the staff decides to leave the
In addition, data mining technique can assist researcher to
company. There are a series of data, which records useful
analyze from multiple/different perspectives, so as to
information of employee [4]-[9]. Due to the serious/high
obtain valuable information. The process of data mining is
employee turnover recently, it is of great significance for
divided into data understanding, data cleaning, data mining
the company to analyze and forecast the characteristics of
and result processing [24]-[27].
turnover through machine learning. This can help
companies to take relevant measures to deal with the B. Key factors of data mining
employee turnover problem [10]-[13]. Machine learning There are various widely-used data mining algorithms,
algorithm has been widely used in data analysis, mining for example, C4.5, K-Means, SVM, Apriori, EM,
and prediction in industries [14]-[16]. Valuable knowledge PageRank, AdaBoost, KNN, Naive Bayes, CART [28]-[34].
is mined through feature analysis and reasoning [17] [18], The four key factors to improve the accuracy of analysis
more specifically, the characteristics of employee turnover results are as follows:
are analyzed and predicted by machine learning, which 1) Understanding of analytical data
provides a reference to reduce the employees turnover 2) The analysis and processing of error index in data
rate. 3) Feature engineering
4) Model fusion
C. The importance of data processing of the actual reasons for leaving. Therefore, it is necessary
Through the feature extraction, we can obtain for a company to study in detail its own talent composition,
unprocessed features. These unprocessed features have the work tasks, human resources and compensation system,
following characteristics: and the external environment, so as to assist the company
to set appropriate measures to detain the staff.
1) Dimensional disunity: features of different
specifications cannot be compared together. Dimensionless There are many factors affecting employee turnover.
can deal with this problem. According to the analysis of both staff behavior and
2) Information redundancy: for some quantitative company experience, employee turnover is impacted by a
characteristics, the valid information contained is interval series of factors, for example, salary, business trip, job
division, such as academic achievement. If only "pass" or environment satisfaction, work commitment, overtime,
"fail" is concerned, then quantitative test scores need to be promotion, salary increase. In this paper, the data for
converted to either "1" or "0", which indicates "pass" and employee turnover is from around 100 companies.
"fail", respectively. We can use Binarization to deal with Based on data mining technology, this paper mainly
this problem. analyzes the basic information, work experience, position
3) Qualitative features cannot be directly used: some salary etc. Based on the historical records of employees
machine learning algorithms and models can only accept (especially, leave or not leave), the weights of various
the input of quantitative features, so it is necessary to factors are fitted by logical regression, adaboost, SVM and
convert qualitative features into quantitative features. other algorithms. The key characteristics of employee
Dummy coding is usually used to convert qualitative turnover will be used in the prediction of future turnover.
features into quantitative features. Suppose there are N The overall flowchart is shown in Fig.1.
qualitative values, then this feature is extended to N kinds
of features. When the original eigenvalue is the first
qualitative value, the first extended eigenvalue is assigned
as “1” and the other extended eigenvalues are assigned as
“0”. The dumb coding method does not need to increase the
work of tuning parameters, compared with the directly
specified model. For the linear model, the nonlinear effect
can be achieved by using the dumb coding feature.
4) Missing values: missing values need to be
supplemented.
5) Low information utilization: different machine
learning algorithms and models make different use of
Fig.1. Overall flowchart of employee turnover characteristics
information in the data. In the linear model, the nonlinear construction
effect can be achieved by using the dummy coding of
qualitative features. Similarly, the polynomial of A. Data understanding
quantitative variables can achieve nonlinear effect.
It is generally known that a series of factors affect
III. CONSTRUCTING EMPLOYEE TURNOVER employee turnover, such as salary, travel, job environment
CHARACTERISTICS BY DATA MINING satisfaction, work engagement, work overtime, position
promotion, salary increase ratio etc. Initially, the factors
In most cases, the only way for a company to be aware and separations are compared to investigate the correlation
of the reason for an employee's departure is through an among various factors and separations. As there are many
interview at the time of departure. However, the factors, this paper chooses the key items to analyze, as
information, which is provided by the employee, is usually shown in Fig.2.
untrue. Hence, it is not helpful for the company to be aware
372
The 18th International Symposium on Communications and Information Technologies (ISCIT 2018)
According to correlation analysis in Fig.2, “Over Meanwhile, the correlation analysis among above-
18”and “Standard Hours” do not have remark relationship mentioned factors is carried out, and the relationship map
with employee turnover. “Employee number”, “age”, and is obtained as Fig. 3.
“relationship satisfaction” have little to do with separation
and can be eliminated when factor analysis is performed.
From the correlation coefficient of all items, the darker factors on the results, the less important to the accuracy of
the color, the closer to 1, and this indicates the greater the results evaluation. Therefore, it can be removed. As
correlation. On the contrary, a lighter color indicates the shown in Fig.3, the relevant results of above-mentioned
worse correlation. factors are as follows:
From the perspective of influencing factor analysis, the 1) Correlation range is from -1 to 1. 1 reflects the strongest
greater the weight consistency of the strong correlation correlation. -1 reflects the lowest correlation.
373
The 18th International Symposium on Communications and Information Technologies (ISCIT 2018)
2) The department and the work role have relatively high 3) If the missing sample is a continuous value
similarity, reaching 0.82. eigenvalue, we will consider setting a step and discretizing
3) The negative correlation degree is the most important it. Finally, 1aN is added to the attribute category as a type.
factor. After the model analysis, the factors can be selected 4) For the category data, the results can be divided into
and eliminated. several feature factors according to the content. The
B. Data preprocessing property is set to 1 if it exists, otherwise it is set to 0.
For data such as income and age, if logical regression
In the data preprocessing stage, the missing value,
and gradient decline are used, the scale difference between
invalid value, category data value and continuous value
attribute values is too large. In this case, the convergence
should be processed according to the feature. The main
rate will be greatly affected, and even the results will not
methods for missing values include:
converge. So we can use the preprocessing module in
1) If the number of missing samples is large, they will scikit-learn to do a scaling for these values. Scaling is to
be discarded directly. If they are added as features, noises converge some of the features with a larger range of
will be brought in which will affect the final results. changes to [-1,+1].
2) If the missing samples are moderate and the attribute A statistical view of all the data is done through the
is discontinuous, such as a class attribute, then NaN (NaN Description _ borer function, as Table 1.
indicates the Null value) can be added to the class feature
as a new category.
TABLE I. DESCRIPTION BORDER FUNCTION LIST
߲ܮሺߠሻ ݁ ఏ ௫
C. Data modeling ൌ ݕ ݔ െ ݔ
߲ߠ ͳ ݁ ఏ ௫
Logistic regression, despite its name, is a linear model ୀଵ ୀଵ
for classification rather than regression. Logistic ൌ σୀଵሺݕ െ ߪሺߠ ் ݔ ሻሻ ݔ (3)
regression is also known in the literature as logit డሺఏሻ
ߠ ௧ାଵ ௧
ൌߠ െߙ ൌߠ െ௧
ߙ σୀଵሺݕ ்
െ ߪሺߠ ݔ ሻሻݔ (4)
regression, maximum-entropy classification (MaxEnt) or డఏ
the log-linear classifier. In this model, the probabilities The implementation of logistic regression in scikit-learn
describing the possible outcomes of a single trial are can be accessed from class LogisticRegression. This
modeled using a logistic function.This paper employs the implementation can fit binary, One-vs-Rest, or
logistic regression in scikit-learn to model. multinomial logistic regression with optional L2 or L1
This paper assumes the sample is {x, y}, the value of y regularization.
means turnover or not can be 0 or 1, and X are all the As an optimization problem, binary class L2 penalized
factors which affect the results. There are n independent logistic regression minimizes the following cost function:
training samples {(x1, y1), (x2, y2),... , (xn, yn)}, y={0, 1}, x ଵ
்߱ ߱ ܥσୀଵ ሺቀെݕ ሺ்ܺ ߱ ܿሻቁ ͳሻ(5)
is the employee’s factors mentioned above, y means if ఠǡ ଶ
turnover happened or not. The probability of the turnover Similarly, L1 regularized logistic regression solves the
for each observed sample (xi, yi) is as (1). following optimization problem
ሺݕ ǡ ݔ ሻ ൌ ܲሺݕ ൌ ͳȁݔ ሻ௬ ሺͳ െ ܲሺݕ ൌ ͳȁݔ ሻሻଵି௬ (1) ȁȁ߱ȁȁଵ ܥσୀଵ ቀ݁ݔቀെݕ ሺ்ܺ ߱ ܿሻቁ ͳቁ (6)
ఠǡ
Above-mentioned formulas are the model we used in
What we need to do is figure out the possibility of N
this analysis. By substituting all the data into the logical
independent events:
regression model, we get a model. The details are as
ሺɅሻ ൌ ȫܲሺݕ ൌ ͳȁݔ ሻ௬ ሺͳ െ ܲሺݕ ൌ ͳȁݔ ሻሻଵି௬ (2) follows:
374
The 18th International Symposium on Communications and Information Technologies (ISCIT 2018)
TABLE II. CORRELATION ANALYSIS OF MODEL COEFFICIENTS scoring rule can simply be seemed as “YES or NO”, which
means the items or sub-items get full score when satisfying
the criteria and get zero when not. The criteria here are
made by network maintenance and optimization experts.
E. Model fusion
Model fusion is a useful tool often used in model
prediction. It can usually improve results in a variety of
machine learning tasks. Model fusion is to consider the
situation of different models and merge their results
together. Model fusion is mainly achieved through several
parts: fusion from the submission of the result document,
stacking, and blending.
If the model has been determined, we can use the way of
bagging. Instead of using all the training sets, this paper
takes only one subset from the training set at a time.
Although the same machine learning algorithm is used, the
resulting model is not the same. At the same time, because
any part of the data set is not complete, even if there has
been fitting, there is fitting on the sub-training set, not on
The L1 regular penalty of logical regression is used the whole data. Model fusion is helpful to the final result.
here, so the system will automatically calculate the feature
with 0 correlation. The more tends to 1, the stronger the (7)
correlation is, and the worse the influence degree is on the
results. Here we analyze the characteristic behaviors with
more weight, as follows: Therefore, the fusion of the multi-service sampling
1) Frequent travel will increase turnover probability, model is carried out by using the loading Regressor
while non-travel will significantly reduce turnover function in python, and the success rate of the final
probability. prediction is 89.32%.
2) The probability of turnover of sales personnel is IV. CONCLUSIONS
higher than that of sales staff.
On the basis of big data, machine learning can be
3) The turnover probability of human resource and regarded as a process of gradual adjustment and
technical degree is higher in the field of employee's improvement. First, based on a baseline model, the
specialty. subsequent analysis will be improved step by step. In data
4) The probability of male turnover is higher than that analysis, it is very important to understand the data.
of male, and the female is relatively stable. Through the understanding of the data and the analysis and
5) The probability of single separation is higher. treatment of the special points / outliers, a more
standardized initial data is formed in this paper. The future
6) The turnover rate of employees who often work
research is to take the processing of feature engineering
overtime increases obviously.
375
The 18th International Symposium on Communications and Information Technologies (ISCIT 2018)
(feature engineering), which is as important as the model [17] G. Shao, et al. "Telecom big data based user analysis and
algorithms. application in telecom industry," in Proc. 5GWN, April 2017,
Beijing, China, pp.99-109.
REFERENCES [18] L. Xu, et al. "User relay assisted traffic shifting in LTE-Advanced
systems," in Proc. IEEE VTC, June 2013, Dresden, Germany, pp.
[1] H. Zhang, et al. "Big data research on driving behavior model and 1-7.
auto insurance pricing factors based on UBI," in Proc. ICSINC,
[19] L. Xu, et al. "Cooperative load balancing for OFDMA cellular
Sept 2017, Chongqing China, pp.1-8.
networks," European Wireless, Apr. 2012, Poznan, Poland, pp. 1-7.
[2] L. Xu, X. Cheng, et al. "Mobility load balancing aware radio
[20] K. Chao, et al. "Data mining based modeling and application of
resource allocation scheme for LTE-Advanced cellular networks,”
mobile video service awareness," in Proc. ICSINC, Sept 2017,
in Proc. IEEE ICCT, Hangzhou, China, Oct. 2015, pp.806-812.
Chongqing China, pp. 389-396.
[3] L. Xu, et al. "Telecom big data based user offloading self-
[21] J. Gao, et al. "A Coverage Self-optimization Algorithm using Big
optimisation in heterogeneous relay cellular systems," Data Analytics in WCDMA Cellular Networks," in Proc. ICSINC,
International Journal of Distributed Systems and Technologies, Oct 2015, Beijing, China.
8(2), pp. 27-46, April 2017.
[22] L. Xu, et al. "Self-optimised joint traffic offloading in
[4] J. Gao, et al. "An interference management algorithm using big data heterogeneous cellular networks," in Proc. IEEE ISCIT, Sept 2016,
analytics in LTE cellular networks,” in Proc. IEEE ISCIT, Sept Qingdao China, pp.263-267.
2016, Qingdao China, pp.246-251.
[23] J. Guan, et al. "A comprehensive method of evaluation for wireless
[5] L. Xu, et al. "A self-optimizing load balancing scheme for fixed network operation stability," in Proc. IEEE ISCIT, Sept 2016,
relay cellular networks," in Proc. IET ICCTA, 14-16 Oct. 2011, Qingdao, China, pp.342-346.
Beijing, China, pp. 306-311.
[24] P. Ren, et al. "A novel big data based problematical sectors
[6] L. Xu, et al. "Data mining and evaluation for Base Station
detection algorithm in WCDMA networks," in Proc. ICSINC, Oct
deployment," in Proc. ICSINC, 13-15 Sept 2017, Chongqing China,
2015, Beijing, China.
pp. 356-364.
[25] China Unicom. China Unicom LTE network optimization guide
[7] X. Cheng, et al. "A novel big data based telecom operation
book. Beijing: China Unicom Press. 2013.
architecture," in Proc. ICSINC, Beijing, China, October 2015,
pp.385-396 [26] Z. Han, et al. 2012. LTE FDD technology principle and network
planning. Beijing: China Post and Telecommunications Press.
[8] K. Chao, et al. "A novel big data based telecom user value
evaluation method," in Proc. ICSINC, Oct 2015, Beijing, China. [27] H. Xing, et al. "On minimizing coding operations in network
coding based multicast: an evolutionary algorithm." Applied
[9] L. Xu, et al., "WCDMA data based LTE site selection scheme in Intelligence, 41(3). pp.820-836, 2014.
LTE deployment," in Proc. ICSINC, Beijing, China, October 2015,
pp.249-260. [28] L. Xu, et al. "Cooperative mobility load balancing in relay cellular
networks," in Proc. IEEE ICCC, August 2013, Xi'An, China,
[10] L. Xu, et al. "Self-organizing load balancing for relay based cellular pp.141-146.
networks," in Proc. IEEE CIT, 29 June - 1 July 2010, Bradford,
United Kingdom, pp. 791-796. [29] H. Xing, et al. "A Modified Artificial Bee Colony Algorithm for
Load Balancing in Network Coding Based Multicast," Soft
[11] F. Zhang, et al. "A novel evaluation method of WCDMA RNC Computing, published online, 2018.
signaling carrying capacity," in Proc. IEEE ISCIT, 26-28 Sept
2016, Qingdao China, pp.352-356. [30] H. Xing, et al. "A Hybrid EDA for Load Balancing in Multicast
With Network Coding," Applied Soft Computing, 59. pp.363-377,
[12] Y. Wang, et al. "A novel complaint calls handle scheme using big 2017.
date analytic in mobile networks," in Proc. ICSINC, 17-18 Oct
2015, Beijing, China. [31] Y. Cui, et al. "SD-Anti-DDoS: fast and efficient DDoS defense in
software-defined networks," Journal of Network and Computer
[13] L. Xu, et al. "User-vote assisted self-organizing load balancing for Applications, 67. pp.65-79, 2016.
OFDMA cellular systems," in Proc. IEEE PIMRC, Sept. 2011,
Toronto, Canada, pp. 217-221. [32] Z. Wang, et al. "A Modified Ant Colony Optimization Algorithm
for Network Coding Resource Minimization," IEEE Transactions
[14] Y. Liu, et al., "A novel power control mechanism based on on Evolutionary Computation, 20(3). pp.325-342, 2016.
interference estimation in LTE cellular networks," in Proc. IEEE
ISCIT, Sept 2016, Qingdao China, pp.397-401. [33] W. Wang, et al. "A novel cell-level resource allocation scheme for
OFDMA system," in Proc. IEEE CMC, Kunming, China, Jan.
[15] L. Xu, et al. "Self-optimised coordinated traffic shifting scheme for 2009, vol.1, pp. 287-292.
LTE cellular systems," in Proc. EAI ICSON, Beijing, China, Jan
2015, vol. 149, pp. 67-75. [34] A. Y. Al-Dubai, L. Zhao, et al. "QoS-aware inter-domain multicast
for scalable wireless community networks," IEEE Transactions on
[16] T. Zhang, et al. "A novel LTE network deployment scheme using Parallel and Distributed Systems, 26(11), pp.3136-3148,
telecom big data," in Proc. ICSINC, Beijing, China, October 2015, November 2015.
pp. 261-270.
376