pyspark 使用_使用pyspark预测用户流失

pyspark 使用

Many different factors come into play as to why a particular user may or may not churn. In this project I use PySpark to analyse and predict churn using data similar to those of companies like Spotify and Apple Music.

关于特定用户为何会或可能不会搅动的因素有许多不同的因素起作用。 在这个项目中,我使用PySpark使用类似于Spotify和Apple Music等公司的数据来分析和预测客户流失。

为什么选择PySpark? (Why PySpark?)

I chose to use PySpark because of its scalability and speed. PySpark helps work with Resilient Distributed Datasets (RDD). Spark is also faster than other frameworks since it stores data in memory.

我选择使用PySpark是因为它具有可扩展性和速度。 PySpark帮助处理弹性分布式数据集(RDD)。 由于Spark将数据存储在内存中,因此它也比其他框架更快。

数据 (Data)

For the first part of my analysis, I worked on a smaller subset of the whole 12GB dataset. Once I finish my analysis and modeling, I will deploy a Spark cluster to a cloud service such as AWS or IBM cloud where I can efficiently analyse the full dataset. Looking at the columns in the data set below, the data consists of user information as well as their activity on the app. Each row in the dataset represents an action of a user such as playing songs, liking or disliking them, and navigating between pages.

在分析的第一部分中,我研究了整个12GB数据集的较小子集。 完成分析和建模后,我将Spark集群部署到云服务(例如AWS或IBM cloud),在其中我可以有效地分析整个数据集。 查看下面的数据集中的列,数据包括用户信息及其在应用程序上的活动。 数据集中的每一行代表用户的操作,例如播放歌曲,喜欢或不喜欢它们,以及在页面之间导航。

探索性数据分析 (Exploratory Data Analysis)

Now to find out what does it mean to churn in this dataset, I looked into the page column to see the different values that it represents. Here I decided that a user going through the “Cancellation Confirmation” page is one who churns.

现在,找出在此数据集中搅动的含义是什么,我调查了page列,以查看它代表的不同值。 在这里,我认为通过“取消确认”页面的用户是流失者。

Looking at some of the demographics of the churned users, we can see that a higher proportion of male users churn more than females (26% vs 19%).

通过查看流失用户的一些人口统计数据,我们可以发现,男性用户流失的比例要高于女性(26%对19%)。

Based on the proportion of page events by user churn status, I noticed that Roll Advert and Thumbs Down event correlate with a user churning, while Thumbs Up event correlate with user not churning. I believe that advertisements are a major factor in a users decision to keep or release a service.

根据页面事件与用户流失状态的比例,我注意到“滚动广告”和“拇指向下”事件与用户搅动相关,而“拇指向上”事件与用户不搅动相关。 我认为广告是用户决定保留或发布服务的主要因素。

Another interesting visualization is that more paid users decide to churn more than free users. This may be because of displeasure with the service or even the pricing versus other streaming services.

另一个有趣的可视化是,付费用户决定流失的人数要多于免费用户。 这可能是因为与其他流媒体服务相比,该服务的不满意甚至是定价问题。

公制 (Metric)

With the small subset data, I saw that there were 360 users in the set and there is a class imbalance with the churn label where 14% of the users have churned. Because of this I also decided that the best metric to use for this classification is f1 score since it takes in precision and recall into account.

利用小的子集数据,我看到集合中有360个用户,并且流失标签存在类不平衡的情况,其中有14%的用户流失了。 因此,我还决定了用于此分类的最佳指标是f1得分,因为它考虑了精度和召回率。

特征工程 (Feature Engineering)

To create the final training set, I converted page event types from the page column into their own features with a binary value to make scaling simple. I also added the amount of songs a user listened to, the amount of different listening sessions, the lifetime of the users account, and target feature which the churn label. I also grouped the data so that there would be any repeat users in the dataset.

为了创建最终的训练集,我将页面列中的页面事件类型转换为具有二进制值的自身功能,以简化扩展。 我还添加了用户收听的歌曲数量,不同的收听会话数量,用户帐户的生存期以及客户流失标签的目标功能。 我还对数据进行了分组,以便数据集中会有重复用户。

造型 (Modeling)

After combining all the features together as a vector, scaling the values, and splitting my data between training and validation sets, I was ready to start training my models. I chose to train a logistic regression classifier, random forest classifier, gradient boosted trees, and support vector machine classifier.

将所有功能组合为一个向量,缩放值并在训练集和验证集之间划分数据后,我准备开始训练模型。 我选择训练逻辑回归分类器,随机森林分类器,梯度增强树和支持向量机分类器。

Image for post
Pros and Cons of Selected Classifiers.
所选分类器的优缺点。

Here, the gradient boosted trees model performed the best with an f1 score of 0.67.

在这里,梯度增强树模型以0.67的f1得分表现最佳。

Next, I hypertuned the maximum iterations and maximum depth parameters for the gradient boosted trees using cross validation, however there was no improvements to the performance of my model.

接下来,我使用交叉验证对梯度增强树的最大迭代次数和最大深度参数进行了超调,但是我的模型性能没有任何改善。

结论 (Conclusion)

Learning PySpark for this project was a very interesting experience. I was able to leverage the framework to analyze the data, create new features for the training set, and test out different models from the Spark MLlib to predict user churn.

为这个项目学习PySpark是非常有趣的经历。 我能够利用该框架来分析数据,为训练集创建新功能,并从Spark MLlib中测试不同的模型以预测用户流失。

Based on the results, it seems that some of the causes of user churn may be the amount of advertisements that users are seeing and the dislikes of the songs that users are listening to.

根据结果​​,似乎导致用户流失的某些原因可能是用户正在观看的广告数量以及用户正在收听的歌曲的不喜欢。

The next steps for my project will be to create the spark cluster on a cloud service and try training my model on the full dataset to look for improvements. I believe that other improvements to this analysis can be made by finding data for competition of this music application to see if there are more explainable reasons for the churn rate, as well as insight into how can we decrease the churn.

我项目的下一步将是在云服务上创建spark集群,并尝试在完整数据集上训练我的模型以寻求改进。 我相信可以通过查找音乐应用程序竞争数据来查看流失率是否还有更多可解释的原因,以及深入了解如何降低流失率,从而对该分析进行其他改进。

Thank you for reading and check out my full analysis on my GitHub here. Part 2 of my analysis will be coming soon. Please leave a like or comment any suggestions.

感谢您阅读并在此处在GitHub上查看我的完整分析。 我的分析的第二部分即将推出。 请留下一个喜欢或评论任何建议。

翻译自: https://medium.com/python-in-plain-english/using-pyspark-to-predict-user-churn-6946ec29b6a6

pyspark 使用

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值