数学模型预测模型
Given, a large real-life data set containing information on general population and existing customer base. A major challenge is to target the most promising section of the population for customer acquisition. In this article, a data set containing personal, household, building information of representative German population and existing customer base of a mail order company is analyzed [1]. The goal is to create a model for predicting if a person would probably respond to customer acquisition campaign.
给定了一个庞大的真实数据集,其中包含有关一般人群和现有客户群的信息。 一个主要的挑战是针对最有希望的人群以吸引客户。 在本文中,分析了一个数据集,该数据集包含代表德国人口的个人,家庭,建筑物信息以及邮购公司的现有客户群[1]。 目标是创建一个模型来预测一个人是否会响应客户获取活动。
The aim of the project is to create models for answering following questions:
该项目的目的是创建用于回答以下问题的模型:
- What are the demographic features of a typical mail order customer? 典型的邮购客户的人口统计特点是什么?
- What models can be used for identifying a typical customer based on the features provided in a data set? 根据数据集中提供的功能,可以使用哪些模型来识别典型客户?
- Which models perform the best? 哪些型号表现最佳?
The strategy to answer the above question is
回答上述问题的策略是
- to study the data and nature of problem: Is it a binary classification problem? Is it a imbalanced data set? What metric should be chosen? Are there many missing data and what would be the best strategy to handle them? Are the features of continuous nature or categorical type? 研究问题的数据和性质:是二进制分类问题吗? 它是不平衡的数据集吗? 应该选择什么指标? 是否有许多丢失的数据,什么是处理这些数据的最佳策略? 特征是连续性质还是分类类型?
- draw inferences from two big datasets without labeled response (i.e. general population and customers of mail order company) using unsupervised machine learning algorithms (like principal component analysis for dimension reduction and subsequent cluster analysis). The inferences can be on finding clusters where customers are over-represented or under-represented and hence giving a hint on important features for identifying a potential customer. 使用无监督的机器学习算法(例如用于降维的主成分分析和随后的聚类分析)从两个没有标签响应的大型数据集(即邮购公司的总人口和客户)中得出推断。 这些推论可以基于寻找客户人数过多或不足的集群,从而可以提示识别潜在客户的重要功能。
create prediction model given labeled training and test dataset from a mail order campaign using popular new machine learning algorithms like xgboost and classical proven-in use algorithms gradient boosted machine.
使用流行的新型机器学习算法(例如xgboost和经典的在用算法)梯度增强机器,从邮购活动中创建带有标签的训练和测试数据集的预测模型。
数据探索与整理 (Data Exploration and Wrangling)
The data provided in several spreadsheets contains information on
多个电子表格中提供的数据包含有关以下内容的信息:
- almost a million persons (891221) in Germany. The large dataset may take a long time to load on a machine with less memory. 德国有将近一百万的人口(891221)。 大数据集可能需要很长时间才能加载到内存较少的计算机上。
- almost 200,000 customers of a mail-order company. 一家邮购公司的近20万客户。
- Each of the data-set had at least 366 features (e.g. age group, home data, car ownership information, and others) 每个数据集至少具有366个功能(例如年龄组,家庭数据,汽车拥有信息等)
- In all the datasets (general population, customer, campaign target), lot many features had missing or unknown data. Therefore, the dataset had to be updated to replace all missing or unknown variables with different codes like [‘9’,’X’,’XX, …] to a uniform code (NaN). 在所有数据集中(总人口,客户,活动目标),很多功能缺少或未知数据。 因此,必须更新数据集以将所有缺失或未知变量替换为不同的代码,例如['9','X','XX,…]为统一代码(NaN)。

As imputation strategy could lead to significant false assumption, therefore row with more than 13 feature data and columns with more than 30% missing values were removed from customer data set. This data wrangling led to customer dataset with 356 instead of 366 features and 125912 customer data points instead of 191652 data points.
由于插补策略可能导致重大的错误假设,因此从客户数据集中删除了具有13个以上要素数据的行和具有30%以上的缺失值的列。 这种数据争执导致客户数据集具有356个特征(而不是366个特征)和125912个客户数据点而不是191652个数据点。
Thereafter, the customer and general population dataset had 34% and 39% lesser data points, respectively. Due to sheer size of dataset, the general population was further sampled for 50% of data for reasonable execution time of learning algorithms.
此后,客户和一般人群数据集的数据点分别减少了34%和39% 。 由于数据集的庞大,在合理的学习算法执行时间下,对总体样本进行了50%的采样。
Furthermore, the following Figure shows that customer dataset has now only columns with maximum of 14% of missing data. Similarly, general population has now columns with maximum 19% of missing data. Other than 4 features in both dataset only maximum of 2.5% of data is missing for most of the features.
此外,下图显示客户数据集现在仅具有最多包含丢失数据的14%的列。 同样,一般人口现在具有的列最多包含19%的丢失数据。 除了两个数据集中的4个要素外,大多数要素最多仅丢失2.5%的数据。

The data preprocessing includes following steps:
数据预处理包括以下步骤:
Imputation: Different strategies of filling in the missing values were tried out (e.g. most frequent, median, mean, fill with zero). The experiments later during supervised learning showed that filling missing values with zero produced the best results with xgboost. Any other imputation strategy would be a strong assumption.
推算:尝试了填充缺失值的不同策略(例如,最常见,中位数,均值,零填充)。 后来在有监督学习中进行的实验表明,用xgboost填充零值可获得最佳结果。 任何其他归因策略都是一个很强的假设。
Encoding: The categorical feature types need to be identified, which should be encoded as dummy variable before application of learning algorithms. Since, the dataset contains lots of categorical variables and PCA is designed for continuous variables, therefore to obtain meaningful results one needs to identify the categorical variables which are not ordinal in nature. This information is available in ‘DIAS Attributes — Values 2017.xlsx’ which describes the features. There are 41 categorical features, however they are very often ordinal in nature. 18 categorical variables (e.g. ‘FINANZTYP’) were selected after studying the mentioned file. It must be noted that many of the features were not described and these were not classified as categorical variables.
编码:需要确定分类特征类型,应在应用学习算法之前将其编码为虚拟变量。 由于数据集包含许多类别变量,并且PCA是为连续变量设计的,因此,要获得有意义的结果,需要识别本质上不是序数的类别变量。 该信息可在“ DIAS属性-值2017.xlsx”中获得,其中描述了这些功能。 有41个分类特征,但实际上它们通常是序数。 在研究了提到的文件之后,选择了18个类别变量(例如'FINANZTYP')。 必须注意的是,许多功能没有描述,也没有归类为分类变量。
Scaling: The standard scaler is used. After encoding and scaling, we have 441 features instead of 355 features.
缩放:使用标准缩放器。 编码和缩放后,我们拥有441个功能,而不是355个功能。
无监督学习:典型邮购客户的人口统计特点是什么? (Unsupervised Learning: What are the demographic features of a typical mail order customer?)
Unsupervised learning is used for customer segmentation, i.e. to find the relationship between existing customer base and general population of Germany. This is done by
无监督学习用于客户细分,即查找现有客户群与德国总人口之间的关系。 这是通过

- reducing dimensionality using principal component analysis (PCA): to ease the interpretation of the data by creating new uncorrelated variables (principal components). 使用主成分分析(PCA)降低维度:通过创建新的不相关变量(主成分)来简化数据的解释。
- By looking into the weights of principal component features, one can interpret the components. In the first principal components, number of family houses in the PLZ8 region and mobility are very important features,e.g. positive PLZ8_ANTG3 , negative MOBI_REGIO (negative). In the second principal components, financial character is important, e.g. positive FINANZ_SPARER, negative FINANZ_SPARER. In the third principal components, type of car is important, e.g. positive KBA13_HERST_BMW_BENZ, negative KBA13_SITZE_5. 通过研究主要组件特征的权重,可以解释这些组件。 在第一个主要组成部分中,PLZ8地区的家庭住房数量和流动性是非常重要的特征,例如,正PLZ8_ANTG3,负MOBI_REGIO(负)。 在第二个主要组成部分中,财务特征很重要,例如,正FINANZ_SPARER,负FINANZ_SPARER。 在第三个主要组成部分中,汽车的类型很重要,例如,正KBA13_HERST_BMW_BENZ,负KBA13_SITZE_5。
Clustering: In this step, we first use k-means clustering to partition the general population dataset into k clusters. A major task is to find optimal number of clusters. Different values of k was tried out and using elbow method, k=8 was decided as number of cluster.
聚类:在此步骤中,我们首先使用k均值聚类将一般总体数据集划分为k个聚类。 一个主要任务是找到最佳数量的群集。 尝试了k的不同值,并使用弯头法,将k = 8决定为簇数。

After fitting and transforming the customer data using PCA model used for general population. The following figure shows that persons in cluster 3 and 7 have higher proclivity to becoming customer of the mail order company. Therefore, a targeted campaign could address only population falling in these clusters. The persons in cluster 1 and 5 are most underrated to become the customers.
在使用和转换了用于一般人群的PCA模型的客户数据之后。 下图显示了群集3和7中的人员更倾向于成为邮购公司的客户。 因此,有针对性的运动只能解决这些集群中的人口。 群集1和5中的人员被低估了成为客户。
Doing the comparing of center of max cluster (i.e. 3 and 7) and min cluster (1 and 5), one can identify the features which differ significantly. We found 50 important features. Hence, they can focused on in building a prediction model. The following Figure shows the distribution of one such feature in both customer and general population dataset.
通过比较最大聚类(即3和7 )和最小聚类( 1和5 )的中心,可以识别出明显不同的特征。 我们发现了50个重要功能。 因此,他们可以专注于建立预测模型。 下图显示了一个这样的特征在客户和一般人口数据集中的分布。

监督学习:建立预测模型 (Supervised Learning: Building a prediction model)
The task here is to create a supervised machine learning model for predicting whether a person would respond to a marketing campaign and become a customer. The interesting part of this exercise was find out how the model compares to other team models in a Kaggle competition. The training and test set contains more than 80,000 persons who were target of a customer acquisition campaign. Only half of the data (i.e. 42,962) contains information if the customer responded or not. The major challenge was to create a prediction model for response from the other half.
这里的任务是创建一个有监督的机器学习模型,以预测一个人是否会响应市场营销活动并成为客户。 该练习有趣的部分是找出该模型与Kaggle竞赛中其他团队模型的比较。 培训和测试集包含80,000多人,他们是客户获取活动的目标。 无论客户是否响应,只有一半的数据(即42,962)包含信息。 主要的挑战是为另一半的响应创建一个预测模型。
- ETL: In first step, we applied the same data wrangling methods of dropping features with high amount of missing data. However, here we did not drop the rows with missing data. Furthermore, we applied the same imputation, encoding, and scaling strategy as in preprocessing for unsupervised learning. ETL:在第一步中,我们应用了相同的数据整理方法,即删除具有大量缺失数据的要素。 但是,这里我们没有删除缺少数据的行。 此外,我们采用了与无监督学习预处理中相同的归因,编码和缩放策略。
- Training: The training set was then split into a train and test split for training the prediction model and cross-validation, respectively. 训练:然后将训练集划分为训练和测试划分,分别用于训练预测模型和交叉验证。
Metric: There are many metrics like F1 Score, Precision, Recall, AUC-ROC, Accuracy for binary classification. The given dataset is imbalanced (i.e. only 1.2% of customers reached out responded and almost 99% did not respond in training set), Therefore one can obtain 99% accuracy by labeling all data points as not responded. So accuracy should not be used as a metric. AUC (Area under Curve)-ROC (Receiver Operating Characteristic) curve is a evaluation metric for binary classification problems, i.e. for performance measurement to tell how much model is capable of distinguishing between ‘responded’ and ‘not responded’ classes. AUC-ROC =.8 means that the classifier has 80% chance of distinguishing between responded cases and not responded cases, which are both important classes. AUC-PR could also have been an alternative.
指标:有许多指标,例如F1得分,精度,召回率,AUC-ROC,二进制分类的准确性。 给定的数据集不平衡(即,只有1.2%的客户得到响应,而在训练集中几乎没有99%的客户未响应),因此,可以通过将所有数据点标记为未响应来获得99%的准确性。 因此,精度不应该用作度量。 AUC(曲线下面积)-ROC (接收器工作特性)曲线是针对二元分类问题(即性能测量)的评估指标,用于说明多少模型能够区分“响应”和“未响应”类别。 AUC-ROC = .8表示分类器有80%的机会区分已响应的案例和未响应的案例,这两个都是重要的类别。 AUC-PR也可能是替代方案。
Learning Model: After, several submissions to Kaggle. It was noted that among the many supervised learning classifier, xgboost classifier performed much better than gradient boosting classifier and simple linear classifier after tuning. Gradient Boosted Machine (GBM) had better score at cost of longer execution time. However, GBM was prone to overfitting. I.e. it showed better score on training set, but on test set the ROC-AUC score was low as seen during several Kaggle submission. Due to parallel processing in xgboost, the training is much faster than gbm. This scales during hypertuning when 100s of parameter settings needs to be tried. The XGBoost is also robust in handling missing data. It was noted that imputation strategy of ‘fill with zero’ was better than ‘most frequent’ for xgboost.
学习模式:之后,向Kaggle提交了一些意见书。 需要注意的是,在许多监督学习分类器中, xgboost分类器在调整后的性能比梯度提升分类器和简单线性分类器好得多。 梯度提升机(GBM)得分更高,但执行时间更长。 但是,GBM容易过拟合。 也就是说,它在训练组上显示出更好的分数,但是在测试组上,ROC-AUC分数较低,这在几次Kaggle提交中都可见。 由于xgboost中的并行处理,因此训练比gbm快得多。 当需要尝试数百个参数设置时,这会在超调整期间扩展。 XGBoost在处理丢失的数据方面也很强大。 有人指出,对于xgboost,“用零填充”的插补策略要优于“最频繁”的插补策略。
search_spaces = { 'learning_rate': (0.01, .2, 'log-uniform'),
'max_depth': (2, 5),
'min_child_weight': (2, 8),
'subsample': [0.7, .8, .9, 1.0],
'colsample_bytree': (0.5, 1.0, 'uniform'),
'n_estimators': (40, 100),
'colsample_bylevel': (0.1, 1.0, 'uniform'),
'scale_pos_weight': [1]}


5. Optimization: Therefore, the focus thereafter was on tuning the xgboost classifier parameters like ‘learning_rate’, ‘max_depth’, ‘n_estimators’, ‘min_samples_split’. After initial time consuming optimization using GridSearchCV and individual tuning of parameter one at a time helped in fixing the range of values for the parameters on Bayesian optimization over hyper parameters by using BayesSearchCV from skopt library.
5.优化:因此,此后的重点是调整xgboost分类器参数,例如“ learning_rate”,“ max_depth”,“ n_estimators”,“ min_samples_split”。 在使用GridSearchCV进行了耗时的初始优化之后,并且一次对参数进行了单独调整,这有助于通过使用skopt库中的BayesSearchCV来固定贝叶斯优化上的参数值范围,而不是超参数。
6. Results: Using XGBoost Classifier, following score in Kaggle was obtained.
6.结果:使用XGBoost分类器,获得Kaggle中的以下得分。

The nice feature of XGBoost classifier, is that it can output sorted list of features according to their importance. The most important feature found by almost all classifier was an undocumented feature ‘D19_SOZIALES’, which may indicate the social status of person. On visualizing the feature between general population and customer, one can see in following Figure that it has a major influence in predicting customer acquisition.
XGBoost分类器的一个不错的功能是,它可以根据功能的重要性输出排序的功能列表。 几乎所有分类器发现的最重要功能是未记录的功能“ D19_SOZIALES” ,它可能表明人的社会地位。 在可视化总体人口与客户之间的特征时,可以在下图中看到,它对预测客户获得量具有重大影响。

结论 (Conclusion)
The project was a real-life project, where one got to know the problem better over time. The fun part was the addictive nature of Kaggle competition. Over the time, the important breakthrough moments were selection of classifier which is more conducive to hypertuning (i.e. XGBoost). The realization that enabling wrong imputation strategy could lead to results that either look too good or too bad, depending on the amount of missing values in dataset. We found that simple imputation strategy of ‘fill with zero’ was good enough as XGBoost is quite robust to missing values. The final breakthrough was in selecting appropriate range of parameters for tuning. Very often, the optimization search found parameters which led to overfit solution. For example, selecting the upper bound of ‘num_estimators’ to 100 instead of 500 not only reduced the execution time of optimization search, but also found a not overfitted classifier whose score led improvement of 30 places in ranking from 52 to 22.
该项目是一个真实的项目,随着时间的流逝,人们对这个问题有了更好的了解。 有趣的部分是Kaggle比赛的令人上瘾的性质。 随着时间的流逝,重要的突破时刻是选择更有助于超调的分类器(即XGBoost)。 认识到启用错误的插补策略可能会导致结果看起来好坏,这取决于数据集中缺失值的数量。 我们发现,简单的插补策略“用零填充”就足够了,因为XGBoost可以很好地应对缺失值。 最终的突破是选择合适的参数范围进行调整。 优化搜索通常会找到导致过拟合解决方案的参数。 例如,将'num_estimators'的上限选择为100而不是500,不仅减少了优化搜索的执行时间,而且发现了一个过拟合的分类器,其得分将排名从52提升了30位,从22提高到22。
Finally, one can say effort in feature engineering (e.g. handling missing data, removing undocumented features) pays off in developing better models. Parameter tuning of supervised learning models (e.g. xgboost, gradient boosting classification, …) leads to better models only after good feature engineering (e.g. identifying categorical variables for one-hot encoding). An investment in a powerful machine is a must for data science project. One could try methods like iterative imputation, better feature engineering (e.g. feattools), other models like KERAS deep learning as future work.
最后,可以说,在要素工程方面的努力(例如处理缺失的数据,删除未记录的要素)在开发更好的模型方面会有所回报。 监督学习模型的参数调整(例如xgboost ,梯度提升分类等)仅在经过良好的特征工程设计(例如识别用于一键编码的分类变量)后才能产生更好的模型。 对于数据科学项目,必须对功能强大的机器进行投资。 人们可以尝试诸如迭代插补,更好的特征工程(例如feattools)之类的方法,以及诸如KERAS深度学习之类的其他模型作为未来的工作。
[1] Data set is provided by Arvato Financial Services for an Udacity capstone project.
[1]数据集由Arvato Financial Services提供给Udacity顶标项目。
[2] The notebooks and python files are available in Github repo: https://github.com/kross11480/courses/blob/master/capstone/Arvato%20Project%20Workbook.ipynb
[2]笔记本和python文件可在Github存储库中找到: https : //github.com/kross11480/courses/blob/master/capstone/Arvato%20Project%20Workbook.ipynb
数学模型预测模型