目录
- (一)数据探查
- 1. 读取数据文件,查看行列信息
- 2. 查看数值类型列的数据描述信息
- 3. 以可视化的方式查看数值类型的取值分布情况
- 4. 查看文本类型列的取值
- 5. 查看文本类型的取值分布情况
- 6. 观察某行数据及单个字段
- 7. 分析education取值与wage_class的对应数量关系
- (二)数据清洗
- 1. 去除所有文本字段首尾的多余空格
- 2. 统一分类标签
- 3. 处理'?'字段
- (三)数据预处理
- 1. 文本字段转换成数值字段的方法试验
- 2. 将所有文本列均转换成数值编码
- (四)模型训练
- 1. 准备工作
- 2. 使用XGBoost模型训练,并且优选出最佳的模型参数
- 3. 计算模型性能
- 4. 再次调整超参数
- 5. 寻找最优的模型训练迭代停止时机
- 6. 计算最终模型的性能
- (五)特征分析
- 1. 查看各个特征之间的相关性
- 2. 去除强相关的冗余特征
- 3. 将age特征分箱处理
从指定的数据源读取数据,对数据进行必要的处理,选取合适的特征,构造分类模型,确定一个人的年收入是否超过50K。
数据来源:1994年美国人口普查数据库。(原始数据下载地址:https://archive.ics.uci.edu/ml/datasets/Adult )。数据存放在data目录中,其中,adult.data存放训练数据,adult.test存放测试数据。
特征列
age:年龄,整数
workclass:工作性质,字符串,包含少数几种取值,例如:Private、State-gov等
education:教育程度,字符串,包含少数几种取值,例如:Bachelors、Masters等
education_num:受教育年限,整数
maritial_status:婚姻状况,字符串,包含少数几种取值,例如:Never-married、Divorced等
occupation:职业,字符串,包含少数几种取值,例如:Sales、Tech-Support等
relationship:亲戚关系,字符串,包含少数几种取值,例如:Husband、Wife等
race:种族,字符串,包含少数几种取值,例如:White、Black等
sex:性别,字符串,包含少数几种取值,例如:Female, Male
capital_gain:资本收益,浮点数
capital_loss:资本损失,浮点数
hours_per_week:每周工作小时数,浮点数
native_country:原籍,包含少数几种取值,例如:United-States, Mexico等
分类标签列:income
>50K
≤50K
(一)数据探查
1. 读取数据文件,查看行列信息
import numpy as np
import pandas as pd
train_data_path = 'adult.txt'
test_data_path = 'adult.test'
# 过程省略
train_data = ...
test_data = ...
col_labels = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status',
'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss',
'hours_per_week', 'native_country', 'wage_class']
train_data.columns = col_labels
test_data.columns = col_labels
print(train_data.info())
2. 查看数值类型列的数据描述信息
print(train_data.describe())
3. 以可视化的方式查看数值类型的取值分布情况
import matplotlib.pyplot as plt
%matplotlib inline
numeric_columns = ['age','fnlwgt','education_num','capital_gain','capital_loss','hours_per_week']
plt.figure(figsize=(16,12))
for i in range(len(numeric_columns)):
# 画图过程省略
...
plt.ylabel("Frequency")
plt.savefig("one.png")
plt.show()
4. 查看文本类型列的取值
print("训练数据:")
for column in train_data.columns:
if train_data[column].dtype == 'object':
print(column + "取值为:")
print(train_data[column].unique())
print("==========================")
print("=========================================================")
print("测试数据:")
for column in test_data.columns:
if test_data[column].dtype == 'object':
print(column + "取值为:")
print(test_data[column].unique())
print("==========================")
5. 查看文本类型的取值分布情况
text_columns = ['workclass','education','marital_status','occupation','relationship','race','sex','native_country','wage_class']
...
for i in range(len(text_columns)):
# 画图过程省略
...
plt.savefig("two.png")
plt.show()
6. 观察某行数据及单个字段
print(train_data.iloc[0])
print("==========================")
workclass = train_data.iloc[0, 1]
print(workclass)
print(len(workclass))
7. 分析education取值与wage_class的对应数量关系
print("教育类型取值:")
print(train_data.education.unique())
result = pd.crosstab(index=train_data['wage_class'], columns=train_data['education'], rownames=['wage_class'])
print(result)
(二)数据清洗
1. 去除所有文本字段首尾的多余空格
for column_index in train_data.dtypes.index:
if train_data.dtypes[column_index] == 'object':
train_data[column_index] = train_data[column_index].str.strip()
for column_index in test_data.dtypes.index:
if test_data.dtypes[column_index] == 'object':
test_data[column_index] = test_data[column_index].str.strip()
workclass = train_data.iloc[0, 1]
print(workclass)
print(len(workclass))#workclass的长度
2. 统一分类标签
test_data['wage_class'] = test_data['wage_class'].str.strip('.')
print(test_data['wage_class'].unique())
3. 处理’?'字段
print("原始数据:")
print(train_data['workclass'].unique())
train_data = train_data.replace('?', np.nan).dropna()
print("更改后的数据:")
print(train_data['workclass'].unique())
print("测试数据修正前:")
print(test_data.loc[:10, ['workclass', 'occupation', 'native_country']])
for column in test_data.columns:
if test_data[column].dtype == 'object':
column_most_common_value = test_data[column].value_counts().index[0]
test_data[column] = test_data[column].replace('?', column_most_common_value)
print("测试数据修正后:")
print(test_data.loc[:10, ['workclass', 'occupation', 'native_country']])
(三)数据预处理
1. 文本字段转换成数值字段的方法试验
workclass_categorical = pd.Categorical(train_data['workclass'])
print(workclass_categorical.codes )
2. 将所有文本列均转换成数值编码
merged_data = pd.concat([train_data, test_data])
for column in merged_data.columns:
if merged_data[column].dtype == 'object':
merged_data[column] = pd.Categorical(merged_data[column]).codes
train_data = ...
test_data = ...
print("训练数据维度:", train_data.shape)
print("测试数据维度:", test_data.shape)
print(test_data.head())
(四)模型训练
1. 准备工作
X_train = train_data.iloc[:, :-1]
y_train = train_data['wage_class']
X_test = test_data.iloc[:, :-1]
y_test = test_data['wage_class']
cv_params = {'max_depth': [3, 5, 7], 'min_child_weight': [1, 3, 5]}
ind_params = {'learning_rate': 0.1, 'n_estimators': 1000, 'seed': 0,
'subsample': 0.8, 'colsample_bytree': 0.8,
'objective': 'binary:logistic'}
2. 使用XGBoost模型训练,并且优选出最佳的模型参数
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
optimized_GBM = GridSearchCV(XGBClassifier(**ind_params), cv_params, scoring='accuracy', ...具体看个人选择)
optimized_GBM.fit(X_train, y_train)
print("最佳参数:", optimized_GBM.best_params_)
means = optimized_GBM.cv_results_['mean_test_score']
stds = optimized_GBM.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, optimized_GBM.cv_results_['params']):
print("%0.5f (+/-%0.05f) for %r" % (mean, std * 2, params))
3. 计算模型性能
from sklearn.metrics import classification_report
y_pred = optimized_GBM.predict(X_test)
print(classification_report(y_test, y_pred))
4. 再次调整超参数
cv_params = {...}
ind_params = {...}
optimized_GBM = GridSearchCV(XGBClassifier(**ind_params), cv_params, scoring='accuracy', cv=5, n_jobs=-1, verbose=10)
optimized_GBM.fit(X_train, y_train)
print("最佳参数:", optimized_GBM.best_params_)
means = optimized_GBM.cv_results_['mean_test_score']
stds = optimized_GBM.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, optimized_GBM.cv_results_['params']):
print("%0.5f (+/-%0.05f) for %r" % (mean, std * 2, params))
5. 寻找最优的模型训练迭代停止时机
ind_params = {...}
eval_set = [(X_test, y_test)]
model = XGBClassifier(**ind_params)
result = model.fit(X_train, y_train, early_stopping_rounds=100, eval_metric="error", eval_set=eval_set, verbose=20)#训练
print("最佳迭代次数:", result.best_iteration)
6. 计算最终模型的性能
y_pred = model.predict(X_test, ntree_limit=result.best_iteration)
print(classification_report(y_test, y_pred))
(五)特征分析
1. 查看各个特征之间的相关性
import seaborn as sns
sns.set(...)
# 画图过程省略
...
plt.savefig("three.png")
plt.show()
2. 去除强相关的冗余特征
from xgboost import XGBClassifier
from sklearn.metrics import classification_report
ind_params = {...}
X_train_reduced = X_train.drop(columns = ['education_num','relationship'])
X_test_reduced = X_test.drop(columns = ['education_num','relationship'])
eval_set = [(X_test_reduced, y_test)]
model = XGBClassifier(**ind_params)
result = model.fit(X_train_reduced, y_train, early_stopping_rounds=100, eval_metric="error", eval_set=eval_set, verbose=50)
print("最佳迭代次数:", result.best_iteration)
y_pred = model.predict(X_test_reduced, ntree_limit=result.best_iteration)#预测/测试
print(classification_report(y_test, y_pred))
3. 将age特征分箱处理
age_bins = [10, 30, 40, 50, 60, 70]
X_train_reduced['age'] = np.digitize(X_train['age'], bins=age_bins)
X_test_reduced['age'] = np.digitize(X_test['age'], bins=age_bins)
print(X_train_reduced['age'].unique())
eval_set = [(X_test_reduced, y_test)]
ind_params = {...}
model = XGBClassifier(**ind_params)
result = model.fit(X_train_reduced, y_train, early_stopping_rounds=100, eval_metric="error", eval_set=eval_set, verbose=20)
print("最佳迭代次数:", result.best_iteration)
y_pred = model.predict(X_test_reduced, ntree_limit=result.best_iteration)
print(classification_report(y_test, y_pred))
回到文章开头
部分代码省略,详细可以