分类预测与集成学习（机器学习）-CSDN博客

本文详细介绍了如何从1994年美国人口普查数据库中提取数据，包括数据探查、清洗、预处理（如文本转数值、特征编码）、模型训练（使用XGBoost进行分类预测），以及特征分析（如相关性检查和特征选择）。作者展示了如何通过这些步骤构建一个预测个人年收入是否超过50K的模型。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

（一）数据探查
- 1. 读取数据文件，查看行列信息
- 2. 查看数值类型列的数据描述信息
- 3. 以可视化的方式查看数值类型的取值分布情况
- 4. 查看文本类型列的取值
- 5. 查看文本类型的取值分布情况
- 6. 观察某行数据及单个字段
- 7. 分析education取值与wage_class的对应数量关系
（二）数据清洗
- 1. 去除所有文本字段首尾的多余空格
- 2. 统一分类标签
- 3. 处理'?'字段
（三）数据预处理
- 1. 文本字段转换成数值字段的方法试验
- 2. 将所有文本列均转换成数值编码
（四）模型训练
- 1. 准备工作
- 2. 使用XGBoost模型训练，并且优选出最佳的模型参数
- 3. 计算模型性能
- 4. 再次调整超参数
- 5. 寻找最优的模型训练迭代停止时机
- 6. 计算最终模型的性能
（五）特征分析
- 1. 查看各个特征之间的相关性
- 2. 去除强相关的冗余特征
- 3. 将age特征分箱处理

从指定的数据源读取数据，对数据进行必要的处理，选取合适的特征，构造分类模型，确定一个人的年收入是否超过50K。
数据来源：1994年美国人口普查数据库。（原始数据下载地址：https://archive.ics.uci.edu/ml/datasets/Adult ）。数据存放在data目录中，其中，adult.data存放训练数据，adult.test存放测试数据。
特征列
age：年龄，整数
workclass：工作性质，字符串，包含少数几种取值，例如：Private、State-gov等
education：教育程度，字符串，包含少数几种取值，例如：Bachelors、Masters等
education_num：受教育年限，整数
maritial_status：婚姻状况，字符串，包含少数几种取值，例如：Never-married、Divorced等
occupation：职业，字符串，包含少数几种取值，例如：Sales、Tech-Support等
relationship：亲戚关系，字符串，包含少数几种取值，例如：Husband、Wife等
race：种族，字符串，包含少数几种取值，例如：White、Black等
sex：性别，字符串，包含少数几种取值，例如：Female, Male
capital_gain：资本收益，浮点数
capital_loss：资本损失，浮点数
hours_per_week：每周工作小时数，浮点数
native_country：原籍，包含少数几种取值，例如：United-States, Mexico等
分类标签列：income
>50K
≤50K

（一）数据探查

1. 读取数据文件，查看行列信息

import numpy as np
import pandas as pd
train_data_path = 'adult.txt' 
test_data_path = 'adult.test'
# 过程省略
train_data = ...
test_data = ...
col_labels = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status',
              'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss',
              'hours_per_week', 'native_country', 'wage_class']
train_data.columns = col_labels 
test_data.columns = col_labels 
print(train_data.info())

在这里插入图片描述

2. 查看数值类型列的数据描述信息

print(train_data.describe())

在这里插入图片描述

3. 以可视化的方式查看数值类型的取值分布情况

import matplotlib.pyplot as plt
%matplotlib inline 
numeric_columns = ['age','fnlwgt','education_num','capital_gain','capital_loss','hours_per_week']
plt.figure(figsize=(16,12)) 
for i in range(len(numeric_columns)):
    # 画图过程省略
    ...
    plt.ylabel("Frequency") 
plt.savefig("one.png") 
plt.show()

在这里插入图片描述

4. 查看文本类型列的取值

print("训练数据：")
for column in train_data.columns:
    if train_data[column].dtype == 'object':
        print(column + "取值为：")
        print(train_data[column].unique())
        print("==========================")
print("=========================================================")
print("测试数据：")
for column in test_data.columns:
    if test_data[column].dtype == 'object':
        print(column + "取值为：")
        print(test_data[column].unique())
        print("==========================")

在这里插入图片描述

5. 查看文本类型的取值分布情况

text_columns = ['workclass','education','marital_status','occupation','relationship','race','sex','native_country','wage_class']
...
for i in range(len(text_columns)):
    # 画图过程省略
    ...
plt.savefig("two.png") 
plt.show()

在这里插入图片描述

6. 观察某行数据及单个字段

print(train_data.iloc[0])
print("==========================")
workclass = train_data.iloc[0, 1]
print(workclass)  
print(len(workclass))

在这里插入图片描述

7. 分析education取值与wage_class的对应数量关系

print("教育类型取值：")
print(train_data.education.unique())
result = pd.crosstab(index=train_data['wage_class'], columns=train_data['education'], rownames=['wage_class'])
print(result)

在这里插入图片描述

（二）数据清洗

1. 去除所有文本字段首尾的多余空格

for column_index in train_data.dtypes.index:
    if train_data.dtypes[column_index] == 'object':
        train_data[column_index] = train_data[column_index].str.strip()
for column_index in test_data.dtypes.index:
    if test_data.dtypes[column_index] == 'object':
        test_data[column_index] = test_data[column_index].str.strip()
workclass = train_data.iloc[0, 1]
print(workclass)
print(len(workclass))#workclass的长度

2. 统一分类标签

test_data['wage_class'] = test_data['wage_class'].str.strip('.')
print(test_data['wage_class'].unique())

在这里插入图片描述

3. 处理’?'字段

print("原始数据：")
print(train_data['workclass'].unique())
train_data = train_data.replace('?', np.nan).dropna()
print("更改后的数据：")
print(train_data['workclass'].unique())

在这里插入图片描述

print("测试数据修正前：")
print(test_data.loc[:10, ['workclass', 'occupation', 'native_country']])
for column in test_data.columns:
    if test_data[column].dtype == 'object':
        column_most_common_value = test_data[column].value_counts().index[0]
       test_data[column] = test_data[column].replace('?', column_most_common_value)
print("测试数据修正后：")
print(test_data.loc[:10, ['workclass', 'occupation', 'native_country']])

在这里插入图片描述

（三）数据预处理

1. 文本字段转换成数值字段的方法试验

workclass_categorical = pd.Categorical(train_data['workclass'])
print(workclass_categorical.codes )

在这里插入图片描述

2. 将所有文本列均转换成数值编码

merged_data = pd.concat([train_data, test_data]) 
for column in merged_data.columns:
    if merged_data[column].dtype == 'object':
        merged_data[column] = pd.Categorical(merged_data[column]).codes
train_data = ...
test_data = ...
print("训练数据维度：", train_data.shape)
print("测试数据维度：", test_data.shape)
print(test_data.head())

在这里插入图片描述

（四）模型训练

1. 准备工作

X_train = train_data.iloc[:, :-1]
y_train = train_data['wage_class']
X_test = test_data.iloc[:, :-1]
y_test = test_data['wage_class']
cv_params = {'max_depth': [3, 5, 7], 'min_child_weight': [1, 3, 5]}
ind_params = {'learning_rate': 0.1, 'n_estimators': 1000, 'seed': 0,
              'subsample': 0.8, 'colsample_bytree': 0.8,
              'objective': 'binary:logistic'}

2. 使用XGBoost模型训练，并且优选出最佳的模型参数

from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
optimized_GBM = GridSearchCV(XGBClassifier(**ind_params), cv_params, scoring='accuracy', ...具体看个人选择)
optimized_GBM.fit(X_train, y_train)

print("最佳参数：", optimized_GBM.best_params_)
means = optimized_GBM.cv_results_['mean_test_score']
stds = optimized_GBM.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, optimized_GBM.cv_results_['params']):    
    print("%0.5f (+/-%0.05f) for %r" % (mean, std * 2, params))

在这里插入图片描述

3. 计算模型性能

from sklearn.metrics import classification_report
y_pred = optimized_GBM.predict(X_test)
print(classification_report(y_test, y_pred))

在这里插入图片描述

4. 再次调整超参数

cv_params = {...}
ind_params = {...}
optimized_GBM = GridSearchCV(XGBClassifier(**ind_params), cv_params, scoring='accuracy', cv=5, n_jobs=-1, verbose=10)
optimized_GBM.fit(X_train, y_train)
print("最佳参数：", optimized_GBM.best_params_)
means = optimized_GBM.cv_results_['mean_test_score']
stds = optimized_GBM.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, optimized_GBM.cv_results_['params']):    
    print("%0.5f (+/-%0.05f) for %r" % (mean, std * 2, params))

在这里插入图片描述

5. 寻找最优的模型训练迭代停止时机

ind_params = {...}
eval_set = [(X_test, y_test)]
model = XGBClassifier(**ind_params)
result = model.fit(X_train, y_train, early_stopping_rounds=100, eval_metric="error", eval_set=eval_set, verbose=20)#训练
print("最佳迭代次数:", result.best_iteration)

在这里插入图片描述

6. 计算最终模型的性能

y_pred = model.predict(X_test, ntree_limit=result.best_iteration)
print(classification_report(y_test, y_pred))

在这里插入图片描述

（五）特征分析

1. 查看各个特征之间的相关性

import seaborn as sns
sns.set(...)
# 画图过程省略
...
plt.savefig("three.png") 
plt.show()

在这里插入图片描述

2. 去除强相关的冗余特征

from xgboost import XGBClassifier
from sklearn.metrics import classification_report
ind_params = {...}
X_train_reduced = X_train.drop(columns = ['education_num','relationship'])
X_test_reduced = X_test.drop(columns = ['education_num','relationship'])
eval_set = [(X_test_reduced, y_test)]
model = XGBClassifier(**ind_params)
result = model.fit(X_train_reduced, y_train, early_stopping_rounds=100, eval_metric="error", eval_set=eval_set, verbose=50)
print("最佳迭代次数:", result.best_iteration)
y_pred = model.predict(X_test_reduced, ntree_limit=result.best_iteration)#预测/测试
print(classification_report(y_test, y_pred))

3. 将age特征分箱处理

age_bins = [10, 30, 40, 50, 60, 70]   
X_train_reduced['age'] = np.digitize(X_train['age'], bins=age_bins)
X_test_reduced['age'] = np.digitize(X_test['age'], bins=age_bins)
print(X_train_reduced['age'].unique())
eval_set = [(X_test_reduced, y_test)]
ind_params = {...}
model = XGBClassifier(**ind_params)
result = model.fit(X_train_reduced, y_train, early_stopping_rounds=100, eval_metric="error", eval_set=eval_set, verbose=20)
print("最佳迭代次数:", result.best_iteration)
y_pred = model.predict(X_test_reduced, ntree_limit=result.best_iteration)
print(classification_report(y_test, y_pred))