上期我们讲过了Xgboost的原理和目标函数推导,今天我们来进行实战练习,首先简单回顾下Xgboost原理:
Xgboost是一个集成模型,将K个树的结果进行求和,作为最终的预测值,Xgboost的目的就是通过不断加入树模型来使得预测结果比之前的效果好,使目标函数最小,即预测值无限接近真实值。
接下来进行实战训练:
1. 首先引入
import xgboost as xgb
import pandas as pd
import numpy as np
import pickle
import sys
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error, make_scorer
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV
from scipy.sparse import csr_matrix, hstack
from sklearn.cross_validation import KFold, train_test_split
from xgboost import XGBRegressor
import warnings
warnings.filterwarnings('ignore')
2. 数据预处理和对数转换
train = pd.read_csv('train.csv')
train['log_loss'] = np.log(train['loss'])
3.数据分成连续和离散的特征
features = [x for x in train.columns if x not in ['id','loss', 'log_loss']]
cat_fe