利用LightGBM对波士顿房价进行模拟和预测

最新推荐文章于 2025-04-27 21:36:23 发布

原创最新推荐文章于 2025-04-27 21:36:23 发布 · 5k 阅读

107 ·

CC 4.0 BY-SA版权

文章标签：

#数据分析 #python #机器学习

目标：根据房屋属性预测每个房子的最终价格

任务流程：

一、分析数据指标

不同指标对结果的影响
连续值与离散值的情况

二、观察数据的分布，是否正态

是否满足正态分布
数据变换操作

三、数据预处理

缺失值填充
标签转换

四、建模

LightGBM模型

一、探索性数据分析

1、加载数据并了解数据意义

数据包含训练集和测试集，数据量不大，但包含的变量比较多，下面我们来认识下变量的具体含义：

MSSubClass 建筑类
mszoning 一般的分区类别
LotFrontage 街道连接属性线性英尺
LotArea 平方英尺面积
Street 街道，道路通行方式
Alley 小巷，通道入口的类型
LotShape 财产的形状
LandContour 财产的平整度
Utilities 实用程序，可用的实用程序类型
LotConfig 很多配置
LandSlope 滑坡
Neighborhood 邻近，Ames市区范围内的物理位置
Condition1 状态，邻近主要道路或铁路
Condition2 条件，靠近主要道路或铁路（如果第二存在）
BldgType 住宅类型
housestyle 住宅风格
overallqual 整体材料和完成质量
overallcond 总体状况评价
yearbuilt 原施工日期
yearremodadd 重塑日期
RoofStyle 屋顶类型
RoofMatl 屋面材料
exterior1st 外部覆盖的房子
exterior2nd 外部覆盖的房子（如果有一个以上的材料）
MasVnrType 砌体饰面型
MasVnrArea 砌体饰面面积，平方英尺
exterqual 外部材料质量
extercond 在外部的物质条件
Foundation 基金会的类型
BsmtQual 地下室的高度
BsmtCond 地下室的一般条件
BsmtExposure 罢工或花园层地下室
BsmtFinType1 质量基层成品区
BsmtFinSF1 完成1平方英尺所需材料
BsmtFinType2 质量第二成品区（如果有的话）
BsmtFinSF2 完成2平方英尺所需材料
BsmtUnfSF 未完成的平方英尺的地下室
TotalBsmtSF 地下室面积总平方英尺
Heating 加热类型
HeatingQc 加热质量和条件
CentralAir 是否有中央空调
Electrical 电气系统的类型
1stFlrSF 一楼平方英尺
2ndFlrSF 二楼平方英尺
LowQualFinSF 完成每平方英尺最低的质量
GrLivArea 居住面积平方英尺
BsmtFullBath 地下至完整的浴室
BsmtHalfBath 地下室部分浴室
FullBath 完整的浴室等级
HalfBath 部分浴室等级
BedroomAbvGr 高于地下室的卧室数
KitchenAbvGr 厨房数量
KitchenQual 厨房质量
TotRmsAbvGrd 总房间数（不含卫生间）
Functional 家庭功能评级
Fireplaces 壁炉位置
FireplaceQu 壁炉质量
GarageType 车库位置
GarageYrBlt 车库年限
GarageFinish 车库的室内装修
GarageCars 车库可放车辆数
GarageArea 车库面积
GarageQual 车库质量
GarageCond 车库条件
PavedDrive 铺的车道
WoodDeckSF 平方英尺的木甲板面积
OpenPorchSF 平方英尺打开阳台的面积
EnclosedPorch 封闭式阳台的面积（平方英尺）
3SsnPorch 三季阳台的面积（平方英尺）
ScreenPorch 纱窗门廊区（平方英尺）
PoolArea 游泳池
PoolQC 游泳池质量
Fence 莎兰的质量
MiscFeature 杂项功能
MiscVal 杂项特征值
MoSold 在什么月份销售
YrSold 在什么年份销售
SaleType 销售类型
SaleCondition 销售环境

查看目标变量的分布

目标变量整体分布类正态，但还是有所偏，后期需要做调整。再看下偏度和峰度，基本可以确定偏度比较大，稍后再做调整。

2、查看重要属性对目标变量的影响

# 居住面积（平方英尺），基本结论：居住面积越大，房价越高

# 地下室面积（平方英尺），基本结论：地下室面积越大，房价越高

# 整体材料和饰面质量，基本结论：整体材料和饰面质量等级越高，房价越高

# 原施工日期，基本结论：施工日期与房价价格无明显关系

3、查看变量与变量之间的相关性，及哪些变量对房价价格影响最大

corr = train.corr()
f,ax = plt.subplots(figsize = (14,8))
sns.heatmap(corr,square = True,cmap = 'Blues')

筛选10个对房价价格影响最大的变量

k = 10
cols = corr.nlargest(k,'SalePrice')['SalePrice'].index
cm = np.corrcoef(train[cols].values.T)
sns.set(font_scale = 1.25)
hm = sns.heatmap(cm,cbar = True,annot = True,square = True,fmt = '.2f',annot_kws = {'size': 10},yticklabels = cols.values,xticklabels = cols.values,cmap = 'Blues')
plt.show()

可视化下散点图，观察前5个相关性最大的变量（还是3个吧，5个太多放不下。。。。。）

很明显，这些相关性很大的变量基本呈现出正相关关系。

sns.set()
cols = ['SalePrice','OverallQual','GrLivArea','GarageCars']
sns.pairplot(train[cols],size = 2.0)
plt.show()

二、数据清洗

1、查看缺失情况

2、删除离群点

3、对目标变量做对数变换

一开始我们看了目标变量的分布，是一个类正态的情形，进一步验证：从QQ图可明确得出，数据分布偏度较大，需做进一步的数据变换，以使其满足正态分布。

#Stats
from scipy.stats import skew,norm
from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax
from scipy import stats

sns.distplot(train['SalePrice'],fit = norm)
(mu,sigma) = norm.fit(train['SalePrice'])
print('\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu,sigma))

# 分布图
plt.legend(['Normal dist,($\mu=${:.2f} and $\sigma=${:.2f})'.format(mu,sigma)],loc = 'best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

# QQ图
fig = plt.figure()
res = stats.probplot(train['SalePrice'],plot = plt)
plt.show()

变换后满足正态分布，具体如下：

# 对数变换log(1+x)
train['SalePrice'] = np.log1p(train['SalePrice'])
# 查看新的分布
sns.distplot(train['SalePrice'],fit = norm)
# 参数
(mu,sigma) = norm.fit(train['SalePrice'])
print('\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu,sigma))
# 画图
plt.legend(['Normal dist($\mu=${:.2f} and $\sigma=$ {:.2f})'.format(mu,sigma)],loc = 'best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')
# QQ图
fig = plt.figure()
res = stats.probplot(train['SalePrice'],plot = plt)
plt.show()

4、缺失值处理

train_labels = train['SalePrice'].reset_index(drop=True)
train_features = train.drop(['SalePrice'],axis=1)
test_features = test

all_features = pd.concat([train_features,test_features]).reset_index(drop=True)
all_features.shape

def percent_missing(df):
    data = pd.DataFrame(df)
    df_cols = list(pd.DataFrame(data))
    dict_x = {}
    for i in range(0,len(df_cols)):
        dict_x.update({df_cols[i]: round(data[df_cols[i]].isnull().mean()*100,2)})
    return dict_x

missing = percent_missing(all_features)
df_miss = sorted(missing.items(),key=lambda x: x[1],reverse=True)
print('Percent of missing data')
df_miss[0:10]

sns.set_style('white')
f,ax = plt.subplots(figsize=(8,7))
sns.set_color_codes(palette='deep')
missing = round(train.isnull().mean()*100,2)
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar(color='b')
#Tweak the visual presentation
ax.xaxis.grid(False)
ax.set(ylabel='Percent of missing values')
ax.set(xlabel='Features')
ax.set(title='Percent missing data by feature')
sns.despine(trim=True,left=True)

#Some of the non-numeric preditors are stored as numbers;convert them into strings
all_features['MSSubClass'] = all_features['MSSubClass'].apply(str)
all_features['YrSold'] = all_features['YrSold'].astype(str)
all_features['MoSold'] = all_features['MoSold'].astype(str)

def handle_missing(features):
    #the data description states that NA refers to typical('Typ') values
    features['Functional'] = features['Functional'].fillna('Typ')
    #Replace the missing values in each of the columns below with their mode
    features['Electrical'] = features['Electrical'].fillna('SBrkr')
    features['KitchenQual'] = features['KitchenQual'].fillna('TA')
    features['Exterior1st'] = features['Exterior1st'].fillna(features['Exterior1st'].mode()[0])
    features['Exterior2nd'] = features['Exterior2nd'].fillna(features['Exterior2nd'].mode()[0])
    features['SaleType'] = features['SaleType'].fillna(features['SaleType'].mode()[0])
    features['MSZoning'] = features.groupby('MSSubClass')['MSZoning'].transform(lambda x: x.fillna(x.mode()[0]))
    
    #the data description stats that NA refers to 'No Pool'
    features['PoolQC'] = features['PoolQC'].fillna('None')
    #Replacing the missing values with 0,since no garage = no cars in garage
    for col in ('GarageYrBlt','GarageArea','GarageCars'):
        features[col] = features[col].fillna(0)
    #Replacing the missing values with None
    for col in ['GarageType','GarageFinish','GarageQual','GarageCond']:
        features[col] = features[col].fillna('None')
    #NaN values for these categorical basement features,means there's no basement
    for col in ('BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2'):
        features[col] = features[col].fillna('None')
        
    #Groupby the neighborhoods ,and fill in missing value by the median LotFrontage of the neighborhood
    features['LotFrontage'] = features.groupby('Neighborhood')['LotFrontage'].transform(lambda x: x.fillna(x.median()))
    
    #We have no particular intuition around how to fill in the rest of the categorical features
    #So we replace their missing values with None
    objects = []
    for i in features.columns:
        if features[i].dtype == object:
            objects.append(i)
    features.update(features[objects].fillna('None'))
    
    #And we do the same thing for numerical features,but this time with 0s
    numeric_dtypes = ['int16','int32','int64','float16','float32','float64']
    numeric = []
    for i in features.columns:
        if features[i].dtype in numeric_dtypes:
            numeric.append(i)
    features.update(features[numeric].fillna(0))
    return features

all_features = handle_missing(all_features)

确认下缺失值是否处理完毕。

5、变量处理

1）类别变量标签化

all_features['MSSubClass'] = all_features['MSSubClass'].apply(str)
all_features['OverallCond'] = all_features['OverallCond'].astype(str)
all_features['YrSold'] = all_features['YrSold'].astype(str)
all_features['MoSold'] = all_features['MoSold'].astype(str)

from sklearn.preprocessing import LabelEncoder
cols = ('FireplaceQu','BsmtQual','BsmtCond','GarageQual','GarageCond','ExterQual','ExterCond','HeatingQC','PoolQC','KitchenQual',
       'BsmtFinType1','BsmtFinType2','Functional','Fence','BsmtExposure','GarageFinish','LandSlope','LotShape','PavedDrive',
       'Street','Alley','CentralAir','MSSubClass','OverallCond','YrSold','MoSold')
for col in cols:
    lb1 = LabelEncoder()
    lb1.fit(list(all_features[col].values))
    all_features[col] = lb1.transform(list(all_features[col].values))

2）数值变量做Box-Cox变换

查看数值变量的偏度，很多变量的偏度都蛮高的，会影响我们后续的预测和建模，我们还需做进一步的数据变换。

Box-Cox变换基本原理：假设样本里一共有n个数据点，分别是y1,y2,...,yn，找到一个合适的函数使得数据点经过变换之后样本整体呈现最好的正态分布。我们可以通过scipy里面的包引用boxcox1p进行处理。

Box-Cox变换关键点在于如何找到一共合适的参数，一般情况下以0.15为经验值。目标就是找到一个简单的转换方式使数据规范化。

三、构建模型

划分训练集和测试集

X = all_features.iloc[:len(train_labels),:]
X_test = all_features.iloc[len(train_labels):,:]
X.shape,train_labels.shape,X_test.shape

构建模型验证--5折交叉验证

kf = KFold(n_splits=12,random_state=42,shuffle=True)
def cv_rmse(model,X=X):
    rmse = np.sqrt(-cross_val_score(model,X,train_labels,scoring='neg_mean_squared_error',cv=kf))
    return (rmse)

这里我们用lightgbm进行建模和预测

lightgbm = LGBMRegressor(objective='regression',num_leaves=6,learning_rate=0.01,n_estimators=7000,max_bin=200,bagging_fraction=0.8,
                       bagging_freq=4,bagging_seed=8,feature_fraction=0.2,feature_fraction_seed=8,min_sum_hessian_in_leaf=11,
                        verbose=-1,random_state=42)
score = cv_rmse(lightgbm)
print('lightgbm: {:.4f}({:.4f})'.format(score.mean(),score.std()))

最终结果：score.mean = 0.1155，score.std = 0.0161

总结：

项目中模拟了整个建模的流程，从数据获取，到探索性数据分析，再到数据清洗和数据变换，以及后面的建模，完整的再次呈现建模的各个环节，其中还有很多不足之处，还需进一步加强和学习。

本次利用国外的数据集进行了房价预测，并利用LightGBM算法来建模和预测，总体效果还算ok。