【机器学习笔记】决策树随机森林 XGBoost

原创已于 2023-09-19 20:38:35 修改 · 376 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#机器学习 #笔记 #决策树

于 2023-09-18 22:44:53 首次发布

本文详细介绍了决策树的构建过程，包括纯度、熵函数、信息增益的计算，特征值的选择，以及独热编码和处理连续解的方法。此外，还讨论了如何通过随机森林和XGBoost等技术提高模型鲁棒性和预测准确性，以及过拟合的防止策略。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1. 决策树

决策树根据有结构的数据样本的某些特征来区分树的枝叶，比如下图例子中给出的是猫和狗的区分，按照耳朵的形状，脸的形状，是否有胡须三个特征来分辨当前样本是否是猫
在这里插入图片描述
通过这个例子的引入，下面我们需要考虑如下几个问题：

按照什么依据来拆分分支？
当有多个特征值时，应该优先选择哪一个？
什么时候决策树停止继续扩展？
鲁棒性如何？

2. 拆分

还是以上面的区分猫狗为例子。要使得拆分的结果最好，拆分结果应该满足一边有一边没有的情况如下图G1（完美区分），然鹅实际情况我们很难找到如此完美的特征值来做拆分，像是图G2、G3两边分支都不是纯的，这里引入了纯度，那何为纯度呢？
在这里插入图片描述

2.1 纯度

纯度纯度就是样本K中包含a对象的比例（像上图G1 5个样本里都是猫，则说🐱的纯度是100%）
而熵函数作为一个标准的方法，可以用在测量一个对象的纯度上，熵函数公式可以写成（是 $p_1$ 和不是 $p_1$ 两种情况） $H(p_1) = -p_1log_2(p_1) - (1-p_1)log_2(1-p_1)$ 取 $log_2$ 而不是 $ln⁡\ln$ 的原因是前者在峰值处的数值能有更好的解释（结果为1）其画出来的图像如下所示
在这里插入图片描述

假设我有六个样本节点，对于不同的样本分布我会得到不同的熵函数值

3个🐱 3个🐶 此时猫的纯度为 $p1=36p_1 = \frac{3}{6}$ 根据熵函数可以计算出H(0.50) = 1
4个🐱 2个🐶 此时猫的纯度为 $p2=46p_2 = \frac{4}{6}$ 根据熵函数可以计算出H(0.67) ≈ 0.91
5个🐱 1个🐶 此时猫的纯度为 $p3=56p_3 = \frac{5}{6}$ 根据熵函数可以计算出H(0.83) ≈ 0.66

3. 选特征值拆分

有了前面纯度和熵函数的引入，下面来讲讲如何进行分支，信息增益是衡量一个特征值是否应该在当前步骤被选择且进行分支的标准（也叫做熵的减少），其公式定义如下 $info\_gain = H(p_1^{root}) - (w^{left}H(p_1^{left}) + w^{right}H(p_1^{right}))$
其中          $p_1^{root}$             代表当前节点中p1的纯度
        $w_{left}$ / $w_{right}$     代表左右子树中样本节点占总样本的比值
        $p_1^{left}$ / $p_1^{right}$       代表左右子树中P1的纯度

看完还是有点懵吧？没关系，下面结合图片用一个例子来演示一遍选择的过程
在这里插入图片描述

在G2所选的特征值中描述的是耳朵的形状，根据是竖起来的耳朵还是弯曲的耳朵分为两类，此时 $p_1^{root} = 0.5 \,\ \ \ p_1^{left} = 4/5 \, \ \ \ p_1^{right} = 1/5$ ，由于左右两分枝样本数相同所以权重 $w_{right} = w_{left} = 0.5$ ，根据计算公式和熵函数，我们不难算出G2的信息增益 = 0.28
在这里插入图片描述

在G3所选的特征值中描述的是脸的形状，根据是圆脸还是尖脸分为两类，此时 $p_1^{root} = 0.5 \,\ \ \ p_1^{left} = 4/7 \, \ \ \ p_1^{right} = 1/3$ ，权重 $w_{right} = 7/10 \, \ \ \ w_{left} = 3/10$ ，根据计算公式和熵函数，我们不难算出G2的信息增益 = 0.03
在这里插入图片描述
在G4所选的特征值中描述的是是否有胡须，根据有还是没有分为两类，此时 $p_1^{root} = 0.5 \,\ \ \ p_1^{left} = 3/4 \, \ \ \ p_1^{right} = 2/6$ ，权重 $w_{right} = 4/10 \, \ \ \ w_{left} = 6/10$ ，根据计算公式和熵函数，我们不难算出G2的信息增益 = 0.12

假设当前只选择有特征值分辨，那么经过计算上述三个不同的信息增益，我们应该挑选信息增益最大的特征值作为选择扩展分支的策略，如上述三个特征值中应该选择耳朵的形状（0.28>0.12>0.03）作为特征值进行拆分

3.1 代码实现计算信息增益并选择其最大的特征值

上述公式即计算熵代码表示如下：

def compute_entropy(y):
    """
    Computes the entropy for 
    
    Args:
       y (ndarray): Numpy array indicating whether each example at a node is
           edible (`1`) or poisonous (`0`)
       
    Returns:
        entropy (float): Entropy at that node
        
    """
    # You need to return the following variables correctly
    entropy = 0.
    
    p = np.sum(y) / len(y) #根据y的个数计算y出现的概率
    
    if(p == 1 or p == 0):
        return entropy; # 根据熵函数 当p为0和1时，应该对应的函数值为0
    
    entropy = -p * np.log2(p) - (1-p) * np.log2(1-p)
    
    return entropy

而拆分的代码可以写成

def split_dataset(X, node_indices, feature):
    """
    Splits the data at the given node into
    left and right branches
    
    Args:
        X (ndarray):             Data matrix of shape(n_samples, n_features)
        node_indices (ndarray):  List containing the active indices. I.e, the samples being considered at this step.
        feature (int):           Index of feature to split on
    
    Returns:
        left_indices (ndarray): Indices with feature value == 1
        right_indices (ndarray): Indices with feature value == 0
    """
    
    # You need to return the following variables correctly
    left_indices = []
    right_indices = []
    pick_feature = X[:,feature] #获取到对应的权重
    
    for i in node_indices:
        if(pick_feature[i] == 1):
            left_indices.append(i)
        else:
            right_indices.append(i)
            
	return left_indices, right_indices #返回左右两子树的集合

计算information gain代码如下

def compute_information_gain(X, y, node_indices, feature):
    
    """
    Compute the information of splitting the node on a given feature
    
    Args:
        X (ndarray):            Data matrix of shape(n_samples, n_features)
        y (array like):         list or ndarray with n_samples containing the target variable
        node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.
   
    Returns:
        cost (float):        Cost computed
    
    """    
    # Split dataset
    left_indices, right_indices = split_dataset(X, node_indices, feature)
    
    # Some useful variables
    X_node, y_node = X[node_indices], y[node_indices]
    X_left, y_left = X[left_indices], y[left_indices]
    X_right, y_right = X[right_indices], y[right_indices]
    
	# Weights 
    w_left = len(X_left) / len(X_node)
    w_right = len(X_right) / len(X_node)
        
    #Weighted entropy
    w_entropy = w_left*compute_entropy(y_left) + w_right*compute_entropy(y_right)
    #Information gain                                                   
    information_gain = compute_entropy(y_node) - w_entropy
    
    return information_gain

最后选择最大的信息增益

def get_best_split(X, y, node_indices):   
    """
    Returns the optimal feature and threshold value
    to split the node data 
    
    Args:
        X (ndarray):            Data matrix of shape(n_samples, n_features)
        y (array like):         list or ndarray with n_samples containing the target variable
        node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.

    Returns:
        best_feature (int):     The index of the best feature to split
    """    
    
    # Some useful variables
    num_features = X.shape[1]
    
    # You need to return the following variables correctly
    best_feature = -1
    best_value = -1
    
    for i in range(num_features):
        information_gain = compute_information_gain(X,y,node_indices = node_indices,feature = i)
        #注意当样本数据纯度100%，即单一对象时，不应该分支，返回-1
        if(information_gain > best_value and information_gain != 0):
            best_value = information_gain
            best_feature = i 
   
    return best_feature

3.2 独热编码（One-hot）

目前为止上面的例子讲的都是0/1变量，即一个特征值只有两种情况，但实际上会存在一个特征值有n个解的情况，那我们应该如何解决呢？独热编码就派上用场了，简单点来讲一个特征值往往不太可能同时拥有不同的解（比如我的脸怎么可能又圆又方🙃）所以将不同的解分开记录，一个对象只能拥有其中一个解（标记1，其他标记0）如下图所示
在这里插入图片描述
每一个对象仅可能占用其中一条，所以一个特征值有k个解，则创建k个0/1变量对应每一个解！！此时我们就可以将所有的特征值都用0/1变量来表示了

3.3 连续解

解决了有多个解的问题，将其化为0/1变量来讨论，那么连续解呢？比如说我的身高体重，难道也要一cm一个记录吗？显然不可能
这时候我们应思考前面都是0/1变量那么身高体重这些如何化为0/1变量一同记录呢？

通过信息增量的计算，确定一个最佳分界点，大于该分界点的为1，小于该分界点的为0。如下小栗子，假设男性与女性体重相差较大，则可以通过其作为一个特征值来分割样本点。
我们选择了三条分界线（当然可以测试更多的例子，这里为了方便讲解）计算不同分界线对应的信息增量（把分界线看成根节点，左右两边分别是左右子树，p1是男生的概率）得到S2情况下信息增量最大，所以我们应该使用 weights > 68 作为一个分界线将样本点转化为0/1变量
在这里插入图片描述

4. 什么时候停止？

决策树不可能一直扩展吧，应该有一个明确的规范来限制其过多的枝叶（因为会导致过拟合的现象）

最简单的一个就是当决策树的所有子节点都是100%纯的
限制最大深度，比如说设定个3则决策树在第三层时即为最终的根节点
在计算信息增益时计算得到的减少熵小于阈值
当一个节点的样本数已经低于界定的阈值（再分支失去意义了）

通过这四种方法来约束我们的模型，避免过拟合！！！

5. 总结如何构建决策树

收集样本数据，确定所有的特征值
根据计算不同特征值对应的信息增量来确定当前选择的是哪一个特征值（选择最大的信息增量，且选完就不能选了）
根据所选的特征值进行分支，然后再分别对左右子树重复操作2 ， 直到满足停止条件

用老师的区分猫狗例子来最终演示过程
在这里插入图片描述

5.1 最终实现建树代码

# Not graded
tree = []

def build_tree_recursive(X, y, node_indices, branch_name, max_depth, current_depth):
    """
    Build a tree using the recursive algorithm that split the dataset into 2 subgroups at each node.
    This function just prints the tree.
    
    Args:
        X (ndarray):            Data matrix of shape(n_samples, n_features)
        y (array like):         list or ndarray with n_samples containing the target variable
        node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.
        branch_name (string):   Name of the branch. ['Root', 'Left', 'Right']
        max_depth (int):        Max depth of the resulting tree. 
        current_depth (int):    Current depth. Parameter used during recursive call.
   
    """ 

    # Maximum depth reached - stop splitting
    if current_depth == max_depth:
        formatting = " "*current_depth + "-"*current_depth
        print(formatting, "%s leaf node with indices" % branch_name, node_indices)
        return
   
    # Otherwise, get best split and split the data
    # Get the best feature and threshold at this node
    best_feature = get_best_split(X, y, node_indices) 
    tree.append((current_depth, branch_name, best_feature, node_indices))
    
    formatting = "-"*current_depth
    print("%s Depth %d, %s: Split on feature: %d" % (formatting, current_depth, branch_name, best_feature))
    
    # Split the dataset at the best feature
    left_indices, right_indices = split_dataset(X, node_indices, best_feature)
    
    # continue splitting the left and the right child. Increment current depth
    build_tree_recursive(X, y, left_indices, "Left", max_depth, current_depth+1)
    build_tree_recursive(X, y, right_indices, "Right", max_depth, current_depth+1)

6. 回归树

如果我们不用3.2讲到的方法添加体重到特征值中，并根据某一个值分界。而是使用其他特征值来预测体重，那应该怎么实现呢？
回归树就是用于解决这类问题的，还是以G1 G2 G3为例子讲解
在这里插入图片描述假设我们知道每一个样本的体重，我们可以记录下所有样本的方差，每一个特征值下对应的左右子树样本方差 $variance\_reduction = V(p_1^{root}) - (w^{left}V(p_1^{left}) + w^{right}V(p_1^{right}))$ 其中V代表求方差，是不是很熟悉，就是我们前面讲到的信息增量。只不过熵函数换了（偷懒用老师给出来的样本数据😋）同样是选择最大的方差的减少
在这里插入图片描述

7. 使用多决策树（树集合）

使用单一决策树的鲁棒性不高！！或者说对样本数据反应敏感。而通过构建多决策树可以有效地避免这一点。而最终的预测结果就是通过这些树集合所预测的结果做最后的推断（异构的决策树结果往往不同）

7.1 有放回抽样

我们可以对原样本数据经行一个有放回抽样，得到其他不同的样本数据（很像，个数一样，但是可能存在重复），再来利用得到的这些样本数据来做决策树分析，这也是构建树集合的关键一步

7.2 随机森林算法

将样本数据放到一个黑盒里，每次随机从里面抽取并放回构成异构的数据集
进行决策树构建，注意在选择特征值(K=n)进行分割时，往往不会给定全集n，而是随机选取 $n\sqrt{n}$ 个特征值构成子集，并对该子集计算信息增量，选择最大的进行分支

避免了树集合相似度过高，失去了随机的意义，同时增加最后预测结果的准确性

重复上述步骤，直到循环次数达到给定值（一般不超过100），此时就可以通过构建出的树集合进行预测了

8. XGBoost

实现了boosted tree的开源包
boosted tree即使前面随机森林的改进算法，即是每一次构建决策树后，都会过一遍所有数据集，并将错误的数据做标记。在下一次随机选取样本数据时，将会有更高的概率选择到这些错误数据（刻意的学习这些错的，让我们的决策树越来越“聪明”）
效率高功能强大，而且易操作
有优秀的机制来决策分支和停止更新决策树
使用正则化避免过拟合

使用代码

#分类
from xgboost import XGBClassifier

model = XGBClassifier()
model.fit(X_train,y_train)
y_pred = model.predict(X_test)


#预测
from xgboost import XGBRegressor

model = XGBRegressor()
model.fit(X_train,y_train)
y_pred = model.predict(X_test)