Ensembles
前言
本文将基于UoA的课件介绍机器学习中的集成学习。
We will cover:
Averaging, Random Forests, AdaBoost, XGBoost
涉及的英语比较基础,所以为节省时间(不是full-time,还有其他三门课程,所以时间还是比较紧的),只在我以为需要解释的地方进行解释。
此文不用于任何商业用途,仅仅是个人学习过程笔记以及心得体会,侵必删。
Ensembles
集成学习是一种机器学习技术,通过将多个模型的预测结果进行整合,得到一个更准确、更稳定的预测结果。集成学习可以用于分类、回归、聚类等各种机器学习任务。
集成学习可以提高模型的稳定性和泛化能力,减少过拟合现象,使得模型对噪声数据更具有鲁棒性。
Averaging,
Averaging是一种集成学习方法,它是基于Bagging思想的一种简单的集成方法。它的基本思想是通过对多个基分类器的预测结果进行平均或加权平均来得到最终预测结果。
Averaging方法首先使用随机有放回抽样的方法,从原始数据集中生成多个子集,每个子集用于训练一个独立的基分类器。然后在测试时,每个基分类器对测试样本进行预测,最终的预测结果是所有基分类器的预测结果的平均值或加权平均值。
Averaging方法的优点是实现简单,易于并行化,可以有效地减少模型的方差,提高模型的稳定性和泛化能力。它适用于各种机器学习任务,特别是在训练数据量较少的情况下,Averaging方法可以通过生成多个子集来增加样本量,提高模型的性能。
Stacking
Stacking是一种集成学习方法,它可以将多个基分类器组合在一起,通过结合不同基分类器的优点,得到一个更准确的最终预测结果。Stacking方法的基本思想是通过将多个基分类器的预测结果作为新的特征输入到另一个分类器中进行训练。
Stacking方法首先将原始数据集分为两部分:训练集和验证集。然后使用训练集对多个不同类型的基分类器进行训练。接着,对于每个基分类器,使用训练集对其进行预测,将预测结果作为新的特征,与原始数据集合并得到新的数据集。最后,使用新的数据集和验证集对另一个分类器(称为元分类器)进行训练,得到最终的预测结果。
Stacking方法的优点是可以利用多个基分类器的优势,提高模型的性能和稳定性,同时避免了单个分类器的缺点。但是,Stacking方法需要更多的计算资源和时间,因为需要训练多个基分类器和一个元分类器。此外,Stacking方法也更容易过拟合,需要谨慎地选择基分类器和元分类器。
Why does averaging work?
很简单,没啥可说的。
如何理解:In practice errors won’t be completely independent due to noise in the labels
在Averaging方法中,每个基分类器使用不同的随机子集训练,因此它们是相互独立的。然而,在实践中,由于数据集中可能存在噪声数据或标签错误,基分类器之间的误差可能不是完全独立的。
例如,如果数据集中存在标签错误,那么在随机有放回地抽样生成不同子集时,相同的错误数据可能会被包含在不同的子集中,从而使得多个基分类器之间出现一定的相关性。这种相关性可能导致基分类器的预测结果更加一致,从而使得集成模型的方差没有完全减少,无法达到预期的效果。
为了避免这种情况,可以使用一些技术来减少标签噪声或增加数据样本,例如数据清洗、数据增强等方法。此外,也可以考虑使用更复杂的集成方法,例如Bagging和Boosting等方法,这些方法可以通过改变基分类器之间的关系来减少误差的相关性,从而提高模型的性能。
Random Forests
随机森林是一种基于决策树的集成学习方法,它通过Bagging方法和随机特征选择的方式,构建多个决策树并将它们组合起来,得到一个更加准确的预测结果。
Does averaging work if you use trees with the same parameters?
Yes, averaging can work even if you use trees with the same parameters, but the performance improvement may not be as significant as when using trees with different parameters.
When trees have the same parameters and are trained on the same dataset, they are likely to have similar biases and errors, which can limit the diversity of the ensemble. As a result, averaging the predictions of these trees may not lead to significant improvements in accuracy compared to a single decision tree.
However, averaging can still be beneficial in some cases. For example, if the dataset is noisy, averaging can help to reduce the effect of random errors in individual trees and improve the overall performance of the ensemble. In addition, if the dataset is small, averaging can help to stabilize the predictions and reduce the variance of the model.
In practice, it is often beneficial to use trees with different parameters and/or different subsets of features when constructing a decision tree ensemble. This can increase the diversity of the ensemble and lead to better performance.
Bootstrap Sampling
Bootstrapping: Bootstrapping is a statistical technique that involves sampling with replacement from the original dataset to create multiple new datasets. In the context of Random Forests, bootstrapping is used to generate multiple subsets of the original dataset, each of which is used to train a decision tree. This process is called Bagging (Bootstrap Aggregating).
The bootstrapping process introduces randomness into the dataset, which helps to reduce the correlation between the decision trees and improve the accuracy of the ensemble. By generating multiple subsets of the data, the algorithm can capture different aspects of the data and reduce the risk of overfitting.
Bagging and Random Forests are both ensemble learning techniques that use multiple models to improve the accuracy and robustness of predictions. The main difference between the two lies in the way the individual models are trained and combined.
Bagging is a technique that involves randomly sampling subsets of the original dataset with replacement, and training a separate model on each subset. Each model is trained independently of the others, and the final prediction is made by averaging the predictions of all the models.
Random Forests, on the other hand, use a modified form of decision trees known as “randomized decision trees”, which introduce additional randomness into the modeling process. In addition to randomly sampling subsets of the data, Random Forests also randomly select a subset of features at each split in the decision tree. By introducing this additional level of randomness, Random Forests are able to produce more diverse models, which can lead to improved accuracy and robustness.
So, while Bagging is primarily a data sampling technique, Random Forests incorporate both data sampling and feature selection techniques to improve model performance.
这个百分比是通过统计学中的中心极限定理得出来的。中心极限定理指出,对于一个具有有限方差的随机变量,在随机抽样下,其样本平均值的分布会趋近于正态分布,且随着样本数量的增加,逼近程度越来越高。具体而言,在二项分布中,当样本容量足够大时,每个数据子集的样本数量大约为原始数据集样本数量的63.2%左右。这个数值被称为"一倍标准差",可以通过计算标准差来推算。
Random Trees
Random trees: In addition to bootstrapping, Random Forests also uses randomization in the construction of individual decision trees. Specifically, at each node of the decision tree, a random subset of the available features is selected to determine the best split. This ensures that each tree is constructed using a different set of features, which further increases the diversity of the ensemble.
The combination of bootstrapping and random trees results in a powerful algorithm that is robust to overfitting and can handle high-dimensional datasets with complex relationships between the features and the target variable. Random Forests have been widely used in many applications, such as classification, regression, and feature selection.
AdaBoost
AdaBoost (Adaptive Boosting) is a machine learning algorithm that is used for classification and regression problems. It works by combining multiple “weak” models into a “strong” model. In AdaBoost, each weak model is trained on a weighted version of the training data, where the weights are adjusted based on the performance of the previous weak models. The final prediction is made by combining the predictions of all the weak models.
The key idea behind AdaBoost is to iteratively improve the performance of the weak models by focusing on the examples that are misclassified by the previous models. In each iteration, the weights of the training examples are adjusted so that the misclassified examples are given a higher weight. This means that the next weak model will focus more on these examples, and hopefully be able to classify them correctly.
In AdaBoost, the final prediction is made by taking a weighted vote of the predictions of the individual base classifiers. The weights assigned to each base classifier are based on their accuracy on the training data.
Specifically, after each round of boosting, the weights of the training examples are adjusted such that the examples that were misclassified by the previous round of classifiers are given higher weights. The base classifiers are then retrained on the updated weights.</