决策树
文章目录
决策树一个重要任务是为了理解数据中所蕴含的知识信息,因此决策树可以使用不熟悉的数据集合,并从中提取出一系列规则,这些及其根据数据集创建规则的过程,就是机器学习的过程
本章节使用`ID3`算法划分数据集,该算法处理如何划分数据集就,何时停止划分数据集。
划分数据集的最大原则:将无序的数据变得更加有序。
熵
熵定义为信息的期望值,明白这个概念之前,必须知道信息的定义。
如果待分类的事务可能划分在多个分类之中, 则符号 Xi 的信息定义为
l ( x i ) = − l o g 2 p ( x i ) l(x_i) = - log_2 p(x_i) l(xi)=−log2p(xi)
其中 $$p(x_i)$$是选择该分类的概率
所有类别可能值包含的信息期望值为:
H = − Σ i = 1 n p ( x i ) l o g 2 p ( x i ) H = - \Sigma^n_{i=1}p(x_i)log_2p(x_i) H=−Σi=1np(xi)log2p(xi)
其中,n是分类的数目
计算给定数据集的香农熵
def calcShannonEnt(dataSet):
numEntries = len(dataSet)
labelCounts = {}
for featVec in dataSet: #the the number of unique elements and their occurance
currentLabel = featVec[-1]
if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
labelCounts[currentLabel] += 1
shannonEnt = 0.0
for key in labelCounts:
prob = float(labelCounts[key])/numEntries
shannonEnt -= prob * log(prob,2) #log base 2
return shannonEnt
划分数据集
分类算法除了需要知道测量信息熵,还需要划分数据集,度量划分数据集的熵,以便判断当前是否正确划分了数据集。
可以对每个特征划分数据集的结果计算1次信息熵,然后判断按照哪个特征划分数据集是最好的划分方式
按照给定特征划分数据集代码:
def splitDataSet(dataSet, axis, value):
retDataSet = []
for featVec in dataSet:
if featVec[axis] == value:
reducedFeatVec = featVec[:axis] #chop out axis used for splitting
reducedFeatVec.extend(featVec[axis+1:])
retDataSet.append(reducedFeatVec)
return retDataSet
选择最好的数据集划分方式:
def chooseBestFeatureToSplit(dataSet):
numFeatures = len(dataSet[0]) - 1 #the last column is used for the labels
baseEntropy = calcShannonEnt(dataSet)
bestInfoGain = 0.0; bestFeature = -1
for i in range(numFeatures): #iterate over all the features
featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
uniqueVals = set(featList) #get a set of unique values
newEntropy = 0.0
for value in uniqueVals:
subDataSet = splitDataSet(dataSet, i, value)
prob = len(subDataSet)/float(len(dataSet))
newEntropy += prob * calcShannonEnt(subDataSet)
infoGain = baseEntropy - newEntropy #calculate the info gain; ie reduction in entropy
if (infoGain > bestInfoGain): #compare this to the best gain so far
bestInfoGain = infoGain #if better than current best, set to best
bestFeature = i
return bestFeature #returns an integer
递归构建决策树
从数据集构造决策树算法:得到原始数据集,然后基于最好的属性值划分数据集,由于特征值可能多于2个,因此可能存在大于2个分支的数据集划分。
递归结束的条件: 程序遍历完所有划分数据集的属性,或者每个分支下的所有实例都具有相同的分类。如果所有实例具有相同的分类,则得到一个叶子节点或终止块。任何达到叶子节点的数据必然属于叶子节点的分类。
def createTree(dataSet,labels):
classList = [example[-1] for example in dataSet]
if classList.count(classList[0]) == len(classList):
return classList[0] #stop splitting when all of the classes are equal
if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
return majorityCnt(classList)
bestFeat = chooseBestFeatureToSplit(dataSet)
bestFeatLabel = labels[bestFeat]
myTree = {bestFeatLabel:{}}
del(labels[bestFeat])
featValues = [example[bestFeat] for example in dataSet]
uniqueVals = set(featValues)
for value in uniqueVals:
subLabels = labels[:] #copy all of labels, so trees don't mess up existing labels
myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
return myTree
结果:
myTree: {'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}
欢迎关注公众号【三戒纪元】