05.决策树公式推导

ID3决策树

信息熵,是度量样本集合纯度最常用的一种指标,其定义如下
Ent(D)=−∑k=1∣Y∣pklog⁡2pk Ent(D) = -\sum_{k=1}^{|\mathcal{Y}|}p_{k}\log_{2}p_{k} Ent(D)=k=1Ypklog2pk
其中D={(x1,y1),(x2,y2),⋯ ,(xn,yn)}D=\left \{ (x_{1},y_{1}),(x_{2},y_{2}),\cdots,(x_{n},y_{n}) \right \}D={(x1,y1),(x2,y2),,(xn,yn)}表示样本集合,∣Y∣|\mathcal{Y}|Y表示样本类别总数,如果是二分类,就是2,pkp_{k}pk表示第kkk类样本所占比例,且0≤pk≤1,∑k=1∣Y∣pk=10 \leq p_{k}\leq 1 ,\sum_{k=1}^{|\mathcal{Y}|}p_{k} = 10pk1,k=1Ypk=1Ent(D)Ent(D)Ent(D)值越小,纯度越高

证明:0≤Ent(D)≤log⁡2∣Y∣0 \leq Ent(D) \leq \log_{2}|\mathcal{Y}|0Ent(D)log2Y

Ent(D)Ent(D)Ent(D)最大值,若令∣Y∣=n,pk=xk|\mathcal{Y}|=n,p_{k}=x_{k}Y=n,pk=xk,则是一个n分类问题,那么信息熵Ent(D)Ent(D)Ent(D)就可以看作一个n元实值函数,也即
Ent(D)=f(x1,x2,⋯ ,xn)=−∑k=1nxklog⁡2xk Ent(D) = f(x_{1},x_{2},\cdots,x_{n}) = -\sum_{k=1}^{n}x_{k}\log_{2}x_{k} Ent(D)=f(x1,x2,,xn)=k=1nxklog2xk
0≤xk≤1,∑k=1nxk=10 \leq x_{k}\leq 1 ,\sum_{k=1}^{n}x_{k} = 10xk1,k=1nxk=1,下面考虑求该多元函数的最值

如果不考虑约束0≤xk≤10 \leq x_{k}\leq 10xk1,仅考虑∑k=1nxk=1\sum_{k=1}^{n}x_{k} = 1k=1nxk=1的话,对f(x1,x2,⋯ ,xn)f(x_{1},x_{2},\cdots,x_{n})f(x1,x2,,xn)求最大值等价于如下最小化问题
min∑k=1nxklog⁡2xks.t.∑k=1nxk=1 min \sum_{k=1}^{n}x_{k}\log_{2}x_{k}\\ s.t. \sum_{k=1}^{n}x_{k} = 1 mink=1nxklog2xks.t.k=1nxk=1
∑k=1nxklog⁡2xk\sum_{k=1}^{n}x_{k}\log_{2}x_{k}k=1nxklog2xk可以看成是n个xlog⁡2xx\log_{2}xxlog2x求和

单独看其中一个函数,记f(x)=xlog2xf(x) = xlog_{2}xf(x)=xlog2x,则
f′(x)=log2x+x⋅1xln⁡2=log2x+1ln⁡2f′′(x)=1xln⁡2 f'(x) = log_{2}x + x\cdot \frac{1}{x\ln 2} = log_{2}x + \frac{1}{\ln 2}\\ f''(x) = \frac{1}{x\ln 2} f(x)=log2x+xxln21=log2x+ln21f(x)=xln21
0≤x≤10 \leq x\leq 10x1,f′′(x)>0f''(x)>0f(x)>0,所以f(x)f(x)f(x)是凸函数,由n个f(x)f(x)f(x)组合而成的∑k=1nxklog⁡2xk\sum_{k=1}^{n}x_{k}\log_{2}x_{k}k=1nxklog2xk函数也是凸函数

0≤xk≤10 \leq x_{k}\leq 10xk1时,此问题为凸优化问题,而对于凸优化问题来说,满足KKT条件的点即为最优解,由于此最小化问题仅含等式约束,那么令其拉格朗日函数的一阶偏导数等于0的点即为满足KKT条件的点

根据拉格朗日乘子法可知,该优化问题的拉格朗日函数为
L(x1,⋯ ,xn,λ)=∑k=1nxklog⁡2xk+λ(∑k=1nxk−1) L(x_{1},\cdots,x_{n},\lambda ) = \sum_{k=1}^{n}x_{k}\log_{2}x_{k} + \lambda(\sum_{k=1}^{n}x_{k} - 1) L(x1,,xn,λ)=k=1nxklog2xk+λ(k=1nxk1)
对该拉格朗日函数分别关于x1,⋯ ,xn,λx_{1},\cdots,x_{n},\lambdax1,,xn,λ求一阶偏导数,并令偏导数等于0

先对x1x_{1}x1求偏导等于0
∂L(x1,⋯ ,xn,λ)∂x1=∂∂x[∑k=1nxklog⁡2xk+λ(∑k=1nxk−1)]=log2x1+x1⋅1x1ln⁡2+λ=log2x1+1ln⁡2+λ=0 \begin{aligned} \frac{\partial L(x_{1},\cdots,x_{n},\lambda )}{\partial x_{1}} &= \frac{\partial }{\partial x}\left [ \sum_{k=1}^{n}x_{k}\log_{2}x_{k} + \lambda(\sum_{k=1}^{n}x_{k} - 1) \right ] \\&= log_{2}x_{1} + x_{1}\cdot \frac{1}{x_{1}\ln 2} + \lambda \\&= log_{2}x_{1} + \frac{1}{\ln 2} + \lambda =0 \end{aligned} x1L(x1,,xn,λ)=x[k=1nxklog2xk+λ(k=1nxk1)]=log2x1+x1x1ln21+λ=log2x1+ln21+λ=0

λ=−log2x1−1ln⁡2 \lambda = -log_{2}x_{1} - \frac{1}{\ln 2} λ=log2x1ln21
然后分别对x2,⋯ ,xn{x_{2},\cdots, x_{n}}x2,,xn分别求偏导,可得
λ=−log2x1−1ln⁡2=−log2x2−1ln⁡2=⋯=−log2xn−1ln⁡2 \lambda = -log_{2}x_{1} - \frac{1}{\ln 2} = -log_{2}x_{2} - \frac{1}{\ln 2} = \cdots = -log_{2}x_{n} - \frac{1}{\ln 2} λ=log2x1ln21=log2x2ln21==log2xnln21
λ\lambdaλ求偏导
∂L(x1,⋯ ,xn,λ)∂λ=∂∂λ[∑k=1nxklog⁡2xk+λ(∑k=1nxk−1)]=∑k=1nxk−1 \frac{\partial L(x_{1},\cdots,x_{n},\lambda )}{\partial \lambda} = \frac{\partial }{\partial \lambda}\left [ \sum_{k=1}^{n}x_{k}\log_{2}x_{k} + \lambda(\sum_{k=1}^{n}x_{k} - 1) \right ] = \sum_{k=1}^{n}x_{k} - 1 λL(x1,,xn,λ)=λ[k=1nxklog2xk+λ(k=1nxk1)]=k=1nxk1
令其等于0得
∑k=1nxk=1 \sum_{k=1}^{n}x_{k} = 1 k=1nxk=1
所以可以解得x1=x2=⋯=xn=1nx_{1} = x_{2} = \cdots = x_{n} = \frac{1}{n}x1=x2==xn=n1 (因为x1=x2=⋯=xnx_{1} = x_{2} = \cdots = x_{n}x1=x2==xn∑k=1nxk=1\sum_{k=1}^{n}x_{k} = 1k=1nxk=1)

又因为xkx_{k}xk还要满足0≤xk≤10 \leq x_{k}\leq 10xk1,显然0≤1n≤10 \leq \frac{1}{n} \leq 10n11,所以x1=x2=⋯=xn=1nx_{1} = x_{2} = \cdots = x_{n} = \frac{1}{n}x1=x2==xn=n1是满足所有约束的最优解,也即为当前最小化问题的最小值点,同时也是$ f(x_{1},x_{2},\cdots,x_{n})的最大值点,将的最大值点,将x_{1} = x_{2} = \cdots = x_{n} = \frac{1}{n}代入代入 f(x_{1},x_{2},\cdots,x_{n})$中可得
f(1n,1n,⋯ ,1n)=−∑k=1n1nlog⁡21n=−n⋅1n⋅log⁡21n=log⁡2n f(\frac{1}{n},\frac{1}{n},\cdots,\frac{1}{n}) = -\sum_{k=1}^{n}\frac{1}{n}\log_{2}\frac{1}{n} = -n\cdot \frac{1}{n} \cdot \log_{2}\frac{1}{n} = \log_{2}n f(n1,n1,,n1)=k=1nn1log2n1=nn1log2n1=log2n
所以$ f(x_{1},x_{2},\cdots,x_{n})$在满足约束0≤xk≤1,∑k=1nxk=10 \leq x_{k}\leq 1,\sum_{k=1}^{n}x_{k} = 10xk1,k=1nxk=1时的最大值为log⁡2n\log_{2}nlog2n

Ent(D)Ent(D)Ent(D)的最小值

如果不考虑∑k=1nxk=1\sum_{k=1}^{n}x_{k} = 1k=1nxk=1,仅考虑leqxk≤1leq x_{k}\leq 1leqxk1的话,$ f(x_{1},x_{2},\cdots,x_{n})$可以看成是n个互不相关的一元函数加和,也即
f(x1,x2,⋯ ,xn)=∑k=1ng(xk) f(x_{1},x_{2},\cdots,x_{n}) =\sum_{k=1}^{n}g(x_{k}) f(x1,x2,,xn)=k=1ng(xk)
其中,g(xk)=−xklog⁡2xkg(x_{k}) = -x_{k}\log_{2}x_{k}g(xk)=xklog2xk0≤xk≤10 \leq x_{k}\leq 10xk1,那么当g(x1),g(x2),⋯ ,g(xk)g(x_{1}),g(x_{2}),\cdots,g(x_{k})g(x1),g(x2),,g(xk)分别取到其最小值时,f(x1,x2,⋯ ,xn)f(x_{1},x_{2},\cdots,x_{n})f(x1,x2,,xn)也就取到了最小值,由于g(x1),g(x2),⋯ ,g(xk)g(x_{1}),g(x_{2}),\cdots,g(x_{k})g(x1),g(x2),,g(xk)的定义域和函数表达式均相同,所以只需求出g(x1)g(x_{1})g(x1)的最小值也就求出了g(x2),⋯ ,g(xk)g(x_{2}),\cdots,g(x_{k})g(x2),,g(xk)的最小值,下面考虑求g(x1)g(x_{1})g(x1)的最小值

首先对g(x1)g(x_{1})g(x1)关于x1x_{1}x1求一阶和二阶导数
g′(x1)=−log2x1−x1⋅1x1ln⁡2=−log2x1−1ln⁡2g′′(x1)=−1x1ln⁡2 g'(x_{1}) = -log_{2}x_{1} - x_{1}\cdot \frac{1}{x_{1}\ln 2} = -log_{2}x_{1} - \frac{1}{\ln 2}\\ g''(x_{1}) = -\frac{1}{x_{1}\ln 2} g(x1)=log2x1x1x1ln21=log2x1ln21g(x1)=x1ln21
显然,当0≤xk≤10 \leq x_{k}\leq 10xk1g′′(x1)=−1x1ln⁡2g''(x_{1}) = -\frac{1}{x_{1}\ln 2}g(x1)=x1ln21恒小于0,所以g(x1)g(x_{1})g(x1)是一个在其定义域范围内开口向下的凹函数,那么其最小值必定在边界取,于是分别取x1=0x_{1} = 0x1=0x1=1x_{1}=1x1=1代入g(x1)g(x_{1})g(x1)
g(0)=−0log⁡20=0g(1)=−log21=0 g(0) = -0\log_{2}0 = 0\\ g(1) = -log_{2}1 = 0 g(0)=0log20=0g(1)=log21=0
所以,g(x1)g(x_{1})g(x1)的最小值为0,同理可得g(x2),⋯ ,g(xk)g(x_{2}),\cdots,g(x_{k})g(x2),,g(xk)的最小值也为0,那么$ f(x_{1},x_{2},\cdots,x_{n})$的最小值也为0,但是,此时是仅考虑0≤xk≤10 \leq x_{k}\leq 10xk1时取到的最小值,若考虑约束∑k=1nxk=1\sum_{k=1}^{n}x_{k} = 1k=1nxk=1的话,那么$ f(x_{1},x_{2},\cdots,x_{n})的最小值一定大于等于0,如果令某个的最小值一定大于等于0,如果令某个0x_{k}=1,那么根据约束,那么根据约束\sum_{k=1}^{n}x_{k} = 1可知可知x_{1} = x_{2} = \cdots = x_{k-1} = x_{k+1} = \cdots = x_{n} = 0,将其代入,将其代入 f(x_{1},x_{2},\cdots,x_{n})$可得
f(0,0,⋯ ,1,0,⋯ ,0)=−0log⁡20−−0log⁡20−⋯−log⁡21−−0log⁡20−⋯−0log⁡20=0 f(0,0,\cdots,1,0,\cdots,0) = -0\log_{2}0 - -0\log_{2}0 - \cdots -\log_{2}1 - -0\log_{2}0 - \cdots -0\log_{2}0 = 0 f(0,0,,1,0,,0)=0log200log20log210log200log20=0
所以xk=1,x1=x2=⋯=xk−1=xk+1=⋯=xn=0x_{k} = 1,x_{1} = x_{2} = \cdots = x_{k-1} = x_{k+1} = \cdots = x_{n} = 0xk=1,x1=x2==xk1=xk+1==xn=0一定是$ f(x_{1},x_{2},\cdots,x_{n})$在满足约束0≤xk≤1,∑k=1nxk=10 \leq x_{k}\leq 1,\sum_{k=1}^{n}x_{k} = 10xk1,k=1nxk=1 的条件下的最小值,其最小值为0

条件熵:在已知样本属性a的取值情况下,度量样本集合纯度的一种指标
H(D∣a)=∑v=1V∣Dv∣DEnt(Dv) H(D|a) = \sum_{v=1}^{V}\frac{|D^v|}{D}Ent(D^v) H(Da)=v=1VDDvEnt(Dv)
其中,aaa表示样本的某个属性,假定属性aaaVVV个可能的取值{a1,a2,⋯ ,aV}\left \{ a^1,a^2,\cdots,a^V\right \}{a1,a2,,aV},样本集合DDD中在属性aaa上取值为aVa^VaV的样本记为DVD^VDV,Ent(DV)Ent(D^V)Ent(DV)表示样本集合DvD^vDv的信息熵,H(D∣a)H(D|a)H(Da)值越小,纯度越高

ID3决策树,已信息增益为准则来选择划分属性的决策树,信息增益公式为
Gain(D,a)=Ent(D)−∑v=1V∣Dv∣DEnt(Dv)=Ent(D)−H(D∣a) \begin{aligned} Gain(D,a) &= Ent(D) - \sum_{v=1}^{V}\frac{|D^v|}{D}Ent(D^v) \\&= Ent(D) - H(D|a) \end{aligned} Gain(D,a)=Ent(D)v=1VDDvEnt(Dv)=Ent(D)H(Da)
选择信息增益值最大的属性作为划分属性,因为信息增益越大,则意味着使用该属性来进行划分所获得的"纯度提升"越大

以信息增益为划分标准的ID3决策树对可取值越多数目较多的属性有所偏好
Gain(D,a)=Ent(D)−∑v=1V∣Dv∣DEnt(Dv)=Ent(D)−∑v=1V∣Dv∣D(−∑k=1∣Y∣pklog⁡2pk)=Ent(D)−∑v=1V∣Dv∣D(−∑k=1∣Y∣pklog⁡2pk)∣Dkv∣Dv \begin{aligned} Gain(D,a) &= Ent(D) - \sum_{v=1}^{V}\frac{|D^v|}{D}Ent(D^v) \\&= Ent(D) - \sum_{v=1}^{V}\frac{|D^v|}{D}(-\sum_{k=1}^{|\mathcal{Y}|}p_{k}\log_{2}p_{k}) \\&=Ent(D) - \sum_{v=1}^{V}\frac{|D^v|}{D}(-\sum_{k=1}^{|\mathcal{Y}|}p_{k}\log_{2}p_{k})\frac{|D_{k}^{v}|}{D^v} \end{aligned} Gain(D,a)=Ent(D)v=1VDDvEnt(Dv)=Ent(D)v=1VDDv(k=1Ypklog2pk)=Ent(D)v=1VDDv(k=1Ypklog2pk)DvDkv
其中,DkvD_{k}^{v}Dkv样本集合DDD中在属性aaa上取值为ava^{v}av且类别为kkk的样本

C4.5决策树

C4.5决策树以信息增益率为准则来选择划分属性的决策树,信息增益率
Gain_ratio(D,a)=Gain(D,a)IV(a) Gain\_ratio(D,a) = \frac{Gain(D,a)}{IV(a)} Gain_ratio(D,a)=IV(a)Gain(D,a)
其中
IV(a)=−∑v=1V∣Dv∣Dlog⁡2∣Dv∣D IV(a) = -\sum_{v=1}^{V}\frac{|D^v|}{D}\log_{2}\frac{|D^v|}{D} IV(a)=v=1VDDvlog2DDv

CART决策树

CART决策树以基尼指数为准则来选择划分属性的决策树

基尼值:
Gini(D)=∑k=1∣Y∣∑k′≠kpkpk′=∑k=1∣Y∣pk∑k′≠kpk′=∑k=1∣Y∣pk(1−pk)=1−∑k=1∣Y∣pk2 Gini(D) = \sum_{k=1}^{|\mathcal{Y}|}\sum_{k'\neq k}p_{k}p_{k'} = \sum_{k=1}^{|\mathcal{Y}|}p_{k}\sum_{k'\neq k}p_{k'} = \sum_{k=1}^{|\mathcal{Y}|}p_{k}(1-p_{k}) = 1-\sum_{k=1}^{|\mathcal{Y}|}p_{k}^2 Gini(D)=k=1Yk=kpkpk=k=1Ypkk=kpk=k=1Ypk(1pk)=1k=1Ypk2
基尼指数:
Gini_index(D,a)=∑v=1V∣Dv∣DGini(Dv) Gini\_index(D,a) =\sum_{v=1}^{V}\frac{|D^v|}{D}Gini(D^v) Gini_index(D,a)=v=1VDDvGini(Dv)
基尼值和基尼指数越小,样本集合纯度越高

CART决策树分类算法

  1. 根据基尼指数公式Gini_index(D,a)=∑v=1V∣Dv∣DGini(Dv)Gini\_index(D,a) =\sum_{v=1}^{V}\frac{|D^v|}{D}Gini(D^v)Gini_index(D,a)=v=1VDDvGini(Dv)找出基尼指数最小的属性a∗a_{*}a
  2. 计算属性a∗a_{*}a的所有可能取值的基尼值Gini(Dv)Gini(D^v)Gini(Dv),v=1,2,⋯ Vv=1,2,\cdots\,Vv=1,2,V,选择季妮志最小的取值a∗va_{*}^{v}av作为划分点,将集合DDD划分为D1D1D1D2D2D2两个集合(节点),其中D1D1D1集合的样本为a∗=a∗va_{*}=a_{*}^{v}a=av的样本,D2D2D2集合为a∗≠a∗va_{*}\neq a_{*}^{v}a=av的样本
  3. 对集合D1D1D1D2D2D2重复步骤1和步骤2,直到满足停止条件

CART决策树回归算法

  1. 根据以下公式找出最优划分特征a∗a^*a和最优划分点a∗va_{*}^vav
    a∗,a∗v=argmina,av[minc1∑xi∈D1(a,av)(yi−c1)2−minc2∑xi∈D2(a,av)(yi−c2)2] a_{*},a_{*}^v = \underset{a,a^v}{arg min}\left [\underset{c_{1}}{min} \underset{x_{i} \in D_{1}(a,a^v)}{\sum }(y_{i}-c_{1})^2-\underset{c_{2}}{min} \underset{x_{i} \in D_{2}(a,a^v)}{\sum }(y_{i}-c_{2})^2 \right ] a,av=a,avargminc1minxiD1(a,av)(yic1)2c2minxiD2(a,av)(yic2)2
    其中,D1(a,a∗)D_{1}(a,a^*)D1(a,a)表示在属性aaa上取值小于等于ava^vav的样本集合,D2(a,av)D_{2}(a,a^v)D2(a,av)表示在属性aaa上取值大于ava^vav的样本集合,c1c_{1}c1表示D1D_{1}D1的样本输出均值,c2c_{2}c2表示D2D_{2}D2的样本输出均值

  2. 根据划分点a∗va_{*}^vav将集合DDD划分为D1D_{1}D1D2D_{2}D2两个集合(节点)

  3. 对集合D1D_{1}D1D2D_{2}D2重复步骤1和步骤2,直至满足停止条件

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值