ID3决策树
信息熵,是度量样本集合纯度最常用的一种指标,其定义如下
Ent(D)=−∑k=1∣Y∣pklog2pk
Ent(D) = -\sum_{k=1}^{|\mathcal{Y}|}p_{k}\log_{2}p_{k}
Ent(D)=−k=1∑∣Y∣pklog2pk
其中D={(x1,y1),(x2,y2),⋯ ,(xn,yn)}D=\left \{ (x_{1},y_{1}),(x_{2},y_{2}),\cdots,(x_{n},y_{n}) \right \}D={(x1,y1),(x2,y2),⋯,(xn,yn)}表示样本集合,∣Y∣|\mathcal{Y}|∣Y∣表示样本类别总数,如果是二分类,就是2,pkp_{k}pk表示第kkk类样本所占比例,且0≤pk≤1,∑k=1∣Y∣pk=10 \leq p_{k}\leq 1 ,\sum_{k=1}^{|\mathcal{Y}|}p_{k} = 10≤pk≤1,∑k=1∣Y∣pk=1,Ent(D)Ent(D)Ent(D)值越小,纯度越高
证明:0≤Ent(D)≤log2∣Y∣0 \leq Ent(D) \leq \log_{2}|\mathcal{Y}|0≤Ent(D)≤log2∣Y∣
求Ent(D)Ent(D)Ent(D)最大值,若令∣Y∣=n,pk=xk|\mathcal{Y}|=n,p_{k}=x_{k}∣Y∣=n,pk=xk,则是一个n分类问题,那么信息熵Ent(D)Ent(D)Ent(D)就可以看作一个n元实值函数,也即
Ent(D)=f(x1,x2,⋯ ,xn)=−∑k=1nxklog2xk
Ent(D) = f(x_{1},x_{2},\cdots,x_{n}) = -\sum_{k=1}^{n}x_{k}\log_{2}x_{k}
Ent(D)=f(x1,x2,⋯,xn)=−k=1∑nxklog2xk
0≤xk≤1,∑k=1nxk=10 \leq x_{k}\leq 1 ,\sum_{k=1}^{n}x_{k} = 10≤xk≤1,∑k=1nxk=1,下面考虑求该多元函数的最值
如果不考虑约束0≤xk≤10 \leq x_{k}\leq 10≤xk≤1,仅考虑∑k=1nxk=1\sum_{k=1}^{n}x_{k} = 1∑k=1nxk=1的话,对f(x1,x2,⋯ ,xn)f(x_{1},x_{2},\cdots,x_{n})f(x1,x2,⋯,xn)求最大值等价于如下最小化问题
min∑k=1nxklog2xks.t.∑k=1nxk=1
min \sum_{k=1}^{n}x_{k}\log_{2}x_{k}\\
s.t. \sum_{k=1}^{n}x_{k} = 1
mink=1∑nxklog2xks.t.k=1∑nxk=1
∑k=1nxklog2xk\sum_{k=1}^{n}x_{k}\log_{2}x_{k}∑k=1nxklog2xk可以看成是n个xlog2xx\log_{2}xxlog2x求和
单独看其中一个函数,记f(x)=xlog2xf(x) = xlog_{2}xf(x)=xlog2x,则
f′(x)=log2x+x⋅1xln2=log2x+1ln2f′′(x)=1xln2
f'(x) = log_{2}x + x\cdot \frac{1}{x\ln 2} = log_{2}x + \frac{1}{\ln 2}\\
f''(x) = \frac{1}{x\ln 2}
f′(x)=log2x+x⋅xln21=log2x+ln21f′′(x)=xln21
当0≤x≤10 \leq x\leq 10≤x≤1,f′′(x)>0f''(x)>0f′′(x)>0,所以f(x)f(x)f(x)是凸函数,由n个f(x)f(x)f(x)组合而成的∑k=1nxklog2xk\sum_{k=1}^{n}x_{k}\log_{2}x_{k}∑k=1nxklog2xk函数也是凸函数
在0≤xk≤10 \leq x_{k}\leq 10≤xk≤1时,此问题为凸优化问题,而对于凸优化问题来说,满足KKT条件的点即为最优解,由于此最小化问题仅含等式约束,那么令其拉格朗日函数的一阶偏导数等于0的点即为满足KKT条件的点
根据拉格朗日乘子法可知,该优化问题的拉格朗日函数为
L(x1,⋯ ,xn,λ)=∑k=1nxklog2xk+λ(∑k=1nxk−1)
L(x_{1},\cdots,x_{n},\lambda ) = \sum_{k=1}^{n}x_{k}\log_{2}x_{k} + \lambda(\sum_{k=1}^{n}x_{k} - 1)
L(x1,⋯,xn,λ)=k=1∑nxklog2xk+λ(k=1∑nxk−1)
对该拉格朗日函数分别关于x1,⋯ ,xn,λx_{1},\cdots,x_{n},\lambdax1,⋯,xn,λ求一阶偏导数,并令偏导数等于0
先对x1x_{1}x1求偏导等于0
∂L(x1,⋯ ,xn,λ)∂x1=∂∂x[∑k=1nxklog2xk+λ(∑k=1nxk−1)]=log2x1+x1⋅1x1ln2+λ=log2x1+1ln2+λ=0
\begin{aligned}
\frac{\partial L(x_{1},\cdots,x_{n},\lambda )}{\partial x_{1}} &= \frac{\partial }{\partial x}\left [ \sum_{k=1}^{n}x_{k}\log_{2}x_{k} + \lambda(\sum_{k=1}^{n}x_{k} - 1) \right ] \\&= log_{2}x_{1} + x_{1}\cdot \frac{1}{x_{1}\ln 2} + \lambda \\&= log_{2}x_{1} + \frac{1}{\ln 2} + \lambda =0
\end{aligned}
∂x1∂L(x1,⋯,xn,λ)=∂x∂[k=1∑nxklog2xk+λ(k=1∑nxk−1)]=log2x1+x1⋅x1ln21+λ=log2x1+ln21+λ=0
得
λ=−log2x1−1ln2
\lambda = -log_{2}x_{1} - \frac{1}{\ln 2}
λ=−log2x1−ln21
然后分别对x2,⋯ ,xn{x_{2},\cdots, x_{n}}x2,⋯,xn分别求偏导,可得
λ=−log2x1−1ln2=−log2x2−1ln2=⋯=−log2xn−1ln2
\lambda = -log_{2}x_{1} - \frac{1}{\ln 2} = -log_{2}x_{2} - \frac{1}{\ln 2} = \cdots = -log_{2}x_{n} - \frac{1}{\ln 2}
λ=−log2x1−ln21=−log2x2−ln21=⋯=−log2xn−ln21
对λ\lambdaλ求偏导
∂L(x1,⋯ ,xn,λ)∂λ=∂∂λ[∑k=1nxklog2xk+λ(∑k=1nxk−1)]=∑k=1nxk−1
\frac{\partial L(x_{1},\cdots,x_{n},\lambda )}{\partial \lambda} = \frac{\partial }{\partial \lambda}\left [ \sum_{k=1}^{n}x_{k}\log_{2}x_{k} + \lambda(\sum_{k=1}^{n}x_{k} - 1) \right ] = \sum_{k=1}^{n}x_{k} - 1
∂λ∂L(x1,⋯,xn,λ)=∂λ∂[k=1∑nxklog2xk+λ(k=1∑nxk−1)]=k=1∑nxk−1
令其等于0得
∑k=1nxk=1
\sum_{k=1}^{n}x_{k} = 1
k=1∑nxk=1
所以可以解得x1=x2=⋯=xn=1nx_{1} = x_{2} = \cdots = x_{n} = \frac{1}{n}x1=x2=⋯=xn=n1 (因为x1=x2=⋯=xnx_{1} = x_{2} = \cdots = x_{n}x1=x2=⋯=xn且∑k=1nxk=1\sum_{k=1}^{n}x_{k} = 1∑k=1nxk=1)
又因为xkx_{k}xk还要满足0≤xk≤10 \leq x_{k}\leq 10≤xk≤1,显然0≤1n≤10 \leq \frac{1}{n} \leq 10≤n1≤1,所以x1=x2=⋯=xn=1nx_{1} = x_{2} = \cdots = x_{n} = \frac{1}{n}x1=x2=⋯=xn=n1是满足所有约束的最优解,也即为当前最小化问题的最小值点,同时也是$ f(x_{1},x_{2},\cdots,x_{n})的最大值点,将的最大值点,将的最大值点,将x_{1} = x_{2} = \cdots = x_{n} = \frac{1}{n}代入代入代入 f(x_{1},x_{2},\cdots,x_{n})$中可得
f(1n,1n,⋯ ,1n)=−∑k=1n1nlog21n=−n⋅1n⋅log21n=log2n
f(\frac{1}{n},\frac{1}{n},\cdots,\frac{1}{n}) = -\sum_{k=1}^{n}\frac{1}{n}\log_{2}\frac{1}{n} = -n\cdot \frac{1}{n} \cdot \log_{2}\frac{1}{n} = \log_{2}n
f(n1,n1,⋯,n1)=−k=1∑nn1log2n1=−n⋅n1⋅log2n1=log2n
所以$ f(x_{1},x_{2},\cdots,x_{n})$在满足约束0≤xk≤1,∑k=1nxk=10 \leq x_{k}\leq 1,\sum_{k=1}^{n}x_{k} = 10≤xk≤1,∑k=1nxk=1时的最大值为log2n\log_{2}nlog2n
求Ent(D)Ent(D)Ent(D)的最小值
如果不考虑∑k=1nxk=1\sum_{k=1}^{n}x_{k} = 1∑k=1nxk=1,仅考虑leqxk≤1leq x_{k}\leq 1leqxk≤1的话,$ f(x_{1},x_{2},\cdots,x_{n})$可以看成是n个互不相关的一元函数加和,也即
f(x1,x2,⋯ ,xn)=∑k=1ng(xk)
f(x_{1},x_{2},\cdots,x_{n}) =\sum_{k=1}^{n}g(x_{k})
f(x1,x2,⋯,xn)=k=1∑ng(xk)
其中,g(xk)=−xklog2xkg(x_{k}) = -x_{k}\log_{2}x_{k}g(xk)=−xklog2xk,0≤xk≤10 \leq x_{k}\leq 10≤xk≤1,那么当g(x1),g(x2),⋯ ,g(xk)g(x_{1}),g(x_{2}),\cdots,g(x_{k})g(x1),g(x2),⋯,g(xk)分别取到其最小值时,f(x1,x2,⋯ ,xn)f(x_{1},x_{2},\cdots,x_{n})f(x1,x2,⋯,xn)也就取到了最小值,由于g(x1),g(x2),⋯ ,g(xk)g(x_{1}),g(x_{2}),\cdots,g(x_{k})g(x1),g(x2),⋯,g(xk)的定义域和函数表达式均相同,所以只需求出g(x1)g(x_{1})g(x1)的最小值也就求出了g(x2),⋯ ,g(xk)g(x_{2}),\cdots,g(x_{k})g(x2),⋯,g(xk)的最小值,下面考虑求g(x1)g(x_{1})g(x1)的最小值
首先对g(x1)g(x_{1})g(x1)关于x1x_{1}x1求一阶和二阶导数
g′(x1)=−log2x1−x1⋅1x1ln2=−log2x1−1ln2g′′(x1)=−1x1ln2
g'(x_{1}) = -log_{2}x_{1} - x_{1}\cdot \frac{1}{x_{1}\ln 2} = -log_{2}x_{1} - \frac{1}{\ln 2}\\
g''(x_{1}) = -\frac{1}{x_{1}\ln 2}
g′(x1)=−log2x1−x1⋅x1ln21=−log2x1−ln21g′′(x1)=−x1ln21
显然,当0≤xk≤10 \leq x_{k}\leq 10≤xk≤1时g′′(x1)=−1x1ln2g''(x_{1}) = -\frac{1}{x_{1}\ln 2}g′′(x1)=−x1ln21恒小于0,所以g(x1)g(x_{1})g(x1)是一个在其定义域范围内开口向下的凹函数,那么其最小值必定在边界取,于是分别取x1=0x_{1} = 0x1=0和x1=1x_{1}=1x1=1代入g(x1)g(x_{1})g(x1)得
g(0)=−0log20=0g(1)=−log21=0
g(0) = -0\log_{2}0 = 0\\
g(1) = -log_{2}1 = 0
g(0)=−0log20=0g(1)=−log21=0
所以,g(x1)g(x_{1})g(x1)的最小值为0,同理可得g(x2),⋯ ,g(xk)g(x_{2}),\cdots,g(x_{k})g(x2),⋯,g(xk)的最小值也为0,那么$ f(x_{1},x_{2},\cdots,x_{n})$的最小值也为0,但是,此时是仅考虑0≤xk≤10 \leq x_{k}\leq 10≤xk≤1时取到的最小值,若考虑约束∑k=1nxk=1\sum_{k=1}^{n}x_{k} = 1∑k=1nxk=1的话,那么$ f(x_{1},x_{2},\cdots,x_{n})的最小值一定大于等于0,如果令某个的最小值一定大于等于0,如果令某个的最小值一定大于等于0,如果令某个x_{k}=1,那么根据约束,那么根据约束,那么根据约束\sum_{k=1}^{n}x_{k} = 1可知可知可知x_{1} = x_{2} = \cdots = x_{k-1} = x_{k+1} = \cdots = x_{n} = 0,将其代入,将其代入,将其代入 f(x_{1},x_{2},\cdots,x_{n})$可得
f(0,0,⋯ ,1,0,⋯ ,0)=−0log20−−0log20−⋯−log21−−0log20−⋯−0log20=0
f(0,0,\cdots,1,0,\cdots,0) = -0\log_{2}0 - -0\log_{2}0 - \cdots -\log_{2}1 - -0\log_{2}0 - \cdots -0\log_{2}0 = 0
f(0,0,⋯,1,0,⋯,0)=−0log20−−0log20−⋯−log21−−0log20−⋯−0log20=0
所以xk=1,x1=x2=⋯=xk−1=xk+1=⋯=xn=0x_{k} = 1,x_{1} = x_{2} = \cdots = x_{k-1} = x_{k+1} = \cdots = x_{n} = 0xk=1,x1=x2=⋯=xk−1=xk+1=⋯=xn=0一定是$ f(x_{1},x_{2},\cdots,x_{n})$在满足约束0≤xk≤1,∑k=1nxk=10 \leq x_{k}\leq 1,\sum_{k=1}^{n}x_{k} = 10≤xk≤1,∑k=1nxk=1 的条件下的最小值,其最小值为0
条件熵:在已知样本属性a的取值情况下,度量样本集合纯度的一种指标
H(D∣a)=∑v=1V∣Dv∣DEnt(Dv)
H(D|a) = \sum_{v=1}^{V}\frac{|D^v|}{D}Ent(D^v)
H(D∣a)=v=1∑VD∣Dv∣Ent(Dv)
其中,aaa表示样本的某个属性,假定属性aaa有VVV个可能的取值{a1,a2,⋯ ,aV}\left \{ a^1,a^2,\cdots,a^V\right \}{a1,a2,⋯,aV},样本集合DDD中在属性aaa上取值为aVa^VaV的样本记为DVD^VDV,Ent(DV)Ent(D^V)Ent(DV)表示样本集合DvD^vDv的信息熵,H(D∣a)H(D|a)H(D∣a)值越小,纯度越高
ID3决策树,已信息增益为准则来选择划分属性的决策树,信息增益公式为
Gain(D,a)=Ent(D)−∑v=1V∣Dv∣DEnt(Dv)=Ent(D)−H(D∣a)
\begin{aligned}
Gain(D,a) &= Ent(D) - \sum_{v=1}^{V}\frac{|D^v|}{D}Ent(D^v) \\&= Ent(D) - H(D|a)
\end{aligned}
Gain(D,a)=Ent(D)−v=1∑VD∣Dv∣Ent(Dv)=Ent(D)−H(D∣a)
选择信息增益值最大的属性作为划分属性,因为信息增益越大,则意味着使用该属性来进行划分所获得的"纯度提升"越大
以信息增益为划分标准的ID3决策树对可取值越多数目较多的属性有所偏好
Gain(D,a)=Ent(D)−∑v=1V∣Dv∣DEnt(Dv)=Ent(D)−∑v=1V∣Dv∣D(−∑k=1∣Y∣pklog2pk)=Ent(D)−∑v=1V∣Dv∣D(−∑k=1∣Y∣pklog2pk)∣Dkv∣Dv
\begin{aligned}
Gain(D,a) &= Ent(D) - \sum_{v=1}^{V}\frac{|D^v|}{D}Ent(D^v) \\&= Ent(D) - \sum_{v=1}^{V}\frac{|D^v|}{D}(-\sum_{k=1}^{|\mathcal{Y}|}p_{k}\log_{2}p_{k}) \\&=Ent(D) - \sum_{v=1}^{V}\frac{|D^v|}{D}(-\sum_{k=1}^{|\mathcal{Y}|}p_{k}\log_{2}p_{k})\frac{|D_{k}^{v}|}{D^v}
\end{aligned}
Gain(D,a)=Ent(D)−v=1∑VD∣Dv∣Ent(Dv)=Ent(D)−v=1∑VD∣Dv∣(−k=1∑∣Y∣pklog2pk)=Ent(D)−v=1∑VD∣Dv∣(−k=1∑∣Y∣pklog2pk)Dv∣Dkv∣
其中,DkvD_{k}^{v}Dkv样本集合DDD中在属性aaa上取值为ava^{v}av且类别为kkk的样本
C4.5决策树
C4.5决策树以信息增益率为准则来选择划分属性的决策树,信息增益率
Gain_ratio(D,a)=Gain(D,a)IV(a)
Gain\_ratio(D,a) = \frac{Gain(D,a)}{IV(a)}
Gain_ratio(D,a)=IV(a)Gain(D,a)
其中
IV(a)=−∑v=1V∣Dv∣Dlog2∣Dv∣D
IV(a) = -\sum_{v=1}^{V}\frac{|D^v|}{D}\log_{2}\frac{|D^v|}{D}
IV(a)=−v=1∑VD∣Dv∣log2D∣Dv∣
CART决策树
CART决策树以基尼指数为准则来选择划分属性的决策树
基尼值:
Gini(D)=∑k=1∣Y∣∑k′≠kpkpk′=∑k=1∣Y∣pk∑k′≠kpk′=∑k=1∣Y∣pk(1−pk)=1−∑k=1∣Y∣pk2
Gini(D) = \sum_{k=1}^{|\mathcal{Y}|}\sum_{k'\neq k}p_{k}p_{k'} = \sum_{k=1}^{|\mathcal{Y}|}p_{k}\sum_{k'\neq k}p_{k'} = \sum_{k=1}^{|\mathcal{Y}|}p_{k}(1-p_{k}) = 1-\sum_{k=1}^{|\mathcal{Y}|}p_{k}^2
Gini(D)=k=1∑∣Y∣k′=k∑pkpk′=k=1∑∣Y∣pkk′=k∑pk′=k=1∑∣Y∣pk(1−pk)=1−k=1∑∣Y∣pk2
基尼指数:
Gini_index(D,a)=∑v=1V∣Dv∣DGini(Dv)
Gini\_index(D,a) =\sum_{v=1}^{V}\frac{|D^v|}{D}Gini(D^v)
Gini_index(D,a)=v=1∑VD∣Dv∣Gini(Dv)
基尼值和基尼指数越小,样本集合纯度越高
CART决策树分类算法
- 根据基尼指数公式Gini_index(D,a)=∑v=1V∣Dv∣DGini(Dv)Gini\_index(D,a) =\sum_{v=1}^{V}\frac{|D^v|}{D}Gini(D^v)Gini_index(D,a)=∑v=1VD∣Dv∣Gini(Dv)找出基尼指数最小的属性a∗a_{*}a∗
- 计算属性a∗a_{*}a∗的所有可能取值的基尼值Gini(Dv)Gini(D^v)Gini(Dv),v=1,2,⋯ Vv=1,2,\cdots\,Vv=1,2,⋯V,选择季妮志最小的取值a∗va_{*}^{v}a∗v作为划分点,将集合DDD划分为D1D1D1和D2D2D2两个集合(节点),其中D1D1D1集合的样本为a∗=a∗va_{*}=a_{*}^{v}a∗=a∗v的样本,D2D2D2集合为a∗≠a∗va_{*}\neq a_{*}^{v}a∗=a∗v的样本
- 对集合D1D1D1和D2D2D2重复步骤1和步骤2,直到满足停止条件
CART决策树回归算法
-
根据以下公式找出最优划分特征a∗a^*a∗和最优划分点a∗va_{*}^va∗v
a∗,a∗v=argmina,av[minc1∑xi∈D1(a,av)(yi−c1)2−minc2∑xi∈D2(a,av)(yi−c2)2] a_{*},a_{*}^v = \underset{a,a^v}{arg min}\left [\underset{c_{1}}{min} \underset{x_{i} \in D_{1}(a,a^v)}{\sum }(y_{i}-c_{1})^2-\underset{c_{2}}{min} \underset{x_{i} \in D_{2}(a,a^v)}{\sum }(y_{i}-c_{2})^2 \right ] a∗,a∗v=a,avargmin⎣⎡c1minxi∈D1(a,av)∑(yi−c1)2−c2minxi∈D2(a,av)∑(yi−c2)2⎦⎤
其中,D1(a,a∗)D_{1}(a,a^*)D1(a,a∗)表示在属性aaa上取值小于等于ava^vav的样本集合,D2(a,av)D_{2}(a,a^v)D2(a,av)表示在属性aaa上取值大于ava^vav的样本集合,c1c_{1}c1表示D1D_{1}D1的样本输出均值,c2c_{2}c2表示D2D_{2}D2的样本输出均值 -
根据划分点a∗va_{*}^va∗v将集合DDD划分为D1D_{1}D1和D2D_{2}D2两个集合(节点)
-
对集合D1D_{1}D1和D2D_{2}D2重复步骤1和步骤2,直至满足停止条件