参考:https://www.cnblogs.com/pinard/
pca: z = wT*x; x*xT*w = r * w; kpca: fi(x) * fi(x)T *w = r*w
svd: Ax = rx -> A W = W *S -> A = W * S * W-1 -> A = W*S*WT => A = U * S * VT
文本主题模型-潜在语义索引lsi: A = U*S*VT
文本主题模型-非负矩阵分解nmf: A = W*H; loss = argmin(W,H) 1/2*||A-WH||^2 +alpha*rho*||W||1 + alpha*rho*||H||1+alpha*(1-rho)/2*||W||^2+alpha*(1-rho)/2*||H||^2
分词原理:r=argmax(i) P(Ai1, Ai2, ..., Aini), 马尔可夫假设 P(Aij|Ai1,Ai2, ... , Ai(j-1))=P(Aij|Ai(j-1)) -> 2-gram, P(Ai1Ai2...Ain)=P(Ai1)*P(Ai2|Ai1)*P(Ai3|Ai2)*...*P(Aij|Ai(j-1)) 用维特比算法; P(w2|w1)=freq(w1,w2)/freq(w1)
TF-IDF: IDF(x) = log((N+1)/(N(x)+1)) +1; tf-idf(x) = tf(x) * idf(x)
Bag of Words: 词代模型,单词个数;Set of Words: 词集模型,出现与否;Hash Trick: fi(j) = sigma(h(i)=j) fi(i), fi(j) = sigma(h(i)=j) epsilon(i)*fi(i), epsilon(i)=+/-1
中文文本挖掘:数据收集、去除非文本部分、处理中文编码、中文分词、引入停用词、特征处理、建立分析模型。
英文文本挖掘:数据收集、去除非文本部分、拼写检查、词干提取和词形还原、转化为小写、引入停用词、特征处理、建立分析模型。
word2vec:
lda:
gd: h(x) = X*theta; J = 1/2 * (x*theta - y)T * ( x*theta - Y); theta = theta - alpha * patial(J)/patial(theta); patial = XT*(X*theta - Y);
ls: J = 1/2 * (X*theta - Y)T * (X*theta - Y); patial = XT * (X*theta - Y) = 0 => theta = (XTX)-1 * XTY
线性回归:h theta (X) = X * theta ; J = 1/2*(X*theta - Y)T*(X*theta-Y); gd法 theta = theta - alpha * XT*(X*theta - Y); ls法:theta = (XTX)-1*XTY; 多项式回归: (x1,x2) -> (1, x1, x2, x1^2, x2^2, x1*x2);广义线性回归: lnY = X*theta=> g(Y) = X*theta, Y = g-1(X*theta); 正则化: J = J上面的 + alpha * ||theta||1 或 1/2*alpha*||theta||^2 => theta = (XTX+alpha*E)-1*XTY;
朴素贝叶斯:
knn:
k-means/ k-means++: