开源库libfm代码分析与运行
上篇博文介绍了FM的基本概念与公式,使用FM进行推荐训练的三点优点,以及训练FM模型的三种方法。在文章[1]、[2]、[3]中,Rendle等人专门为FM开发了一个基于C++11开源的unix软件库,也就是libfm。
首先,将整个仓库clone到本地,并进入到相应目录:
git clone https://github.com/srendle/libfm.git
cd libfm
其次,编译所有源代码:
make all
这样在目录bin
中可以得到三个可执行文件:
convert
将文本输入转换为二进制输入transpose
一个二进制设计矩阵的工具libFM
libFM主工具
我们一般使用libFM
进行训练。通过输入以下命令:
./libFM -help
可以得到其所有的命令选项:
通过使用MovieLens数据集,并使用libfm中自带的Perl脚本./scripts/triple_format_to_libfm.pl
进行转换,并分割训练集与测试集,之后可以通过运行以下脚本,进行FM的训练:
./libFM -task r -train train.csv.libfm -test test.csv.libfm -dim '1,1,8'
即可实现对FM的训练。
代码分析
libfm集成了三种训练方法,那么我们可以解析其核心代码。
首先在./src/fm_core
中定义了三个文件:
fm_data.h
定义floatfm_model.h
定义fm模型,具体api包括init(), predict(sparse_row<FM_float>& x), saveModel()
以及loadModel()
fm_sgd.h
定义了sgd算法的具体操作
这个文件夹中主要展示的就是一些核心的fm模型。
我所认为,最核心的部分是在./src/libfm/src
中。其中定义了一系列训练模型。
SGD算法
主要在./src/libfm/src/fm_learn_sgd.h
中,下面的函数输入稀疏向量,并调用fm_SGD
类的方法:
void fm_learn_sgd::SGD(sparse_row<DATA_FLOAT> &x, const double multiplier, DVector<double> &sum) {
fm_SGD(fm, learn_rate, x, multiplier, sum);
}
// implementation of algorithm 1
void fm_SGD(fm_model* fm, const double& learn_rate, sparse_row<DATA_FLOAT> &x, const double multiplier, DVector<double> &sum) {
// 0阶特征参量,即bias项
if (fm->k0) {
double& w0 = fm->w0;
w0 -= learn_rate * (multiplier + fm->reg0 * w0);
}
// 1阶特征参量,考虑单个特征的特性
if (fm->k1) {
for (uint i = 0; i < x.size; i++) {
double& w = fm->w(x.data[i].id);
w -= learn_rate * (multiplier * x.data[i].value + fm->regw * w);
}
}
// 2阶特征参量,考虑两个特征的interaction
for (int f = 0; f < fm->num_factor; f++) {
for (uint i = 0; i < x.size; i++) {
double& v = fm->v(f,x.data[i].id);
double grad = sum(f) * x.data[i].value - v * x.data[i].value * x.data[i].value;
v -= learn_rate * (multiplier * grad + fm->regv * v);
}
}
}
可以看到,fm_SGD
的方法实现了Algorithm 1中的梯度部分。另外,对于预测部分,libfm的实现如下:
void fm_learn_sgd::predict(Data& data, DVector<double>& out) {
assert(data.data->getNumRows() == out.dim);
for (data.data->begin(); !data.data->end(); data.data->next()) {
double p = predict_case(data);
if (task == TASK_REGRESSION ) {
p = std::min(max_target, p);
p = std::max(min_target, p);
} else if (task == TASK_CLASSIFICATION) {
p = 1.0/(1.0 + exp(-p));
} else {
throw "task not supported";
}
out(data.data->getRowIndex()) = p;
}
}
可以看到,libfm主要支持回归问题与分类问题的预测。
ALS与MCMC方法
libfm中实现的主要是./src/libfm/src/fm_learn_mcmc.h
:基于MCMC和ALS的fm。 该文件包含所有模型和先前参数的完整样本的采样器。其训练过程实现如下:
void fm_learn_mcmc::learn(Data& train, Data& test) {
pred_sum_all.setSize(test.num_cases);
pred_sum_all_but5.setSize(test.num_cases);
pred_this.setSize(test.num_cases);
pred_sum_all.init(0.0);
pred_sum_all_but5.init(0.0);
pred_this.init(0.0);
// init caches data structure
MemoryLog::getInstance().logNew("e_q_term", sizeof(e_q_term), train.num_cases);
cache = new e_q_term[train.num_cases];
MemoryLog::getInstance().logNew("e_q_term", sizeof(e_q_term), test.num_cases);
cache_test = new e_q_term[test.num_cases];
rel_cache.setSize(train.relation.dim);
for (uint r = 0; r < train.relation.dim; r++) {
MemoryLog::getInstance().logNew("relation_cache", sizeof(relation_cache), train.relation(r).data->num_cases);
rel_cache(r) = new relation_cache[train.relation(r).data->num_cases];
for (uint c = 0; c < train.relation(r).data->num_cases; c++) {
rel_cache(r)[c].wnum = 0;
}
}
// calculate #^R
for (uint r = 0; r < train.relation.dim; r++) {
for (uint c = 0; c < train.relation(r).data_row_to_relation_row.dim; c++) {
rel_cache(r)[train.relation(r).data_row_to_relation_row(c)].wnum += 1.0;
}
}
_learn(train, test);
// free data structures
for (uint i = 0; i < train.relation.dim; i++) {
MemoryLog::getInstance().logFree("relation_cache", sizeof(relation_cache), train.relation(i).data->num_cases);
delete[] rel_cache(i);
}
MemoryLog::getInstance().logFree("e_q_term", sizeof(e_q_term), test.num_cases);
delete[] cache_test;
MemoryLog::getInstance().logFree("e_q_term", sizeof(e_q_term), train.num_cases);
delete[] cache;
}
可以看到,这里主要考虑残差项
e
i
:
=
y
i
−
y
^
(
x
i
∣
Θ
)
e_{i} :=y_{i}-\hat{y}\left(\mathbf{x}_{i} | \Theta\right)
ei:=yi−y^(xi∣Θ)的训练与更新。这与ALS的Algorithm 2对应起来。在这其中,_learn
函数是虚函数,其在子类fm_learn_mcmc_simultaneous
中实现,代码核心部分如下:
double acc_train = 0.0;
double rmse_train = 0.0;
if (task == TASK_REGRESSION) {
// evaluate test and store it
for (uint c = 0; c < test.num_cases; c++) {
double p = cache_test[c].e;
pred_this(c) = p;
p = std::min(max_target, p);
p = std::max(min_target, p);
pred_sum_all(c) += p;
if (i >= 5) {
pred_sum_all_but5(c) += p;
}
}
// Evaluate the training dataset and update the e-terms
for (uint c = 0; c < train.num_cases; c++) {
double p = cache[c].e;
p = std::min(max_target, p);
p = std::max(min_target, p);
double err = p - train.target(c);
rmse_train += err*err;
cache[c].e = cache[c].e - train.target(c);
}
rmse_train = std::sqrt(rmse_train/train.num_cases);
} else if (task == TASK_CLASSIFICATION) {
// evaluate test and store it
for (uint c = 0; c < test.num_cases; c++) {
double p = cache_test[c].e;
p = cdf_gaussian(p);
pred_this(c) = p;
pred_sum_all(c) += p;
if (i >= 5) {
pred_sum_all_but5(c) += p;
}
}
// Evaluate the training dataset and update the e-terms
uint _acc_train = 0;
for (uint c = 0; c < train.num_cases; c++) {
double p = cache[c].e;
p = cdf_gaussian(p);
if (((p >= 0.5) && (train.target(c) > 0.0)) || ((p < 0.5) && (train.target(c) < 0.0))) {
_acc_train++;
}
double sampled_target;
if (train.target(c) >= 0.0) {
if (do_sample) {
sampled_target = ran_left_tgaussian(0.0, cache[c].e, 1.0);
} else {
// the target is the expected value of the truncated normal
double mu = cache[c].e;
double phi_minus_mu = exp(-mu*mu/2.0) / sqrt(3.141*2);
double Phi_minus_mu = cdf_gaussian(-mu);
sampled_target = mu + phi_minus_mu / (1-Phi_minus_mu);
}
} else {
if (do_sample) {
sampled_target = ran_right_tgaussian(0.0, cache[c].e, 1.0);
} else {
// the target is the expected value of the truncated normal
double mu = cache[c].e;
double phi_minus_mu = exp(-mu*mu/2.0) / sqrt(3.141*2);
double Phi_minus_mu = cdf_gaussian(-mu);
sampled_target = mu - phi_minus_mu / Phi_minus_mu;
}
}
cache[c].e = cache[c].e - sampled_target;
}
acc_train = (double) _acc_train / train.num_cases;
}
另外,针对回归任务与分类任务的不同,评估FM准确性的度量方式也不同,代码如下:
void fm_learn_mcmc_simultaneous::_evaluate(DVector<double>& pred, DVector<DATA_FLOAT>& target, double normalizer, double& rmse, double& mae, uint from_case, uint to_case) {
assert(pred.dim == target.dim);
double _rmse = 0;
double _mae = 0;
uint num_cases = 0;
for (uint c = std::max((uint) 0, from_case); c < std::min((uint)pred.dim, to_case); c++) {
double p = pred(c) * normalizer;
p = std::min(max_target, p);
p = std::max(min_target, p);
double err = p - target(c);
_rmse += err*err;
_mae += std::abs((double)err);
num_cases++;
}
rmse = std::sqrt(_rmse/num_cases);
mae = _mae/num_cases;
}
void fm_learn_mcmc_simultaneous::_evaluate_class(DVector<double>& pred, DVector<DATA_FLOAT>& target, double normalizer, double& accuracy, double& loglikelihood, uint from_case, uint to_case) {
double _loglikelihood = 0.0;
uint _accuracy = 0;
uint num_cases = 0;
for (uint c = std::max((uint) 0, from_case); c < std::min((uint)pred.dim, to_case); c++) {
double p = pred(c) * normalizer;
if (((p >= 0.5) && (target(c) > 0.0)) || ((p < 0.5) && (target(c) < 0.0))) {
_accuracy++;
}
double m = (target(c)+1.0)*0.5;
double pll = p;
if (pll > 0.99) { pll = 0.99; }
if (pll < 0.01) { pll = 0.01; }
_loglikelihood -= m*log10(pll) + (1-m)*log10(1-pll);
num_cases++;
}
loglikelihood = _loglikelihood/num_cases;
accuracy = (double) _accuracy / num_cases;
}
可以看到,在回归问题中,loss function为RMSE,而在分类问题中则使用的是交叉熵。