因子分解机（FM）详解（二）——开源库libfm代码分析与运行

本文对开源库libfm进行代码分析与运行介绍。先说明了将仓库克隆到本地、编译代码得到可执行文件，用MovieLens数据集训练FM。接着解析核心代码，分析了集成的三种训练方法，包括SGD算法、ALS与MCMC方法，还提及不同任务评估FM准确性的度量方式。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

开源库libfm代码分析与运行

上篇博文介绍了FM的基本概念与公式，使用FM进行推荐训练的三点优点，以及训练FM模型的三种方法。在文章[1]、[2]、[3]中，Rendle等人专门为FM开发了一个基于C++11开源的unix软件库，也就是libfm。

首先，将整个仓库clone到本地，并进入到相应目录：

git clone https://github.com/srendle/libfm.git
cd libfm

其次，编译所有源代码：

make all

这样在目录bin中可以得到三个可执行文件：

convert 将文本输入转换为二进制输入
transpose 一个二进制设计矩阵的工具
libFM libFM主工具

我们一般使用libFM进行训练。通过输入以下命令：

./libFM -help

可以得到其所有的命令选项：

通过使用MovieLens数据集，并使用libfm中自带的Perl脚本./scripts/triple_format_to_libfm.pl进行转换，并分割训练集与测试集，之后可以通过运行以下脚本，进行FM的训练：

./libFM -task r -train train.csv.libfm -test test.csv.libfm -dim '1,1,8'

即可实现对FM的训练。

代码分析

libfm集成了三种训练方法，那么我们可以解析其核心代码。
首先在./src/fm_core中定义了三个文件：

fm_data.h 定义float
fm_model.h 定义fm模型，具体api包括init(), predict(sparse_row<FM_float>& x), saveModel()以及loadModel()
fm_sgd.h 定义了sgd算法的具体操作

这个文件夹中主要展示的就是一些核心的fm模型。
我所认为，最核心的部分是在./src/libfm/src中。其中定义了一系列训练模型。

SGD算法

主要在./src/libfm/src/fm_learn_sgd.h中，下面的函数输入稀疏向量，并调用fm_SGD类的方法：

void fm_learn_sgd::SGD(sparse_row<DATA_FLOAT> &x, const double multiplier, DVector<double> &sum) {
  fm_SGD(fm, learn_rate, x, multiplier, sum);
}
// implementation of algorithm 1
void fm_SGD(fm_model* fm, const double& learn_rate, sparse_row<DATA_FLOAT> &x, const double multiplier, DVector<double> &sum) {
// 0阶特征参量，即bias项
  if (fm->k0) {
    double& w0 = fm->w0;
    w0 -= learn_rate * (multiplier + fm->reg0 * w0);
  }
  // 1阶特征参量，考虑单个特征的特性
  if (fm->k1) {
    for (uint i = 0; i < x.size; i++) {
      double& w = fm->w(x.data[i].id);
      w -= learn_rate * (multiplier * x.data[i].value + fm->regw * w);
    }
  }
  // 2阶特征参量，考虑两个特征的interaction
  for (int f = 0; f < fm->num_factor; f++) {
    for (uint i = 0; i < x.size; i++) {
      double& v = fm->v(f,x.data[i].id);
      double grad = sum(f) * x.data[i].value - v * x.data[i].value * x.data[i].value;
      v -= learn_rate * (multiplier * grad + fm->regv * v);
    }
  }
}

可以看到，fm_SGD的方法实现了Algorithm 1中的梯度部分。另外，对于预测部分，libfm的实现如下：

void fm_learn_sgd::predict(Data& data, DVector<double>& out) {
  assert(data.data->getNumRows() == out.dim);
  for (data.data->begin(); !data.data->end(); data.data->next()) {
    double p = predict_case(data);
    if (task == TASK_REGRESSION ) {
      p = std::min(max_target, p);
      p = std::max(min_target, p);
    } else if (task == TASK_CLASSIFICATION) {
      p = 1.0/(1.0 + exp(-p));
    } else {
      throw "task not supported";
    }
    out(data.data->getRowIndex()) = p;
  }
}

可以看到，libfm主要支持回归问题与分类问题的预测。

ALS与MCMC方法

libfm中实现的主要是./src/libfm/src/fm_learn_mcmc.h：基于MCMC和ALS的fm。该文件包含所有模型和先前参数的完整样本的采样器。其训练过程实现如下：

void fm_learn_mcmc::learn(Data& train, Data& test) {
  pred_sum_all.setSize(test.num_cases);
  pred_sum_all_but5.setSize(test.num_cases);
  pred_this.setSize(test.num_cases);
  pred_sum_all.init(0.0);
  pred_sum_all_but5.init(0.0);
  pred_this.init(0.0);

  // init caches data structure
  MemoryLog::getInstance().logNew("e_q_term", sizeof(e_q_term), train.num_cases);
  cache = new e_q_term[train.num_cases];
  MemoryLog::getInstance().logNew("e_q_term", sizeof(e_q_term), test.num_cases);
  cache_test = new e_q_term[test.num_cases];

  rel_cache.setSize(train.relation.dim);
  for (uint r = 0; r < train.relation.dim; r++) {
    MemoryLog::getInstance().logNew("relation_cache", sizeof(relation_cache), train.relation(r).data->num_cases);
    rel_cache(r) = new relation_cache[train.relation(r).data->num_cases];
    for (uint c = 0; c < train.relation(r).data->num_cases; c++) {
      rel_cache(r)[c].wnum = 0;
    }
  }

  // calculate #^R
  for (uint r = 0; r < train.relation.dim; r++) {
    for (uint c = 0; c < train.relation(r).data_row_to_relation_row.dim; c++) {
      rel_cache(r)[train.relation(r).data_row_to_relation_row(c)].wnum += 1.0;
    }
  }

  _learn(train, test);

  // free data structures
  for (uint i = 0; i < train.relation.dim; i++) {
    MemoryLog::getInstance().logFree("relation_cache", sizeof(relation_cache), train.relation(i).data->num_cases);
    delete[] rel_cache(i);
  }
  MemoryLog::getInstance().logFree("e_q_term", sizeof(e_q_term), test.num_cases);
  delete[] cache_test;
  MemoryLog::getInstance().logFree("e_q_term", sizeof(e_q_term), train.num_cases);
  delete[] cache;
}

可以看到，这里主要考虑残差项 $e_{i} :=y_{i}-\hat{y}\left(\mathbf{x}_{i} | \Theta\right)$ 的训练与更新。这与ALS的Algorithm 2对应起来。在这其中，_learn函数是虚函数，其在子类fm_learn_mcmc_simultaneous中实现，代码核心部分如下：

	double acc_train = 0.0;
    double rmse_train = 0.0;
    if (task == TASK_REGRESSION) {
      // evaluate test and store it
      for (uint c = 0; c < test.num_cases; c++) {
        double p = cache_test[c].e;
        pred_this(c) = p;
        p = std::min(max_target, p);
        p = std::max(min_target, p);
        pred_sum_all(c) += p;
        if (i >= 5) {
          pred_sum_all_but5(c) += p;
        }
      }

      // Evaluate the training dataset and update the e-terms
      for (uint c = 0; c < train.num_cases; c++) {
        double p = cache[c].e;
        p = std::min(max_target, p);
        p = std::max(min_target, p);
        double err = p - train.target(c);
        rmse_train += err*err;
        cache[c].e = cache[c].e - train.target(c);
      }
      rmse_train = std::sqrt(rmse_train/train.num_cases);

    } else if (task == TASK_CLASSIFICATION) {
      // evaluate test and store it
      for (uint c = 0; c < test.num_cases; c++) {
        double p = cache_test[c].e;
        p = cdf_gaussian(p);
        pred_this(c) = p;
        pred_sum_all(c) += p;
        if (i >= 5) {
          pred_sum_all_but5(c) += p;
        }
      }

      // Evaluate the training dataset and update the e-terms
      uint _acc_train = 0;
      for (uint c = 0; c < train.num_cases; c++) {
        double p = cache[c].e;
        p = cdf_gaussian(p);
        if (((p >= 0.5) && (train.target(c) > 0.0)) || ((p < 0.5) && (train.target(c) < 0.0))) {
          _acc_train++;
        }

        double sampled_target;
        if (train.target(c) >= 0.0) {
          if (do_sample) {
            sampled_target = ran_left_tgaussian(0.0, cache[c].e, 1.0);
          } else {
            // the target is the expected value of the truncated normal
            double mu = cache[c].e;
            double phi_minus_mu = exp(-mu*mu/2.0) / sqrt(3.141*2);
            double Phi_minus_mu = cdf_gaussian(-mu);
            sampled_target = mu + phi_minus_mu / (1-Phi_minus_mu);
          }
        } else {
          if (do_sample) {
            sampled_target = ran_right_tgaussian(0.0, cache[c].e, 1.0);
          } else {
            // the target is the expected value of the truncated normal
            double mu = cache[c].e;
            double phi_minus_mu = exp(-mu*mu/2.0) / sqrt(3.141*2);
            double Phi_minus_mu = cdf_gaussian(-mu);
            sampled_target = mu - phi_minus_mu / Phi_minus_mu;
          }
        }
        cache[c].e = cache[c].e - sampled_target;
      }
      acc_train = (double) _acc_train / train.num_cases;
}

另外，针对回归任务与分类任务的不同，评估FM准确性的度量方式也不同，代码如下：

void fm_learn_mcmc_simultaneous::_evaluate(DVector<double>& pred, DVector<DATA_FLOAT>& target, double normalizer, double& rmse, double& mae, uint from_case, uint to_case) {
  assert(pred.dim == target.dim);
  double _rmse = 0;
  double _mae = 0;
  uint num_cases = 0;
  for (uint c = std::max((uint) 0, from_case); c < std::min((uint)pred.dim, to_case); c++) {
    double p = pred(c) * normalizer;
    p = std::min(max_target, p);
    p = std::max(min_target, p);
    double err = p - target(c);
    _rmse += err*err;
    _mae += std::abs((double)err);
    num_cases++;
  }

  rmse = std::sqrt(_rmse/num_cases);
  mae = _mae/num_cases;
}

void fm_learn_mcmc_simultaneous::_evaluate_class(DVector<double>& pred, DVector<DATA_FLOAT>& target, double normalizer, double& accuracy, double& loglikelihood, uint from_case, uint to_case) {
  double _loglikelihood = 0.0;
  uint _accuracy = 0;
  uint num_cases = 0;
  for (uint c = std::max((uint) 0, from_case); c < std::min((uint)pred.dim, to_case); c++) {
    double p = pred(c) * normalizer;
    if (((p >= 0.5) && (target(c) > 0.0)) || ((p < 0.5) && (target(c) < 0.0))) {
      _accuracy++;
    }
    double m = (target(c)+1.0)*0.5;
    double pll = p;
    if (pll > 0.99) { pll = 0.99; }
    if (pll < 0.01) { pll = 0.01; }
    _loglikelihood -= m*log10(pll) + (1-m)*log10(1-pll);
    num_cases++;
  }
  loglikelihood = _loglikelihood/num_cases;
  accuracy = (double) _accuracy / num_cases;
}