Bayesian Probabilistic Matrix Factorization
Bayesian Probabilistic Matrix Factorization
Ruixuan Xu
Department of Computer Science and Engineering
The Chinese University of Hong Kong
[email protected]
Xiangxiang Weng
arXiv:2506.09928v1 [cs.LG] 11 Jun 2025
Abstract
1 Introduction
Collaborative filtering is an essential technique in recommendation systems, where the goal is to pre-
dict user preferences based on sparse observed ratings. Matrix factorization has been widely adopted
due to its ability to model user-item interactions effectively. However, traditional matrix factorization
relies on point estimates, which may lead to overfitting and lack of uncertainty quantification.
Probabilistic Matrix Factorization (PMF) [1] mitigates this issue by considering a probabilistic
approach, where user and item latent matrices are treated as random variables with prior distributions.
The challenge in PMF is computing the posterior distribution of latent matrices, which is intractable.
To approximate the posterior, we explore two Bayesian inference methods:
1. Markov Chain Monte Carlo (MCMC) [2]: A sampling-based approach that provides asymptotically
exact posterior estimates.
2. Variational Inference (VI) [3]: An optimization-based approach that approximates the posterior
using a parameterized distribution.
In this report, we implement both methods on MovieLens dataset and compare their performance.
2 Problem Setting
2.1 Matrix Factorization Model
In mathematics, a sparse matrix refers to a matrix in which most of the elements are zero. A sparse
rating matrix is a common data structure in recommendation systems, typically used to represent
users’ rating data for items. Collaborative filtering relies on user behavior data such as ratings to
predict items a user might like. Due to the high sparsity of rating matrices, directly storing and
computing them would waste a lot of memory and computational resources. Matrix factorization
addresses this by decomposing the user-item rating matrix into two low-dimensional matrices, and
then predicting ratings via their dot product, thereby achieving collaborative filtering.
Given an N × M sparse rating matrix R:
Suppose there are L observed ratings (L ≪ N × M ), for predicting all unrated items, we need to
estimate N × M − L ≈ N × M parameters. An N × M rating matrix R can be approximated by
the multiplication of an N × K matrix and a K × M matrix.
2
where:
- O is the set of all rated items.
- rij is the true rating given by user i to item j.
- ui vjT is the predicted rating.
Gradient descent update rules:
X
ui ← ui + α · (rij − ui vjT )vj
j∈Oi
X
vj ← vj + α · (rij − ui vjT )ui
i∈Oj
Instead of estimating matrices U and V to compute each rating rij based on deterministic approaches,
one can use Bayesian inference. Consider each row of U and V , i.e., Ui and Vj , as a multivariate
random variables. Ui and Vj are assumed to be standard normal random vectors.
Uik ∼ N (0, 1) for any 1 ≤ k ≤ K
Vik ∼ N (0, 1) for any 1 ≤ k ≤ K
The prior distributions of Ui and Vj have the following PDFs
fUi (ui ) = 1 K exp − 12 ui uTi
(2π) 2
fVj (vj ) = 1 K exp − 12 vj vjT
(2π) 2
Note: as sigmoid function returns a value into [0, 1], during training, we need to first normalize ratings
rij −1
via rij ← R−1 , where original ratings are defined in {1, 2, . . . , R}, and σ 2 is a hyperparameter.
The PDF of likelihood is
2 !
1 rij − sigmoid ui vjT
fRij |Ui ,Vj (rij |ui , vj ) = √ exp − .
2πσ 2 2σ 2
Based on conditional independence and Bayes’ rule, the posterior of U and V can be inferred as
fU,V|{Rij } (U, V |{rij })
∝ f{Rij }|U,V ({rij }|U, V )fU,V (U, V )
N Y M
Y Iij 1 1
exp − ui uTi exp − vj vjT .
∝ fRij |Ui ,Vj (rij |ui , vj )
i=1 j=1
2 2
where Iij is an indicator function that is equal to 1 if user i rated movie j, otherwise 0.
If the likelihood adopts a simple Gaussian distribution, minimizing the negative log of the posterior
(loss function) combined with gradient descent to obtain the most likely U and V , and then directly
multiplying them to generate the prediction matrix are still available, just as researchers do in PMF
[1]. However, we introduce nonlinearity by applying a sigmoid function to the mean of the Gaussian
likelihood distribution, in order to simulate the complex likelihoods that are likely to occur in real-
world applications. In addition, a simple Gaussian likelihood may produce predicted ratings outside
the valid range (e.g., from 1 to 5), while sigmoid can compress the predicted ratings into valid range.
3
In this case, the gradient descent becomes infeasible. Therefore, we adopt Bayesian framework,
retain posterior distributions of U and V , and optimizing the predictive distribution over unrated data.
To make a prediction on an unobserved rating rab :
fRab |{Rij } (rab |{rij })
Z Z
= fRab |Ua ,Vb (rab |ua , vb )fUa ,Vb |{Rij } (ua , vb |{rij })dua dvb
ua ∈RK vb ∈RK
= EUa ,Vb |{Rij } fRab |Ua ,Vb (rab |ua , vb ) .
4
3.1 Markov Chain Monte Carlo (MCMC)
Suppose z is multi-variate random variable, and we are interested in evaluating the expectation:
Z
EZ [h(z)] = h(z)f (z)dz.
z
Where
g(z)
f (z) = .
Z
Our objective is to draw independent samples {z1 , . . . , zn } from f (z) to approximate EZ [h(z)]:
n
1X
EZ [h(z)] ≈ h(zi ).
n i=1
In high-dimensional settings, Markov Chain Monte Carlo (MCMC) is widely used for sampling. The
Metropolis-Hastings algorithm is a commonly used MCMC method.
Core idea: Construct a proposal distribution q(z′ |z) to generate candidate samples and use an
acceptance rate to decide whether to accept the sample.
g(z)
Detailed steps: Let the target distribution be f (z) = Z .
7: Sample u ∼ Uniform(0, 1)
8: if u < α(zt , z′ ) then
9: Accept: set zt+1 = z′
10: else
11: Reject: set zt+1 = zt
12: end if
13: Update t ← t + 1
14: end while
15: return {z1 , z2 , . . . , zn }
5
3.2 Variational Inference (VI)
Due to the intractability of the exact posterior P (U, V | {rij }), we employ Variational Inference (VI)
to approximate it. We define a variational distribution Q(U, V) under the mean-field assumption:
N
Y M
Y
Q(U, V) = Qi (ui ) Qj (vj ),
i=1 j=1
and optimize it to minimize the KL divergence between the true posterior and the variational
distribution. This leads to the maximization of the Evidence Lower Bound (ELBO):
log P ({rij }) ≥ L(Q) = EQ [log P ({rij }, U, V)] − EQ [log Q(U, V)].
We derive the update rules using Coordinate Ascent Variational Inference (CAVI), iteratively optimiz-
ing each variational factor while keeping others fixed.
The specific iterative process is as follows:
4 Dataset Processing
We use the MovieLens-small dataset, which consists of 100,836 ratings from 610 users on 9,724
movies. The rating data is stored in the ratings.csv file. We implemented two Python scripts:
MCMC.py and VI.py, which perform Bayesian matrix completion using the MCMC and VI methods,
respectively, on the data from the CSV file.
The data in ratings.csv is stored in the following format:
We ignore the timestamp column and import the first three columns into the Python scripts for
further processing. The dataset is preprocessed by:
1. Normalizing ratings between 0 and 1.
2. Divided into three groups: 60% training set, 20% validation set, and 20% test set.
5 Experimental Evaluation
We evaluate MCMC and VI on the MovieLens dataset based on:
6
1. Convergence Speed;
2. Predictive Accuracy;
3. Computational Efficiency.
Controlling other parameters such as latent vector dimension and variance to be consistent, we set
300 epochs for VI and found that it generally began to stabilize between 150 and 200 epochs; we set
1000 epochs for MCMC and found that it generally began to stabilize between 600 and 700 epochs.
This indicates that VI requires fewer epochs to converge compared to MCMC.
Measured using RMSE, where VI achieves 1.2277 and MCMC achieves 1.1836.
We used an NVIDIA GeForce RTX 4060 Laptop GPU for computation. The execution time of VI.py
was approximately 6 seconds, while MCMC.py took about 6 hours to run. This means VI runs
approximately 3,600x faster than MCMC due to its optimization-based approach.
5.4 Discussion
6 Conclusion
In this report, we proposed a Bayesian approach to matrix factorization, leveraging two prominent
inference methods—Markov Chain Monte Carlo (MCMC) and Variational Inference (VI)—to address
7
the intractability of the posterior distribution over latent user and item features in collaborative filtering.
We formulated the probabilistic model by introducing Gaussian priors on user and item latent vectors
and a sigmoid-transformed Gaussian likelihood to ensure bounded rating predictions.
We implemented both inference techniques and evaluated their performance on the MovieLens
dataset, focusing on three key dimensions: convergence speed, predictive accuracy, and computa-
tional efficiency. Our experiments demonstrated that VI converges considerably faster than MCMC
and achieves remarkable computational efficiency due to its deterministic optimization framework.
However, MCMC, by virtue of sampling from the true posterior, offers more accurate predictions,
though at the cost of significantly higher runtime.
These results highlight a fundamental trade-off in Bayesian matrix factorization: MCMC yields
higher fidelity at the expense of time, while VI offers scalability and speed with slightly reduced
accuracy. Therefore, the choice of inference method should be guided by the specific constraints and
requirements of the target application.
In future work, we plan to explore hybrid approaches that combine the strengths of MCMC and VI,
such as initializing MCMC with variational parameters or employing amortized inference techniques.
Additionally, extending the model to incorporate content-based features or temporal dynamics could
further enhance recommendation accuracy and applicability in real-world systems.
References
[1] Mnih, A., & Salakhutdinov, R. R. (2007). Probabilistic matrix factorization. Advances in neural
information processing systems, 20.
[2] W. K. Hastings. (1970). Monte Carlo sampling methods using Markov chains and
their applications, Biometrika, Volume 57, Issue 1, April 1970, Pages 97–109,
https://doi.org/10.1093/biomet/57.1.97
[3] Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational Inference: A Re-
view for Statisticians. Journal of the American Statistical Association, 112(518), 859–877.
https://doi.org/10.1080/01621459.2017.1285773
[4] Ruslan Salakhutdinov and Andriy Mnih. (2008). Bayesian probabilistic matrix factorization
using Markov chain Monte Carlo. In Proceedings of the 25th international conference on
Machine learning (ICML ’08). Association for Computing Machinery, New York, NY, USA,
880–887. https://doi.org/10.1145/1390156.1390267
[5] Guangyong Chen, Fengyuan Zhu, and Pheng Ann Heng. (2018). Large-Scale Bayesian Proba-
bilistic Matrix Factorization with Memo-Free Distributed Variational Inference. ACM Trans.
Knowl. Discov. Data 12, 3, Article 31 (June 2018), 24 pages. https://doi.org/10.1145/3161886
8
Appendix
1. Mathmatical Details of MCMC
A set of random variables {z1 , z2 , . . . , zn } forms a first-order Markov chain if the following condi-
tional independence holds:
P (zk+1 |z1 , z2 , . . . , zk ) = P (zk+1 |zk ) for k ∈ {1, . . . , n − 1}.
This means that the current state depends only on the previous state and not on earlier states.
For continuous variables, it becomes:
f (zk+1 |z1 , z2 , . . . , zk ) = f (zk+1 |zk ) for k ∈ {1, . . . , n − 1}.
To generate a Markov chain, we need:
1. Define the initial state distribution P (z1 ).
2. Construct the transition kernels:
Tk (zk+1 ← zk ) = P (zk+1 |zk ), k ∈ {1, . . . , n − 1}.
For continuous variables, the transition kernel is written as:
Tk (zk+1 ← zk ) = f (zk+1 |zk ), k ∈ {1, . . . , n − 1}.
A Markov chain is called homogeneous if the transition kernels are the same for all k.
The marginal probability of a specific state can be computed via product and sum rules, i.e., the law
of total probability:
X
P (zk+1 ) = T (zk+1 ← zk )P (zk ),
zk
A sufficient (but not necessary) condition for ensuring P ∗ (·) is stationary or invariant is to choose a
transition kernel that satisfies the property of detailed balance, defined by:
T (z′ ← z)P ∗ (z) = T (z ← z′ )P ∗ (z′ ), ∀z, z′ .
If detailed balance holds, the chain is said to be reversible.
The core idea of using Markov Chain Monte Carlo (MCMC) methods for sampling:
1. Construct a Markov chain whose stationary distribution is f (z);
2. Simulate the chain to generate samples;
3. After a sufficient number of steps, the samples approximately follow the distribution f (z).
When the Markov chain runs long enough, even if the initial state is not sampled from f (z), the
eventually generated samples will converge to this distribution, thus allowing approximate sampling.
It should be noted that, to guarantee convergence to f (z), the Markov chain needs to be ergodic.
9
2. Mathmatical Details of VI
Considering the general case, by Bayes’ theorem, we have
Divide both terms inside the log on the right-hand side by Q(Z). The equation remains equivalent:
P (X, Z) P (Z | X)
log P (X) = log − log .
Q(Z) Q(Z)
Multiply both sides of the equation by Q(Z) and integrate over Z, yielding:
Left-hand side:
Z
log P (X)Q(Z)dZ = log P (X).
Z
Right-hand side:
P (Z | X)
Z Z
P (X, Z)
Q(Z) log dZ − Q(Z) log dZ
Q(Z) Q(Z)
ZZ ZZ
P (X, Z)
= Q(Z) log dZ − Q(Z) log P (Z | X)dZ
Z Q(Z) Z
= L(Q) + KL(Q∥P ).
In the above, the first term is denoted as L(Q), and the second term (with a negative sign) represents
the KL divergence, indicating the distance between the posterior distribution P (Z | X) and the
distribution Q(Z). Therefore, we have:
Assume under the mean-field theory that Q(Z) is conditionally independent for all components of Z
(let Z have M components), i.e.,
M
Y
Q(Z) = Qi (Zi ).
i=1
We will make some transformations to the first expression on the left side:
M
Z Y
Qi (Zi ) log P (X, Z)dZ
Z i=1
Z Z Y
= Qj (Zj ) Qi (Zi ) log P (X, Z)dZj dZj
i̸=j
Z
= qj (Zj )Eqi (Zi ),i̸=j [log P (X, Z)] dZj .
10
Now let’s make some transformations to the second expression on the right side:
Z Y M M
X
Qi (Zi ) log Qi (Zi )dZ
Z i=1 i=1
Z M
! M
!
Y X
= Qi (zi ) log Qi (zi ) dZ
Z i=1 i=1
Z M
!
Y
= Qi (zi ) [log Q1 (z1 ) + log Q2 (z2 ) + · · · + log QM (zM )] dZ.
Z i=1
Now we attempt to isolate one of the items to discover the pattern:
Z M
!
Y
Qi (zi ) log Q1 (z1 ) dZ
Z i=1
Z
= Q1 Q2 · · · QM log Q1 dZ
ZZ
= Q1 Q2 · · · QM log Q1 dz1 dz2 · · · dzM
z1 ,z2 ,...,zM
Z Z Z
= Q1 log Q1 dz1 Q2 dz2 · · · QM dzM
z z2 zM
Z 1
= Q1 log Q1 dz1 .
z1
Back to the expression we care about
Z M
!
Y
Qi (zi ) [log Q1 (z1 ) + log Q2 (z2 ) + · · · + log QM (zM )] dZ
Z i=1
M Z
X
= Qi (zi ) log Qi (zi ) dzi
i=1 zi
Z
= Qj (zj ) log Qj (zj ) dzj + Constant.
zj
We transform the expectation to another form:
EQi (Zi ),i̸=j [log P (X, Z)] = log P̃ (X, Zj ).
Here we adopt Coordinate Ascent Variational Inference (CAVI), the idea of CAVI is that when
updating Qj (Zj ), the other Qi (Zi ) for i ̸= j are kept fixed, thus
Z
P̂ (X, zj )
L(Q) = Qj (zj ) log dzj + Constant
zj Qj (zj )
= − KL Qj ∥ P̂ (X, zj ) + Constant.
11