0% found this document useful (0 votes)

0 views11 pages

Bayesian Probabilistic Matrix Factorization

This document discusses Bayesian Probabilistic Matrix Factorization (BPMF) as an enhancement to traditional matrix factorization for recommendation systems, focusing on the incorporation of uncertainty through probabilistic methods. It evaluates two Bayesian inference techniques, Markov Chain Monte Carlo (MCMC) and Variational Inference (VI), on the MovieLens dataset, highlighting that VI achieves faster convergence while MCMC provides more accurate posterior estimates. The document also details the mathematical foundations of BPMF and the implementation of both inference methods.

Uploaded by

xpertgame1467

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

0 views11 pages

Bayesian Probabilistic Matrix Factorization

Uploaded by

xpertgame1467

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Bayesian Probabilistic Matrix Factorization

Ruixuan Xu
Department of Computer Science and Engineering
The Chinese University of Hong Kong
[email protected]

Xiangxiang Weng
arXiv:2506.09928v1 [cs.LG] 11 Jun 2025

Department of Computer Science and Engineering

The Chinese University of Hong Kong
[email protected]

Abstract

Matrix factorization is a widely used technique in recommendation systems. Prob-

abilistic Matrix Factorization (PMF) [1] extends traditional matrix factorization
by incorporating probability distributions over latent factors, allowing for uncer-
tainty quantification. However, computing the posterior distribution is intractable
due to the high-dimensional integral. To address this, we employ two Bayesian
inference methods: Markov Chain Monte Carlo (MCMC) [2] and Variational In-
ference (VI) [3] to approximate the posterior. We evaluate their performance on
MovieLens dataset and compare their convergence speed, predictive accuracy, and
computational efficiency. Experimental results demonstrate that VI offers faster
convergence, while MCMC provides more accurate posterior estimates.

1 Introduction
Collaborative filtering is an essential technique in recommendation systems, where the goal is to pre-
dict user preferences based on sparse observed ratings. Matrix factorization has been widely adopted
due to its ability to model user-item interactions effectively. However, traditional matrix factorization
relies on point estimates, which may lead to overfitting and lack of uncertainty quantification.
Probabilistic Matrix Factorization (PMF) [1] mitigates this issue by considering a probabilistic
approach, where user and item latent matrices are treated as random variables with prior distributions.
The challenge in PMF is computing the posterior distribution of latent matrices, which is intractable.
To approximate the posterior, we explore two Bayesian inference methods:
1. Markov Chain Monte Carlo (MCMC) [2]: A sampling-based approach that provides asymptotically
exact posterior estimates.
2. Variational Inference (VI) [3]: An optimization-based approach that approximates the posterior
using a parameterized distribution.
In this report, we implement both methods on MovieLens dataset and compare their performance.

2 Problem Setting
2.1 Matrix Factorization Model

In mathematics, a sparse matrix refers to a matrix in which most of the elements are zero. A sparse
rating matrix is a common data structure in recommendation systems, typically used to represent
users’ rating data for items. Collaborative filtering relies on user behavior data such as ratings to
predict items a user might like. Due to the high sparsity of rating matrices, directly storing and
computing them would waste a lot of memory and computational resources. Matrix factorization
addresses this by decomposing the user-item rating matrix into two low-dimensional matrices, and
then predicting ratings via their dot product, thereby achieving collaborative filtering.
Given an N × M sparse rating matrix R:

Figure 1: Sparse Rating Matrix

Suppose there are L observed ratings (L ≪ N × M ), for predicting all unrated items, we need to
estimate N × M − L ≈ N × M parameters. An N × M rating matrix R can be approximated by
the multiplication of an N × K matrix and a K × M matrix.

Figure 2: Matrix Factorization

Here, K is feature dimension, U ∈ RN ×K is user feature matrix, and V ∈ RM ×K is item feature

matrix. With matrix factorization, rather than estimating N × M − L ≈ N × M parameters, we
only need to estimate (N + M ) × K, where K ≪ min(N, M ). Our goal is to find matrices U and
V such that U V T can accurately represent the original rating matrix. In this way, the number of
parameters to be estimated is greatly reduced, and the entire prediction matrix can be obtained simply
by matrix multiplication, making computation efficient. Finding the target parameter matrices U and
V involves optimization methods based on gradient descent. For example, minimizing the following
loss function: X
min (rij − ui vjT )2
U,V
(i,j)∈O

2
where:
- O is the set of all rated items.
- rij is the true rating given by user i to item j.
- ui vjT is the predicted rating.
Gradient descent update rules:
X
ui ← ui + α · (rij − ui vjT )vj
j∈Oi
X
vj ← vj + α · (rij − ui vjT )ui
i∈Oj

where α is the learning rate.

However, traditional matrix factorization has many limitations, such as poor generalization ability,
inability to capture the uncertainty of latent vectors and lack of probabilistic interpretation. Therefore,
we introduce Bayesian Probabilistic Matrix Factorization (BPMF) on the basis of matrix factorization.

2.2 Bayesian Probabilistic Matrix Factorization

Instead of estimating matrices U and V to compute each rating rij based on deterministic approaches,
one can use Bayesian inference. Consider each row of U and V , i.e., Ui and Vj , as a multivariate
random variables. Ui and Vj are assumed to be standard normal random vectors.

Uik ∼ N (0, 1) for any 1 ≤ k ≤ K
Vik ∼ N (0, 1) for any 1 ≤ k ≤ K
The prior distributions of Ui and Vj have the following PDFs

fUi (ui ) = 1 K exp − 12 ui uTi

(2π) 2
fVj (vj ) = 1 K exp − 12 vj vjT

(2π) 2

The likelihood of each observed rating rij is defined as a normal distribution:

Rij |Ui , Vj ∼ N sigmoid ui vjT , σ 2

Note: as sigmoid function returns a value into [0, 1], during training, we need to first normalize ratings
rij −1
via rij ← R−1 , where original ratings are defined in {1, 2, . . . , R}, and σ 2 is a hyperparameter.
The PDF of likelihood is
2 !
1 rij − sigmoid ui vjT
fRij |Ui ,Vj (rij |ui , vj ) = √ exp − .
2πσ 2 2σ 2
Based on conditional independence and Bayes’ rule, the posterior of U and V can be inferred as
fU,V|{Rij } (U, V |{rij })
∝ f{Rij }|U,V ({rij }|U, V )fU,V (U, V )
N Y M
Y Iij 1 1
exp − ui uTi exp − vj vjT .

∝ fRij |Ui ,Vj (rij |ui , vj )
i=1 j=1
2 2

where Iij is an indicator function that is equal to 1 if user i rated movie j, otherwise 0.
If the likelihood adopts a simple Gaussian distribution, minimizing the negative log of the posterior
(loss function) combined with gradient descent to obtain the most likely U and V , and then directly
multiplying them to generate the prediction matrix are still available, just as researchers do in PMF
[1]. However, we introduce nonlinearity by applying a sigmoid function to the mean of the Gaussian
likelihood distribution, in order to simulate the complex likelihoods that are likely to occur in real-
world applications. In addition, a simple Gaussian likelihood may produce predicted ratings outside
the valid range (e.g., from 1 to 5), while sigmoid can compress the predicted ratings into valid range.

3
In this case, the gradient descent becomes infeasible. Therefore, we adopt Bayesian framework,
retain posterior distributions of U and V , and optimizing the predictive distribution over unrated data.
To make a prediction on an unobserved rating rab :
fRab |{Rij } (rab |{rij })
Z Z
= fRab |Ua ,Vb (rab |ua , vb )fUa ,Vb |{Rij } (ua , vb |{rij })dua dvb
ua ∈RK vb ∈RK

= EUa ,Vb |{Rij } fRab |Ua ,Vb (rab |ua , vb ) .

In general, we can use MAP to generate a predicted rating in [0, 1]:

∗
rab = arg max fRab |{Rij } (rab |{rij }).
rab

Then map back to the original rating using r = (R − 1)r∗ + 1.

However, we still cannot compute fRab |{Rij } (rab |{rij }), because there are two challenges:
1. Although we have derived the parameter posterior distribution fUa ,Vb |{Rij } (ua , vb |{rij }), we
cannot accurately compute the normalization constant, since integrating a high-dimensional Gaussian
distribution with a sigmoid is difficult even for computers;
2. Even if we obtain the parameter posterior, the integral involved in the predictive distribution itself
is high-dimensional and difficult to compute.
To address this, we introduce some Bayesian inference methods [4], [5].

3 Bayesian Inference Methods

Consider Bayesian inference:

fΘ (θ)fX|Θ (x|θ)
fΘ|X (θ|x) = .
Z(x)

- fΘ|X (θ|x): Posterior.

- fΘ (θ): Prior.
- fX|Θ (x|θ): Likelihood.
R
- Z(x): Normalization constant, where Z(x) = θ
fΘ (θ)fX|Θ (x|θ)dθ.
For multiple observations, let D = {X1 , . . . , Xn } be the joint random variables and d =
{x1 , . . . , xn } their values. Bayesian inference can be written as:
fΘ (θ)fD|Θ (d|θ)
fΘ|D (θ|d) = .
Z(d)
Bayesian prediction:
Z +∞
fX|D (x∗ |d) = fX|Θ (x∗ |θ)fΘ|D (θ|d)dθ = EΘ|D=d [fX|Θ (x∗ |θ)].
−∞

Two major computational challenges:

1. Computing Z(d) requires computing high dimensional integrals as θ is multivariate in practice;
2. The integral involved in the predictive distribution itself is difficult to compute.
We propose two solutions:
1. Use MCMC [2] to sample from fΘ|D (θ|d) and use the sample mean to approximate the expectation.
2. Use VI [3] to approximate fΘ|D (θ|d) by q(θ), which is easy to integrate.

4
3.1 Markov Chain Monte Carlo (MCMC)

Suppose z is multi-variate random variable, and we are interested in evaluating the expectation:
Z
EZ [h(z)] = h(z)f (z)dz.
z
Where
g(z)
f (z) = .
Z
Our objective is to draw independent samples {z1 , . . . , zn } from f (z) to approximate EZ [h(z)]:
n
1X
EZ [h(z)] ≈ h(zi ).
n i=1
In high-dimensional settings, Markov Chain Monte Carlo (MCMC) is widely used for sampling. The
Metropolis-Hastings algorithm is a commonly used MCMC method.
Core idea: Construct a proposal distribution q(z′ |z) to generate candidate samples and use an
acceptance rate to decide whether to accept the sample.
g(z)
Detailed steps: Let the target distribution be f (z) = Z .

1. Construct a proposal distribution q(z′ |z).

2. Choose an initial state z0 . Set the initial time t = 0.
3. Sample a candidate z′ from q(z′ |zt ).
4. Compute the acceptance rate:
g(z′ )q(zt |z′ )

′
α(zt , z ) = min 1, .
g(zt )q(z′ |zt )
5. Accept the new sample with probability α. If accepted, set zt+1 = z′ ; otherwise, set zt+1 = zt .
6. Update time t ← t + 1.
7. Repeat steps 3–6 until the samples meet the requirements.
The specific iterative process is as follows:

Algorithm 1 Metropolis-Hastings Algorithm for MCMC

1: Input: Unnormalized target density g(z), proposal distribution q(z′ |z)
2: Output: Samples {z1 , z2 , . . . , zn } approximating f (z)
3: Initialize: Initial state z0 , set t = 0
4: while t < n do
5: Sample a candidate z′ ∼ q(z′ |zt )
6: Compute acceptance rate:
g(z′ )q(zt |z′ )

′
α(zt , z ) = min 1,
g(zt )q(z′ |zt )

7: Sample u ∼ Uniform(0, 1)
8: if u < α(zt , z′ ) then
9: Accept: set zt+1 = z′
10: else
11: Reject: set zt+1 = zt
12: end if
13: Update t ← t + 1
14: end while
15: return {z1 , z2 , . . . , zn }

For mathmatical details, please check Appendix.

5
3.2 Variational Inference (VI)

Due to the intractability of the exact posterior P (U, V | {rij }), we employ Variational Inference (VI)
to approximate it. We define a variational distribution Q(U, V) under the mean-field assumption:
N
Y M
Y
Q(U, V) = Qi (ui ) Qj (vj ),
i=1 j=1

and optimize it to minimize the KL divergence between the true posterior and the variational
distribution. This leads to the maximization of the Evidence Lower Bound (ELBO):
log P ({rij }) ≥ L(Q) = EQ [log P ({rij }, U, V)] − EQ [log Q(U, V)].
We derive the update rules using Coordinate Ascent Variational Inference (CAVI), iteratively optimiz-
ing each variational factor while keeping others fixed.
The specific iterative process is as follows:

Algorithm 2 Coordinate Ascent Variational Inference (CAVI)

1: Input: A model p(x, z), a data set x Q
m
2: Output: A variational density Q(z) = j=1 Qj (zj )
3: Initialize: Variational factors Qj (zj )
4: while the ELBO has not converged do
5: for j ∈ {1, . . . , m} do
6: Set Qj (zj ) ∝ exp {E−j [log p(zj | z−j , x)]}
7: end for
8: Compute ELBO(Q) = E [log p(z, x)] − E [log Q(z)]
9: end while
10: return q(z)

For mathmatical details, please check Appendix.

4 Dataset Processing
We use the MovieLens-small dataset, which consists of 100,836 ratings from 610 users on 9,724
movies. The rating data is stored in the ratings.csv file. We implemented two Python scripts:
MCMC.py and VI.py, which perform Bayesian matrix completion using the MCMC and VI methods,
respectively, on the data from the CSV file.
The data in ratings.csv is stored in the following format:

userId movieId rating timestamp

1 1 4 964982703
1 3 4 964981247
1 6 4 964982224
1 47 5 964983815
1 50 5 964982931
.. .. .. ..
. . . .

We ignore the timestamp column and import the first three columns into the Python scripts for
further processing. The dataset is preprocessed by:
1. Normalizing ratings between 0 and 1.
2. Divided into three groups: 60% training set, 20% validation set, and 20% test set.

5 Experimental Evaluation
We evaluate MCMC and VI on the MovieLens dataset based on:

6
1. Convergence Speed;
2. Predictive Accuracy;
3. Computational Efficiency.

5.1 Convergence Speed

Controlling other parameters such as latent vector dimension and variance to be consistent, we set
300 epochs for VI and found that it generally began to stabilize between 150 and 200 epochs; we set
1000 epochs for MCMC and found that it generally began to stabilize between 600 and 700 epochs.
This indicates that VI requires fewer epochs to converge compared to MCMC.

Figure 3: Loss-Epoch Plot of VI and MCMC

5.2 Predictive Accuracy

Measured using RMSE, where VI achieves 1.2277 and MCMC achieves 1.1836.

Figure 4: RMSE of VI and MCMC

5.3 Computational Efficiency

We used an NVIDIA GeForce RTX 4060 Laptop GPU for computation. The execution time of VI.py
was approximately 6 seconds, while MCMC.py took about 6 hours to run. This means VI runs
approximately 3,600x faster than MCMC due to its optimization-based approach.

5.4 Discussion

1. MCMC is more accurate as it samples from the true posterior.

2. VI is computationally efficient and suitable for large-scale data.
3. Trade-off: If accuracy is paramount, use MCMC; if speed is critical, use VI.

6 Conclusion
In this report, we proposed a Bayesian approach to matrix factorization, leveraging two prominent
inference methods—Markov Chain Monte Carlo (MCMC) and Variational Inference (VI)—to address

7
the intractability of the posterior distribution over latent user and item features in collaborative filtering.
We formulated the probabilistic model by introducing Gaussian priors on user and item latent vectors
and a sigmoid-transformed Gaussian likelihood to ensure bounded rating predictions.
We implemented both inference techniques and evaluated their performance on the MovieLens
dataset, focusing on three key dimensions: convergence speed, predictive accuracy, and computa-
tional efficiency. Our experiments demonstrated that VI converges considerably faster than MCMC
and achieves remarkable computational efficiency due to its deterministic optimization framework.
However, MCMC, by virtue of sampling from the true posterior, offers more accurate predictions,
though at the cost of significantly higher runtime.
These results highlight a fundamental trade-off in Bayesian matrix factorization: MCMC yields
higher fidelity at the expense of time, while VI offers scalability and speed with slightly reduced
accuracy. Therefore, the choice of inference method should be guided by the specific constraints and
requirements of the target application.
In future work, we plan to explore hybrid approaches that combine the strengths of MCMC and VI,
such as initializing MCMC with variational parameters or employing amortized inference techniques.
Additionally, extending the model to incorporate content-based features or temporal dynamics could
further enhance recommendation accuracy and applicability in real-world systems.

References
[1] Mnih, A., & Salakhutdinov, R. R. (2007). Probabilistic matrix factorization. Advances in neural
information processing systems, 20.
[2] W. K. Hastings. (1970). Monte Carlo sampling methods using Markov chains and
their applications, Biometrika, Volume 57, Issue 1, April 1970, Pages 97–109,
https://doi.org/10.1093/biomet/57.1.97
[3] Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational Inference: A Re-
view for Statisticians. Journal of the American Statistical Association, 112(518), 859–877.
https://doi.org/10.1080/01621459.2017.1285773
[4] Ruslan Salakhutdinov and Andriy Mnih. (2008). Bayesian probabilistic matrix factorization
using Markov chain Monte Carlo. In Proceedings of the 25th international conference on
Machine learning (ICML ’08). Association for Computing Machinery, New York, NY, USA,
880–887. https://doi.org/10.1145/1390156.1390267
[5] Guangyong Chen, Fengyuan Zhu, and Pheng Ann Heng. (2018). Large-Scale Bayesian Proba-
bilistic Matrix Factorization with Memo-Free Distributed Variational Inference. ACM Trans.
Knowl. Discov. Data 12, 3, Article 31 (June 2018), 24 pages. https://doi.org/10.1145/3161886

8
Appendix
1. Mathmatical Details of MCMC
A set of random variables {z1 , z2 , . . . , zn } forms a first-order Markov chain if the following condi-
tional independence holds:
P (zk+1 |z1 , z2 , . . . , zk ) = P (zk+1 |zk ) for k ∈ {1, . . . , n − 1}.
This means that the current state depends only on the previous state and not on earlier states.
For continuous variables, it becomes:
f (zk+1 |z1 , z2 , . . . , zk ) = f (zk+1 |zk ) for k ∈ {1, . . . , n − 1}.
To generate a Markov chain, we need:
1. Define the initial state distribution P (z1 ).
2. Construct the transition kernels:
Tk (zk+1 ← zk ) = P (zk+1 |zk ), k ∈ {1, . . . , n − 1}.
For continuous variables, the transition kernel is written as:
Tk (zk+1 ← zk ) = f (zk+1 |zk ), k ∈ {1, . . . , n − 1}.
A Markov chain is called homogeneous if the transition kernels are the same for all k.
The marginal probability of a specific state can be computed via product and sum rules, i.e., the law
of total probability:
X
P (zk+1 ) = T (zk+1 ← zk )P (zk ),
zk

or for continuous variables, expressed in terms of PDFs:

Z
f (zk+1 ) = T (zk+1 ← zk )f (zk )dzk .
zk
∗
A distribution P (·) is said to be stationary or invariant with respect to a Markov chain if each step in
the chain leaves P ∗ (·) invariant:
X
P ∗ (z) = T (z ← z′ )P ∗ (z′ ), ∀z.
z′

A sufficient (but not necessary) condition for ensuring P ∗ (·) is stationary or invariant is to choose a
transition kernel that satisfies the property of detailed balance, defined by:
T (z′ ← z)P ∗ (z) = T (z ← z′ )P ∗ (z′ ), ∀z, z′ .
If detailed balance holds, the chain is said to be reversible.
The core idea of using Markov Chain Monte Carlo (MCMC) methods for sampling:
1. Construct a Markov chain whose stationary distribution is f (z);
2. Simulate the chain to generate samples;
3. After a sufficient number of steps, the samples approximately follow the distribution f (z).
When the Markov chain runs long enough, even if the initial state is not sampled from f (z), the
eventually generated samples will converge to this distribution, thus allowing approximate sampling.
It should be noted that, to guarantee convergence to f (z), the Markov chain needs to be ergodic.

9
2. Mathmatical Details of VI
Considering the general case, by Bayes’ theorem, we have

log P (X) = log P (X, Z) − log P (Z | X).

Divide both terms inside the log on the right-hand side by Q(Z). The equation remains equivalent:

P (X, Z) P (Z | X)
log P (X) = log − log .
Q(Z) Q(Z)

Multiply both sides of the equation by Q(Z) and integrate over Z, yielding:
Left-hand side:
Z
log P (X)Q(Z)dZ = log P (X).
Z

Right-hand side:
P (Z | X)
Z Z
P (X, Z)
Q(Z) log dZ − Q(Z) log dZ
Q(Z) Q(Z)
ZZ ZZ
P (X, Z)
= Q(Z) log dZ − Q(Z) log P (Z | X)dZ
Z Q(Z) Z
= L(Q) + KL(Q∥P ).

In the above, the first term is denoted as L(Q), and the second term (with a negative sign) represents
the KL divergence, indicating the distance between the posterior distribution P (Z | X) and the
distribution Q(Z). Therefore, we have:

log P (X) = L(Q) + KL(Q∥P ).

Assume under the mean-field theory that Q(Z) is conditionally independent for all components of Z
(let Z have M components), i.e.,
M
Y
Q(Z) = Qi (Zi ).
i=1

From the previous derivation, we know:

Z
P (X, Z)
L(Q) = Q(Z) log dZ
Z Q(Z)
Z Y M M
Z Y M
Y
= Qi (Zi ) log P (X, Z)dZ − Qi (Zi ) log Qi (Zi )dZ
Z i=1 Z i=1 i=1
M
Z Y M
Z Y M
X
= Qi (Zi ) log P (X, Z)dZ − Qi (Zi ) log Qi (Zi )dZ.
Z i=1 Z i=1 i=1

We will make some transformations to the first expression on the left side:
M
Z Y
Qi (Zi ) log P (X, Z)dZ
Z i=1
 
Z Z Y
= Qj (Zj )  Qi (Zi ) log P (X, Z)dZj  dZj
i̸=j
Z
= qj (Zj )Eqi (Zi ),i̸=j [log P (X, Z)] dZj .

10
Now let’s make some transformations to the second expression on the right side:
Z Y M M
X
Qi (Zi ) log Qi (Zi )dZ
Z i=1 i=1
Z M
! M
!
Y X
= Qi (zi ) log Qi (zi ) dZ
Z i=1 i=1
Z M
!
Y
= Qi (zi ) [log Q1 (z1 ) + log Q2 (z2 ) + · · · + log QM (zM )] dZ.
Z i=1
Now we attempt to isolate one of the items to discover the pattern:
Z M
!
Y
Qi (zi ) log Q1 (z1 ) dZ
Z i=1
Z
= Q1 Q2 · · · QM log Q1 dZ
ZZ
= Q1 Q2 · · · QM log Q1 dz1 dz2 · · · dzM
z1 ,z2 ,...,zM
Z Z Z
= Q1 log Q1 dz1 Q2 dz2 · · · QM dzM
z z2 zM
Z 1
= Q1 log Q1 dz1 .
z1
Back to the expression we care about
Z M
!
Y
Qi (zi ) [log Q1 (z1 ) + log Q2 (z2 ) + · · · + log QM (zM )] dZ
Z i=1
M Z
X
= Qi (zi ) log Qi (zi ) dzi
i=1 zi
Z
= Qj (zj ) log Qj (zj ) dzj + Constant.
zj
We transform the expectation to another form:
EQi (Zi ),i̸=j [log P (X, Z)] = log P̃ (X, Zj ).
Here we adopt Coordinate Ascent Variational Inference (CAVI), the idea of CAVI is that when
updating Qj (Zj ), the other Qi (Zi ) for i ̸= j are kept fixed, thus
Z
P̂ (X, zj )
L(Q) = Qj (zj ) log dzj + Constant
zj Qj (zj )

= − KL Qj ∥ P̂ (X, zj ) + Constant.

To maximize L(Q), we need to set Qj (Zj ) = P̃ (X, Zj ), that is,

Q∗j (Zj ) = exp EQi (Zi ),i̸=j [log P (X, Z)] .

Q∗j (Zj ) = 1, we normalize the above expression, obtaining

P
To ensure that Zj

exp EQi (Zi ),i̸=j [log P (X, Z)]
Q∗j (Zj ) = R .
exp EQi (Zi ),i̸=j [log P (X, Z)] dZj
In this project, Z can be regarded as [U; V], X can be regarded as {rij }
P (X, Z) = f (U, V, {rij })
N Y M
Y
2 Iij 1 ⊤ 1 ⊤
N (rij | sigmoid(u⊤

= v
i j ), σ ) exp − u ui exp − v v j .
i=1 j=1
2 i 2 j

Summer Task Mathematics-VI
No ratings yet
Summer Task Mathematics-VI
5 pages
Probabilistic Matrix Factorization: Ruslan Salakhutdinov and Andriy Mnih
No ratings yet
Probabilistic Matrix Factorization: Ruslan Salakhutdinov and Andriy Mnih
8 pages
Unit 5 - Machine Learning
No ratings yet
Unit 5 - Machine Learning
16 pages
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
17 pages
Class19 Approxinf
No ratings yet
Class19 Approxinf
45 pages
CS-601-Machine-learning-Unit-5 (1)
No ratings yet
CS-601-Machine-learning-Unit-5 (1)
18 pages
Restricted Boltzmann Machines For Collaborative Filtering
No ratings yet
Restricted Boltzmann Machines For Collaborative Filtering
8 pages
Classification Naive Bayes
No ratings yet
Classification Naive Bayes
17 pages
Unit 5 - Machine Learning
No ratings yet
Unit 5 - Machine Learning
17 pages
MLFA Bayesian Classifier
No ratings yet
MLFA Bayesian Classifier
25 pages
FULLTEXT01
No ratings yet
FULLTEXT01
44 pages
CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable
No ratings yet
CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable
5 pages
ML 05 Bayesian Classifier
No ratings yet
ML 05 Bayesian Classifier
19 pages
26 Matrix Factorization
No ratings yet
26 Matrix Factorization
20 pages
Lecture 12 Bayesian Neural Network
No ratings yet
Lecture 12 Bayesian Neural Network
46 pages
Nayes Bayes Classifier
No ratings yet
Nayes Bayes Classifier
46 pages
Bayesian Methods For Support Vector Machines: Evidence and Predictive Class Probabilities
No ratings yet
Bayesian Methods For Support Vector Machines: Evidence and Predictive Class Probabilities
32 pages
L3 (Week3) Bayesian Classifier
No ratings yet
L3 (Week3) Bayesian Classifier
21 pages
IML Module 3.pptx
No ratings yet
IML Module 3.pptx
95 pages
Using Singular Value Decomposition Approximation For Collaborati
No ratings yet
Using Singular Value Decomposition Approximation For Collaborati
8 pages
Naive Bayes
No ratings yet
Naive Bayes
32 pages
Rec Sys Comparison
No ratings yet
Rec Sys Comparison
9 pages
Introduction To Algorithms For Behavior Based Recommendation
No ratings yet
Introduction To Algorithms For Behavior Based Recommendation
36 pages
Recommendation Systems
No ratings yet
Recommendation Systems
62 pages
Bishop2008 Chapter ANewFrameworkForMachineLearnin
No ratings yet
Bishop2008 Chapter ANewFrameworkForMachineLearnin
24 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
Bayesian
No ratings yet
Bayesian
23 pages
Naive Bayes
No ratings yet
Naive Bayes
37 pages
07 - Bayesian Learning
No ratings yet
07 - Bayesian Learning
55 pages
Predicting Good Probabilities With Supervised Learning: Alexandru Niculescu-Mizil Rich Caruana
No ratings yet
Predicting Good Probabilities With Supervised Learning: Alexandru Niculescu-Mizil Rich Caruana
8 pages
Sparse Bayesian Learning - Analysis and Applications
No ratings yet
Sparse Bayesian Learning - Analysis and Applications
57 pages
Notes
No ratings yet
Notes
9 pages
Basics of Probabilistic/Bayesian Modeling and Parameter Estimation
No ratings yet
Basics of Probabilistic/Bayesian Modeling and Parameter Estimation
21 pages
Machine learning 04 - Bayes
No ratings yet
Machine learning 04 - Bayes
35 pages
L23 Bayesian Naive
No ratings yet
L23 Bayesian Naive
18 pages
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
No ratings yet
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
56 pages
I M F M F: Ntroduction TO Atrix Actorization Ethods Collaborative Iltering
No ratings yet
I M F M F: Ntroduction TO Atrix Actorization Ethods Collaborative Iltering
20 pages
Bayesian Probabilistic Matrix Factorization With Social Relations and Item Contents
No ratings yet
Bayesian Probabilistic Matrix Factorization With Social Relations and Item Contents
14 pages
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
No ratings yet
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
15 pages
Applsci 13 04852 v2
No ratings yet
Applsci 13 04852 v2
18 pages
4.ML_Estimation
No ratings yet
4.ML_Estimation
19 pages
probabalistic-approach-for-recommendation-systems
No ratings yet
probabalistic-approach-for-recommendation-systems
6 pages
Recommender Treatment
No ratings yet
Recommender Treatment
10 pages
ml-3
No ratings yet
ml-3
66 pages
Ryan Adams 140814 Bayesopt Ncap
No ratings yet
Ryan Adams 140814 Bayesopt Ncap
84 pages
Naive Bayes Classifier
No ratings yet
Naive Bayes Classifier
14 pages
Bayesian Machine Learning
No ratings yet
Bayesian Machine Learning
127 pages
Note 1518944988
No ratings yet
Note 1518944988
27 pages
Lecture 03 Bayes Classifier With Prob Concepts
No ratings yet
Lecture 03 Bayes Classifier With Prob Concepts
70 pages
Website - Machine Learning
No ratings yet
Website - Machine Learning
6 pages
FunctionSpace Regularization in Neural NetworksA Probabilistic Perspective
No ratings yet
FunctionSpace Regularization in Neural NetworksA Probabilistic Perspective
16 pages
Lecture2 2013
No ratings yet
Lecture2 2013
60 pages
Lecture 5 Bayesian Classification
No ratings yet
Lecture 5 Bayesian Classification
16 pages
Lecture Slide 03 - Bayesian Classifier - Summer 2023
No ratings yet
Lecture Slide 03 - Bayesian Classifier - Summer 2023
23 pages
Week 6 v1.61 (Hidden) - Revision, CW1, and Probabilistic Graphical Models
No ratings yet
Week 6 v1.61 (Hidden) - Revision, CW1, and Probabilistic Graphical Models
65 pages
module_3_Last Part
No ratings yet
module_3_Last Part
16 pages
05_lecturenote_NB
No ratings yet
05_lecturenote_NB
10 pages
Chapter 4: Linear Models For Classification: Grit Hein & Susanne Leiberg
No ratings yet
Chapter 4: Linear Models For Classification: Grit Hein & Susanne Leiberg
21 pages
Dear The Weight
From Everand
Dear The Weight
Masud Rana
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Quantum Algorithms in Action: A Practical Guide to Implementation with Qiskit
From Everand
Quantum Algorithms in Action: A Practical Guide to Implementation with Qiskit
Robert Johnson
No ratings yet
1 Zeros of LTI Systems: MAE 280A 1 Maur Icio de Oliveira
No ratings yet
1 Zeros of LTI Systems: MAE 280A 1 Maur Icio de Oliveira
9 pages
Algebra-02 Enitv12d Elep06
No ratings yet
Algebra-02 Enitv12d Elep06
61 pages
Full Download The Mathematical Experience Study Edition Davis PDF DOCX
100% (8)
Full Download The Mathematical Experience Study Edition Davis PDF DOCX
67 pages
BTL - Group 5 - CC07 (General Physics 1)
No ratings yet
BTL - Group 5 - CC07 (General Physics 1)
10 pages
Add Math p1 Answers
No ratings yet
Add Math p1 Answers
86 pages
Basic Differentiation Formulas
No ratings yet
Basic Differentiation Formulas
2 pages
Algebraic Expressions 2 - Js3 For Voicing
No ratings yet
Algebraic Expressions 2 - Js3 For Voicing
13 pages
2011 P5 Math SA1 MGS
No ratings yet
2011 P5 Math SA1 MGS
28 pages
Cambridge Additional Mathematics Unit Quiz - Chapters 1 and 2
No ratings yet
Cambridge Additional Mathematics Unit Quiz - Chapters 1 and 2
3 pages
Functions
No ratings yet
Functions
38 pages
Statistics and Probability Performace Task
No ratings yet
Statistics and Probability Performace Task
6 pages
Ss - Vtamps 2024 4
No ratings yet
Ss - Vtamps 2024 4
13 pages
Ward Instructional Grouping
No ratings yet
Ward Instructional Grouping
17 pages
EMFT Assignment No.1
No ratings yet
EMFT Assignment No.1
3 pages
Study Schedule Topic Learning Outcomes Activities Week 5: Completion of Let's Try This and Gauge Your Learning Activities
No ratings yet
Study Schedule Topic Learning Outcomes Activities Week 5: Completion of Let's Try This and Gauge Your Learning Activities
11 pages
Kelly & O'Neill - The Network Simplex Method
100% (1)
Kelly & O'Neill - The Network Simplex Method
91 pages
ME549 Computational Fluid Dynamics Assignment 1
No ratings yet
ME549 Computational Fluid Dynamics Assignment 1
5 pages
2-3 and 2-6 HW
No ratings yet
2-3 and 2-6 HW
4 pages
Lesson 2 Families of Curve
No ratings yet
Lesson 2 Families of Curve
16 pages
Key Answer - 81e - A
No ratings yet
Key Answer - 81e - A
28 pages
The Construction of Real Numbers, Dedekind's Cuts
No ratings yet
The Construction of Real Numbers, Dedekind's Cuts
7 pages
FULL PROJECT WORK-1_030053
No ratings yet
FULL PROJECT WORK-1_030053
42 pages
RIPMWC 2021 Round 1
No ratings yet
RIPMWC 2021 Round 1
8 pages
October 2020 QP
No ratings yet
October 2020 QP
32 pages
When To Use A Particular Statistical Test
No ratings yet
When To Use A Particular Statistical Test
1 page
Utsp Part III Section 1
No ratings yet
Utsp Part III Section 1
80 pages
Aqa 83652 SMS S1
No ratings yet
Aqa 83652 SMS S1
13 pages
[FREE PDF sample] Mereology A. J. Cotnoir & Achille C. Varzi ebooks
100% (3)
[FREE PDF sample] Mereology A. J. Cotnoir & Achille C. Varzi ebooks
41 pages
Laplace Transform and Annihilators
No ratings yet
Laplace Transform and Annihilators
2 pages

Bayesian Probabilistic Matrix Factorization

Uploaded by

Bayesian Probabilistic Matrix Factorization

Uploaded by

Bayesian Probabilistic Matrix Factorization

Department of Computer Science and Engineering

Matrix factorization is a widely used technique in recommendation systems. Prob-

Figure 1: Sparse Rating Matrix

Figure 2: Matrix Factorization

Here, K is feature dimension, U ∈ RN ×K is user feature matrix, and V ∈ RM ×K is item feature

where α is the learning rate.

2.2 Bayesian Probabilistic Matrix Factorization

The likelihood of each observed rating rij is defined as a normal distribution:

In general, we can use MAP to generate a predicted rating in [0, 1]:

Then map back to the original rating using r = (R − 1)r∗ + 1.

3 Bayesian Inference Methods

Consider Bayesian inference:

- fΘ|X (θ|x): Posterior.

Two major computational challenges:

1. Construct a proposal distribution q(z′ |z).

Algorithm 1 Metropolis-Hastings Algorithm for MCMC

For mathmatical details, please check Appendix.

Algorithm 2 Coordinate Ascent Variational Inference (CAVI)

For mathmatical details, please check Appendix.

userId movieId rating timestamp

5.1 Convergence Speed

Figure 3: Loss-Epoch Plot of VI and MCMC

5.2 Predictive Accuracy

Figure 4: RMSE of VI and MCMC

5.3 Computational Efficiency

1. MCMC is more accurate as it samples from the true posterior.

or for continuous variables, expressed in terms of PDFs:

log P (X) = log P (X, Z) − log P (Z | X).

log P (X) = L(Q) + KL(Q∥P ).

From the previous derivation, we know:

To maximize L(Q), we need to set Qj (Zj ) = P̃ (X, Zj ), that is,

Q∗j (Zj ) = 1, we normalize the above expression, obtaining

You might also like