0% found this document useful (0 votes)
0 views11 pages

Bayesian Probabilistic Matrix Factorization

This document discusses Bayesian Probabilistic Matrix Factorization (BPMF) as an enhancement to traditional matrix factorization for recommendation systems, focusing on the incorporation of uncertainty through probabilistic methods. It evaluates two Bayesian inference techniques, Markov Chain Monte Carlo (MCMC) and Variational Inference (VI), on the MovieLens dataset, highlighting that VI achieves faster convergence while MCMC provides more accurate posterior estimates. The document also details the mathematical foundations of BPMF and the implementation of both inference methods.

Uploaded by

xpertgame1467
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views11 pages

Bayesian Probabilistic Matrix Factorization

This document discusses Bayesian Probabilistic Matrix Factorization (BPMF) as an enhancement to traditional matrix factorization for recommendation systems, focusing on the incorporation of uncertainty through probabilistic methods. It evaluates two Bayesian inference techniques, Markov Chain Monte Carlo (MCMC) and Variational Inference (VI), on the MovieLens dataset, highlighting that VI achieves faster convergence while MCMC provides more accurate posterior estimates. The document also details the mathematical foundations of BPMF and the implementation of both inference methods.

Uploaded by

xpertgame1467
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Bayesian Probabilistic Matrix Factorization

Ruixuan Xu
Department of Computer Science and Engineering
The Chinese University of Hong Kong
[email protected]

Xiangxiang Weng
arXiv:2506.09928v1 [cs.LG] 11 Jun 2025

Department of Computer Science and Engineering


The Chinese University of Hong Kong
[email protected]

Abstract

Matrix factorization is a widely used technique in recommendation systems. Prob-


abilistic Matrix Factorization (PMF) [1] extends traditional matrix factorization
by incorporating probability distributions over latent factors, allowing for uncer-
tainty quantification. However, computing the posterior distribution is intractable
due to the high-dimensional integral. To address this, we employ two Bayesian
inference methods: Markov Chain Monte Carlo (MCMC) [2] and Variational In-
ference (VI) [3] to approximate the posterior. We evaluate their performance on
MovieLens dataset and compare their convergence speed, predictive accuracy, and
computational efficiency. Experimental results demonstrate that VI offers faster
convergence, while MCMC provides more accurate posterior estimates.

1 Introduction
Collaborative filtering is an essential technique in recommendation systems, where the goal is to pre-
dict user preferences based on sparse observed ratings. Matrix factorization has been widely adopted
due to its ability to model user-item interactions effectively. However, traditional matrix factorization
relies on point estimates, which may lead to overfitting and lack of uncertainty quantification.
Probabilistic Matrix Factorization (PMF) [1] mitigates this issue by considering a probabilistic
approach, where user and item latent matrices are treated as random variables with prior distributions.
The challenge in PMF is computing the posterior distribution of latent matrices, which is intractable.
To approximate the posterior, we explore two Bayesian inference methods:
1. Markov Chain Monte Carlo (MCMC) [2]: A sampling-based approach that provides asymptotically
exact posterior estimates.
2. Variational Inference (VI) [3]: An optimization-based approach that approximates the posterior
using a parameterized distribution.
In this report, we implement both methods on MovieLens dataset and compare their performance.

2 Problem Setting
2.1 Matrix Factorization Model

In mathematics, a sparse matrix refers to a matrix in which most of the elements are zero. A sparse
rating matrix is a common data structure in recommendation systems, typically used to represent
users’ rating data for items. Collaborative filtering relies on user behavior data such as ratings to
predict items a user might like. Due to the high sparsity of rating matrices, directly storing and
computing them would waste a lot of memory and computational resources. Matrix factorization
addresses this by decomposing the user-item rating matrix into two low-dimensional matrices, and
then predicting ratings via their dot product, thereby achieving collaborative filtering.
Given an N × M sparse rating matrix R:

Figure 1: Sparse Rating Matrix

Suppose there are L observed ratings (L ≪ N × M ), for predicting all unrated items, we need to
estimate N × M − L ≈ N × M parameters. An N × M rating matrix R can be approximated by
the multiplication of an N × K matrix and a K × M matrix.

Figure 2: Matrix Factorization

Here, K is feature dimension, U ∈ RN ×K is user feature matrix, and V ∈ RM ×K is item feature


matrix. With matrix factorization, rather than estimating N × M − L ≈ N × M parameters, we
only need to estimate (N + M ) × K, where K ≪ min(N, M ). Our goal is to find matrices U and
V such that U V T can accurately represent the original rating matrix. In this way, the number of
parameters to be estimated is greatly reduced, and the entire prediction matrix can be obtained simply
by matrix multiplication, making computation efficient. Finding the target parameter matrices U and
V involves optimization methods based on gradient descent. For example, minimizing the following
loss function: X
min (rij − ui vjT )2
U,V
(i,j)∈O

2
where:
- O is the set of all rated items.
- rij is the true rating given by user i to item j.
- ui vjT is the predicted rating.
Gradient descent update rules:
X
ui ← ui + α · (rij − ui vjT )vj
j∈Oi
X
vj ← vj + α · (rij − ui vjT )ui
i∈Oj

where α is the learning rate.


However, traditional matrix factorization has many limitations, such as poor generalization ability,
inability to capture the uncertainty of latent vectors and lack of probabilistic interpretation. Therefore,
we introduce Bayesian Probabilistic Matrix Factorization (BPMF) on the basis of matrix factorization.

2.2 Bayesian Probabilistic Matrix Factorization

Instead of estimating matrices U and V to compute each rating rij based on deterministic approaches,
one can use Bayesian inference. Consider each row of U and V , i.e., Ui and Vj , as a multivariate
random variables. Ui and Vj are assumed to be standard normal random vectors.

Uik ∼ N (0, 1) for any 1 ≤ k ≤ K
Vik ∼ N (0, 1) for any 1 ≤ k ≤ K
The prior distributions of Ui and Vj have the following PDFs

fUi (ui ) = 1 K exp − 12 ui uTi

(2π) 2
fVj (vj ) = 1 K exp − 12 vj vjT

(2π) 2

The likelihood of each observed rating rij is defined as a normal distribution:


Rij |Ui , Vj ∼ N sigmoid ui vjT , σ 2
 

Note: as sigmoid function returns a value into [0, 1], during training, we need to first normalize ratings
rij −1
via rij ← R−1 , where original ratings are defined in {1, 2, . . . , R}, and σ 2 is a hyperparameter.
The PDF of likelihood is
2 !
1 rij − sigmoid ui vjT
fRij |Ui ,Vj (rij |ui , vj ) = √ exp − .
2πσ 2 2σ 2
Based on conditional independence and Bayes’ rule, the posterior of U and V can be inferred as
fU,V|{Rij } (U, V |{rij })
∝ f{Rij }|U,V ({rij }|U, V )fU,V (U, V )
N Y M    
Y Iij 1 1
exp − ui uTi exp − vj vjT .

∝ fRij |Ui ,Vj (rij |ui , vj )
i=1 j=1
2 2

where Iij is an indicator function that is equal to 1 if user i rated movie j, otherwise 0.
If the likelihood adopts a simple Gaussian distribution, minimizing the negative log of the posterior
(loss function) combined with gradient descent to obtain the most likely U and V , and then directly
multiplying them to generate the prediction matrix are still available, just as researchers do in PMF
[1]. However, we introduce nonlinearity by applying a sigmoid function to the mean of the Gaussian
likelihood distribution, in order to simulate the complex likelihoods that are likely to occur in real-
world applications. In addition, a simple Gaussian likelihood may produce predicted ratings outside
the valid range (e.g., from 1 to 5), while sigmoid can compress the predicted ratings into valid range.

3
In this case, the gradient descent becomes infeasible. Therefore, we adopt Bayesian framework,
retain posterior distributions of U and V , and optimizing the predictive distribution over unrated data.
To make a prediction on an unobserved rating rab :
fRab |{Rij } (rab |{rij })
Z Z
= fRab |Ua ,Vb (rab |ua , vb )fUa ,Vb |{Rij } (ua , vb |{rij })dua dvb
ua ∈RK vb ∈RK
 
= EUa ,Vb |{Rij } fRab |Ua ,Vb (rab |ua , vb ) .

In general, we can use MAP to generate a predicted rating in [0, 1]:



rab = arg max fRab |{Rij } (rab |{rij }).
rab

Then map back to the original rating using r = (R − 1)r∗ + 1.


However, we still cannot compute fRab |{Rij } (rab |{rij }), because there are two challenges:
1. Although we have derived the parameter posterior distribution fUa ,Vb |{Rij } (ua , vb |{rij }), we
cannot accurately compute the normalization constant, since integrating a high-dimensional Gaussian
distribution with a sigmoid is difficult even for computers;
2. Even if we obtain the parameter posterior, the integral involved in the predictive distribution itself
is high-dimensional and difficult to compute.
To address this, we introduce some Bayesian inference methods [4], [5].

3 Bayesian Inference Methods

Consider Bayesian inference:


fΘ (θ)fX|Θ (x|θ)
fΘ|X (θ|x) = .
Z(x)

- fΘ|X (θ|x): Posterior.


- fΘ (θ): Prior.
- fX|Θ (x|θ): Likelihood.
R
- Z(x): Normalization constant, where Z(x) = θ
fΘ (θ)fX|Θ (x|θ)dθ.
For multiple observations, let D = {X1 , . . . , Xn } be the joint random variables and d =
{x1 , . . . , xn } their values. Bayesian inference can be written as:
fΘ (θ)fD|Θ (d|θ)
fΘ|D (θ|d) = .
Z(d)
Bayesian prediction:
Z +∞
fX|D (x∗ |d) = fX|Θ (x∗ |θ)fΘ|D (θ|d)dθ = EΘ|D=d [fX|Θ (x∗ |θ)].
−∞

Two major computational challenges:


1. Computing Z(d) requires computing high dimensional integrals as θ is multivariate in practice;
2. The integral involved in the predictive distribution itself is difficult to compute.
We propose two solutions:
1. Use MCMC [2] to sample from fΘ|D (θ|d) and use the sample mean to approximate the expectation.
2. Use VI [3] to approximate fΘ|D (θ|d) by q(θ), which is easy to integrate.

4
3.1 Markov Chain Monte Carlo (MCMC)

Suppose z is multi-variate random variable, and we are interested in evaluating the expectation:
Z
EZ [h(z)] = h(z)f (z)dz.
z
Where
g(z)
f (z) = .
Z
Our objective is to draw independent samples {z1 , . . . , zn } from f (z) to approximate EZ [h(z)]:
n
1X
EZ [h(z)] ≈ h(zi ).
n i=1
In high-dimensional settings, Markov Chain Monte Carlo (MCMC) is widely used for sampling. The
Metropolis-Hastings algorithm is a commonly used MCMC method.
Core idea: Construct a proposal distribution q(z′ |z) to generate candidate samples and use an
acceptance rate to decide whether to accept the sample.
g(z)
Detailed steps: Let the target distribution be f (z) = Z .

1. Construct a proposal distribution q(z′ |z).


2. Choose an initial state z0 . Set the initial time t = 0.
3. Sample a candidate z′ from q(z′ |zt ).
4. Compute the acceptance rate:
g(z′ )q(zt |z′ )
 

α(zt , z ) = min 1, .
g(zt )q(z′ |zt )
5. Accept the new sample with probability α. If accepted, set zt+1 = z′ ; otherwise, set zt+1 = zt .
6. Update time t ← t + 1.
7. Repeat steps 3–6 until the samples meet the requirements.
The specific iterative process is as follows:

Algorithm 1 Metropolis-Hastings Algorithm for MCMC


1: Input: Unnormalized target density g(z), proposal distribution q(z′ |z)
2: Output: Samples {z1 , z2 , . . . , zn } approximating f (z)
3: Initialize: Initial state z0 , set t = 0
4: while t < n do
5: Sample a candidate z′ ∼ q(z′ |zt )
6: Compute acceptance rate:
g(z′ )q(zt |z′ )
 

α(zt , z ) = min 1,
g(zt )q(z′ |zt )

7: Sample u ∼ Uniform(0, 1)
8: if u < α(zt , z′ ) then
9: Accept: set zt+1 = z′
10: else
11: Reject: set zt+1 = zt
12: end if
13: Update t ← t + 1
14: end while
15: return {z1 , z2 , . . . , zn }

For mathmatical details, please check Appendix.

5
3.2 Variational Inference (VI)

Due to the intractability of the exact posterior P (U, V | {rij }), we employ Variational Inference (VI)
to approximate it. We define a variational distribution Q(U, V) under the mean-field assumption:
N
Y M
Y
Q(U, V) = Qi (ui ) Qj (vj ),
i=1 j=1

and optimize it to minimize the KL divergence between the true posterior and the variational
distribution. This leads to the maximization of the Evidence Lower Bound (ELBO):
log P ({rij }) ≥ L(Q) = EQ [log P ({rij }, U, V)] − EQ [log Q(U, V)].
We derive the update rules using Coordinate Ascent Variational Inference (CAVI), iteratively optimiz-
ing each variational factor while keeping others fixed.
The specific iterative process is as follows:

Algorithm 2 Coordinate Ascent Variational Inference (CAVI)


1: Input: A model p(x, z), a data set x Q
m
2: Output: A variational density Q(z) = j=1 Qj (zj )
3: Initialize: Variational factors Qj (zj )
4: while the ELBO has not converged do
5: for j ∈ {1, . . . , m} do
6: Set Qj (zj ) ∝ exp {E−j [log p(zj | z−j , x)]}
7: end for
8: Compute ELBO(Q) = E [log p(z, x)] − E [log Q(z)]
9: end while
10: return q(z)

For mathmatical details, please check Appendix.

4 Dataset Processing
We use the MovieLens-small dataset, which consists of 100,836 ratings from 610 users on 9,724
movies. The rating data is stored in the ratings.csv file. We implemented two Python scripts:
MCMC.py and VI.py, which perform Bayesian matrix completion using the MCMC and VI methods,
respectively, on the data from the CSV file.
The data in ratings.csv is stored in the following format:

userId movieId rating timestamp


1 1 4 964982703
1 3 4 964981247
1 6 4 964982224
1 47 5 964983815
1 50 5 964982931
.. .. .. ..
. . . .

We ignore the timestamp column and import the first three columns into the Python scripts for
further processing. The dataset is preprocessed by:
1. Normalizing ratings between 0 and 1.
2. Divided into three groups: 60% training set, 20% validation set, and 20% test set.

5 Experimental Evaluation
We evaluate MCMC and VI on the MovieLens dataset based on:

6
1. Convergence Speed;
2. Predictive Accuracy;
3. Computational Efficiency.

5.1 Convergence Speed

Controlling other parameters such as latent vector dimension and variance to be consistent, we set
300 epochs for VI and found that it generally began to stabilize between 150 and 200 epochs; we set
1000 epochs for MCMC and found that it generally began to stabilize between 600 and 700 epochs.
This indicates that VI requires fewer epochs to converge compared to MCMC.

Figure 3: Loss-Epoch Plot of VI and MCMC

5.2 Predictive Accuracy

Measured using RMSE, where VI achieves 1.2277 and MCMC achieves 1.1836.

Figure 4: RMSE of VI and MCMC

5.3 Computational Efficiency

We used an NVIDIA GeForce RTX 4060 Laptop GPU for computation. The execution time of VI.py
was approximately 6 seconds, while MCMC.py took about 6 hours to run. This means VI runs
approximately 3,600x faster than MCMC due to its optimization-based approach.

5.4 Discussion

1. MCMC is more accurate as it samples from the true posterior.


2. VI is computationally efficient and suitable for large-scale data.
3. Trade-off: If accuracy is paramount, use MCMC; if speed is critical, use VI.

6 Conclusion
In this report, we proposed a Bayesian approach to matrix factorization, leveraging two prominent
inference methods—Markov Chain Monte Carlo (MCMC) and Variational Inference (VI)—to address

7
the intractability of the posterior distribution over latent user and item features in collaborative filtering.
We formulated the probabilistic model by introducing Gaussian priors on user and item latent vectors
and a sigmoid-transformed Gaussian likelihood to ensure bounded rating predictions.
We implemented both inference techniques and evaluated their performance on the MovieLens
dataset, focusing on three key dimensions: convergence speed, predictive accuracy, and computa-
tional efficiency. Our experiments demonstrated that VI converges considerably faster than MCMC
and achieves remarkable computational efficiency due to its deterministic optimization framework.
However, MCMC, by virtue of sampling from the true posterior, offers more accurate predictions,
though at the cost of significantly higher runtime.
These results highlight a fundamental trade-off in Bayesian matrix factorization: MCMC yields
higher fidelity at the expense of time, while VI offers scalability and speed with slightly reduced
accuracy. Therefore, the choice of inference method should be guided by the specific constraints and
requirements of the target application.
In future work, we plan to explore hybrid approaches that combine the strengths of MCMC and VI,
such as initializing MCMC with variational parameters or employing amortized inference techniques.
Additionally, extending the model to incorporate content-based features or temporal dynamics could
further enhance recommendation accuracy and applicability in real-world systems.

References
[1] Mnih, A., & Salakhutdinov, R. R. (2007). Probabilistic matrix factorization. Advances in neural
information processing systems, 20.
[2] W. K. Hastings. (1970). Monte Carlo sampling methods using Markov chains and
their applications, Biometrika, Volume 57, Issue 1, April 1970, Pages 97–109,
https://doi.org/10.1093/biomet/57.1.97
[3] Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational Inference: A Re-
view for Statisticians. Journal of the American Statistical Association, 112(518), 859–877.
https://doi.org/10.1080/01621459.2017.1285773
[4] Ruslan Salakhutdinov and Andriy Mnih. (2008). Bayesian probabilistic matrix factorization
using Markov chain Monte Carlo. In Proceedings of the 25th international conference on
Machine learning (ICML ’08). Association for Computing Machinery, New York, NY, USA,
880–887. https://doi.org/10.1145/1390156.1390267
[5] Guangyong Chen, Fengyuan Zhu, and Pheng Ann Heng. (2018). Large-Scale Bayesian Proba-
bilistic Matrix Factorization with Memo-Free Distributed Variational Inference. ACM Trans.
Knowl. Discov. Data 12, 3, Article 31 (June 2018), 24 pages. https://doi.org/10.1145/3161886

8
Appendix
1. Mathmatical Details of MCMC
A set of random variables {z1 , z2 , . . . , zn } forms a first-order Markov chain if the following condi-
tional independence holds:
P (zk+1 |z1 , z2 , . . . , zk ) = P (zk+1 |zk ) for k ∈ {1, . . . , n − 1}.
This means that the current state depends only on the previous state and not on earlier states.
For continuous variables, it becomes:
f (zk+1 |z1 , z2 , . . . , zk ) = f (zk+1 |zk ) for k ∈ {1, . . . , n − 1}.
To generate a Markov chain, we need:
1. Define the initial state distribution P (z1 ).
2. Construct the transition kernels:
Tk (zk+1 ← zk ) = P (zk+1 |zk ), k ∈ {1, . . . , n − 1}.
For continuous variables, the transition kernel is written as:
Tk (zk+1 ← zk ) = f (zk+1 |zk ), k ∈ {1, . . . , n − 1}.
A Markov chain is called homogeneous if the transition kernels are the same for all k.
The marginal probability of a specific state can be computed via product and sum rules, i.e., the law
of total probability:
X
P (zk+1 ) = T (zk+1 ← zk )P (zk ),
zk

or for continuous variables, expressed in terms of PDFs:


Z
f (zk+1 ) = T (zk+1 ← zk )f (zk )dzk .
zk

A distribution P (·) is said to be stationary or invariant with respect to a Markov chain if each step in
the chain leaves P ∗ (·) invariant:
X
P ∗ (z) = T (z ← z′ )P ∗ (z′ ), ∀z.
z′

A sufficient (but not necessary) condition for ensuring P ∗ (·) is stationary or invariant is to choose a
transition kernel that satisfies the property of detailed balance, defined by:
T (z′ ← z)P ∗ (z) = T (z ← z′ )P ∗ (z′ ), ∀z, z′ .
If detailed balance holds, the chain is said to be reversible.
The core idea of using Markov Chain Monte Carlo (MCMC) methods for sampling:
1. Construct a Markov chain whose stationary distribution is f (z);
2. Simulate the chain to generate samples;
3. After a sufficient number of steps, the samples approximately follow the distribution f (z).
When the Markov chain runs long enough, even if the initial state is not sampled from f (z), the
eventually generated samples will converge to this distribution, thus allowing approximate sampling.
It should be noted that, to guarantee convergence to f (z), the Markov chain needs to be ergodic.

9
2. Mathmatical Details of VI
Considering the general case, by Bayes’ theorem, we have

log P (X) = log P (X, Z) − log P (Z | X).

Divide both terms inside the log on the right-hand side by Q(Z). The equation remains equivalent:

P (X, Z) P (Z | X)
log P (X) = log − log .
Q(Z) Q(Z)

Multiply both sides of the equation by Q(Z) and integrate over Z, yielding:
Left-hand side:
Z
log P (X)Q(Z)dZ = log P (X).
Z

Right-hand side:
P (Z | X)
Z Z
P (X, Z)
Q(Z) log dZ − Q(Z) log dZ
Q(Z) Q(Z)
ZZ ZZ
P (X, Z)
= Q(Z) log dZ − Q(Z) log P (Z | X)dZ
Z Q(Z) Z
= L(Q) + KL(Q∥P ).

In the above, the first term is denoted as L(Q), and the second term (with a negative sign) represents
the KL divergence, indicating the distance between the posterior distribution P (Z | X) and the
distribution Q(Z). Therefore, we have:

log P (X) = L(Q) + KL(Q∥P ).

Assume under the mean-field theory that Q(Z) is conditionally independent for all components of Z
(let Z have M components), i.e.,
M
Y
Q(Z) = Qi (Zi ).
i=1

From the previous derivation, we know:


Z
P (X, Z)
L(Q) = Q(Z) log dZ
Z Q(Z)
Z Y M M
Z Y M
Y
= Qi (Zi ) log P (X, Z)dZ − Qi (Zi ) log Qi (Zi )dZ
Z i=1 Z i=1 i=1
M
Z Y M
Z Y M
X
= Qi (Zi ) log P (X, Z)dZ − Qi (Zi ) log Qi (Zi )dZ.
Z i=1 Z i=1 i=1

We will make some transformations to the first expression on the left side:
M
Z Y
Qi (Zi ) log P (X, Z)dZ
Z i=1
 
Z Z Y
= Qj (Zj )  Qi (Zi ) log P (X, Z)dZj  dZj
i̸=j
Z
= qj (Zj )Eqi (Zi ),i̸=j [log P (X, Z)] dZj .

10
Now let’s make some transformations to the second expression on the right side:
Z Y M M
X
Qi (Zi ) log Qi (Zi )dZ
Z i=1 i=1
Z M
! M
!
Y X
= Qi (zi ) log Qi (zi ) dZ
Z i=1 i=1
Z M
!
Y
= Qi (zi ) [log Q1 (z1 ) + log Q2 (z2 ) + · · · + log QM (zM )] dZ.
Z i=1
Now we attempt to isolate one of the items to discover the pattern:
Z M
!
Y
Qi (zi ) log Q1 (z1 ) dZ
Z i=1
Z
= Q1 Q2 · · · QM log Q1 dZ
ZZ
= Q1 Q2 · · · QM log Q1 dz1 dz2 · · · dzM
z1 ,z2 ,...,zM
Z  Z  Z 
= Q1 log Q1 dz1 Q2 dz2 · · · QM dzM
z z2 zM
Z 1
= Q1 log Q1 dz1 .
z1
Back to the expression we care about
Z M
!
Y
Qi (zi ) [log Q1 (z1 ) + log Q2 (z2 ) + · · · + log QM (zM )] dZ
Z i=1
M Z
X 
= Qi (zi ) log Qi (zi ) dzi
i=1 zi
Z
= Qj (zj ) log Qj (zj ) dzj + Constant.
zj
We transform the expectation to another form:
EQi (Zi ),i̸=j [log P (X, Z)] = log P̃ (X, Zj ).
Here we adopt Coordinate Ascent Variational Inference (CAVI), the idea of CAVI is that when
updating Qj (Zj ), the other Qi (Zi ) for i ̸= j are kept fixed, thus
Z
P̂ (X, zj )
L(Q) = Qj (zj ) log dzj + Constant
zj Qj (zj )
 
= − KL Qj ∥ P̂ (X, zj ) + Constant.

To maximize L(Q), we need to set Qj (Zj ) = P̃ (X, Zj ), that is,


Q∗j (Zj ) = exp EQi (Zi ),i̸=j [log P (X, Z)] .


Q∗j (Zj ) = 1, we normalize the above expression, obtaining


P
To ensure that Zj

exp EQi (Zi ),i̸=j [log P (X, Z)]
Q∗j (Zj ) = R  .
exp EQi (Zi ),i̸=j [log P (X, Z)] dZj
In this project, Z can be regarded as [U; V], X can be regarded as {rij }
P (X, Z) = f (U, V, {rij })
N Y M    
Y
2 Iij 1 ⊤ 1 ⊤
N (rij | sigmoid(u⊤

= v
i j ), σ ) exp − u ui exp − v v j .
i=1 j=1
2 i 2 j

11

You might also like