Scalable Bayesian Matrix Factorization
Avijit Saha1, Rishabh Misra2, Balaraman Ravindran1
1Department of CSE, Indian Institute of Technology Madras, India
2Department of CSE, Thapar University, India
Presented by - Ayan Acharya
1 / 27
Outline
Motivation
Introduction
Model
Inference
Experiments
Datasets
Baseline
Experimental Setup
Results
Conclusion and Future Work
References
2 / 27
Motivation
In many scientific domains matrix factorization (MF) [1] is
ubiquitous.
MF is a dimensionality reduction technique and is used to fill
missing entries of a matrix.
!"#"$ !"#"%
%"#"$
$
!
& %
%
'
( ( (
( ( (
( (
( ( (
( ( (
Figure 1: Example of MF.
3 / 27
Motivation
One common approach to solve MF is using stochastic
gradient descent (SGD) [1].
SGD is scalable and enjoys local convergence guarantee [2].
However, it often overfits the data and requires manual tuning
of the learning rate and regularization parameters.
Fully Bayesian methods for MF such as Bayesian Probabilistic
Matrix Factorization (BPMF) [3] avoids these problems and
provides state of the art performance.
However, BPMF has cubic time complexity with respect to
the target rank.
To alleviate this problem of BPMF, we propose the Scalable
Bayesian Matrix Factorization (SBMF).
4 / 27
Introduction
MF is the simplest factor based model.
Consider a user-movie matrix R ∈ RI×J where the rij cell
represents the rating provided to the jth movie by the ith user.
MF decomposes the matrix R into two low-rank matrices
U = [u1, u2, ..., uI ]T ∈ RI×K and V = [v1, v2, ..., vJ]T ∈
RJ×K :
R ∼ UV T
. (1)
5 / 27
Introduction
Probabilistic Matrix Factorization (PMF) [4] provides a
probabilistic interpretation for MF. PMF considers the
likelihood term as:
p(R|U, V , τ−1
) =
(i,j)∈Ω
N(rij |uT
i vj , τ−1
), (2)
In PMF, inferring the posterior distribution is intractable.
PMF handles this intractability by providing a maximum a
posteriori estimation, which is equivalent to minimizing the
regularized square error:
(i,j)∈Ω
rij − uT
i vj
2
+ λ ||U||2
F + ||V ||2
F (3)
6 / 27
Introduction
The optimization problem in Eq. (3) is solved using SGD.
Hence, maximum a posteriori estimation of PMF suffers from
the same problems of SGD.
On the other hand, fully Bayesain method BPMF directly
approximates the posterior distribution using the Gibbs
sampling technique and provides state of the art performance.
7 / 27
Introduction
However, BPMF has cubic time complexity with respect to
the target rank due to it’s multivariate Gaussian assumption
on priors.
Hence, we develop the SBMF which uses Gibbs sampling as
inference mechanism.
SBMF has linear time complexity with respect to the target
rank and linear space complexity with respect to the number
of non-zero observations, as we assume univariate Gaussian
prior.
8 / 27
Model
rij
τ
a0 b0
µ
µg σg
αi
µα σα
uik
µuk
σuk
vjk
µvk
σvk
βj
µβσβ
µ0, ν0 µ0, ν0α0, β0 α0, β0
i = 1..I j = 1..J
k = 1..K
Figure 2: Graphical model for SBMF.
9 / 27
Model
The likelihood term of SBMF is as follows:
p(R|Θ) =
(i,j)∈Ω
N(rij |µ + αi + βj + uT
i vj , τ−1
), (4)
where µ is the global bias,
αi is the bias for the ith user,
βj is the bias for the jth item,
ui is the latent factor vector for the ith user,
vj is the latent factor vector for the jth item,
τ is the model precision,
Ω is the set of all observations, and
Θ is the set of all the model parameters.
10 / 27
Model
We place independent univariate priors on all the model
parameters in Θ as:
p(µ) =N(µ|µg , σ−1
g ), (5)
p(αi ) =N(αi |µα, σ−1
α ), (6)
p(βj ) =N(βj |µβ, σ−1
β ), (7)
p(U) =
I
i=1
K
k=1
N(uik|µuk
, σ−1
uk
), (8)
p(V ) =
J
j=1
K
k=1
N(vjk|µvk
, σ−1
vk
), (9)
p(τ) =N(τ|a0, b0). (10)
11 / 27
Model
We further place Normal-Gamma priors on all the hyperparameters
ΘH = {µα, σα, µβ, σβ, {µuk
, σuk
}, {µvk
, σvk
}} as:
p(µα, σα) = NG (µα, σα|µ0, ν0, α0, β0) , (11)
p(µβ, σβ) = NG (µβ, σβ|µ0, ν0, α0, β0) , (12)
p(µuk
, σuk
) = NG (µuk
, σuk
|µ0, ν0, α0, β0) , (13)
p(µvk
, σvk
) = NG (µvk
, σvk
|µ0, ν0, α0, β0) . (14)
12 / 27
Model
The joint distribution of the observations and the hidden variables
can be written as:
p(R, Θ, ΘH|Θ0) = p(R|Θ)p(µ)
I
i=1
p(αi )
J
j=1
p(βj )p(U)p(V )
p(µα, σα)p(µβ, σβ)
K
k=1
p(µuk
, σuk
)p(µvk
, σvk
),
(15)
where Θ0 = {a0, b0, µg , σg , µ0, ν0, α0, β0}
13 / 27
Inference
Evaluation of the joint distribution in Eq. (15) is intractable.
However, all the model parameters are conditionally
conjugate [5].
So we develop a Gibbs sampler with closed form updates.
Replacing Eq. (4)-(14) in Eq. (15), the sampling distribution
of uik can be written as follows:
p(uik|−) ∼ N (uik|µ∗
, σ∗
) , (16)
σ∗
=

σuk
+ τ
j∈Ωi
v2
jk


−1
(17)
µ∗ = σ∗ σuk
µuk
+ τ
j∈Ωi
vjk rij − µ + αi + βj +
K
l=1&l=k
uil vjl .
(18)
14 / 27
Inference
Now, directly sampling uik from Eq. (16) requires O(K|Ωi |)
complexity.
However, precomputing eij = rij − (µ + αi + βj + uT
i vj ) for all
(i, j) ∈ Ω reduces the complexity to O(|Ωi |).
We sample model parameters in parallel using the Gibbs
sampling equations.
Table 1: Complexity comparison.
Method Time Complexity Space Complexity
SBMF O(|Ω|K) O((I + J)K)
BPMF O(|Ω|K2 + (I + J)K3) O((I + J)K)
15 / 27
Datasets
Table 2: Dataset description.
Dataset No. of users No. of movies No. of ratings
Movielens 10m 71567 10681 10m
Movielens 20m 138493 27278 20m
Netflix 480189 17770 100m
16 / 27
Baseline
We compare SBMF with BPMF.
Parallel implementation for BPMF is not available.
Hence, we compare serial version of BPMF with both serial
and parallel implementation of SBMF, and call them SBMF-S
and SBMF-P, respectively.
17 / 27
Experimental Setup
We use Intel core i5 machines with 16GB RAM.
BPMF is initialized with standard parameter setting.
We use 50 burn-in iterations followed by 100 collection
iterations for all the experiments.
18 / 27
Results
(a) (b) (c)
Figure 3: Graphs from left to right show results on Movielens 10M
dataset for K = 50, 100, and 200.
19 / 27
Results
(a) (b) (c)
Figure 4: Graphs from left to right show results on Movielens 20M
dataset for K = 50, 100, and 200.
20 / 27
Results
(a) (b) (c)
Figure 5: Graphs from left to right show results on Netflix dataset for
K = 50, 100, and 200.
21 / 27
Results
SBMF-P takes much less time than BPMF.
Similar trend exist for SBMF-S (except for K = {50, 100} in
Netflix dataset).
We believe that in Netflix dataset for K = {50, 100}, BPMF
takes less time than SBMF-S, because BPMF is implemented
in Matlab where matrix operations are efficient, whereas
SBMF uses unoptimized C++ code. The problem is
compounded in Netflix dataset as large number of
observations need more matrix operations.
22 / 27
Results
Table 3 shows detailed results comparison.
Table 3: Results Comparison.
K = 50 K = 100 K = 200
Dataset Method RMSE Time(Hr) RMSE Time(Hr) RMSE Time(Hr)
Movielens 10m
BPMF 0.8629 1.317 0.8638 3.517 0.8651 22.058
SBMF-S 0.8655 1.091 0.8667 2.316 0.8654 5.205
SBMF-P 0.8646 0.462 0.8659 0.990 0.8657 2.214
Movielens 20m
BPMF 0.7534 2.683 0.7513 6.761 0.7508 45.355
SBMF-S 0.7553 2.364 0.7545 5.073 0.7549 11.378
SBMF-P 0.7553 1.142 0.7545 2.427 0.7551 5.321
Netflix
BPMF 0.9057 11.739 0.9021 28.797 0.8997 150.026
SBMF-S 0.9048 17.973 0.9028 40.287 0.9017 89.809
SBMF-P 0.9047 7.902 0.9026 16.477 0.9017 34.934
23 / 27
Results
For K = 200, SBMF provides significant speed up.
Total time difference between both of the variants of SBMF
and BPMF increases with the dimension of latent factor
vector.
Both SBMF-P and SBMF-S suffer small loss in the
performance compare to BPMF.
Increasing the latent space dimension reduces the RMSE value
in the Netflix dataset.
24 / 27
Conclusion and Future Work
We have developed the SBMF.
SBMF has linear time and space complexity.
SBMF gives competitive performance in less time as
compared to BPMF, as validated empirically.
BPMF is preferred with lower latent space dimension, but for
higher latent space dimension SBMF should be used.
Future work is to optimize the code of SBMF and apply
SBMF with side-information.
25 / 27
References
Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for
recommender systems,” Computer, vol. 42, pp. 30–37, Aug. 2009.
M.-A. Sato, “Online model selection based on the variational Bayes,” Neural
Computation, vol. 13, no. 7, pp. 1649–1681, 2001.
R. Salakhutdinov and A. Mnih, “Bayesian probabilistic matrix factorization using
markov chain monte carlo,” in Proc. of ICML, pp. 880–887, 2008.
R. Salakhutdinov and A. Mnih, “Probabilistic matrix factorization,” in Proc. of
NIPS, 2007.
M. Hoffman, D. Blei, C. Wang, and J. Paisley, “Stochastic variational
inference,” JMLR, vol. 14, pp. 1303–1347, may 2013.
26 / 27
Questions
Thanks!
27 / 27

More Related Content

PDF
Presentation_Tan
PDF
Aistats RTD
PPTX
Icici bme 2011
PPT
FR4.L09.5 - THREE DIMENSIONAL RECONSTRUCTION OF URBAN AREAS USING JOINTLY PHA...
PDF
GMMNに基づく音声合成におけるグラム行列の
スパース近似の検討
PPTX
PPT
Computer Vision Pertemuan 05
PDF
A comparative study of different multiplier designs
Presentation_Tan
Aistats RTD
Icici bme 2011
FR4.L09.5 - THREE DIMENSIONAL RECONSTRUCTION OF URBAN AREAS USING JOINTLY PHA...
GMMNに基づく音声合成におけるグラム行列の
スパース近似の検討
Computer Vision Pertemuan 05
A comparative study of different multiplier designs

What's hot (12)

PDF
OPTIMAL BEAM STEERING ANGLES OF A SENSOR ARRAY FOR A MULTIPLE SOURCE SCENARIO
PPTX
Open source adobe lightroom like
PDF
MISSILE TELEMETRY DATALINK CALCULATION (A MATLAB PROGRAM)
PDF
Learning Convolutional Neural Networks for Graphs
PDF
M.Tech Thesis Defense Presentation
PDF
Ky3518821886
PDF
Platforming_Automated_And_Quickly_Beamer
PDF
Working with spatial trajectories in Boost Geometry
PDF
TemporalSpreadingOfAlternativeTrainsPresentation
PDF
HOSVD-visualization
PDF
Teaching Graph Algorithms in the Field - Bipartite Matching in optical datace...
PDF
ICMR 2014 - Sparse Kernel Learning Poster
OPTIMAL BEAM STEERING ANGLES OF A SENSOR ARRAY FOR A MULTIPLE SOURCE SCENARIO
Open source adobe lightroom like
MISSILE TELEMETRY DATALINK CALCULATION (A MATLAB PROGRAM)
Learning Convolutional Neural Networks for Graphs
M.Tech Thesis Defense Presentation
Ky3518821886
Platforming_Automated_And_Quickly_Beamer
Working with spatial trajectories in Boost Geometry
TemporalSpreadingOfAlternativeTrainsPresentation
HOSVD-visualization
Teaching Graph Algorithms in the Field - Bipartite Matching in optical datace...
ICMR 2014 - Sparse Kernel Learning Poster
Ad

Similar to SBMF (20)

PDF
PMF BPMF and BPTF
PDF
Final presentation2-----------------.pdf
PPTX
Factorization Machines with libFM
PDF
Factorization Machines and Applications in Recommender Systems
PDF
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018
PDF
Priors for BNNs
PPTX
Event classification & prediction using support vector machine
PPTX
Deep generative learning_icml_part1
PPTX
PDF
Probabilistic Models in Recommender Systems: Time Variant Models
PDF
行列分解いろいろ
PDF
Petrini - MSc Thesis
PDF
Recommendation System --Theory and Practice
PDF
Epsrcws08 campbell isvm_01
PDF
Steffen Rendle, Research Scientist, Google at MLconf SF
PDF
Steffen Rendle, Research Scientist, Google at MLconf SF
PDF
SigOpt_Bayesian_Optimization_Primer
PDF
Svm map reduce_slides
PPT
GAUSSIAN PRESENTATION.ppt
PPT
GAUSSIAN PRESENTATION (1).ppt
PMF BPMF and BPTF
Final presentation2-----------------.pdf
Factorization Machines with libFM
Factorization Machines and Applications in Recommender Systems
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018
Priors for BNNs
Event classification & prediction using support vector machine
Deep generative learning_icml_part1
Probabilistic Models in Recommender Systems: Time Variant Models
行列分解いろいろ
Petrini - MSc Thesis
Recommendation System --Theory and Practice
Epsrcws08 campbell isvm_01
Steffen Rendle, Research Scientist, Google at MLconf SF
Steffen Rendle, Research Scientist, Google at MLconf SF
SigOpt_Bayesian_Optimization_Primer
Svm map reduce_slides
GAUSSIAN PRESENTATION.ppt
GAUSSIAN PRESENTATION (1).ppt
Ad

SBMF

  • 1. Scalable Bayesian Matrix Factorization Avijit Saha1, Rishabh Misra2, Balaraman Ravindran1 1Department of CSE, Indian Institute of Technology Madras, India 2Department of CSE, Thapar University, India Presented by - Ayan Acharya 1 / 27
  • 3. Motivation In many scientific domains matrix factorization (MF) [1] is ubiquitous. MF is a dimensionality reduction technique and is used to fill missing entries of a matrix. !"#"$ !"#"% %"#"$ $ ! & % % ' ( ( ( ( ( ( ( ( ( ( ( ( ( ( Figure 1: Example of MF. 3 / 27
  • 4. Motivation One common approach to solve MF is using stochastic gradient descent (SGD) [1]. SGD is scalable and enjoys local convergence guarantee [2]. However, it often overfits the data and requires manual tuning of the learning rate and regularization parameters. Fully Bayesian methods for MF such as Bayesian Probabilistic Matrix Factorization (BPMF) [3] avoids these problems and provides state of the art performance. However, BPMF has cubic time complexity with respect to the target rank. To alleviate this problem of BPMF, we propose the Scalable Bayesian Matrix Factorization (SBMF). 4 / 27
  • 5. Introduction MF is the simplest factor based model. Consider a user-movie matrix R ∈ RI×J where the rij cell represents the rating provided to the jth movie by the ith user. MF decomposes the matrix R into two low-rank matrices U = [u1, u2, ..., uI ]T ∈ RI×K and V = [v1, v2, ..., vJ]T ∈ RJ×K : R ∼ UV T . (1) 5 / 27
  • 6. Introduction Probabilistic Matrix Factorization (PMF) [4] provides a probabilistic interpretation for MF. PMF considers the likelihood term as: p(R|U, V , τ−1 ) = (i,j)∈Ω N(rij |uT i vj , τ−1 ), (2) In PMF, inferring the posterior distribution is intractable. PMF handles this intractability by providing a maximum a posteriori estimation, which is equivalent to minimizing the regularized square error: (i,j)∈Ω rij − uT i vj 2 + λ ||U||2 F + ||V ||2 F (3) 6 / 27
  • 7. Introduction The optimization problem in Eq. (3) is solved using SGD. Hence, maximum a posteriori estimation of PMF suffers from the same problems of SGD. On the other hand, fully Bayesain method BPMF directly approximates the posterior distribution using the Gibbs sampling technique and provides state of the art performance. 7 / 27
  • 8. Introduction However, BPMF has cubic time complexity with respect to the target rank due to it’s multivariate Gaussian assumption on priors. Hence, we develop the SBMF which uses Gibbs sampling as inference mechanism. SBMF has linear time complexity with respect to the target rank and linear space complexity with respect to the number of non-zero observations, as we assume univariate Gaussian prior. 8 / 27
  • 9. Model rij τ a0 b0 µ µg σg αi µα σα uik µuk σuk vjk µvk σvk βj µβσβ µ0, ν0 µ0, ν0α0, β0 α0, β0 i = 1..I j = 1..J k = 1..K Figure 2: Graphical model for SBMF. 9 / 27
  • 10. Model The likelihood term of SBMF is as follows: p(R|Θ) = (i,j)∈Ω N(rij |µ + αi + βj + uT i vj , τ−1 ), (4) where µ is the global bias, αi is the bias for the ith user, βj is the bias for the jth item, ui is the latent factor vector for the ith user, vj is the latent factor vector for the jth item, τ is the model precision, Ω is the set of all observations, and Θ is the set of all the model parameters. 10 / 27
  • 11. Model We place independent univariate priors on all the model parameters in Θ as: p(µ) =N(µ|µg , σ−1 g ), (5) p(αi ) =N(αi |µα, σ−1 α ), (6) p(βj ) =N(βj |µβ, σ−1 β ), (7) p(U) = I i=1 K k=1 N(uik|µuk , σ−1 uk ), (8) p(V ) = J j=1 K k=1 N(vjk|µvk , σ−1 vk ), (9) p(τ) =N(τ|a0, b0). (10) 11 / 27
  • 12. Model We further place Normal-Gamma priors on all the hyperparameters ΘH = {µα, σα, µβ, σβ, {µuk , σuk }, {µvk , σvk }} as: p(µα, σα) = NG (µα, σα|µ0, ν0, α0, β0) , (11) p(µβ, σβ) = NG (µβ, σβ|µ0, ν0, α0, β0) , (12) p(µuk , σuk ) = NG (µuk , σuk |µ0, ν0, α0, β0) , (13) p(µvk , σvk ) = NG (µvk , σvk |µ0, ν0, α0, β0) . (14) 12 / 27
  • 13. Model The joint distribution of the observations and the hidden variables can be written as: p(R, Θ, ΘH|Θ0) = p(R|Θ)p(µ) I i=1 p(αi ) J j=1 p(βj )p(U)p(V ) p(µα, σα)p(µβ, σβ) K k=1 p(µuk , σuk )p(µvk , σvk ), (15) where Θ0 = {a0, b0, µg , σg , µ0, ν0, α0, β0} 13 / 27
  • 14. Inference Evaluation of the joint distribution in Eq. (15) is intractable. However, all the model parameters are conditionally conjugate [5]. So we develop a Gibbs sampler with closed form updates. Replacing Eq. (4)-(14) in Eq. (15), the sampling distribution of uik can be written as follows: p(uik|−) ∼ N (uik|µ∗ , σ∗ ) , (16) σ∗ =  σuk + τ j∈Ωi v2 jk   −1 (17) µ∗ = σ∗ σuk µuk + τ j∈Ωi vjk rij − µ + αi + βj + K l=1&l=k uil vjl . (18) 14 / 27
  • 15. Inference Now, directly sampling uik from Eq. (16) requires O(K|Ωi |) complexity. However, precomputing eij = rij − (µ + αi + βj + uT i vj ) for all (i, j) ∈ Ω reduces the complexity to O(|Ωi |). We sample model parameters in parallel using the Gibbs sampling equations. Table 1: Complexity comparison. Method Time Complexity Space Complexity SBMF O(|Ω|K) O((I + J)K) BPMF O(|Ω|K2 + (I + J)K3) O((I + J)K) 15 / 27
  • 16. Datasets Table 2: Dataset description. Dataset No. of users No. of movies No. of ratings Movielens 10m 71567 10681 10m Movielens 20m 138493 27278 20m Netflix 480189 17770 100m 16 / 27
  • 17. Baseline We compare SBMF with BPMF. Parallel implementation for BPMF is not available. Hence, we compare serial version of BPMF with both serial and parallel implementation of SBMF, and call them SBMF-S and SBMF-P, respectively. 17 / 27
  • 18. Experimental Setup We use Intel core i5 machines with 16GB RAM. BPMF is initialized with standard parameter setting. We use 50 burn-in iterations followed by 100 collection iterations for all the experiments. 18 / 27
  • 19. Results (a) (b) (c) Figure 3: Graphs from left to right show results on Movielens 10M dataset for K = 50, 100, and 200. 19 / 27
  • 20. Results (a) (b) (c) Figure 4: Graphs from left to right show results on Movielens 20M dataset for K = 50, 100, and 200. 20 / 27
  • 21. Results (a) (b) (c) Figure 5: Graphs from left to right show results on Netflix dataset for K = 50, 100, and 200. 21 / 27
  • 22. Results SBMF-P takes much less time than BPMF. Similar trend exist for SBMF-S (except for K = {50, 100} in Netflix dataset). We believe that in Netflix dataset for K = {50, 100}, BPMF takes less time than SBMF-S, because BPMF is implemented in Matlab where matrix operations are efficient, whereas SBMF uses unoptimized C++ code. The problem is compounded in Netflix dataset as large number of observations need more matrix operations. 22 / 27
  • 23. Results Table 3 shows detailed results comparison. Table 3: Results Comparison. K = 50 K = 100 K = 200 Dataset Method RMSE Time(Hr) RMSE Time(Hr) RMSE Time(Hr) Movielens 10m BPMF 0.8629 1.317 0.8638 3.517 0.8651 22.058 SBMF-S 0.8655 1.091 0.8667 2.316 0.8654 5.205 SBMF-P 0.8646 0.462 0.8659 0.990 0.8657 2.214 Movielens 20m BPMF 0.7534 2.683 0.7513 6.761 0.7508 45.355 SBMF-S 0.7553 2.364 0.7545 5.073 0.7549 11.378 SBMF-P 0.7553 1.142 0.7545 2.427 0.7551 5.321 Netflix BPMF 0.9057 11.739 0.9021 28.797 0.8997 150.026 SBMF-S 0.9048 17.973 0.9028 40.287 0.9017 89.809 SBMF-P 0.9047 7.902 0.9026 16.477 0.9017 34.934 23 / 27
  • 24. Results For K = 200, SBMF provides significant speed up. Total time difference between both of the variants of SBMF and BPMF increases with the dimension of latent factor vector. Both SBMF-P and SBMF-S suffer small loss in the performance compare to BPMF. Increasing the latent space dimension reduces the RMSE value in the Netflix dataset. 24 / 27
  • 25. Conclusion and Future Work We have developed the SBMF. SBMF has linear time and space complexity. SBMF gives competitive performance in less time as compared to BPMF, as validated empirically. BPMF is preferred with lower latent space dimension, but for higher latent space dimension SBMF should be used. Future work is to optimize the code of SBMF and apply SBMF with side-information. 25 / 27
  • 26. References Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for recommender systems,” Computer, vol. 42, pp. 30–37, Aug. 2009. M.-A. Sato, “Online model selection based on the variational Bayes,” Neural Computation, vol. 13, no. 7, pp. 1649–1681, 2001. R. Salakhutdinov and A. Mnih, “Bayesian probabilistic matrix factorization using markov chain monte carlo,” in Proc. of ICML, pp. 880–887, 2008. R. Salakhutdinov and A. Mnih, “Probabilistic matrix factorization,” in Proc. of NIPS, 2007. M. Hoffman, D. Blei, C. Wang, and J. Paisley, “Stochastic variational inference,” JMLR, vol. 14, pp. 1303–1347, may 2013. 26 / 27