Abstract
The problem of estimating a matrix based on a set of observed entries is commonly referred to as the matrix completion problem. In this work, we specifically address the scenario of binary observations, often termed as 1-bit matrix completion. While numerous studies have explored Bayesian and frequentist methods for real-value matrix completion, there has been a lack of theoretical exploration regarding Bayesian approaches in 1-bit matrix completion. We tackle this gap by considering a general, non-uniform sampling scheme and providing theoretical assurances on the efficacy of the fractional posterior. Our contributions include obtaining concentration results for the fractional posterior and demonstrating its effectiveness in recovering the underlying parameter matrix. We accomplish this using two distinct types of prior distributions: low-rank factorization priors and a spectral scaled Student prior, with the latter requiring fewer assumptions. Importantly, our results exhibit an adaptive nature by not mandating prior knowledge of the rank of the parameter matrix. Our findings are comparable to those found in the frequentist literature, yet demand fewer restrictive assumptions.
Similar content being viewed by others
1 Introduction
Matrix completion has been extensively explored in the fields of machine learning and statistics, attracting considerable attention in recent years due to its relevance to various contemporary applications such as recommendation systems (Bobadilla et al., 2013; Koren et al., 2009), including the notable Netflix challenge (Bennett and Lanning, 2007), image processing (Ji et al., 2010; Han et al., 2014), genotype imputation (Chi et al., 2013; Jiang et al., 2016), and quantum statistics (Gross, 2011). Although completing a matrix in general is often deemed infeasible, seminal works by Candès and Tao (2010), Candes and Plan (2010), Candès and Recht (2009) have demonstrated its potential feasibility under the assumption of a low-rank structure. This assumption aligns naturally with practical scenarios, particularly in recommendation systems, where it implies the presence of a limited number of latent features that capture user preferences. Various theoretical and computational approaches to matrix completion have been proposed and investigated, as in Tsybakov et al. (2011), Lim and Teh (2007), Salakhutdinov and Mnih (2008), Recht and Ré (2013), Chatterjee (2015), Mai and Alquier (2015), Alquier and Ridgway (2020), Chen et al. (2019).
The previously mentioned studies primarily focused on matrices with real-numbered elements. However, in many practical situations, the observed elements are often binary, taking values from the set \(\{-1, 1\}\). This type of data is prevalent in diverse contexts, such as voting or rating data, where responses typically involve binary distinctions like “yes/no", “like/dislike", or “true/false". Tackling the challenge of reconstructing a matrix from incomplete binary observations, known as 1-bit matrix completion, was initially investigated in Davenport et al. (2014). Subsequent studies in this field have been conducted by various researchers Cai and Zhou (2013), Klopp et al. (2015), Hsieh et al. (2015), Cottet and Alquier (2018), Herbster et al. (2016), Alquier et al. (2019), most of whom have taken a frequentist approach. However, there remains a gap in the literature concerning the theoretical assessment of Bayesian methodologies in this domain.
In this study, we aim to address this gap by focusing on a generalized Bayesian approach, where we utilize a fractional power of the likelihood. This leads to what is commonly referred to as fractional posteriors or tempered posteriors, as elucidated in Bhattacharya et al. (2019), Alquier and Ridgway (2020). It is noteworthy to emphasize that generalized Bayesian methods, where the likelihood is substituted by its fractional power or by a concept of risk, has garnered increased attention in recent years, as demonstrated by various works such as Hammer et al. (2023), Jewson and Rossell (2022), Yonekura and Sugasawa (2023), Mai and Alquier (2017), Matsubara et al. (2022), Medina et al. (2022), Grünwald and Van Ommen (2017), Bissiri et al. (2016), Yang et al. (2020), Lyddon et al. (2019), Syring and Martin (2019), Knoblauch et al. (2022), Mai (2023b), Hong and Martin (2020). It is observed that employing fractional posteriors can simplify computation for some Bayesian models Friel and Pettitt (2008). Moreover, fractional posteriors have demonstrated robustness to model misspecification in comparison to the standard posterior, as evidenced in Grünwald and Van Ommen (2017) and Alquier and Ridgway (2020).
We tackle the 1-bit matrix completion problem by considering a general, non-uniform sampling scheme. While a general sampling scheme for 1-bit matrix completion has also been examined in Klopp et al. (2015), our requirements are less stringent than theirs.
Initially, we present results concerning the employment of a widely used low-rank factorized prior distribution. Such priors have demonstrated practical efficacy, as evidenced in works such as Cottet and Alquier (2018), Lim and Teh (2007), Salakhutdinov and Mnih (2008). However, due to the typically large dimensionalities of matrix completion problems, employing low-rank factorized priors necessitates intricate Markov Chain Monte Carlo (MCMC) adaptations, which can be computationally expensive and lack scalability. Consequently, in practical applications, variational inference is often favored for such priors, as discussed in works like Cottet and Alquier (2018), Lim and Teh (2007), Babacan et al. (2012).
We derive novel results regarding the consistency and concentration properties of the fractional posterior. Specifically, we establish concentration results for the recovering distribution within the \(\alpha\)-Rényi divergence framework. Consequently, as particular instances, we derive concentration outcomes relative to metrics such as the Hellinger metric. Furthermore, we broaden our investigation to establish concentration rates for parameter estimation utilizing specific distance measures such as the Frobenius norm. Our findings are comparable to those in the frequentist literature as documented in Davenport et al. (2014), Cai and Zhou (2013), and Klopp et al. (2015).
In addition to the aforementioned type of prior, we also undertake theoretical examination utilizing a spectral scaled Student prior. This prior, introduced by Dalalyan (2020), shares conceptual similarities with a hierarchical prior discussed in Yang et al. (2018). The spectral scaled Student prior enables posterior sampling through Langevin Monte Carlo, a gradient-based sampling technique that has recently garnered considerable attention in various high-dimensional problems, as observed in Durmus and Moulines (2017), Durmus and Moulines (2019), Dalalyan (2017), Dalalyan and Karagulyan (2019). We demonstrate that by employing this prior, it is possible to achieve concentration results for the fractional posterior without necessitating a boundedness assumption, as is typically required for low-rank factorization priors.
The remainder of the paper is structured as follows. In Sect. 2, we introduce the notations essential for our work and discuss the problem of 1-bit matrix completion. We also present the fractional posterior along with the low-rank factorization prior in this section. Section 3 presents the results pertaining to the low-rank factorization prior, while Sect. 4 is dedicated to the outcomes obtained using the spectral scaled Student prior. All technical proofs are consolidated in Appendix 6. Some concluding remarks are given in Sect. 5.
2 Notations and method
2.1 Notations
For any integer m, let \([m]=\{1,\dots ,m\}\). Given integers m and k, and a matrix \(M \in \mathbb {R}^{m\times k}\), we write \(\Vert M \Vert _\infty : = \max _{(i,j)\in [m]\times [k]} |M_{ij}|\). For a matrix M, its spectral norm is denoted by \(\Vert M \Vert\), its Fobenius norm is denoted by \(\Vert M \Vert _F = \sqrt{\sum _{ij}M^2_{ij}}\), and its nuclear norm is denoted by \(\Vert M \Vert _*\) (the sum of the singular values).
Let \(\alpha \in (0,1)\), the \(\alpha\)-Rényi divergence between two probability distributions Q and R is defined by
where \(\mu\) is any measure such that \(Q \ll \mu\) and \(R\ll \mu\). The Kullback–Leibler divergence is defined by
2.2 1-Bit matrix completion
We assume that the observed data \((\omega_1,Y_1), \ldots , (\omega_n,Y_n)\) are i.i.d. (independent and identically distributed) random variables drawn from a joint distribution characterized by a matrix \(M^* \in \mathbb {R}^{d_1\times d_2}\), denoted by \(P_{M^*}\). Additionally, we assume that \((\omega _s)_{s=1}^n \in ([d_1]\times [d_2])^n\) are i.i.d. and denoted by \(\Pi\) its marginal distribution. These indices correspond to observations, denoted by \((Y_s)_{s=1}^n \in \{-1,+1 \}^n\), distributed accordingly:
where f is the logistic link function \(f(x) = \frac{\exp (x)}{1+\exp (x)}\). This model is similar to Klopp et al. (2015). In this model, we have the likelihood of the observations as \(L_n (M):= \prod _{i=1}^{n}f(M_{\omega _i})^{1_{\left[ Y_i =1 \right] }} (1-f(M_{\omega _i}))^{1_{\left[ Y_i =-1 \right] }}\).
In this study, we operate under the assumption that the rank of \(M^*\), denoted as r, is substantially smaller than its dimensions, specifically \(r \ll \min (d_1, d_2)\). This is a prevalent assumption in 1-bit matrix completion research, (Davenport et al., 2014; Cai & Zhou, 2013; Cottet & Alquier, 2018; Klopp et al., 2015; Alquier et al., 2019).
We concentrate on the fractional posterior for \(\alpha \in (0,1)\), as discussed in Bhattacharya et al. (2019); Alquier and Ridgway (2020), which is formulated as follows:
In the case \(\alpha =1\), one recovers the traditional posterior distribution.
We define the mean estimator as
2.2.1 Low-rank factorization prior
In Bayesian matrix completion methodologies (Babacan et al., 2012; Salakhutdinov and Mnih, 2008; Lim & Teh, 2007), a prevalent concept involves decomposing a matrix into two matrices in order to establish a prior distribution on low-rank matrices. It is commonly acknowledged that any matrix with a rank of r can be decomposed as follows: \(M=LR^\top , \, L\in \mathbb {R}^{d_1 \times r}, \, R \in \mathbb {R}^{d_2 \times r}.\) This approach is grounded in the assumption that the underlying matrix \(M^*\) exhibits a low rank, or is at least well approximated by a low-rank matrix.
However, in practical scenarios, the rank of the matrix is typically unknown. Thus, for a fixed \(K \in \{1, \ldots , \min (d_1,d_2) \}\), one can express \(M=LR^\top\) with \(L\in \mathbb {R}^{d_1 \times K}\), \(R \in \mathbb {R}^{d_2 \times K}\). Subsequently, potential ranks \(r\in [K]\) are adjusted by diminishing certain columns of L and R to zero. To address this, the reference Cottet and Alquier (2018) considers the following hierarchical model:
The prior distribution on the variances \(\pi ^\gamma\) plays a crucial role in controlling the shrinkage of the columns of L and R towards zero. It is common for \(\pi ^\gamma\) to follow an inverse-Gamma distribution (Salakhutdinov and Mnih, 2008). This hierarchical prior distribution bears resemblance to the Bayesian Lasso proposed in Park and Casella (2008), and particularly resembles the Bayesian Group Lasso (Kyung et al., 2010), where the variance term follows a Gamma distribution.
It is worth noting that a Variational Bayesian inference for this prior has been conducted in Section 5 of Cottet and Alquier (2018). However, there is no theoretical study accompanying it.
The paper Cottet and Alquier (2018) shows that Gamma distribution is also a possible alternative in matrix completion, both for theoretical results and practical considerations. Thus all the results in this paper are stated under the assumption that \(\pi ^\gamma\) is either the Gamma or the inverse-Gamma distribution: \(\pi ^\gamma = \Gamma (a,b)\), or \(\pi ^\gamma = \Gamma ^{-1}(a,b)\). In this study, we regard a as a fixed constant, while b is seen as a small parameter requiring adjustment.
3 Main results
3.1 Assumptions
For \(r\ge 1\) and \(B>0\), we define \(\mathcal {M}(r,B)\) as the set of pairs of matrices \((\bar{U},\bar{V})\), with dimensions \(d_1 \times K\) and \(d_2\times K\) respectively, satisfying that: \(\Vert \bar{U}\Vert _{\infty } \le B\), \(\Vert \bar{V}\Vert _{\infty } \le B\) and \(\bar{U}_{i,\ell }=0\) for \(i>r\) and \(\bar{V}_{j,\ell }=0\) for \(j>r\). Similar to Cottet and Alquier (2018); Alquier and Ridgway (2020), we make the following assumption on the true parameter matrix.
Assumption 3.1
We assume that \(M^* = \bar{U}\bar{V}^\top\) for \((\bar{U},\bar{V})\in \mathcal {M}(r,B)\).
Assumption 3.2
We assume that there exist a constant \(C_1>0\), such that,
Assumption 3.3
We assume that \(\Vert M^*\Vert _\infty \le \kappa < \infty\) and there exist a constant \(C_\kappa>0\) such that
where \(f(x) = e^x/(1+e^x)\).
Assumption 3.1 imposes a boundedness condition on the true matrix \(M^*\) for low-rank factorization priors. Similar boundedness assumptions have been used in Klopp et al. (2015); Cottet and Alquier (2018). In Sect. 4, we demonstrate that this assumption can be relaxed by employing a spectral scaled Student’s distribution prior.
Our framework accommodates a general sampling distribution, which is only required to satisfy Assumption 3.2. Assumption 3.2 guarantees that each coefficient has a non-zero probability of being observed. For instance, with the uniform distribution, we can express it as \(C_1 = 1/(d_1d_2)\). This assumption was initially introduced in Klopp (2014) within the classical unquantized (continuous) matrix completion setting. It is also used for 1-bit matrix completion under a general sampling distribution, as demonstrated in Klopp et al. (2015). It is noteworthy that, unlike in Klopp et al. (2015), we do not assume that no column or row is sampled with excessively high probability.
To derive results concerning the parameter matrix, we need to use Assumption 3.3. Assumption 3.3 stands as a cornerstone requirement essential for deriving insights into estimation errors for 1-bit matrix completion. It was first introduced in Davenport et al. (2014). While this may be a strong assumption, it has served as a fundamental premise in various prior works, including Cai and Zhou (2013) and Klopp et al. (2015).
3.2 Main results
The following theorem presents the first consistency result for the fractional posterior in 1-bit matrix completion with low-rank factorization Gaussian priors which are frequently employed in practical applications.
Theorem 3.1
Assume that Assumption 3.1 holds. Then, there is a small enough \(b>0\) such that
where
for some universal constant \(C_{a,B}\) depending only on a, B. Specifically, the result remains valid for the selection \(b= B^2/[512(nd_1d_2)^4 K^2 \max ^2(d_1,d_2)]\).
It is reminded that all technical proofs are postponed to Appendix 6. The main argument is based on a general scheme for fractional posteriors derived in Bhattacharya et al. (2019); Alquier and Ridgway (2020).
In practical applications, it is noted that \(b= B^2/[512(nd_1d_2)^4 K^2 \max ^2(d_1,d_2)]\) may not be the best choice; rather, Alquier et al. (2014); Alquier and Ridgway (2020) suggests employing cross-validation to select b. Ensuring a small b is crucial in practical situations to guarantee a reliable approximation of low-rank matrices (Alquier et al., 2014; Alquier & Ridgway, 2020).
The next theorem introduces a completely novel concentration results for the fractional posterior in 1-bit matrix completion.
Theorem 3.2
Assume that Assumption 3.1 holds. Then, for a sufficiently small \(b> 0\), such as \(b = \frac{B^2}{512(nd_1d_2)^4 K^2 \max ^2(d_1,d_2)}\), it holds that
where,
for some universal constant \(C_{a,B}\) depending only on a and B.
According to Theorem 3.2, there is a probability at least \(1-2/[C_{a,B}r(d_1+d_2)\log (nd_1d_2)]\) that the fractional posterior \(\pi _{n,\alpha }\) will concentrate around the true model at the rate \(\varepsilon _n\), measured by the \(\alpha\)-Rényi divergence.
Remark 1
It is important to note that our results are formulated without prior knowledge of r, the rank of the true underlying parameter matrix. This aspect highlights the adaptive nature of our results, indicating their ability to adjust and perform effectively regardless of the specific rank of the true underlying parameter matrix.
Put
Utilizing the findings from Van Erven and Harremos (2014) on the relationship between the Hellinger distance and the \(\alpha\)-Rényi divergence, we derive the following results. This enables us to draw comparisons with frequentist literature, including works such as Klopp et al. (2015) and Davenport et al. (2014).
Corollary 1
As a special case, Theorem 3.2 leads to a concentration result in terms of the classical Hellinger distance
Remark 2
The rate specified in (3), of the order \(r(d_1+d_2)\log (n)/n\), resembles that observed in prior studies within the frequentist literature, such as Klopp et al. (2015), when examining a general sampling framework. However, we only consider a minimal boundedness assumption as stated in 3.1, and obtain result with respect to the joint distribution \(P_M\) rather than f(M) , the conditional densities.
To elaborate further, Theorem 1 and Lemma 9 in Klopp et al. (2015) delve into the recovery of the distribution f(M) with respect to Hellinger distance. However, they necessitate stricter assumptions such as boundedness assumptions on f(M) and its derivatives as well as no column nor row is sampled with too high probability (see Assumption H1 and H3 in Klopp et al. (2015)). In comparison, the results in Klopp et al. (2015) demonstrate a faster rate than those outlined in Davenport et al. (2014) where their results is of order \(\sqrt{r(d_1+d_2)\log (\max (d_1,d_2))/n}\).
The squared Hellinger metric and the \(\alpha\)-Rényi divergence inherently involve ambiguity, thus preventing any claims about the closeness of M and \(M^*\) within a Euclidean-type distance framework. However, by leveraging Assumption 3.2 and Assumption 3.3, such results can be attained. We now proceed to state our primary results concerning the recovery of the parameter matrix.
Theorem 3.3
Under the same assumption as in Theorem 3.2 and additionally assuming that Assumption 3.2 and Assumption 3.3 hold. We have that
and
Theorem 3.3 states that, with a high probability of at least \(1-2/[C_{a,B}r(d_1+d_2)\log (nd_1d_2)]\), the estimation error of an estimator drawn from the fractional posterior \(\pi _{n,\alpha }\) will converge in Frobenius norm to the true underlying parameter matrix \(M^*\). The rate of this convergence is \(\varepsilon _n\).
The result stated in (5) is achieved by applying Jensen’s inequality to the mean. By employing similar methods, one can readily derive outcomes for other estimator derived from the fractional posterior, such as the median, drawing upon insights provided in Merkle (2005).
Remark 3
Up to a logarithmic factor, the error rate for the mean estimator in the squared Frobenius norm, given in (5), is of order \(r(d_1+d_2)/n\) which is minimax-optimal according to Theorem 3 in Klopp et al. (2015). Compared to the results for estimation error in Corollary 2 of Klopp et al. (2015), our result is obtained with a better probability. Specifically, Corollary 2 in Klopp et al. (2015) is stated with a probability of at least \(1-3/(d_1+d_2)\), whereas our result in (5) holds with a probability of at least \(1-2/[C_{a,B}r(d_1+d_2)\log (nd_1d_2)]\).
Remark 4
Under the uniform sampling assumption and that \(\Vert M \Vert _{\infty }\le \gamma\), Theorem 1 in Davenport et al. (2014) presented results of order \(\sqrt{r(d_1+d_2)/n }\). A similar result using max-norm minimization was also obtained in Cai and Zhou (2013). The paper Klopp et al. (2015) proves an estimation error rate similar to ours, but with different log-term, as \(r(d_1+d_2)\log (d_1+d_2)/n\).
A comparable result to Klopp et al. (2015) is also established in Alquier et al. (2019) but under a uniformly sampling assumption. Subsequently, this rate has been recently enhanced to \(r(d_1+d_2)/n\), without the presence of a logarithmic term, in Alaya and Klopp (2019) (refer to Theorem 7). Consequently, the work presented in Alaya and Klopp (2019) attains the precise minimax estimation rate of convergence for 1-bit matrix completion.
Remark 5
It is noteworthy that our findings are established within a general sampling framework. In contrast to the requirements set forth in Klopp et al. (2015), our approach necessitates only that the probability of observing any entries is strictly positive, without imposing additional assumptions such as no column nor row is sampled with too high probability. This aspect further enhances the robustness of employing a fractional posterior.
Remark 6
As previously noted, Assumption 3.3 is crucial for deriving results on estimation error in our analysis, as well as in prior work within the frequentist literature Davenport et al. (2014), Cai and Zhou (2013), Klopp et al. (2015). Although this assumption may be stringent, relaxing it presents an intriguing avenue for future research.
4 Results with a spectral scaled student prior
We have opted to initially present results in Sect. 3 with factorization-type priors, as they are widely favored in the matrix completion literature for utilization with MCMC or Variational Bayes (VB) methods. However, another spectral scaled Student prior has garnered particular interest due to its promising outcomes, whether employed with VB (Yang et al., 2018) or Langevin Monte Carlo, a gradient-based sampling method (Dalalyan, 2020). This prior has previously been applied in different problems involving matrix parameters (Mai, 2023a, b).
With \(\tau>0\), we consider the following spectral scaled Student prior, given as
This prior possesses the capability to introduce approximate low-rankness in matrices M. This is evident from the fact that \(\pi _{st} (M) \propto \prod _{j=1}^{d_1} (\tau ^2 + s_j(M)^2 )^{- (d_1+d_2+2)/2 },\) where \(s_j(M)\) represents the \(j^{th}\) largest singular value of M. Consequently, the distribution follows a scaled Student’s t-distribution evaluated at \(s_j(M)\), which induces approximate sparsity on \(s_j(M)\), as discussed in Dalalyan and Tsybakov (2012b, 2012a). Thus, under this prior distribution, the majority of \(s_j(M)\) tend to be close to 0, suggesting that M is approximately low-rank.
We now present a consistency result using the spectral scaled Student prior.
Theorem 4.1
For \(\tau = 1/n\), we have that
where
The proofs of this section can be found in Appendix 6.2. It is noted that in the rate \(\varepsilon _n\) outlined in Theorem 4.1 and Theorem 4.2 below, the condition \(r = \textrm{rank} (M^* ) \ne 0\) is not necessary. This is because we interpret \(0\log (1+0/0)\) as 0 for the scenario where \(r^* = 0\) and \(M^* = 0\).
The next theorem presents a concentration result for the fractional posterior.
Theorem 4.2
For \(\tau = 1/n\), we have that
where \(\varepsilon _n\) is given in Theorem 4.1.
Remark 7
We do not assert that \(\tau = 1/n\), in both Theorem 4.1 and 4.2, represents the optimal selection. In practical applications, users can utilize cross-validation to fine-tune the value of \(\tau\).
Remark 8
It is interesting to observe that by utilizing the spectral scaled Student prior described in (6), we are not required to impose a boundedness assumption on \(M^*\), as was necessary in the previous section with low-rank factorized priors or in other previous works such as Klopp et al. (2015); Alquier and Ridgway (2020). Furthermore, the additional logarithmic factor in Theorem 4.1 and Theorem 4.2 can be further simplified. This can be achieved by employing the inequality \(\Vert M^* \Vert _F \le \Vert M^* \Vert \sqrt{r}\), resulting in \(\log (1+ n\Vert M^* \Vert )\).
Similar to Theorem 3.3, with the inclusion of additional assumptions, we can derive concentration results for recovering the underlying matrix parameter as well as results for the mean estimator defined in (2).
Theorem 4.3
Under the same assumption as in Theorem 4.2 and additional assume that Assumption 3.2 and Assumption 3.3 hold. We have that
and
Remark 9
Similar to the outcomes detailed in Sect. 3, the results presented in this section for the spectral scaled Student prior do not necessitate prior knowledge of r, the rank of the true underlying parameter matrix. This underscores the adaptive nature of our results, demonstrating their capacity to adjust and perform effectively, regardless of the rank of the true underlying parameter matrix.
5 Concluding remarks
This paper presents an in-depth theoretical examination of Bayesian 1-bit matrix completion, addressing a gap in the machine learning literature. Our study takes into account a general, non-uniform sampling scheme and offers theoretical assurances for the effectiveness of the fractional posterior. We derive concentration results for the fractional posterior and validate its ability to recover the true parameter matrix. Our approach utilizes two types of priors: low-rank factorization priors and a spectral scaled Student’s t-distribution prior, with the latter needing fewer assumptions. Crucially, our results adaptively handle the matrix rank without requiring to be given. Our findings match those in the frequentist literature but with fewer restrictive assumptions.
While our work yields promising theoretical results for Bayesian 1-bit matrix completion, there remain several avenues for future research. One potential extension involves integrating additional covariate information into our methodology. Another critical area necessitating further investigation in practical applications is the tuning of the learning rate \(\alpha\). Although cross-validation can be employed for this purpose, it incurs significant computational costs. The optimal tuning of this parameter presents a challenging problem in practice and constitutes an open research question that has garnered considerable attention within the framework of generalized Bayesian inference, as highlighted by Wu and Martin (2023) and related literature. While our study considers a general sampling setting, it has certain limitations: with a fairly high probability, some entries may be sampled multiple times. It would be more practical to assume that entries are sampled without replacement. Future research should address this limitation.
References
Alquier, P., Cottet, V., Chopin, N., Rousseau, J. (2014). Bayesian matrix completion: prior specification and consistency. arXiv preprint arXiv:1406.1440.
Alquier, P., Cottet, V., Chopin, N., Rousseau, J. (2014). Bayesian matrix completion: prior specification and consistency. arXiv preprint arXiv:1406.1440.
Alquier, P., Cottet, V., & Lecué, G. (2019). Estimation bounds and sharp oracle inequalities of regularized procedures with lipschitz loss functions. The Annals of Statistics, 47(4), 2117–2144.
Alquier, P., & Ridgway, J. (2020). Concentration of tempered posteriors and of their variational approximations. The Annals of Statistics, 48(3), 1475–1497.
Babacan, S. D., Luessi, M., Molina, R., & Katsaggelos, A. K. (2012). Sparse Bayesian methods for low-rank matrix estimation. IEEE Transactions on Signal Processing, 60(8), 3964–3977.
Bennett, J., & Lanning, S. (2007). The netflix prize. In: Proceedings of KDD Cup and Workshop, vol. 2007, p. 35.
Bhattacharya, A., Pati, D., & Yang, Y. (2019). Bayesian fractional posteriors. Annals of Statistics, 47(1), 39–66.
Bissiri, P. G., Holmes, C. C., & Walker, S. G. (2016). A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5), 1103–1130.
Bobadilla, J., Ortega, F., Hernando, A., & Gutiérrez, A. (2013). Recommender systems survey. Knowledge-based systems,46, 109–132.
Cai, T., & Zhou, W.-X. (2013). A max-norm constrained minimization approach to 1-bit matrix completion. Journal of Machine Learning Research, 14(1), 3619–3647.
Candes, E. J., & Plan, Y. (2010). Matrix completion with noise. Proceedings of the IEEE, 98(6), 925–936.
Candès, E. J., & Recht, B. (2009). Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6), 717–772. https://doi.org/10.1007/s10208-009-9045-5
Candès, E. J., & Tao, T. (2010). The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5), 2053–2080.
Chatterjee, S. (2015). Matrix estimation by universal singular value thresholding. The Annals of Statistics, 43(1), 177–214.
Chen, Y., Fan, J., Ma, C., & Yan, Y. (2019). Inference and uncertainty quantification for noisy matrix completion. Proceedings of the National Academy of Sciences, 116(46), 22931–22937.
Chi, E. C., Zhou, H., Chen, G. K., Del Vecchyo, D. O., & Lange, K. (2013). Genotype imputation via matrix completion. Genome research, 23(3), 509–518.
Cottet, V., & Alquier, P. (2018). 1-bit matrix completion: Pac-bayesian analysis of a variational approximation. Machine Learning, 107(3), 579–603.
Dalalyan, A. S. (2017). Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B Statistical Methodology, 79(3), 651–676.
Dalalyan, A. S. (2020). Exponential weights in multivariate regression and a low-rankness favoring prior. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 56(2), 1465–1483.
Dalalyan, A. S., & Karagulyan, A. (2019). User-friendly guarantees for the langevin monte carlo with inaccurate gradient. Stochastic Processes and their Applications, 129(12), 5278–5311.
Dalalyan, A. S., & Tsybakov, A. (2012). Mirror averaging with sparsity priors. Bernoulli, 18(3), 914–944.
Dalalyan, A. S., & Tsybakov, A. B. (2012). Sparse regression learning by aggregation and langevin monte-carlo. Journal of Computer and System Sciences, 78(5), 1423–1443.
Davenport, M. A., Plan, Y., Van Den Berg, E., & Wootters, M. (2014). 1-Bit matrix completion. Information and Inference: A Journal of the IMA, 3(3), 189–223.
Durmus, A., & Moulines, E. (2017). Nonasymptotic convergence analysis for the unadjusted langevin algorithm. The Annals of Applied Probability, 27(3), 1551–1587.
Durmus, A., & Moulines, E. (2019). High-dimensional Bayesian inference via the unadjusted langevin algorithm. Bernoulli, 25(4A), 2854–2882.
Friel, N., & Pettitt, A. N. (2008). Marginal likelihood estimation via power posteriors. Journal of the Royal Statistical Society Series B: Statistical Methodology, 70(3), 589–607.
Gross, D. (2011). Recovering low-rank matrices from few coefficients in any basis. IEEE Transactions on Information Theory, 57(3), 1548–1566.
Grünwald, P., & Van Ommen, T. (2017). Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it. Bayesian Analysis, 12(4), 1069–1103.
Hammer, H. L., Riegler, M. A., & Tjelmeland, H. (2023). Approximate bayesian inference based on expected evaluation. Bayesian Analysis, 19(3), 677–98.
Han, X., Wu, J., Wang, L., Chen, Y., Senhadji, L., Shu, H.: Linear total variation approximate regularized nuclear norm optimization for matrix completion. In: Abstract and Applied Analysis, vol, 2014 (2014). Hindawi
Herbster, M., Pasteris, S., Pontil, M.: Mistake bounds for binary matrix completion. Advances in Neural Information Processing Systems, 29 (2016)
Hong, L., & Martin, R. (2020). Model misspecification, Bayesian versus credibility estimation, and Gibbs posteriors. Scandinavian Actuarial Journal, 2020(7), 634–649.
Hsieh, C.-J., Natarajan, N., & Dhillon, I. (2015) Pu learning for matrix completion. In: International Conference on Machine Learning, pp. 2445–2453. PMLR
Jewson, J., & Rossell, D. (2022). General bayesian loss function selection and the use of improper models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(5), 1640–1665.
Ji, H., Liu, C., Shen, Z., Xu, Y.: Robust video denoising using low rank matrix completion. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1791–1798 (2010). IEEE
Jiang, B., Ma, S., Causey, J., Qiao, L., Hardin, M. P., Bitts, I., Johnson, D., Zhang, S., & Huang, X. (2016). Sparrec: An effective matrix completion framework of missing data imputation for gwas. Scientific reports, 6(1), 35534.
Klopp, O. (2014). Noisy low-rank matrix completion with general sampling distribution. Bernoulli, 20(1), 282–303. https://doi.org/10.3150/12-BEJ486
Klopp, O., Lafond, J., Moulines, É., & Salmon, J. (2015). Adaptive multinomial matrix completion. Electronic Journal of Statistics, 9, 2950–2975.
Knoblauch, J., Jewson, J., & Damoulas, T. (2022). An optimization-centric view on bayes’ rule: Reviewing and generalizing variational inference. Journal of Machine Learning Research, 23(132), 1–109.
Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer, 42(8), 30–37.
Kyung, M., Gill, J., Ghosh, M., & Casella, G. (2010). Penalized regression, standard errors, and bayesian lassos. Bayesian Analysis, 5(2), 369–412.
Lim, Y. J., & Teh, Y. W. (2007). Variational bayesian approach to movie rating prediction. Proceedings of KDD cup and workshop, 7, 15–21.
Lyddon, S. P., Holmes, C., & Walker, S. (2019). General bayesian updating and the loss-likelihood bootstrap. Biometrika, 106(2), 465–478.
Mai, T. T. (2023). From bilinear regression to inductive matrix completion: A Quasi-Bayesian analysis. Entropy, 25(2), 333.
Mai, T. T. (2023). A reduced-rank approach to predicting multiple binary responses through machine learning. Statistics and Computing, 33(6), 136.
Mai, T. T., & Alquier, P. (2015). A bayesian approach for noisy matrix completion: Optimal rate under general sampling distribution. Electron. J. Statist., 9(1), 823–841.
Mai, T. T., & Alquier, P. (2017). Pseudo-Bayesian quantum tomography with rank-adaptation. Journal of Statistical Planning and Inference, 184, 62–76.
Matsubara, T., Knoblauch, J., Briol, F.-X., & Oates, C. J. (2022). Robust generalised bayesian inference for intractable likelihoods. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(3), 997–1022.
Medina, M. A., Olea, J. L. M., Rush, C., & Velez, A. (2022). On the robustness to misspecification of \(\alpha\)-posteriors and their variational approximations. Journal of Machine Learning Research, 23(147), 1–51.
Merkle, M. (2005). Jensen’s inequality for medians. Statistics & probability letters, 71(3), 277–281.
Park, T., & Casella, G. (2008). The Bayesian lasso. Journal of the american statistical association, 103(482), 681–686.
Recht, B., & Ré, C. (2013). Parallel stochastic gradient algorithms for large-scale matrix completion. Mathematical Programming Computation, 5(2), 201–226.
Salakhutdinov, R., Mnih, A.: Bayesian probabilistic matrix factorization using markov chain monte carlo. In: Proceedings of the 25th International Conference on Machine Learning, pp. 880–887 (2008). ACM
Syring, N., & Martin, R. (2019). Calibrating general posterior credible regions. Biometrika, 106(2), 479–486.
Tsybakov, A. B., Koltchinskii, V., & Lounici, K. (2011). Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Annals of Statistics, 39(5), 2302–2329.
Van Erven, T., & Harremos, P. (2014). Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory, 60(7), 3797–3820.
Wu, P.-S., & Martin, R. (2023). A comparison of learning rate selection methods in generalized Bayesian inference. Bayesian Analysis, 18(1), 105–132.
Yang, L., Fang, J., Duan, H., Li, H., & Zeng, B. (2018). Fast low-rank Bayesian matrix completion with hierarchical gaussian prior models. IEEE Transactions on Signal Processing, 66(11), 2804–2817.
Yang, Y., Pati, D., & Bhattacharya, A. (2020). \(\alpha\)-variational inference with statistical guarantees. Annals of Statistics, 48(2), 886–905.
Yonekura, S., & Sugasawa, S. (2023). Adaptation of the tuning parameter in general bayesian inference with robust divergence. Statistics and Computing, 33(2), 39.
Acknowledgements
The author was supported by the Norwegian Research Council, grant number 309960, through the Centre for Geophysical Forecasting at NTNU. The author expresses gratitude to three anonymous reviewers who generously reviewed the earlier version of this paper, providing valuable suggestions and insightful comments that significantly improved its presentation.
Funding
Open access funding provided by NTNU Norwegian University of Science and Technology (incl St. Olavs Hospital - Trondheim University Hospital).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The author declares no potential Conflict of interest.
Additional information
Editors: Kee-Eung Kim, Shou-De Lin.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Proofs
Proofs
1.1 Proofs for Sect. 3
Proof of Theorem 3.1
As the logistic loss is 1-Lipschitz, the log-likelihood satisfies that
One has that
where \(\Pi _{ij}\le 1\) is the probability to observe the (i, j) -th entry. For any (U, V) in the support of \(\rho _n\), given in (13), one has that
Therefore,
For \(\delta =B/[8(nd_1 d_2)^2]\) that satisfies \(0<\delta <B\), we have that
and
Now, from Lemma 1, we have that
We now can apply Theorem 2.6 in Alquier and Ridgway (2020) with \(\rho _n\) in (13) and
to obtain the result. The proof is completed.
Proof of Theorem 3.2
As the logistic loss is 1-Lipschitz, the log-likelihood satisfies that \(\left| \log f(x)-\log f(y) \right| \le |x-y|\). Thus, we can deduce that
where \(\Pi _{ij}\le 1\) is the probability to observe the (i, j) -th entry. From (9), we have that
and
For any (U, V) in the support of \(\rho _n\), given in (13), we observe that for \(\delta =\frac{B}{8(n d_1 d_2)^2}\), where \(\delta\) satisfies \(0< \delta < B\), and from equation (10) we can deduce that
and
Now, from Lemma 1, we have that
We now can apply Corollary 2.5 and Theorem 2.4 in Alquier and Ridgway (2020) with \(\rho _n\) in (13) and
to obtain the result. The proof is completed.
Proof of Corollary 1
From Van Erven and Harremos (2014), we have that
for \(\alpha \in [0.5,1)\). In addition, we also have that
for \(\alpha \in (0, 0.5)\).
Thus, using definition of \(c_\alpha\) and Theorem 3.2, we obtain the results.
Proof of Theorem 3.3
From (3), we have that
from Lemma 2, one has that
thus, we obtain (4). To obtain (5), one can apple Jensen’s inequality for a convex function, that
This completes the proof.
1.2 Proofs for Sect. 4
Proof of Theorem 3.3
From (9), we have that
When integrating with respect to \(\rho _n:= \rho _{0}\) given in (14), we have that
where we have used Holder’s inequality and Lemma 3 to obtain the result. Now, from Lemma 4, we have that
Taking \(\tau = 1/n\), we obtain that
We now can apply Theorem 2.6 in Alquier and Ridgway (2020) with
to obtain the result. The proof is completed.
Proof of Theorem 4.2
From (9), we have that
When integrating with respect to \(\rho _n:= \rho _{0}\) given in (14), and from (12), we have that
Now, from Lemma 4, we have that
Moreover, from (11), one has that
and when integrating with respect to \(\rho _n:= \rho _{0}\) given in (14), it leads to
where we have used a change of variable and Lemma 3 to obtain the result.
Now, by taking \(\tau = 1/n\), we obtain that
We now can apply Theorem 2.4 and Corollary 2.5 in Alquier and Ridgway (2020) with
to obtain the result. The proof is completed.
Proof of Theorem 4.3
From Theorem 4.2, using a bound for Hellinger distance as in Corollary 1, we have that
from Lemma 2, it yields that
thus, we obtain (7). To obtain (8), one can apple Jensen’s inequality for a convex function, that
and combine with result in (7). This completes the proof.
1.3 Lemma
Definition 1
Fix \(B>0\), \(r\ge 1\). For any pair \((\bar{U},\bar{V})\in \mathcal {M}(r,B)\), we define for \(\delta \in (0,B)\) that will be chosen later,
Lemma 1
Put \(C_a:= \log (8\sqrt{\pi }\Gamma (a)2^{10a+1})+3\). For \(\delta =B/[8(nd_1 d_2)^2]\) that satisfies \(0<\delta <B\), and with \(b = B^2/[512(nd_1d_2)^4 K^2 \max ^2(d_1,d_2)]\), we have for \(\rho _n\) in (13) that
Proof of Lemma 1
This result can found, for example, in the proof of Theorem 4.1 in Alquier and Ridgway (2020).
Lemma 2
For any matrix \(A \in \mathbb {R}^{d_1 \times d_2}\) and \(B \in \mathbb {R}^{d_1 \times d_2}\) satisfying that \(\Vert A\Vert _\infty \le \kappa\) and \(\Vert B\Vert _\infty \le \kappa\), under Assumption 3.3 and Assumption 3.2, one has that
Proof of Lemma 2
This is Lemma A.2 in Davenport et al. (2014). With \(d^2_H (p,q):= (\sqrt{p} -\sqrt{q} )^2 + (\sqrt{1-p} -\sqrt{1-q} )^2\) for two number \(p,q \in [0,1]\), it is noting under Assumption 3.2 that
where \(\Pi _{ij}\) is the probability to observe the (i, j) -th entry. Now from Lemma A.2 in Davenport et al. (2014), under Assumption 3.3, one has that
The argument is also similar to Lemma 9 and Lemma 11 in Klopp et al. (2015). This completes the proof.
Finally, we will use quite often the following distribution that will be defined as translations of the prior \(\pi _{st}\) in (6). We introduce the following notation.
Definition 2
Let’s define
The following technical lemmas will be useful in the proofs.
Lemma 3
(Lemma 1 in Dalalyan (2020)) We have
Lemma 4
(Lemma 2 in Dalalyan (2020)) We have
with the convention \(0\log (1+0/0)=0\).
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Mai, T.T. Concentration properties of fractional posterior in 1-bit matrix completion. Mach Learn 114, 7 (2025). https://doi.org/10.1007/s10994-024-06691-z
Received:
Revised:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1007/s10994-024-06691-z

