1 Introduction

Matrix completion has been extensively explored in the fields of machine learning and statistics, attracting considerable attention in recent years due to its relevance to various contemporary applications such as recommendation systems (Bobadilla et al., 2013; Koren et al., 2009), including the notable Netflix challenge (Bennett and Lanning, 2007), image processing (Ji et al., 2010; Han et al., 2014), genotype imputation (Chi et al., 2013; Jiang et al., 2016), and quantum statistics (Gross, 2011). Although completing a matrix in general is often deemed infeasible, seminal works by Candès and Tao (2010), Candes and Plan (2010), Candès and Recht (2009) have demonstrated its potential feasibility under the assumption of a low-rank structure. This assumption aligns naturally with practical scenarios, particularly in recommendation systems, where it implies the presence of a limited number of latent features that capture user preferences. Various theoretical and computational approaches to matrix completion have been proposed and investigated, as in Tsybakov et al. (2011), Lim and Teh (2007), Salakhutdinov and Mnih (2008), Recht and Ré (2013), Chatterjee (2015), Mai and Alquier (2015), Alquier and Ridgway (2020), Chen et al. (2019).

The previously mentioned studies primarily focused on matrices with real-numbered elements. However, in many practical situations, the observed elements are often binary, taking values from the set \(\{-1, 1\}\). This type of data is prevalent in diverse contexts, such as voting or rating data, where responses typically involve binary distinctions like “yes/no", “like/dislike", or “true/false". Tackling the challenge of reconstructing a matrix from incomplete binary observations, known as 1-bit matrix completion, was initially investigated in Davenport et al. (2014). Subsequent studies in this field have been conducted by various researchers Cai and Zhou (2013), Klopp et al. (2015), Hsieh et al. (2015), Cottet and Alquier (2018), Herbster et al. (2016), Alquier et al. (2019), most of whom have taken a frequentist approach. However, there remains a gap in the literature concerning the theoretical assessment of Bayesian methodologies in this domain.

In this study, we aim to address this gap by focusing on a generalized Bayesian approach, where we utilize a fractional power of the likelihood. This leads to what is commonly referred to as fractional posteriors or tempered posteriors, as elucidated in Bhattacharya et al. (2019), Alquier and Ridgway (2020). It is noteworthy to emphasize that generalized Bayesian methods, where the likelihood is substituted by its fractional power or by a concept of risk, has garnered increased attention in recent years, as demonstrated by various works such as Hammer et al. (2023), Jewson and Rossell (2022), Yonekura and Sugasawa (2023), Mai and Alquier (2017), Matsubara et al. (2022), Medina et al. (2022), Grünwald and Van Ommen (2017), Bissiri et al. (2016), Yang et al. (2020), Lyddon et al. (2019), Syring and Martin (2019), Knoblauch et al. (2022), Mai (2023b), Hong and Martin (2020). It is observed that employing fractional posteriors can simplify computation for some Bayesian models Friel and Pettitt (2008). Moreover, fractional posteriors have demonstrated robustness to model misspecification in comparison to the standard posterior, as evidenced in Grünwald and Van Ommen (2017) and Alquier and Ridgway (2020).

We tackle the 1-bit matrix completion problem by considering a general, non-uniform sampling scheme. While a general sampling scheme for 1-bit matrix completion has also been examined in Klopp et al. (2015), our requirements are less stringent than theirs.

Initially, we present results concerning the employment of a widely used low-rank factorized prior distribution. Such priors have demonstrated practical efficacy, as evidenced in works such as Cottet and Alquier (2018), Lim and Teh (2007), Salakhutdinov and Mnih (2008). However, due to the typically large dimensionalities of matrix completion problems, employing low-rank factorized priors necessitates intricate Markov Chain Monte Carlo (MCMC) adaptations, which can be computationally expensive and lack scalability. Consequently, in practical applications, variational inference is often favored for such priors, as discussed in works like Cottet and Alquier (2018), Lim and Teh (2007), Babacan et al. (2012).

We derive novel results regarding the consistency and concentration properties of the fractional posterior. Specifically, we establish concentration results for the recovering distribution within the \(\alpha\)-Rényi divergence framework. Consequently, as particular instances, we derive concentration outcomes relative to metrics such as the Hellinger metric. Furthermore, we broaden our investigation to establish concentration rates for parameter estimation utilizing specific distance measures such as the Frobenius norm. Our findings are comparable to those in the frequentist literature as documented in Davenport et al. (2014), Cai and Zhou (2013), and Klopp et al. (2015).

In addition to the aforementioned type of prior, we also undertake theoretical examination utilizing a spectral scaled Student prior. This prior, introduced by Dalalyan (2020), shares conceptual similarities with a hierarchical prior discussed in Yang et al. (2018). The spectral scaled Student prior enables posterior sampling through Langevin Monte Carlo, a gradient-based sampling technique that has recently garnered considerable attention in various high-dimensional problems, as observed in Durmus and Moulines (2017), Durmus and Moulines (2019), Dalalyan (2017), Dalalyan and Karagulyan (2019). We demonstrate that by employing this prior, it is possible to achieve concentration results for the fractional posterior without necessitating a boundedness assumption, as is typically required for low-rank factorization priors.

The remainder of the paper is structured as follows. In Sect. 2, we introduce the notations essential for our work and discuss the problem of 1-bit matrix completion. We also present the fractional posterior along with the low-rank factorization prior in this section. Section 3 presents the results pertaining to the low-rank factorization prior, while Sect. 4 is dedicated to the outcomes obtained using the spectral scaled Student prior. All technical proofs are consolidated in Appendix 6. Some concluding remarks are given in Sect. 5.

2 Notations and method

2.1 Notations

For any integer m, let \([m]=\{1,\dots ,m\}\). Given integers m and k, and a matrix \(M \in \mathbb {R}^{m\times k}\), we write \(\Vert M \Vert _\infty : = \max _{(i,j)\in [m]\times [k]} |M_{ij}|\). For a matrix M, its spectral norm is denoted by \(\Vert M \Vert\), its Fobenius norm is denoted by \(\Vert M \Vert _F = \sqrt{\sum _{ij}M^2_{ij}}\), and its nuclear norm is denoted by \(\Vert M \Vert _*\) (the sum of the singular values).

Let \(\alpha \in (0,1)\), the \(\alpha\)-Rényi divergence between two probability distributions Q and R is defined by

$$\begin{aligned} D_{\alpha }(Q ,R) = \frac{1}{\alpha -1} \log \int \left( \frac{\textrm{d}Q }{\textrm{d}\mu }\right) ^\alpha \left( \frac{\textrm{d}R}{\textrm{d}\mu }\right) ^{1-\alpha } \textrm{d}\mu , \end{aligned}$$

where \(\mu\) is any measure such that \(Q \ll \mu\) and \(R\ll \mu\). The Kullback–Leibler divergence is defined by

$$\begin{aligned} \mathcal {K}(Q ,R) = \int \log \left( \frac{\textrm{d}Q }{\textrm{d}R} \right) \textrm{d}Q \text { if } Q \ll R \text {, } + \infty \text { otherwise}. \end{aligned}$$

2.2 1-Bit matrix completion

We assume that the observed data \((\omega_1,Y_1), \ldots , (\omega_n,Y_n)\) are i.i.d. (independent and identically distributed) random variables drawn from a joint distribution characterized by a matrix \(M^* \in \mathbb {R}^{d_1\times d_2}\), denoted by \(P_{M^*}\). Additionally, we assume that \((\omega _s)_{s=1}^n \in ([d_1]\times [d_2])^n\) are i.i.d. and denoted by \(\Pi\) its marginal distribution. These indices correspond to observations, denoted by \((Y_s)_{s=1}^n \in \{-1,+1 \}^n\), distributed accordingly:

$$\begin{aligned} Y| \omega = {\left\{ \begin{array}{ll} 1 & \text { with probability } f(M^*_\omega ), \\ -1 & \text { with probability } 1- f(M^*_\omega ), \end{array}\right. } \end{aligned}$$
(1)

where f is the logistic link function \(f(x) = \frac{\exp (x)}{1+\exp (x)}\). This model is similar to Klopp et al. (2015). In this model, we have the likelihood of the observations as \(L_n (M):= \prod _{i=1}^{n}f(M_{\omega _i})^{1_{\left[ Y_i =1 \right] }} (1-f(M_{\omega _i}))^{1_{\left[ Y_i =-1 \right] }}\).

In this study, we operate under the assumption that the rank of \(M^*\), denoted as r, is substantially smaller than its dimensions, specifically \(r \ll \min (d_1, d_2)\). This is a prevalent assumption in 1-bit matrix completion research, (Davenport et al., 2014; Cai & Zhou, 2013; Cottet & Alquier, 2018; Klopp et al., 2015; Alquier et al., 2019).

We concentrate on the fractional posterior for \(\alpha \in (0,1)\), as discussed in Bhattacharya et al. (2019); Alquier and Ridgway (2020), which is formulated as follows:

$$\begin{aligned} \pi _{n,\alpha }(M) \propto L_n^\alpha (M) \pi (M) . \end{aligned}$$

In the case \(\alpha =1\), one recovers the traditional posterior distribution.

We define the mean estimator as

$$\begin{aligned} \hat{M}:= \int M \pi _{n,\alpha }(\textrm{d}M). \end{aligned}$$
(2)

2.2.1 Low-rank factorization prior

In Bayesian matrix completion methodologies (Babacan et al., 2012; Salakhutdinov and Mnih, 2008; Lim & Teh, 2007), a prevalent concept involves decomposing a matrix into two matrices in order to establish a prior distribution on low-rank matrices. It is commonly acknowledged that any matrix with a rank of r can be decomposed as follows: \(M=LR^\top , \, L\in \mathbb {R}^{d_1 \times r}, \, R \in \mathbb {R}^{d_2 \times r}.\) This approach is grounded in the assumption that the underlying matrix \(M^*\) exhibits a low rank, or is at least well approximated by a low-rank matrix.

However, in practical scenarios, the rank of the matrix is typically unknown. Thus, for a fixed \(K \in \{1, \ldots , \min (d_1,d_2) \}\), one can express \(M=LR^\top\) with \(L\in \mathbb {R}^{d_1 \times K}\), \(R \in \mathbb {R}^{d_2 \times K}\). Subsequently, potential ranks \(r\in [K]\) are adjusted by diminishing certain columns of L and R to zero. To address this, the reference Cottet and Alquier (2018) considers the following hierarchical model:

$$\begin{aligned} \gamma _k&{\mathop {\sim }\limits ^{iid}}\pi ^\gamma , \, \forall k \in [K] , \\ L_{i,\cdot },R_{j,\cdot }|\gamma&{\mathop {\sim }\limits ^{iid}}\mathcal {N}(0,\text {diag}(\gamma )), \,\, \forall (i,j) \in [d_1] \times [d_2]. \end{aligned}$$

The prior distribution on the variances \(\pi ^\gamma\) plays a crucial role in controlling the shrinkage of the columns of L and R towards zero. It is common for \(\pi ^\gamma\) to follow an inverse-Gamma distribution (Salakhutdinov and Mnih, 2008). This hierarchical prior distribution bears resemblance to the Bayesian Lasso proposed in Park and Casella (2008), and particularly resembles the Bayesian Group Lasso (Kyung et al., 2010), where the variance term follows a Gamma distribution.

It is worth noting that a Variational Bayesian inference for this prior has been conducted in Section 5 of Cottet and Alquier (2018). However, there is no theoretical study accompanying it.

The paper Cottet and Alquier (2018) shows that Gamma distribution is also a possible alternative in matrix completion, both for theoretical results and practical considerations. Thus all the results in this paper are stated under the assumption that \(\pi ^\gamma\) is either the Gamma or the inverse-Gamma distribution: \(\pi ^\gamma = \Gamma (a,b)\), or \(\pi ^\gamma = \Gamma ^{-1}(a,b)\). In this study, we regard a as a fixed constant, while b is seen as a small parameter requiring adjustment.

3 Main results

3.1 Assumptions

For \(r\ge 1\) and \(B>0\), we define \(\mathcal {M}(r,B)\) as the set of pairs of matrices \((\bar{U},\bar{V})\), with dimensions \(d_1 \times K\) and \(d_2\times K\) respectively, satisfying that: \(\Vert \bar{U}\Vert _{\infty } \le B\), \(\Vert \bar{V}\Vert _{\infty } \le B\) and \(\bar{U}_{i,\ell }=0\) for \(i>r\) and \(\bar{V}_{j,\ell }=0\) for \(j>r\). Similar to Cottet and Alquier (2018); Alquier and Ridgway (2020), we make the following assumption on the true parameter matrix.

Assumption 3.1

We assume that \(M^* = \bar{U}\bar{V}^\top\) for \((\bar{U},\bar{V})\in \mathcal {M}(r,B)\).

Assumption 3.2

We assume that there exist a constant \(C_1>0\), such that,

$$\begin{aligned} \min _{ i\in [d_1], j\in [d_2]} \mathbb {P} (\omega _1 =(i,j) ) \ge C_1. \end{aligned}$$

Assumption 3.3

We assume that \(\Vert M^*\Vert _\infty \le \kappa < \infty\) and there exist a constant \(C_\kappa>0\) such that

$$\begin{aligned} C_\kappa = \inf _{|x|\le \kappa } \frac{f'(x)^2}{8f(x)(1-f(x))}, \end{aligned}$$

where \(f(x) = e^x/(1+e^x)\).

Assumption 3.1 imposes a boundedness condition on the true matrix \(M^*\) for low-rank factorization priors. Similar boundedness assumptions have been used in Klopp et al. (2015); Cottet and Alquier (2018). In Sect. 4, we demonstrate that this assumption can be relaxed by employing a spectral scaled Student’s distribution prior.

Our framework accommodates a general sampling distribution, which is only required to satisfy Assumption 3.2. Assumption 3.2 guarantees that each coefficient has a non-zero probability of being observed. For instance, with the uniform distribution, we can express it as \(C_1 = 1/(d_1d_2)\). This assumption was initially introduced in Klopp (2014) within the classical unquantized (continuous) matrix completion setting. It is also used for 1-bit matrix completion under a general sampling distribution, as demonstrated in Klopp et al. (2015). It is noteworthy that, unlike in Klopp et al. (2015), we do not assume that no column or row is sampled with excessively high probability.

To derive results concerning the parameter matrix, we need to use Assumption 3.3. Assumption 3.3 stands as a cornerstone requirement essential for deriving insights into estimation errors for 1-bit matrix completion. It was first introduced in Davenport et al. (2014). While this may be a strong assumption, it has served as a fundamental premise in various prior works, including Cai and Zhou (2013) and Klopp et al. (2015).

3.2 Main results

The following theorem presents the first consistency result for the fractional posterior in 1-bit matrix completion with low-rank factorization Gaussian priors which are frequently employed in practical applications.

Theorem 3.1

Assume that Assumption 3.1 holds. Then, there is a small enough \(b>0\) such that

$$\begin{aligned} \mathbb {E} \left[ \int D_{\alpha }( P_{M}, P_{M^*}) \pi _{n,\alpha }(\textrm{d} M ) \right] \le \frac{1+\alpha }{1-\alpha }\varepsilon _n. \end{aligned}$$

where

$$\varepsilon _n = C_{a,B}\frac{ r(d_1+d_2)\log (nd_1d_2)}{n},$$

for some universal constant \(C_{a,B}\) depending only on aB. Specifically, the result remains valid for the selection \(b= B^2/[512(nd_1d_2)^4 K^2 \max ^2(d_1,d_2)]\).

It is reminded that all technical proofs are postponed to Appendix 6. The main argument is based on a general scheme for fractional posteriors derived in Bhattacharya et al. (2019); Alquier and Ridgway (2020).

In practical applications, it is noted that \(b= B^2/[512(nd_1d_2)^4 K^2 \max ^2(d_1,d_2)]\) may not be the best choice; rather, Alquier et al. (2014); Alquier and Ridgway (2020) suggests employing cross-validation to select b. Ensuring a small b is crucial in practical situations to guarantee a reliable approximation of low-rank matrices (Alquier et al., 2014; Alquier & Ridgway, 2020).

The next theorem introduces a completely novel concentration results for the fractional posterior in 1-bit matrix completion.

Theorem 3.2

Assume that Assumption 3.1 holds. Then, for a sufficiently small \(b> 0\), such as \(b = \frac{B^2}{512(nd_1d_2)^4 K^2 \max ^2(d_1,d_2)}\), it holds that

$$\mathbb {P}\left[ \int \mathcal {D}_{\alpha }(P_{M}, P_{M^*}) \pi _{n,\alpha }(\textrm{d}M ) \le \frac{2(\alpha +1)}{1-\alpha } \varepsilon _n \right] \ge 1-\frac{2}{n\varepsilon _n}$$

where,

$$\varepsilon _n = C_{a,B}\frac{ r(d_1+d_2)\log (nd_1d_2)}{n},$$

for some universal constant \(C_{a,B}\) depending only on a and B.

According to Theorem 3.2, there is a probability at least \(1-2/[C_{a,B}r(d_1+d_2)\log (nd_1d_2)]\) that the fractional posterior \(\pi _{n,\alpha }\) will concentrate around the true model at the rate \(\varepsilon _n\), measured by the \(\alpha\)-Rényi divergence.

Remark 1

It is important to note that our results are formulated without prior knowledge of r, the rank of the true underlying parameter matrix. This aspect highlights the adaptive nature of our results, indicating their ability to adjust and perform effectively regardless of the specific rank of the true underlying parameter matrix.

Put

$$\begin{aligned} c_\alpha = {\left\{ \begin{array}{ll} \frac{2(\alpha +1)}{1-\alpha }, \alpha \in [0.5,1), \\ \frac{2(\alpha +1)}{\alpha }, \alpha \in (0, 0.5). \end{array}\right. } \end{aligned}$$

Utilizing the findings from Van Erven and Harremos (2014) on the relationship between the Hellinger distance and the \(\alpha\)-Rényi divergence, we derive the following results. This enables us to draw comparisons with frequentist literature, including works such as Klopp et al. (2015) and Davenport et al. (2014).

Corollary 1

As a special case, Theorem 3.2 leads to a concentration result in terms of the classical Hellinger distance

$$\begin{aligned} \mathbb {P}\left[ \int H^2(P_{M}, P_{M^*} ) \pi _{n,\alpha }(\textrm{d} M ) \le c_\alpha \varepsilon _n \right] \ge 1-\frac{2}{n\varepsilon _n}. \end{aligned}$$
(3)

Remark 2

The rate specified in (3), of the order \(r(d_1+d_2)\log (n)/n\), resembles that observed in prior studies within the frequentist literature, such as Klopp et al. (2015), when examining a general sampling framework. However, we only consider a minimal boundedness assumption as stated in 3.1, and obtain result with respect to the joint distribution \(P_M\) rather than f(M) , the conditional densities.

To elaborate further, Theorem 1 and Lemma 9 in Klopp et al. (2015) delve into the recovery of the distribution f(M) with respect to Hellinger distance. However, they necessitate stricter assumptions such as boundedness assumptions on f(M) and its derivatives as well as no column nor row is sampled with too high probability (see Assumption H1 and H3 in Klopp et al. (2015)). In comparison, the results in Klopp et al. (2015) demonstrate a faster rate than those outlined in Davenport et al. (2014) where their results is of order \(\sqrt{r(d_1+d_2)\log (\max (d_1,d_2))/n}\).

The squared Hellinger metric and the \(\alpha\)-Rényi divergence inherently involve ambiguity, thus preventing any claims about the closeness of M and \(M^*\) within a Euclidean-type distance framework. However, by leveraging Assumption 3.2 and Assumption 3.3, such results can be attained. We now proceed to state our primary results concerning the recovery of the parameter matrix.

Theorem 3.3

Under the same assumption as in Theorem 3.2 and additionally assuming that Assumption 3.2 and Assumption 3.3 hold. We have that

$$\begin{aligned} \mathbb {P}\left[ \int \frac{ \Vert M - M^* \Vert ^2_F }{d_1 d_2} \pi _{n,\alpha }(\textrm{d} M ) \le \frac{c_\alpha }{C_1C_\kappa } \varepsilon _n \right] \ge 1-\frac{2}{n\varepsilon _n}. \end{aligned}$$
(4)

and

$$\begin{aligned} \mathbb {P}\left[ \frac{ \Vert \hat{M} - M^* \Vert ^2_F }{d_1 d_2} \le \frac{c_\alpha }{C_1C_\kappa } \varepsilon _n \right] \ge 1-\frac{2}{n\varepsilon _n}. \end{aligned}$$
(5)

Theorem 3.3 states that, with a high probability of at least \(1-2/[C_{a,B}r(d_1+d_2)\log (nd_1d_2)]\), the estimation error of an estimator drawn from the fractional posterior \(\pi _{n,\alpha }\) will converge in Frobenius norm to the true underlying parameter matrix \(M^*\). The rate of this convergence is \(\varepsilon _n\).

The result stated in (5) is achieved by applying Jensen’s inequality to the mean. By employing similar methods, one can readily derive outcomes for other estimator derived from the fractional posterior, such as the median, drawing upon insights provided in Merkle (2005).

Remark 3

Up to a logarithmic factor, the error rate for the mean estimator in the squared Frobenius norm, given in (5), is of order \(r(d_1+d_2)/n\) which is minimax-optimal according to Theorem 3 in Klopp et al. (2015). Compared to the results for estimation error in Corollary 2 of Klopp et al. (2015), our result is obtained with a better probability. Specifically, Corollary 2 in Klopp et al. (2015) is stated with a probability of at least \(1-3/(d_1+d_2)\), whereas our result in (5) holds with a probability of at least \(1-2/[C_{a,B}r(d_1+d_2)\log (nd_1d_2)]\).

Remark 4

Under the uniform sampling assumption and that \(\Vert M \Vert _{\infty }\le \gamma\), Theorem 1 in Davenport et al. (2014) presented results of order \(\sqrt{r(d_1+d_2)/n }\). A similar result using max-norm minimization was also obtained in Cai and Zhou (2013). The paper Klopp et al. (2015) proves an estimation error rate similar to ours, but with different log-term, as \(r(d_1+d_2)\log (d_1+d_2)/n\).

A comparable result to Klopp et al. (2015) is also established in Alquier et al. (2019) but under a uniformly sampling assumption. Subsequently, this rate has been recently enhanced to \(r(d_1+d_2)/n\), without the presence of a logarithmic term, in Alaya and Klopp (2019) (refer to Theorem 7). Consequently, the work presented in Alaya and Klopp (2019) attains the precise minimax estimation rate of convergence for 1-bit matrix completion.

Remark 5

It is noteworthy that our findings are established within a general sampling framework. In contrast to the requirements set forth in Klopp et al. (2015), our approach necessitates only that the probability of observing any entries is strictly positive, without imposing additional assumptions such as no column nor row is sampled with too high probability. This aspect further enhances the robustness of employing a fractional posterior.

Remark 6

As previously noted, Assumption 3.3 is crucial for deriving results on estimation error in our analysis, as well as in prior work within the frequentist literature Davenport et al. (2014), Cai and Zhou (2013), Klopp et al. (2015). Although this assumption may be stringent, relaxing it presents an intriguing avenue for future research.

4 Results with a spectral scaled student prior

We have opted to initially present results in Sect. 3 with factorization-type priors, as they are widely favored in the matrix completion literature for utilization with MCMC or Variational Bayes (VB) methods. However, another spectral scaled Student prior has garnered particular interest due to its promising outcomes, whether employed with VB (Yang et al., 2018) or Langevin Monte Carlo, a gradient-based sampling method (Dalalyan, 2020). This prior has previously been applied in different problems involving matrix parameters (Mai, 2023a, b).

With \(\tau>0\), we consider the following spectral scaled Student prior, given as

$$\begin{aligned} \pi _{st} (M) \propto \det (\tau ^2 \textbf{I}_{d_1} + MM^\intercal )^{-(d_1+d_2+2)/2}. \end{aligned}$$
(6)

This prior possesses the capability to introduce approximate low-rankness in matrices M. This is evident from the fact that \(\pi _{st} (M) \propto \prod _{j=1}^{d_1} (\tau ^2 + s_j(M)^2 )^{- (d_1+d_2+2)/2 },\) where \(s_j(M)\) represents the \(j^{th}\) largest singular value of M. Consequently, the distribution follows a scaled Student’s t-distribution evaluated at \(s_j(M)\), which induces approximate sparsity on \(s_j(M)\), as discussed in Dalalyan and Tsybakov (2012b, 2012a). Thus, under this prior distribution, the majority of \(s_j(M)\) tend to be close to 0, suggesting that M is approximately low-rank.

We now present a consistency result using the spectral scaled Student prior.

Theorem 4.1

For \(\tau = 1/n\), we have that

$$\mathbb {E}\left[ \int \mathcal {D}_{\alpha }(P_M,P_{M^*}) \pi _{n,\alpha }(\textrm{d} M ) \right] \le \frac{1+\alpha }{1-\alpha }\varepsilon _n$$

where

$$\varepsilon _n = \frac{ 2 r (d_1 + d_2 +2) \log \left( 1+ \frac{ n \Vert M^* \Vert _F}{ \sqrt{2r }} \right) }{n}.$$

The proofs of this section can be found in Appendix 6.2. It is noted that in the rate \(\varepsilon _n\) outlined in Theorem 4.1 and Theorem 4.2 below, the condition \(r = \textrm{rank} (M^* ) \ne 0\) is not necessary. This is because we interpret \(0\log (1+0/0)\) as 0 for the scenario where \(r^* = 0\) and \(M^* = 0\).

The next theorem presents a concentration result for the fractional posterior.

Theorem 4.2

For \(\tau = 1/n\), we have that

$$\mathbb {P}\left[ \int \mathcal {D}_{\alpha }(P_M,P_{M^*}) \pi _{n,\alpha }(\textrm{d}M ) \le \frac{2(\alpha +1)}{1-\alpha } \varepsilon _n \right] \ge 1-\frac{2}{n\varepsilon _n}$$

where \(\varepsilon _n\) is given in Theorem 4.1.

Remark 7

We do not assert that \(\tau = 1/n\), in both Theorem 4.1 and 4.2, represents the optimal selection. In practical applications, users can utilize cross-validation to fine-tune the value of \(\tau\).

Remark 8

It is interesting to observe that by utilizing the spectral scaled Student prior described in (6), we are not required to impose a boundedness assumption on \(M^*\), as was necessary in the previous section with low-rank factorized priors or in other previous works such as Klopp et al. (2015); Alquier and Ridgway (2020). Furthermore, the additional logarithmic factor in Theorem 4.1 and Theorem 4.2 can be further simplified. This can be achieved by employing the inequality \(\Vert M^* \Vert _F \le \Vert M^* \Vert \sqrt{r}\), resulting in \(\log (1+ n\Vert M^* \Vert )\).

Similar to Theorem 3.3, with the inclusion of additional assumptions, we can derive concentration results for recovering the underlying matrix parameter as well as results for the mean estimator defined in (2).

Theorem 4.3

Under the same assumption as in Theorem 4.2 and additional assume that Assumption 3.2 and Assumption 3.3 hold. We have that

$$\begin{aligned} \mathbb {P}\left[ \int \frac{ \Vert M - M^* \Vert ^2_F }{d_1 d_2} \pi _{n,\alpha }(\textrm{d} M ) \le \frac{c_\alpha }{C_1C_\kappa } \varepsilon _n \right] \ge 1-\frac{2}{n\varepsilon _n}. \end{aligned}$$
(7)

and

$$\begin{aligned} \mathbb {P}\left[ \frac{ \Vert \hat{M} - M^* \Vert ^2_F }{d_1 d_2} \le \frac{c_\alpha }{C_1C_\kappa } \varepsilon _n \right] \ge 1-\frac{2}{n\varepsilon _n}. \end{aligned}$$
(8)

Remark 9

Similar to the outcomes detailed in Sect. 3, the results presented in this section for the spectral scaled Student prior do not necessitate prior knowledge of r, the rank of the true underlying parameter matrix. This underscores the adaptive nature of our results, demonstrating their capacity to adjust and perform effectively, regardless of the rank of the true underlying parameter matrix.

5 Concluding remarks

This paper presents an in-depth theoretical examination of Bayesian 1-bit matrix completion, addressing a gap in the machine learning literature. Our study takes into account a general, non-uniform sampling scheme and offers theoretical assurances for the effectiveness of the fractional posterior. We derive concentration results for the fractional posterior and validate its ability to recover the true parameter matrix. Our approach utilizes two types of priors: low-rank factorization priors and a spectral scaled Student’s t-distribution prior, with the latter needing fewer assumptions. Crucially, our results adaptively handle the matrix rank without requiring to be given. Our findings match those in the frequentist literature but with fewer restrictive assumptions.

While our work yields promising theoretical results for Bayesian 1-bit matrix completion, there remain several avenues for future research. One potential extension involves integrating additional covariate information into our methodology. Another critical area necessitating further investigation in practical applications is the tuning of the learning rate \(\alpha\). Although cross-validation can be employed for this purpose, it incurs significant computational costs. The optimal tuning of this parameter presents a challenging problem in practice and constitutes an open research question that has garnered considerable attention within the framework of generalized Bayesian inference, as highlighted by Wu and Martin (2023) and related literature. While our study considers a general sampling setting, it has certain limitations: with a fairly high probability, some entries may be sampled multiple times. It would be more practical to assume that entries are sampled without replacement. Future research should address this limitation.