Concentration properties of fractional posterior in 1-bit matrix completion

Mai, The Tien

doi:10.1007/s10994-024-06691-z

Concentration properties of fractional posterior in 1-bit matrix completion

Open access
Published: 14 January 2025

Volume 114, article number 7, (2025)
Cite this article

You have full access to this open access article

Download PDF

Save article

View saved research

Machine Learning Aims and scope Submit manuscript

Concentration properties of fractional posterior in 1-bit matrix completion

Download PDF

The Tien Mai¹

851 Accesses
5 Citations
2 Altmetric
Explore all metrics

Abstract

The problem of estimating a matrix based on a set of observed entries is commonly referred to as the matrix completion problem. In this work, we specifically address the scenario of binary observations, often termed as 1-bit matrix completion. While numerous studies have explored Bayesian and frequentist methods for real-value matrix completion, there has been a lack of theoretical exploration regarding Bayesian approaches in 1-bit matrix completion. We tackle this gap by considering a general, non-uniform sampling scheme and providing theoretical assurances on the efficacy of the fractional posterior. Our contributions include obtaining concentration results for the fractional posterior and demonstrating its effectiveness in recovering the underlying parameter matrix. We accomplish this using two distinct types of prior distributions: low-rank factorization priors and a spectral scaled Student prior, with the latter requiring fewer assumptions. Importantly, our results exhibit an adaptive nature by not mandating prior knowledge of the rank of the parameter matrix. Our findings are comparable to those found in the frequentist literature, yet demand fewer restrictive assumptions.

Deep Bayesian Matrix Factorization

A Low-Complexity Approach to Computation of the Discrete Fractional Fourier Transform

Article Open access 04 February 2017

Bayesian Estimation of the Precision Matrix with Monotone Missing Data

Article 17 September 2020

1 Introduction

Matrix completion has been extensively explored in the fields of machine learning and statistics, attracting considerable attention in recent years due to its relevance to various contemporary applications such as recommendation systems (Bobadilla et al., 2013; Koren et al., 2009), including the notable Netflix challenge (Bennett and Lanning, 2007), image processing (Ji et al., 2010; Han et al., 2014), genotype imputation (Chi et al., 2013; Jiang et al., 2016), and quantum statistics (Gross, 2011). Although completing a matrix in general is often deemed infeasible, seminal works by Candès and Tao (2010), Candes and Plan (2010), Candès and Recht (2009) have demonstrated its potential feasibility under the assumption of a low-rank structure. This assumption aligns naturally with practical scenarios, particularly in recommendation systems, where it implies the presence of a limited number of latent features that capture user preferences. Various theoretical and computational approaches to matrix completion have been proposed and investigated, as in Tsybakov et al. (2011), Lim and Teh (2007), Salakhutdinov and Mnih (2008), Recht and Ré (2013), Chatterjee (2015), Mai and Alquier (2015), Alquier and Ridgway (2020), Chen et al. (2019).

The previously mentioned studies primarily focused on matrices with real-numbered elements. However, in many practical situations, the observed elements are often binary, taking values from the set $\{-1, 1\}$. This type of data is prevalent in diverse contexts, such as voting or rating data, where responses typically involve binary distinctions like “yes/no", “like/dislike", or “true/false". Tackling the challenge of reconstructing a matrix from incomplete binary observations, known as 1-bit matrix completion, was initially investigated in Davenport et al. (2014). Subsequent studies in this field have been conducted by various researchers Cai and Zhou (2013), Klopp et al. (2015), Hsieh et al. (2015), Cottet and Alquier (2018), Herbster et al. (2016), Alquier et al. (2019), most of whom have taken a frequentist approach. However, there remains a gap in the literature concerning the theoretical assessment of Bayesian methodologies in this domain.

In this study, we aim to address this gap by focusing on a generalized Bayesian approach, where we utilize a fractional power of the likelihood. This leads to what is commonly referred to as fractional posteriors or tempered posteriors, as elucidated in Bhattacharya et al. (2019), Alquier and Ridgway (2020). It is noteworthy to emphasize that generalized Bayesian methods, where the likelihood is substituted by its fractional power or by a concept of risk, has garnered increased attention in recent years, as demonstrated by various works such as Hammer et al. (2023), Jewson and Rossell (2022), Yonekura and Sugasawa (2023), Mai and Alquier (2017), Matsubara et al. (2022), Medina et al. (2022), Grünwald and Van Ommen (2017), Bissiri et al. (2016), Yang et al. (2020), Lyddon et al. (2019), Syring and Martin (2019), Knoblauch et al. (2022), Mai (2023b), Hong and Martin (2020). It is observed that employing fractional posteriors can simplify computation for some Bayesian models Friel and Pettitt (2008). Moreover, fractional posteriors have demonstrated robustness to model misspecification in comparison to the standard posterior, as evidenced in Grünwald and Van Ommen (2017) and Alquier and Ridgway (2020).

We tackle the 1-bit matrix completion problem by considering a general, non-uniform sampling scheme. While a general sampling scheme for 1-bit matrix completion has also been examined in Klopp et al. (2015), our requirements are less stringent than theirs.

Initially, we present results concerning the employment of a widely used low-rank factorized prior distribution. Such priors have demonstrated practical efficacy, as evidenced in works such as Cottet and Alquier (2018), Lim and Teh (2007), Salakhutdinov and Mnih (2008). However, due to the typically large dimensionalities of matrix completion problems, employing low-rank factorized priors necessitates intricate Markov Chain Monte Carlo (MCMC) adaptations, which can be computationally expensive and lack scalability. Consequently, in practical applications, variational inference is often favored for such priors, as discussed in works like Cottet and Alquier (2018), Lim and Teh (2007), Babacan et al. (2012).

We derive novel results regarding the consistency and concentration properties of the fractional posterior. Specifically, we establish concentration results for the recovering distribution within the $\alpha$-Rényi divergence framework. Consequently, as particular instances, we derive concentration outcomes relative to metrics such as the Hellinger metric. Furthermore, we broaden our investigation to establish concentration rates for parameter estimation utilizing specific distance measures such as the Frobenius norm. Our findings are comparable to those in the frequentist literature as documented in Davenport et al. (2014), Cai and Zhou (2013), and Klopp et al. (2015).

In addition to the aforementioned type of prior, we also undertake theoretical examination utilizing a spectral scaled Student prior. This prior, introduced by Dalalyan (2020), shares conceptual similarities with a hierarchical prior discussed in Yang et al. (2018). The spectral scaled Student prior enables posterior sampling through Langevin Monte Carlo, a gradient-based sampling technique that has recently garnered considerable attention in various high-dimensional problems, as observed in Durmus and Moulines (2017), Durmus and Moulines (2019), Dalalyan (2017), Dalalyan and Karagulyan (2019). We demonstrate that by employing this prior, it is possible to achieve concentration results for the fractional posterior without necessitating a boundedness assumption, as is typically required for low-rank factorization priors.

The remainder of the paper is structured as follows. In Sect. 2, we introduce the notations essential for our work and discuss the problem of 1-bit matrix completion. We also present the fractional posterior along with the low-rank factorization prior in this section. Section 3 presents the results pertaining to the low-rank factorization prior, while Sect. 4 is dedicated to the outcomes obtained using the spectral scaled Student prior. All technical proofs are consolidated in Appendix 6. Some concluding remarks are given in Sect. 5.

2 Notations and method

2.1 Notations

For any integer m, let $[m]=\{1,\dots ,m\}$. Given integers m and k, and a matrix $M \in \mathbb {R}^{m\times k}$, we write $\Vert M \Vert _\infty : = \max _{(i,j)\in [m]\times [k]} |M_{ij}|$. For a matrix M, its spectral norm is denoted by $\Vert M \Vert$, its Fobenius norm is denoted by $\Vert M \Vert _F = \sqrt{\sum _{ij}M^2_{ij}}$, and its nuclear norm is denoted by $\Vert M \Vert _*$ (the sum of the singular values).

Let $\alpha \in (0,1)$, the $\alpha$-Rényi divergence between two probability distributions Q and R is defined by

$$\begin{aligned} D_{\alpha }(Q ,R) = \frac{1}{\alpha -1} \log \int \left( \frac{\textrm{d}Q }{\textrm{d}\mu }\right) ^\alpha \left( \frac{\textrm{d}R}{\textrm{d}\mu }\right) ^{1-\alpha } \textrm{d}\mu , \end{aligned}$$

where $\mu$ is any measure such that $Q \ll \mu$ and $R\ll \mu$. The Kullback–Leibler divergence is defined by

$$\begin{aligned} \mathcal {K}(Q ,R) = \int \log \left( \frac{\textrm{d}Q }{\textrm{d}R} \right) \textrm{d}Q \text { if } Q \ll R \text {, } + \infty \text { otherwise}. \end{aligned}$$

2.2 1-Bit matrix completion

We assume that the observed data $(\omega_1,Y_1), \ldots , (\omega_n,Y_n)$ are i.i.d. (independent and identically distributed) random variables drawn from a joint distribution characterized by a matrix $M^* \in \mathbb {R}^{d_1\times d_2}$, denoted by $P_{M^*}$. Additionally, we assume that $(\omega _s)_{s=1}^n \in ([d_1]\times [d_2])^n$ are i.i.d. and denoted by $\Pi$ its marginal distribution. These indices correspond to observations, denoted by $(Y_s)_{s=1}^n \in \{-1,+1 \}^n$, distributed accordingly:

$$\begin{aligned} Y| \omega = {\left\{ \begin{array}{ll} 1 & \text { with probability } f(M^*_\omega ), \\ -1 & \text { with probability } 1- f(M^*_\omega ), \end{array}\right. } \end{aligned}$$

(1)

where f is the logistic link function $f(x) = \frac{\exp (x)}{1+\exp (x)}$. This model is similar to Klopp et al. (2015). In this model, we have the likelihood of the observations as $L_n (M):= \prod _{i=1}^{n}f(M_{\omega _i})^{1_{\left[ Y_i =1 \right] }} (1-f(M_{\omega _i}))^{1_{\left[ Y_i =-1 \right] }}$.

In this study, we operate under the assumption that the rank of $M^*$, denoted as r, is substantially smaller than its dimensions, specifically $r \ll \min (d_1, d_2)$. This is a prevalent assumption in 1-bit matrix completion research, (Davenport et al., 2014; Cai & Zhou, 2013; Cottet & Alquier, 2018; Klopp et al., 2015; Alquier et al., 2019).

We concentrate on the fractional posterior for $\alpha \in (0,1)$, as discussed in Bhattacharya et al. (2019); Alquier and Ridgway (2020), which is formulated as follows:

$$\begin{aligned} \pi _{n,\alpha }(M) \propto L_n^\alpha (M) \pi (M) . \end{aligned}$$

In the case $\alpha =1$, one recovers the traditional posterior distribution.

We define the mean estimator as

$$\begin{aligned} \hat{M}:= \int M \pi _{n,\alpha }(\textrm{d}M). \end{aligned}$$

(2)

2.2.1 Low-rank factorization prior

In Bayesian matrix completion methodologies (Babacan et al., 2012; Salakhutdinov and Mnih, 2008; Lim & Teh, 2007), a prevalent concept involves decomposing a matrix into two matrices in order to establish a prior distribution on low-rank matrices. It is commonly acknowledged that any matrix with a rank of r can be decomposed as follows: $M=LR^\top , \, L\in \mathbb {R}^{d_1 \times r}, \, R \in \mathbb {R}^{d_2 \times r}.$ This approach is grounded in the assumption that the underlying matrix $M^*$ exhibits a low rank, or is at least well approximated by a low-rank matrix.

However, in practical scenarios, the rank of the matrix is typically unknown. Thus, for a fixed $K \in \{1, \ldots , \min (d_1,d_2) \}$, one can express $M=LR^\top$ with $L\in \mathbb {R}^{d_1 \times K}$, $R \in \mathbb {R}^{d_2 \times K}$. Subsequently, potential ranks $r\in [K]$ are adjusted by diminishing certain columns of L and R to zero. To address this, the reference Cottet and Alquier (2018) considers the following hierarchical model:

$$\begin{aligned} \gamma _k&{\mathop {\sim }\limits ^{iid}}\pi ^\gamma , \, \forall k \in [K] , \\ L_{i,\cdot },R_{j,\cdot }|\gamma&{\mathop {\sim }\limits ^{iid}}\mathcal {N}(0,\text {diag}(\gamma )), \,\, \forall (i,j) \in [d_1] \times [d_2]. \end{aligned}$$

The prior distribution on the variances $\pi ^\gamma$ plays a crucial role in controlling the shrinkage of the columns of L and R towards zero. It is common for $\pi ^\gamma$ to follow an inverse-Gamma distribution (Salakhutdinov and Mnih, 2008). This hierarchical prior distribution bears resemblance to the Bayesian Lasso proposed in Park and Casella (2008), and particularly resembles the Bayesian Group Lasso (Kyung et al., 2010), where the variance term follows a Gamma distribution.

It is worth noting that a Variational Bayesian inference for this prior has been conducted in Section 5 of Cottet and Alquier (2018). However, there is no theoretical study accompanying it.

The paper Cottet and Alquier (2018) shows that Gamma distribution is also a possible alternative in matrix completion, both for theoretical results and practical considerations. Thus all the results in this paper are stated under the assumption that $\pi ^\gamma$ is either the Gamma or the inverse-Gamma distribution: $\pi ^\gamma = \Gamma (a,b)$, or $\pi ^\gamma = \Gamma ^{-1}(a,b)$. In this study, we regard a as a fixed constant, while b is seen as a small parameter requiring adjustment.

3 Main results

3.1 Assumptions

For $r\ge 1$ and $B>0$, we define $\mathcal {M}(r,B)$ as the set of pairs of matrices $(\bar{U},\bar{V})$, with dimensions $d_1 \times K$ and $d_2\times K$ respectively, satisfying that: $\Vert \bar{U}\Vert _{\infty } \le B$, $\Vert \bar{V}\Vert _{\infty } \le B$ and $\bar{U}_{i,\ell }=0$ for $i>r$ and $\bar{V}_{j,\ell }=0$ for $j>r$. Similar to Cottet and Alquier (2018); Alquier and Ridgway (2020), we make the following assumption on the true parameter matrix.

Assumption 3.1

We assume that $M^* = \bar{U}\bar{V}^\top$ for $(\bar{U},\bar{V})\in \mathcal {M}(r,B)$.

Assumption 3.2

We assume that there exist a constant $C_1>0$, such that,

$$\begin{aligned} \min _{ i\in [d_1], j\in [d_2]} \mathbb {P} (\omega _1 =(i,j) ) \ge C_1. \end{aligned}$$

Assumption 3.3

We assume that $\Vert M^*\Vert _\infty \le \kappa < \infty$ and there exist a constant $C_\kappa>0$ such that

$$\begin{aligned} C_\kappa = \inf _{|x|\le \kappa } \frac{f'(x)^2}{8f(x)(1-f(x))}, \end{aligned}$$

where $f(x) = e^x/(1+e^x)$.

Assumption 3.1 imposes a boundedness condition on the true matrix $M^*$ for low-rank factorization priors. Similar boundedness assumptions have been used in Klopp et al. (2015); Cottet and Alquier (2018). In Sect. 4, we demonstrate that this assumption can be relaxed by employing a spectral scaled Student’s distribution prior.

Our framework accommodates a general sampling distribution, which is only required to satisfy Assumption 3.2. Assumption 3.2 guarantees that each coefficient has a non-zero probability of being observed. For instance, with the uniform distribution, we can express it as $C_1 = 1/(d_1d_2)$. This assumption was initially introduced in Klopp (2014) within the classical unquantized (continuous) matrix completion setting. It is also used for 1-bit matrix completion under a general sampling distribution, as demonstrated in Klopp et al. (2015). It is noteworthy that, unlike in Klopp et al. (2015), we do not assume that no column or row is sampled with excessively high probability.

To derive results concerning the parameter matrix, we need to use Assumption 3.3. Assumption 3.3 stands as a cornerstone requirement essential for deriving insights into estimation errors for 1-bit matrix completion. It was first introduced in Davenport et al. (2014). While this may be a strong assumption, it has served as a fundamental premise in various prior works, including Cai and Zhou (2013) and Klopp et al. (2015).

3.2 Main results

The following theorem presents the first consistency result for the fractional posterior in 1-bit matrix completion with low-rank factorization Gaussian priors which are frequently employed in practical applications.

Theorem 3.1

Assume that Assumption 3.1 holds. Then, there is a small enough $b>0$ such that

$$\begin{aligned} \mathbb {E} \left[ \int D_{\alpha }( P_{M}, P_{M^*}) \pi _{n,\alpha }(\textrm{d} M ) \right] \le \frac{1+\alpha }{1-\alpha }\varepsilon _n. \end{aligned}$$

where

$$\varepsilon _n = C_{a,B}\frac{ r(d_1+d_2)\log (nd_1d_2)}{n},$$

for some universal constant $C_{a,B}$ depending only on a, B. Specifically, the result remains valid for the selection $b= B^2/[512(nd_1d_2)^4 K^2 \max ^2(d_1,d_2)]$.

It is reminded that all technical proofs are postponed to Appendix 6. The main argument is based on a general scheme for fractional posteriors derived in Bhattacharya et al. (2019); Alquier and Ridgway (2020).

In practical applications, it is noted that $b= B^2/[512(nd_1d_2)^4 K^2 \max ^2(d_1,d_2)]$ may not be the best choice; rather, Alquier et al. (2014); Alquier and Ridgway (2020) suggests employing cross-validation to select b. Ensuring a small b is crucial in practical situations to guarantee a reliable approximation of low-rank matrices (Alquier et al., 2014; Alquier & Ridgway, 2020).

The next theorem introduces a completely novel concentration results for the fractional posterior in 1-bit matrix completion.

Theorem 3.2

Assume that Assumption 3.1 holds. Then, for a sufficiently small $b> 0$, such as $b = \frac{B^2}{512(nd_1d_2)^4 K^2 \max ^2(d_1,d_2)}$, it holds that

$$\mathbb {P}\left[ \int \mathcal {D}_{\alpha }(P_{M}, P_{M^*}) \pi _{n,\alpha }(\textrm{d}M ) \le \frac{2(\alpha +1)}{1-\alpha } \varepsilon _n \right] \ge 1-\frac{2}{n\varepsilon _n}$$

where,

$$\varepsilon _n = C_{a,B}\frac{ r(d_1+d_2)\log (nd_1d_2)}{n},$$

for some universal constant $C_{a,B}$ depending only on a and B.

According to Theorem 3.2, there is a probability at least $1-2/[C_{a,B}r(d_1+d_2)\log (nd_1d_2)]$ that the fractional posterior $\pi _{n,\alpha }$ will concentrate around the true model at the rate $\varepsilon _n$, measured by the $\alpha$-Rényi divergence.

Remark 1

It is important to note that our results are formulated without prior knowledge of r, the rank of the true underlying parameter matrix. This aspect highlights the adaptive nature of our results, indicating their ability to adjust and perform effectively regardless of the specific rank of the true underlying parameter matrix.

Put

$$\begin{aligned} c_\alpha = {\left\{ \begin{array}{ll} \frac{2(\alpha +1)}{1-\alpha }, \alpha \in [0.5,1), \\ \frac{2(\alpha +1)}{\alpha }, \alpha \in (0, 0.5). \end{array}\right. } \end{aligned}$$

Utilizing the findings from Van Erven and Harremos (2014) on the relationship between the Hellinger distance and the $\alpha$-Rényi divergence, we derive the following results. This enables us to draw comparisons with frequentist literature, including works such as Klopp et al. (2015) and Davenport et al. (2014).

Corollary 1

As a special case, Theorem 3.2 leads to a concentration result in terms of the classical Hellinger distance

$$\begin{aligned} \mathbb {P}\left[ \int H^2(P_{M}, P_{M^*} ) \pi _{n,\alpha }(\textrm{d} M ) \le c_\alpha \varepsilon _n \right] \ge 1-\frac{2}{n\varepsilon _n}. \end{aligned}$$

(3)

Remark 2

The rate specified in (3), of the order $r(d_1+d_2)\log (n)/n$, resembles that observed in prior studies within the frequentist literature, such as Klopp et al. (2015), when examining a general sampling framework. However, we only consider a minimal boundedness assumption as stated in 3.1, and obtain result with respect to the joint distribution $P_M$ rather than f(M) , the conditional densities.

To elaborate further, Theorem 1 and Lemma 9 in Klopp et al. (2015) delve into the recovery of the distribution f(M) with respect to Hellinger distance. However, they necessitate stricter assumptions such as boundedness assumptions on f(M) and its derivatives as well as no column nor row is sampled with too high probability (see Assumption H1 and H3 in Klopp et al. (2015)). In comparison, the results in Klopp et al. (2015) demonstrate a faster rate than those outlined in Davenport et al. (2014) where their results is of order $\sqrt{r(d_1+d_2)\log (\max (d_1,d_2))/n}$.

The squared Hellinger metric and the $\alpha$-Rényi divergence inherently involve ambiguity, thus preventing any claims about the closeness of M and $M^*$ within a Euclidean-type distance framework. However, by leveraging Assumption 3.2 and Assumption 3.3, such results can be attained. We now proceed to state our primary results concerning the recovery of the parameter matrix.

Theorem 3.3

Under the same assumption as in Theorem 3.2 and additionally assuming that Assumption 3.2 and Assumption 3.3 hold. We have that

$$\begin{aligned} \mathbb {P}\left[ \int \frac{ \Vert M - M^* \Vert ^2_F }{d_1 d_2} \pi _{n,\alpha }(\textrm{d} M ) \le \frac{c_\alpha }{C_1C_\kappa } \varepsilon _n \right] \ge 1-\frac{2}{n\varepsilon _n}. \end{aligned}$$

(4)

and

$$\begin{aligned} \mathbb {P}\left[ \frac{ \Vert \hat{M} - M^* \Vert ^2_F }{d_1 d_2} \le \frac{c_\alpha }{C_1C_\kappa } \varepsilon _n \right] \ge 1-\frac{2}{n\varepsilon _n}. \end{aligned}$$

(5)

Theorem 3.3 states that, with a high probability of at least $1-2/[C_{a,B}r(d_1+d_2)\log (nd_1d_2)]$, the estimation error of an estimator drawn from the fractional posterior $\pi _{n,\alpha }$ will converge in Frobenius norm to the true underlying parameter matrix $M^*$. The rate of this convergence is $\varepsilon _n$.

The result stated in (5) is achieved by applying Jensen’s inequality to the mean. By employing similar methods, one can readily derive outcomes for other estimator derived from the fractional posterior, such as the median, drawing upon insights provided in Merkle (2005).

Remark 3

Up to a logarithmic factor, the error rate for the mean estimator in the squared Frobenius norm, given in (5), is of order $r(d_1+d_2)/n$ which is minimax-optimal according to Theorem 3 in Klopp et al. (2015). Compared to the results for estimation error in Corollary 2 of Klopp et al. (2015), our result is obtained with a better probability. Specifically, Corollary 2 in Klopp et al. (2015) is stated with a probability of at least $1-3/(d_1+d_2)$, whereas our result in (5) holds with a probability of at least $1-2/[C_{a,B}r(d_1+d_2)\log (nd_1d_2)]$.

Remark 4

Under the uniform sampling assumption and that $\Vert M \Vert _{\infty }\le \gamma$, Theorem 1 in Davenport et al. (2014) presented results of order $\sqrt{r(d_1+d_2)/n }$. A similar result using max-norm minimization was also obtained in Cai and Zhou (2013). The paper Klopp et al. (2015) proves an estimation error rate similar to ours, but with different log-term, as $r(d_1+d_2)\log (d_1+d_2)/n$.

A comparable result to Klopp et al. (2015) is also established in Alquier et al. (2019) but under a uniformly sampling assumption. Subsequently, this rate has been recently enhanced to $r(d_1+d_2)/n$, without the presence of a logarithmic term, in Alaya and Klopp (2019) (refer to Theorem 7). Consequently, the work presented in Alaya and Klopp (2019) attains the precise minimax estimation rate of convergence for 1-bit matrix completion.

Remark 5

It is noteworthy that our findings are established within a general sampling framework. In contrast to the requirements set forth in Klopp et al. (2015), our approach necessitates only that the probability of observing any entries is strictly positive, without imposing additional assumptions such as no column nor row is sampled with too high probability. This aspect further enhances the robustness of employing a fractional posterior.

Remark 6

As previously noted, Assumption 3.3 is crucial for deriving results on estimation error in our analysis, as well as in prior work within the frequentist literature Davenport et al. (2014), Cai and Zhou (2013), Klopp et al. (2015). Although this assumption may be stringent, relaxing it presents an intriguing avenue for future research.

4 Results with a spectral scaled student prior

We have opted to initially present results in Sect. 3 with factorization-type priors, as they are widely favored in the matrix completion literature for utilization with MCMC or Variational Bayes (VB) methods. However, another spectral scaled Student prior has garnered particular interest due to its promising outcomes, whether employed with VB (Yang et al., 2018) or Langevin Monte Carlo, a gradient-based sampling method (Dalalyan, 2020). This prior has previously been applied in different problems involving matrix parameters (Mai, 2023a, b).

With $\tau>0$, we consider the following spectral scaled Student prior, given as

$$\begin{aligned} \pi _{st} (M) \propto \det (\tau ^2 \textbf{I}_{d_1} + MM^\intercal )^{-(d_1+d_2+2)/2}. \end{aligned}$$

(6)

This prior possesses the capability to introduce approximate low-rankness in matrices M. This is evident from the fact that $\pi _{st} (M) \propto \prod _{j=1}^{d_1} (\tau ^2 + s_j(M)^2 )^{- (d_1+d_2+2)/2 },$ where $s_j(M)$ represents the $j^{th}$ largest singular value of M. Consequently, the distribution follows a scaled Student’s t-distribution evaluated at $s_j(M)$, which induces approximate sparsity on $s_j(M)$, as discussed in Dalalyan and Tsybakov (2012b, 2012a). Thus, under this prior distribution, the majority of $s_j(M)$ tend to be close to 0, suggesting that M is approximately low-rank.

We now present a consistency result using the spectral scaled Student prior.

Theorem 4.1

For $\tau = 1/n$, we have that

$$\mathbb {E}\left[ \int \mathcal {D}_{\alpha }(P_M,P_{M^*}) \pi _{n,\alpha }(\textrm{d} M ) \right] \le \frac{1+\alpha }{1-\alpha }\varepsilon _n$$

where

$$\varepsilon _n = \frac{ 2 r (d_1 + d_2 +2) \log \left( 1+ \frac{ n \Vert M^* \Vert _F}{ \sqrt{2r }} \right) }{n}.$$

The proofs of this section can be found in Appendix 6.2. It is noted that in the rate $\varepsilon _n$ outlined in Theorem 4.1 and Theorem 4.2 below, the condition $r = \textrm{rank} (M^* ) \ne 0$ is not necessary. This is because we interpret $0\log (1+0/0)$ as 0 for the scenario where $r^* = 0$ and $M^* = 0$.

The next theorem presents a concentration result for the fractional posterior.

Theorem 4.2

For $\tau = 1/n$, we have that

$$\mathbb {P}\left[ \int \mathcal {D}_{\alpha }(P_M,P_{M^*}) \pi _{n,\alpha }(\textrm{d}M ) \le \frac{2(\alpha +1)}{1-\alpha } \varepsilon _n \right] \ge 1-\frac{2}{n\varepsilon _n}$$

where $\varepsilon _n$ is given in Theorem 4.1.

Remark 7

We do not assert that $\tau = 1/n$, in both Theorem 4.1 and 4.2, represents the optimal selection. In practical applications, users can utilize cross-validation to fine-tune the value of $\tau$.

Remark 8

It is interesting to observe that by utilizing the spectral scaled Student prior described in (6), we are not required to impose a boundedness assumption on $M^*$, as was necessary in the previous section with low-rank factorized priors or in other previous works such as Klopp et al. (2015); Alquier and Ridgway (2020). Furthermore, the additional logarithmic factor in Theorem 4.1 and Theorem 4.2 can be further simplified. This can be achieved by employing the inequality $\Vert M^* \Vert _F \le \Vert M^* \Vert \sqrt{r}$, resulting in $\log (1+ n\Vert M^* \Vert )$.

Similar to Theorem 3.3, with the inclusion of additional assumptions, we can derive concentration results for recovering the underlying matrix parameter as well as results for the mean estimator defined in (2).

Theorem 4.3

Under the same assumption as in Theorem 4.2 and additional assume that Assumption 3.2 and Assumption 3.3 hold. We have that

$$\begin{aligned} \mathbb {P}\left[ \int \frac{ \Vert M - M^* \Vert ^2_F }{d_1 d_2} \pi _{n,\alpha }(\textrm{d} M ) \le \frac{c_\alpha }{C_1C_\kappa } \varepsilon _n \right] \ge 1-\frac{2}{n\varepsilon _n}. \end{aligned}$$

(7)

and

$$\begin{aligned} \mathbb {P}\left[ \frac{ \Vert \hat{M} - M^* \Vert ^2_F }{d_1 d_2} \le \frac{c_\alpha }{C_1C_\kappa } \varepsilon _n \right] \ge 1-\frac{2}{n\varepsilon _n}. \end{aligned}$$

(8)

Remark 9

Similar to the outcomes detailed in Sect. 3, the results presented in this section for the spectral scaled Student prior do not necessitate prior knowledge of r, the rank of the true underlying parameter matrix. This underscores the adaptive nature of our results, demonstrating their capacity to adjust and perform effectively, regardless of the rank of the true underlying parameter matrix.

5 Concluding remarks

This paper presents an in-depth theoretical examination of Bayesian 1-bit matrix completion, addressing a gap in the machine learning literature. Our study takes into account a general, non-uniform sampling scheme and offers theoretical assurances for the effectiveness of the fractional posterior. We derive concentration results for the fractional posterior and validate its ability to recover the true parameter matrix. Our approach utilizes two types of priors: low-rank factorization priors and a spectral scaled Student’s t-distribution prior, with the latter needing fewer assumptions. Crucially, our results adaptively handle the matrix rank without requiring to be given. Our findings match those in the frequentist literature but with fewer restrictive assumptions.

While our work yields promising theoretical results for Bayesian 1-bit matrix completion, there remain several avenues for future research. One potential extension involves integrating additional covariate information into our methodology. Another critical area necessitating further investigation in practical applications is the tuning of the learning rate $\alpha$. Although cross-validation can be employed for this purpose, it incurs significant computational costs. The optimal tuning of this parameter presents a challenging problem in practice and constitutes an open research question that has garnered considerable attention within the framework of generalized Bayesian inference, as highlighted by Wu and Martin (2023) and related literature. While our study considers a general sampling setting, it has certain limitations: with a fairly high probability, some entries may be sampled multiple times. It would be more practical to assume that entries are sampled without replacement. Future research should address this limitation.

References

Alquier, P., Cottet, V., Chopin, N., Rousseau, J. (2014). Bayesian matrix completion: prior specification and consistency. arXiv preprint arXiv:1406.1440.
Alquier, P., Cottet, V., Chopin, N., Rousseau, J. (2014). Bayesian matrix completion: prior specification and consistency. arXiv preprint arXiv:1406.1440.
Alquier, P., Cottet, V., & Lecué, G. (2019). Estimation bounds and sharp oracle inequalities of regularized procedures with lipschitz loss functions. The Annals of Statistics, 47(4), 2117–2144.
MathSciNet MATH Google Scholar
Alquier, P., & Ridgway, J. (2020). Concentration of tempered posteriors and of their variational approximations. The Annals of Statistics, 48(3), 1475–1497.
MathSciNet MATH Google Scholar
Babacan, S. D., Luessi, M., Molina, R., & Katsaggelos, A. K. (2012). Sparse Bayesian methods for low-rank matrix estimation. IEEE Transactions on Signal Processing, 60(8), 3964–3977.
MathSciNet MATH Google Scholar
Bennett, J., & Lanning, S. (2007). The netflix prize. In: Proceedings of KDD Cup and Workshop, vol. 2007, p. 35.
Bhattacharya, A., Pati, D., & Yang, Y. (2019). Bayesian fractional posteriors. Annals of Statistics, 47(1), 39–66.
MathSciNet MATH Google Scholar
Bissiri, P. G., Holmes, C. C., & Walker, S. G. (2016). A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5), 1103–1130.
MathSciNet MATH Google Scholar
Bobadilla, J., Ortega, F., Hernando, A., & Gutiérrez, A. (2013). Recommender systems survey. Knowledge-based systems,46, 109–132.
Cai, T., & Zhou, W.-X. (2013). A max-norm constrained minimization approach to 1-bit matrix completion. Journal of Machine Learning Research, 14(1), 3619–3647.
MathSciNet MATH Google Scholar
Candes, E. J., & Plan, Y. (2010). Matrix completion with noise. Proceedings of the IEEE, 98(6), 925–936.
MATH Google Scholar
Candès, E. J., & Recht, B. (2009). Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6), 717–772. https://doi.org/10.1007/s10208-009-9045-5
Article MathSciNet MATH Google Scholar
Candès, E. J., & Tao, T. (2010). The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5), 2053–2080.
MathSciNet MATH Google Scholar
Chatterjee, S. (2015). Matrix estimation by universal singular value thresholding. The Annals of Statistics, 43(1), 177–214.
MathSciNet MATH Google Scholar
Chen, Y., Fan, J., Ma, C., & Yan, Y. (2019). Inference and uncertainty quantification for noisy matrix completion. Proceedings of the National Academy of Sciences, 116(46), 22931–22937.
MathSciNet MATH Google Scholar
Chi, E. C., Zhou, H., Chen, G. K., Del Vecchyo, D. O., & Lange, K. (2013). Genotype imputation via matrix completion. Genome research, 23(3), 509–518.
Google Scholar
Cottet, V., & Alquier, P. (2018). 1-bit matrix completion: Pac-bayesian analysis of a variational approximation. Machine Learning, 107(3), 579–603.
MathSciNet MATH Google Scholar
Dalalyan, A. S. (2017). Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B Statistical Methodology, 79(3), 651–676.
MathSciNet MATH Google Scholar
Dalalyan, A. S. (2020). Exponential weights in multivariate regression and a low-rankness favoring prior. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 56(2), 1465–1483.
MathSciNet MATH Google Scholar
Dalalyan, A. S., & Karagulyan, A. (2019). User-friendly guarantees for the langevin monte carlo with inaccurate gradient. Stochastic Processes and their Applications, 129(12), 5278–5311.
MathSciNet MATH Google Scholar
Dalalyan, A. S., & Tsybakov, A. (2012). Mirror averaging with sparsity priors. Bernoulli, 18(3), 914–944.
MathSciNet MATH Google Scholar
Dalalyan, A. S., & Tsybakov, A. B. (2012). Sparse regression learning by aggregation and langevin monte-carlo. Journal of Computer and System Sciences, 78(5), 1423–1443.
MathSciNet MATH Google Scholar
Davenport, M. A., Plan, Y., Van Den Berg, E., & Wootters, M. (2014). 1-Bit matrix completion. Information and Inference: A Journal of the IMA, 3(3), 189–223.
MathSciNet MATH Google Scholar
Durmus, A., & Moulines, E. (2017). Nonasymptotic convergence analysis for the unadjusted langevin algorithm. The Annals of Applied Probability, 27(3), 1551–1587.
MathSciNet MATH Google Scholar
Durmus, A., & Moulines, E. (2019). High-dimensional Bayesian inference via the unadjusted langevin algorithm. Bernoulli, 25(4A), 2854–2882.
MathSciNet MATH Google Scholar
Friel, N., & Pettitt, A. N. (2008). Marginal likelihood estimation via power posteriors. Journal of the Royal Statistical Society Series B: Statistical Methodology, 70(3), 589–607.
MathSciNet MATH Google Scholar
Gross, D. (2011). Recovering low-rank matrices from few coefficients in any basis. IEEE Transactions on Information Theory, 57(3), 1548–1566.
MathSciNet MATH Google Scholar
Grünwald, P., & Van Ommen, T. (2017). Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it. Bayesian Analysis, 12(4), 1069–1103.
MathSciNet MATH Google Scholar
Hammer, H. L., Riegler, M. A., & Tjelmeland, H. (2023). Approximate bayesian inference based on expected evaluation. Bayesian Analysis, 19(3), 677–98.
MathSciNet MATH Google Scholar
Han, X., Wu, J., Wang, L., Chen, Y., Senhadji, L., Shu, H.: Linear total variation approximate regularized nuclear norm optimization for matrix completion. In: Abstract and Applied Analysis, vol, 2014 (2014). Hindawi
Herbster, M., Pasteris, S., Pontil, M.: Mistake bounds for binary matrix completion. Advances in Neural Information Processing Systems, 29 (2016)
Hong, L., & Martin, R. (2020). Model misspecification, Bayesian versus credibility estimation, and Gibbs posteriors. Scandinavian Actuarial Journal, 2020(7), 634–649.
MathSciNet MATH Google Scholar
Hsieh, C.-J., Natarajan, N., & Dhillon, I. (2015) Pu learning for matrix completion. In: International Conference on Machine Learning, pp. 2445–2453. PMLR
Jewson, J., & Rossell, D. (2022). General bayesian loss function selection and the use of improper models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(5), 1640–1665.
MathSciNet MATH Google Scholar
Ji, H., Liu, C., Shen, Z., Xu, Y.: Robust video denoising using low rank matrix completion. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1791–1798 (2010). IEEE
Jiang, B., Ma, S., Causey, J., Qiao, L., Hardin, M. P., Bitts, I., Johnson, D., Zhang, S., & Huang, X. (2016). Sparrec: An effective matrix completion framework of missing data imputation for gwas. Scientific reports, 6(1), 35534.
Google Scholar
Klopp, O. (2014). Noisy low-rank matrix completion with general sampling distribution. Bernoulli, 20(1), 282–303. https://doi.org/10.3150/12-BEJ486
Article MathSciNet MATH Google Scholar
Klopp, O., Lafond, J., Moulines, É., & Salmon, J. (2015). Adaptive multinomial matrix completion. Electronic Journal of Statistics, 9, 2950–2975.
MathSciNet MATH Google Scholar
Knoblauch, J., Jewson, J., & Damoulas, T. (2022). An optimization-centric view on bayes’ rule: Reviewing and generalizing variational inference. Journal of Machine Learning Research, 23(132), 1–109.
MathSciNet MATH Google Scholar
Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer, 42(8), 30–37.
MATH Google Scholar
Kyung, M., Gill, J., Ghosh, M., & Casella, G. (2010). Penalized regression, standard errors, and bayesian lassos. Bayesian Analysis, 5(2), 369–412.
MathSciNet MATH Google Scholar
Lim, Y. J., & Teh, Y. W. (2007). Variational bayesian approach to movie rating prediction. Proceedings of KDD cup and workshop, 7, 15–21.
MATH Google Scholar
Lyddon, S. P., Holmes, C., & Walker, S. (2019). General bayesian updating and the loss-likelihood bootstrap. Biometrika, 106(2), 465–478.
MathSciNet MATH Google Scholar
Mai, T. T. (2023). From bilinear regression to inductive matrix completion: A Quasi-Bayesian analysis. Entropy, 25(2), 333.
MathSciNet MATH Google Scholar
Mai, T. T. (2023). A reduced-rank approach to predicting multiple binary responses through machine learning. Statistics and Computing, 33(6), 136.
MathSciNet MATH Google Scholar
Mai, T. T., & Alquier, P. (2015). A bayesian approach for noisy matrix completion: Optimal rate under general sampling distribution. Electron. J. Statist., 9(1), 823–841.
MathSciNet MATH Google Scholar
Mai, T. T., & Alquier, P. (2017). Pseudo-Bayesian quantum tomography with rank-adaptation. Journal of Statistical Planning and Inference, 184, 62–76.
MathSciNet MATH Google Scholar
Matsubara, T., Knoblauch, J., Briol, F.-X., & Oates, C. J. (2022). Robust generalised bayesian inference for intractable likelihoods. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(3), 997–1022.
MathSciNet MATH Google Scholar
Medina, M. A., Olea, J. L. M., Rush, C., & Velez, A. (2022). On the robustness to misspecification of $\alpha$-posteriors and their variational approximations. Journal of Machine Learning Research, 23(147), 1–51.
MathSciNet Google Scholar
Merkle, M. (2005). Jensen’s inequality for medians. Statistics & probability letters, 71(3), 277–281.
MathSciNet MATH Google Scholar
Park, T., & Casella, G. (2008). The Bayesian lasso. Journal of the american statistical association, 103(482), 681–686.
MathSciNet MATH Google Scholar
Recht, B., & Ré, C. (2013). Parallel stochastic gradient algorithms for large-scale matrix completion. Mathematical Programming Computation, 5(2), 201–226.
MathSciNet MATH Google Scholar
Salakhutdinov, R., Mnih, A.: Bayesian probabilistic matrix factorization using markov chain monte carlo. In: Proceedings of the 25th International Conference on Machine Learning, pp. 880–887 (2008). ACM
Syring, N., & Martin, R. (2019). Calibrating general posterior credible regions. Biometrika, 106(2), 479–486.
MathSciNet MATH Google Scholar
Tsybakov, A. B., Koltchinskii, V., & Lounici, K. (2011). Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Annals of Statistics, 39(5), 2302–2329.
MathSciNet MATH Google Scholar
Van Erven, T., & Harremos, P. (2014). Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory, 60(7), 3797–3820.
MathSciNet MATH Google Scholar
Wu, P.-S., & Martin, R. (2023). A comparison of learning rate selection methods in generalized Bayesian inference. Bayesian Analysis, 18(1), 105–132.
MathSciNet MATH Google Scholar
Yang, L., Fang, J., Duan, H., Li, H., & Zeng, B. (2018). Fast low-rank Bayesian matrix completion with hierarchical gaussian prior models. IEEE Transactions on Signal Processing, 66(11), 2804–2817.
MathSciNet MATH Google Scholar
Yang, Y., Pati, D., & Bhattacharya, A. (2020). $\alpha$-variational inference with statistical guarantees. Annals of Statistics, 48(2), 886–905.
MathSciNet MATH Google Scholar
Yonekura, S., & Sugasawa, S. (2023). Adaptation of the tuning parameter in general bayesian inference with robust divergence. Statistics and Computing, 33(2), 39.
MathSciNet MATH Google Scholar

Download references

Acknowledgements

The author was supported by the Norwegian Research Council, grant number 309960, through the Centre for Geophysical Forecasting at NTNU. The author expresses gratitude to three anonymous reviewers who generously reviewed the earlier version of this paper, providing valuable suggestions and insightful comments that significantly improved its presentation.

Funding

Open access funding provided by NTNU Norwegian University of Science and Technology (incl St. Olavs Hospital - Trondheim University Hospital).

Author information

Authors and Affiliations

Department of Mathematical Sciences, Norwegian University of Science and Technology, 7034, Trondheim, Norway
The Tien Mai

Authors

The Tien Mai
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to The Tien Mai.

Ethics declarations

Conflict of interest

The author declares no potential Conflict of interest.

Additional information

Editors: Kee-Eung Kim, Shou-De Lin.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Proofs

1.1 Proofs for Sect. 3

Proof of Theorem 3.1

As the logistic loss is 1-Lipschitz, the log-likelihood satisfies that

$$\begin{aligned} \left| \log f(x)-\log f(y) \right| \le |x-y| . \end{aligned}$$

One has that

$$\begin{aligned} \mathcal {K} (P_{M^*}, P_M)&\le \frac{1}{d_1 d_2} \sum _{i\in [d_1]} \sum _{j\in [d_2]} \Pi _{ij} |M^*_{ij} - M_{ij} | \\&\le \frac{1}{d_1 d_2} \sum _{i\in [d_1]} \sum _{j\in [d_2]} |M^*_{ij} - M_{ij} | \\&\le \frac{\Vert M^* -M \Vert _F }{ \sqrt{ d_1 d_2} } , \end{aligned}$$

where $\Pi _{ij}\le 1$ is the probability to observe the (i, j) -th entry. For any (U, V) in the support of $\rho _n$, given in (13), one has that

$$\begin{aligned} \Vert M^* - UV^\top \Vert _F&= \Vert \bar{U}\bar{V}^\top -\bar{U}V^\top +\bar{U}V^\top -UV^\top \Vert _F \\&\le \Vert \bar{U}(\bar{V}^\top -V^\top )\Vert _F+\Vert (\bar{U}-U)V^\top \Vert _F \\&\le \Vert \bar{U}\Vert _F \Vert \bar{V}-V\Vert _F+\Vert \bar{U}-U\Vert _F \Vert V^\top \Vert _F \\&\le d_1 d_2 \Vert \bar{U}\Vert _{\infty }^{1/2} \Vert \bar{V}-V\Vert _\infty ^{1/2} +d_1 d_2 \Vert V\Vert _{\infty }^{1/2} \Vert \bar{U}-U\Vert _\infty ^{1/2} \\&\le d_1 d_2 \delta ^{1/2} [B^{1/2} + (B+\delta )^{1/2} ] \\&\le 2 d_1 d_2 \delta ^{1/2} (B+\delta )^{1/2} \\&\le 2^{3/2} d_1 d_2 \delta ^{1/2} B^{1/2} . \end{aligned}$$

Therefore,

$$\begin{aligned} \mathcal {K} (P_{M^*}, P_M) \le \frac{\Vert M^* -M \Vert _F }{ \sqrt{ d_1 d_2} } \le \sqrt{ \delta 2^{3} d_1 d_2 B} . \end{aligned}$$

(9)

For $\delta =B/[8(nd_1 d_2)^2]$ that satisfies $0<\delta <B$, we have that

$$\begin{aligned} \Vert M^* - UV^\top \Vert _F \le B/n \end{aligned}$$

(10)

and

$$\begin{aligned} \int \mathcal {K} (P_{M^*}, P_M) \rho _n(dM) \le \frac{B}{n\sqrt{d_1d_2}} . \end{aligned}$$

Now, from Lemma 1, we have that

$$\begin{aligned} \frac{1}{n}\mathcal {K}(\rho _n,\pi ) \le \frac{ 2(1+2a) r(d_1 + d_2) \left[ \log (nd_1 d_2) + C_a \right] }{n}. \end{aligned}$$

We now can apply Theorem 2.6 in Alquier and Ridgway (2020) with $\rho _n$ in (13) and

$$\begin{aligned} \varepsilon _n = C_{a,B}\frac{ r(d_1+d_2)\log (nd_1d_2)}{n} \end{aligned}$$

to obtain the result. The proof is completed.

Proof of Theorem 3.2

As the logistic loss is 1-Lipschitz, the log-likelihood satisfies that $\left| \log f(x)-\log f(y) \right| \le |x-y|$. Thus, we can deduce that

$$\begin{aligned} \mathbb {E}\left[ \log \left( \frac{p_M}{ p_{M^*} } \right) ^2 \right]&\le \frac{1}{d_1 d_2} \sum _{i\in [d_1]} \sum _{j\in [d_2]} \Pi _{ij} \left( \log f(M^*_{ij}) -\log f(M_{ij})\right) ^2 \nonumber \\&\le \frac{1}{d_1 d_2} \sum _{i\in [d_1]} \sum _{j\in [d_2]} \left( \log f(M^*_{ij}) -\log f(M_{ij})\right) ^2 \nonumber \\&\le \frac{1}{d_1 d_2} \sum _{i\in [d_1]} \sum _{j\in [d_2]} \left( M^*_{ij}- M_{ij}\right) ^2 \nonumber \\&\le \frac{1}{d_1 d_2} \Vert M^* -M \Vert ^2_F , \end{aligned}$$

(11)

where $\Pi _{ij}\le 1$ is the probability to observe the (i, j) -th entry. From (9), we have that

$$\begin{aligned} \mathcal {K} (P_{M^*}, P_M) \le \frac{\Vert M^* -M \Vert _F }{ \sqrt{ d_1 d_2} } \le \sqrt{ \delta 2^{3} d_1 d_2 B} , \end{aligned}$$

and

$$\begin{aligned} \mathbb {E}\left[ \log \left( \frac{p_M}{ p_{M^*} } \right) ^2 \right] \le \frac{\Vert M^* -M \Vert ^2_F}{d_1 d_2} \le \delta 2^{3} d_1 d_2 B . \end{aligned}$$

For any (U, V) in the support of $\rho _n$, given in (13), we observe that for $\delta =\frac{B}{8(n d_1 d_2)^2}$, where $\delta$ satisfies $0< \delta < B$, and from equation (10) we can deduce that

$$\begin{aligned} \int \mathcal {K} (P_{M^*}, P_M) \rho _n(dM) \le \frac{B}{n\sqrt{ d_1 d_2}}, \end{aligned}$$

and

$$\begin{aligned} \int \mathbb {E}\left[ \log ^2 \left( \frac{p_M}{ p_{M^*} }\right) \right] \rho _n(dM) \le \frac{B^2}{n^2 d_1 d_2} . \end{aligned}$$

Now, from Lemma 1, we have that

$$\begin{aligned} \frac{1}{n}\mathcal {K}(\rho _n,\pi ) \le \frac{ 2(1+2a) r(d_1 + d_2) \left[ \log (nd_1 d_2) + C_a \right] }{n}. \end{aligned}$$

We now can apply Corollary 2.5 and Theorem 2.4 in Alquier and Ridgway (2020) with $\rho _n$ in (13) and

$$\begin{aligned} \varepsilon _n = C_{a,B}\frac{ r(d_1+d_2)\log (nd_1d_2)}{n} \end{aligned}$$

to obtain the result. The proof is completed.

Proof of Corollary 1

From Van Erven and Harremos (2014), we have that

$$H^2(P,Q) \le D_{1/2}(P,Q) \le D_{\alpha }(P,Q),$$

for $\alpha \in [0.5,1)$. In addition, we also have that

$$D_{1/2}(P,Q) \le \frac{(1-\alpha )1/2}{\alpha (1-1/2)} D_{\alpha }(P,Q) = \frac{(1-\alpha )}{\alpha } D_{\alpha }(P,Q),$$

for $\alpha \in (0, 0.5)$.

Thus, using definition of $c_\alpha$ and Theorem 3.2, we obtain the results.

Proof of Theorem 3.3

From (3), we have that

$$\begin{aligned} \mathbb {P}\left[ \int H^2(P_M ,P_{M^*} ) \pi _{n,\alpha }(\textrm{d} M ) \le c_\alpha \varepsilon _n \right] \ge 1-\frac{2}{n\varepsilon _n} , \end{aligned}$$

from Lemma 2, one has that

$$\begin{aligned} \mathbb {P}\left[ \int \frac{C_1 C_\kappa \Vert M - M^* \Vert ^2_F }{d_1 d_2} \pi _{n,\alpha }(\textrm{d} M ) \le c_\alpha \varepsilon _n \right] \ge 1-\frac{2}{n\varepsilon _n} , \end{aligned}$$

thus, we obtain (4). To obtain (5), one can apple Jensen’s inequality for a convex function, that

$$\begin{aligned} \Vert \hat{M} - M^* \Vert ^2_F = \bigg \Vert \int M \pi _{n,\alpha }(\textrm{d}M) - M^* \bigg \Vert ^2_F \le \int \Vert M- M^* \Vert ^2_F \pi _{n,\alpha }(\textrm{d}M). \end{aligned}$$

This completes the proof.

1.2 Proofs for Sect. 4

Proof of Theorem 3.3

From (9), we have that

$$\begin{aligned} \mathcal {K} (P_{M^*}, P_M) \le \frac{\Vert M^* -M \Vert _F }{ \sqrt{ d_1 d_2} } . \end{aligned}$$

When integrating with respect to $\rho _n:= \rho _{0}$ given in (14), we have that

$$\begin{aligned} \int \mathcal {K} (P_{M^*}, P_M) \rho _n(dM) \nonumber&\le \int \frac{\Vert M^* -M \Vert _F }{ \sqrt{ d_1 d_2} } \rho _0(dM) \nonumber \\&= \int \frac{\Vert M^* -M \Vert _F }{ \sqrt{ d_1 d_2} } \pi _{st} (M - M^*)dM \nonumber \\&= \frac{1}{ \sqrt{ d_1 d_2} } \int \Vert M \Vert _F \pi _{st} (M)dM \nonumber \\&\le \frac{1}{ \sqrt{ d_1 d_2} } \left( \int \Vert M \Vert _F^2 \pi _{st}(M)dM \right) ^{1/2} \nonumber \\&\le \frac{1}{ \sqrt{ d_1 d_2} } \sqrt{d_1 d_2 \tau ^2} = \tau , \end{aligned}$$

(12)

where we have used Holder’s inequality and Lemma 3 to obtain the result. Now, from Lemma 4, we have that

$$\begin{aligned} \frac{1}{n}\mathcal {K}(\rho _n,\pi _{st}) \le \frac{ 2 r (d_1 + d_2 +2) \log \left( 1+ \frac{\Vert M^* \Vert _F}{\tau \sqrt{2r }} \right) }{n}. \end{aligned}$$

Taking $\tau = 1/n$, we obtain that

$$\begin{aligned}&\int \mathcal {K} (P_{M^*}, P_M) \rho _n(dM) \le \frac{1}{n} , \\&\frac{1}{n}\mathcal {K}(\rho _n,\pi _{st}) \le \frac{ 2 r (d_1 + d_2 +2) \log \left( 1+ \frac{ n \Vert M^* \Vert _F}{ \sqrt{2r }} \right) }{n} . \end{aligned}$$

We now can apply Theorem 2.6 in Alquier and Ridgway (2020) with

$$\varepsilon _n = \frac{ 2 r (d_1 + d_2 +2) \log \left( 1+ \frac{ n \Vert M^* \Vert _F}{\sqrt{2r }} \right) }{n}$$

to obtain the result. The proof is completed.

Proof of Theorem 4.2

From (9), we have that

$$\begin{aligned} \mathcal {K} (P_{M^*}, P_M) \le \frac{\Vert M^* -M \Vert _F }{ \sqrt{ d_1 d_2} } . \end{aligned}$$

When integrating with respect to $\rho _n:= \rho _{0}$ given in (14), and from (12), we have that

$$\begin{aligned} \int \mathcal {K} (P_{M^*}, P_M) \rho _n(dM) \le \tau . \end{aligned}$$

Now, from Lemma 4, we have that

$$\begin{aligned} \frac{1}{n}\mathcal {K}(\rho _n,\pi _{st}) \le \frac{ 2 r (d_1 + d_2 +2) \log \left( 1+ \frac{\Vert M^* \Vert _F}{\tau \sqrt{2r }} \right) }{n}. \end{aligned}$$

Moreover, from (11), one has that

$$\begin{aligned} \mathbb {E}\left[ \log \left( \frac{p_M}{ p_{M^*} } \right) ^2 \right] \le \frac{\Vert M -M^* \Vert ^2_F}{d_1 d_2} , \end{aligned}$$

and when integrating with respect to $\rho _n:= \rho _{0}$ given in (14), it leads to

$$\begin{aligned} \int \mathbb {E}\left[ \log \left( \frac{p_M}{ p_{M^*} } \right) ^2 \right] \rho _n(dM)&\le \int \frac{\Vert M -M^* \Vert ^2_F}{d_1 d_2} \rho _n(dM) \\&= \int \frac{\Vert M-M^* \Vert ^2_F}{d_1 d_2} \pi _{st} (M - M^*)dM \\&= \frac{1}{d_1 d_2} \int \Vert M \Vert _F^2 \pi _{st}(M)dM \\&\le \frac{d_1 d_2 \tau ^2}{d_1 d_2} = \tau ^2 , \end{aligned}$$

where we have used a change of variable and Lemma 3 to obtain the result.

Now, by taking $\tau = 1/n$, we obtain that

$$\begin{aligned} \int \mathcal {K} (P_{M^*}, P_M) \rho _n(dM)&\le \frac{1}{n}, \\ \int \mathbb {E}\left[ \log \left( \frac{p_M}{ p_{M^*} } \right) ^2 \right] \rho _n(dM)&\le \frac{1}{n^2}, \\ \frac{1}{n}\mathcal {K}(\rho _n,\pi _{st})&\le \frac{ 2 r (d_1 + d_2 +2) \log \left( 1+ \frac{ n \Vert M^* \Vert _F}{ \sqrt{2r }} \right) }{n} . \end{aligned}$$

We now can apply Theorem 2.4 and Corollary 2.5 in Alquier and Ridgway (2020) with

$$\varepsilon _n = \frac{ 2 r (d_1 + d_2 +2) \log \left( 1+ \frac{ n \Vert M^* \Vert _F}{ \sqrt{2r }} \right) }{n}$$

to obtain the result. The proof is completed.

Proof of Theorem 4.3

From Theorem 4.2, using a bound for Hellinger distance as in Corollary 1, we have that

$$\begin{aligned} \mathbb {P}\left[ \int H^2(P_M ,P_{M^*} ) \pi _{n,\alpha }(\textrm{d} M ) \le c_\alpha \varepsilon _n \right] \ge 1-\frac{2}{n\varepsilon _n} , \end{aligned}$$

from Lemma 2, it yields that

$$\begin{aligned} \mathbb {P}\left[ \int \frac{C_1 C_\kappa \Vert M - M^* \Vert ^2_F }{d_1 d_2} \pi _{n,\alpha }(\textrm{d} M ) \le c_\alpha \varepsilon _n \right] \ge 1-\frac{2}{n\varepsilon _n} , \end{aligned}$$

thus, we obtain (7). To obtain (8), one can apple Jensen’s inequality for a convex function, that

$$\begin{aligned} \Vert \hat{M} - M^* \Vert ^2_F = \bigg \Vert \int M \pi _{n,\alpha }(\textrm{d}M) - M^* \bigg \Vert ^2_F \le \int \Vert M- M^* \Vert ^2_F \pi _{n,\alpha }(\textrm{d}M), \end{aligned}$$

and combine with result in (7). This completes the proof.

1.3 Lemma

Definition 1

Fix $B>0$, $r\ge 1$. For any pair $(\bar{U},\bar{V})\in \mathcal {M}(r,B)$, we define for $\delta \in (0,B)$ that will be chosen later,

$$\begin{aligned} \rho _n(\textrm{d}U,\textrm{d}V,\textrm{d}\gamma ) \propto \textbf{1}_{(\Vert U-\bar{U}\Vert _{\infty } \le \delta ,\Vert U-\bar{U}\Vert _{\infty } \le \delta )} \pi (\textrm{d}U,\textrm{d}V,\textrm{d}\gamma ). \end{aligned}$$

(13)

Lemma 1

Put $C_a:= \log (8\sqrt{\pi }\Gamma (a)2^{10a+1})+3$. For $\delta =B/[8(nd_1 d_2)^2]$ that satisfies $0<\delta <B$, and with $b = B^2/[512(nd_1d_2)^4 K^2 \max ^2(d_1,d_2)]$, we have for $\rho _n$ in (13) that

$$\begin{aligned} \mathcal {K}(\rho _n,\pi ) \le 2(1+2a) r(d_1 + d_2) \left[ \log (nd_1 d_2) + C_a \right] . \end{aligned}$$

Proof of Lemma 1

This result can found, for example, in the proof of Theorem 4.1 in Alquier and Ridgway (2020).

Lemma 2

For any matrix $A \in \mathbb {R}^{d_1 \times d_2}$ and $B \in \mathbb {R}^{d_1 \times d_2}$ satisfying that $\Vert A\Vert _\infty \le \kappa$ and $\Vert B\Vert _\infty \le \kappa$, under Assumption 3.3 and Assumption 3.2, one has that

$$\begin{aligned} \frac{ \Vert A - B \Vert ^2_F }{d_1 d_2} \le \frac{ H^2(P_{A},P_{B}) }{C_1 C_\kappa }. \end{aligned}$$

Proof of Lemma 2

This is Lemma A.2 in Davenport et al. (2014). With $d^2_H (p,q):= (\sqrt{p} -\sqrt{q} )^2 + (\sqrt{1-p} -\sqrt{1-q} )^2$ for two number $p,q \in [0,1]$, it is noting under Assumption 3.2 that

$$\begin{aligned} H^2(P_{A},P_{B})&= \frac{1}{d_1 d_2} \sum _{i\in [d_1]} \sum _{j\in [d_2]} \Pi _{ij} d^2_H (f(A_{ij}),f(B_{ij})) \\&\ge \frac{C_1}{d_1 d_2} \sum _{i\in [d_1]} \sum _{j\in [d_2]} d^2_H (f(A_{ij}),f(B_{ij})) \\&\ge C_1 H^2(f(A),f(B)) . \end{aligned}$$

where $\Pi _{ij}$ is the probability to observe the (i, j) -th entry. Now from Lemma A.2 in Davenport et al. (2014), under Assumption 3.3, one has that

$$\begin{aligned} H^2(f(A),f(B)) \ge C_\kappa \frac{ \Vert A - B \Vert ^2_F }{d_1 d_2}. \end{aligned}$$

The argument is also similar to Lemma 9 and Lemma 11 in Klopp et al. (2015). This completes the proof.

Finally, we will use quite often the following distribution that will be defined as translations of the prior $\pi _{st}$ in (6). We introduce the following notation.

Definition 2

Let’s define

$$\begin{aligned} \rho _{0}(M) = \pi _{st} (M - M^*). \end{aligned}$$

(14)

The following technical lemmas will be useful in the proofs.

Lemma 3

(Lemma 1 in Dalalyan (2020)) We have

$$\int \Vert M \Vert _{F}^2 \pi _{st} ( M )\textrm{d} M \le d_1 d_2 \tau ^2.$$

Lemma 4

(Lemma 2 in Dalalyan (2020)) We have

$$\mathcal {K}( \rho _{0 }, \pi _{st}) \le 2 r (d_1 + d_2 +2) \log \left( 1+ \frac{\Vert M^* \Vert _F}{\tau \sqrt{2r }} \right) ,$$

with the convention $0\log (1+0/0)=0$.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Mai, T.T. Concentration properties of fractional posterior in 1-bit matrix completion. Mach Learn 114, 7 (2025). https://doi.org/10.1007/s10994-024-06691-z

Download citation

Received: 15 April 2024
Revised: 06 August 2024
Accepted: 12 December 2024
Published: 14 January 2025
Version of record: 14 January 2025
DOI: https://doi.org/10.1007/s10994-024-06691-z

Concentration properties of fractional posterior in 1-bit matrix completion

Abstract

Similar content being viewed by others

Deep Bayesian Matrix Factorization

A Low-Complexity Approach to Computation of the Discrete Fractional Fourier Transform

Bayesian Estimation of the Precision Matrix with Monotone Missing Data

Explore related subjects

1 Introduction

2 Notations and method

2.1 Notations

2.2 1-Bit matrix completion

2.2.1 Low-rank factorization prior

3 Main results

3.1 Assumptions

Assumption 3.1

Assumption 3.2

Assumption 3.3

3.2 Main results

Theorem 3.1

Theorem 3.2

Remark 1

Corollary 1

Remark 2

Theorem 3.3

Remark 3

Remark 4

Remark 5

Remark 6

4 Results with a spectral scaled student prior

Theorem 4.1

Theorem 4.2

Remark 7

Remark 8

Theorem 4.3

Remark 9

5 Concluding remarks

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Proofs

Proofs

1.1 Proofs for Sect. 3

Proof of Theorem 3.1

Proof of Theorem 3.2

Proof of Corollary 1

Proof of Theorem 3.3

1.2 Proofs for Sect. 4

Proof of Theorem 3.3

Proof of Theorem 4.2

Proof of Theorem 4.3

1.3 Lemma

Definition 1

Lemma 1

Proof of Lemma 1

Lemma 2

Proof of Lemma 2

Definition 2

Lemma 3

Lemma 4

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles