A Bayesian Approach for Noisy Matrix Completion: Optimal Rate under General Sampling Distribution

Bayesian methods for low-rank matrix completion with noise have been shown to be very efficient computationally. While the behaviour of penalized minimization methods is well understood both from the theoretical and computational points of view in this problem, the theoretical optimality of Bayesian estimators have not been explored yet. In this paper, we propose a Bayesian estimator for matrix completion under general sampling distribution. We also provide an oracle inequality for this estimator. This inequality proves that, whatever the rank of the matrix to be estimated, our estimator reaches the minimax-optimal rate of convergence (up to a logarithmic factor). We end the paper with a short simulation study.


Introduction
The "Netflix Prize" [5] generated a significant interest in the matrix completion problem.The Netflix data can be represented as a sparse matrix made up of ratings given by users (rows) to movies (columns).To infer the missing entries is thus very helpful to propose sensible advertisement and improve the sales.However, it is totally impossible to recover an uncompleted matrix without any assumption.A suitable condition, popular in practice for this problem, is that the matrix has low-rank or approximately low-rank [1,3,7,8,9,15,16].For the Netflix problem, this assumption is sensible as it means that many movies (or users) have similar profiles.
Let M 0 m×p be an unknown matrix (expected to be low-rank) and (X 1 , Y 1 ), . . ., (X n , Y n ) be i.i.d random variables such that The noise variables E i are assumed to be independent with E(E i ) = 0.The variables X i are i.i.d copies of a random variable X having distribution Π on the set X = {1, . . ., m} × {1, . . ., p}.Then, the problem of estimating M 0 with n < mp is called the noisy matrix completion problem under general sampling distribution.
A special setting of this problem is that the sampling distribution Π is uniform, this assumption is done for example in [3,7,8,9,16].Clearly, in practice, the observed entries are not always uniformly distributed: for example, some movies are most popular than others, and thus receive much more ratings.More importantly, the sampling distribution is not known in practice.More general sampling schemes than uniform distribution had been already studied, see e.g.[14,15,21], but there are still some assumptions on Π in these papers.Here, we do not impose any restriction on Π.From now, Π ij = P (X = {i, j}) will denote the probability to observe the (i, j)-th entry.
For any matrix A m×p , let A F denote the Frobenius norm, i.e,A 2 F = Tr(A T A).We define a "generalized Frobenius norm" as follows Note that when the sampling distribution Π is uniform, then A 2 F,Π = (1/mp) A 2 F .For any matrix M m×p ∈ R mp , we define the empirical risk as and the prediction risk In this paper, the prediction problem is considered, i.e, the objective is to define an estimator F,Π for any M (using Pythagorean Theorem).When handing with this problem, most of the recent methods are often based on minimizing a criterion of the fit to the observations, such as r(M ), penalized by the nuclear-norm or the rank of the matrix.A first result can be found in by Candès and Recht [8], Candès and Tao [9] for exact matrix completion (noiseless case, i.e.E i = 0).These results were then developed in the noisy case [7,16].Some efficient algorithms had also been proposed, for example see [22].
Recently, some authors have studied a more general problem, the so-called Trace regression problem: [15,16].This problem includes matrix completion, together with other well-known problems (linear regression, reduced rank regression and multitask learning) as special cases.They proposed nuclear-norm penalized estimators and given reconstruction errors for their methods.They also proved that these errors are minimax-optimal (up to a logarithmic factor).Note that the average quadratic error on the entries of a rank-r matrix size m × p from n-observations can not be better than: r max(m, p)/n [16].
On the other hand, Bayesian methods have been also considered [3,17,18,23,26].Most Bayesian estimators are based on conjugate priors which allow to use Gibbs sampling [3,23] or Variational Bayes methods [18].These priors are discussed in details in [3].These algorithms are fast enough to deal with large datasets like Netflix or MovieLens1 , and are actually tested on these datasets in those papers.However, the theoretical understanding of Bayesian algorithms is not satisfying.Up to our knowledge, [3] contains the first theoretical result on the Bayesian estimator using conjugate priors.Yet, it only proves the consistency of the estimator, but fails to establish the optimal rate.The optimality of this estimator is, up to our knowledge, an open question.
In this paper, we design a new prior and prove an minimax-optimal oracle bound for the corresponding Bayesian estimator.This is presented in Section 2. In Section 3, some experiments comparing our estimator to the one based conjugate priors are done on simulated small datasets.However, the feasibility of this method for large datasets is still an open question.The proof of the main result and the details of our algorithm are provided in the appendices.

Main Result
Before we introduce our estimator, let us formulate some assumptions.
Assumption 1.There is a known L such that This is a mild assumption.Consider the Netflix data as an example, the ratings belong to the set {1, 2, 3, 4, 5}, so we can take L = 5.
Assumption 2. The noise variables E 1 , . . ., E n are independent and independent of X 1 , . . ., X n .There exist two known constants σ > 0 and ξ > 0 such that Assumption 2 states that the noise is sub-exponential, it includes the cases where the noise is bounded or sub-Gaussian (and of course Gaussian), see e.g.Chapter 2 in [6].
We now describe a prior π on matrices M m×p as follows.Let K = min(m, p) and Γ be a random variables taking value in the set {Γ 1 , . . ., K−k times 0, . . ., 0 ) for some constant τ ∈ (0, 1) and k ∈ {1, . . ., K}.Now, assuming that Γ = Γ k and a matrix M m×p is drawn as M = U m×K (V p×K ) T where Note that, in this case, the entries of M almost surely satisfy: sup i,j |M ij | ≤ 2L.
We are now ready to define our estimator.For any λ > 0, we consider the conditional probability measure ρλ given by its density w.r.t. the probability measure π: e −λr dπ .
The aggregate M λ is defined as follows Note that, for λ = 1/(2σ 2 ), this corresponds exactly to the Bayesian estimator that would be obtained for a Gaussian noise E i ∼ N (0, σ 2 ).However, a slightly different choice for λ, denoted by λ * below, will allow to obtain the optimality of the estimator under a wider class of noises.For any x > 0, define Hereafter, the main result is presented.We provide an oracle bound for our estimator M λ * .
The proof of this theorem, given in Appendix 1, follows an argument called "PAC-Bayesian inequality".PAC-Bayesian inequalities were introduced in [24,20] in order to provide empirical bounds on the prevision risk of Bayesian-type estimators.However, our proof is closer to Catoni's work [10,11,12], where it is shown how to derive powerful oracle inequalities from PAC-Bayesian bounds.This approach has been used many times since then to prove oracle inequalities in many dimension-reduction problems like sparse regression estimation [13,4,2] or reduced-rank regression [1].
The choice λ = λ * comes from the proof of this theorem when optimizing an upper bound on the risk R, see ( 14) page 15.However, in practice, this choice may not be the best one.For example, in the experiments done in Section 3 with Gaussian noise E i ∼ N (0, σ 2 ), we take λ = 1 4σ 2 that was shown in [13] to behave very well in regression problems.Also, in practice, to take K smaller than min(m, p) improves significantly the speed of the algorithm with little consequence on the performance of the estimator [3].
The rate (m + p)rank(M 0 ) log(K)/n is minimax-optimal, or at least almost minimaxoptimal: a lower bound in this problem is provided by Theorems 5 and 7 in [16], it is (m + p)rank(M 0 )/n.The optimality of the log term is, to our knowledge, an open question.Note however that the upper bound in [16] is (m + p)rank(M 0 ) log(m + p)/n.So, our bounds represents a slight improvement in the case where min(m, p) max(m, p).
Remark 2. When the sampling distribution Π is uniform in Theorem 1, we obtain the following oracle bound for the Frobenius norm Finally, we want to mention that the rate of [16] is also reached, in a work parallel to ours, by Suzuki [25], in a Bayesian framework.The main difference is that, while [25] provides a rate of convergence in a more general low-rank tensor estimation problem, his works do not bring an oracle inequality like Theorem 1 that can be used when M 0 is not exactly low-rank, but can be well approximated by a low-rank matrix.Moreover, our result holds under any sampling distribution Π. priors, is more popular in practice as it leads to a fast algorithm.In order to compare both estimators, a series of experiments were done with simulated data.The data are simulated as in [7,3].More precisely, a rank-2 matrix M 0 m×m (so m = p) has been created as the product of two rank-2 matrices, M 0 = U 0 m×2 (V 0 m×2 ) T , where the entries of U 0 and V 0 are i.i.d N (0, 20/ √ m).m = 100 m = 200 our estimator 0.59 (±0.027) 0.38 (±0.012)Gibbs Sampler 0.57 (±0.003) 0.36 (±0.001)Table 1: Only 20% entries of the matrix M 0 are observed (using a uniform sampling).This sampled set is then corrupted by noise as in (1), where the E i are i.i.d N (0, 1).We update sequentially the coefficients of U and V using a Metropolis-Hastings kernel; the proposal is uniform on the parameters space.The detail is given in Appendix 2. The behavior of our estimator M λ is computed through the root-mean-squared error (RMSE) per entry, The estimator with conjugate priors is computed using the Gibbs sampler from [3].In Table 1, the results are averaged after repeating 20 times the following tests.We run our algorithm and Gibbs Sampling algorithm 1000 iterations with the burn-in period is 100, for m = 100, m = 200 and K = 5.Note that in [3], three priors are proposed on the singular values of M , here we used the inverse Gamma prior, i.e. the inverse of the singular values are Ga(a = 1, b = 1/10) distributed.A first conclusion is that the two estimators have comparable performances, however, the one from [3] is more stable, which might be due to a faster convergence of the MCMC algorithm.Moreover, the Gibbs sampler performs slightly better, this point out the importance of the question of the optimality of the corresponding estimator based on unbounded priors.
To illustrate the convergence of our Metropolis-Hastings scheme, the value of some entries of M λ along the simulations had been plotted in Figure 1.In correspondance with Figure 1, the same entries also plotted when using the Gibbs Sampler in Figure 2. Clearly, by using Metropolis-Hasting algorithm, our update procedure is much slower than the Gibbs Sampler.Especially, when dealing with large datasets, our procedure cannot be used at the moment.Figure 3 also illustrates this fact: it shows the convergence of the RMSE under both algorithms.
From the above points, we see that implementing our estimator via a Metropolis-Hasting kernel leads to the results similar to the state of the art.However, the algorithm is too slow for large datasets.Hence, a faster algorithm for our estimator is a very important open question.

Conclusion
This paper proposes a Bayesian estimator for the noisy matrix completion problem under general sampling distribution.This estimator satisfies an optimal oracle inequality under any sampling scheme.Based on simulations, it is also clear that this estimator performs well in practice, however, a fast algorithm for large datasets is still an open issue.

Appendix 1: Proof of Theorem 1
First, we state a version of Bernstein's inequality useful in the proof of Theorem 1.This version is taken from [19] (Inequality 2.21 in the proof of Proposition 2.9 page 24).Lemma 2. Let T 1 , . . ., T n be independent real valued random variables.Let us assume that there are two constants v and w such that and for all integers k ≥ 3, Then, for any ζ ∈ (0, 1/w), .
Now, we are ready to present the proof of Theorem 1.
Proof of Theorem 1: the proof is divided in two steps.In the first step, we establish a general PAC-Bayesian inequality for matrix completion, in the style of [11,13].In the second step, we derive the oracle inequality from the first step.
Step 1: Let's define, for any matrix M ∈ M(2L), the following random variables Note that these variables are independent.We first check that the variables T i satisfy the assumptions of Lemma 2, in order to apply this lemma.We have Next we have, for any integer k ≥ 3, that with w := 12L(2ξ + 3L).
Next, for any λ ∈ (0, n/w), applying Lemma 2 with ζ = λ/n gives For the sake of simplicity let us put In order to understand what follows, keep in mind that w is a constant and that our optimal estimator comes with λ = λ * = n 2C , so α is of order n.For any ε > 0, the last display yields Integrating w.r.t. the probability distribution π(.), we get Next, Fubini's theorem gives Jensen's inequality yields where K(p, q) is the Kullback-Leibler divergence of p from q. Now, using the basic inequality exp(x) ≥ 1 R + (x), we get Using Jensen's inequality again gives Combining the last two displays we obtain Using Donsker and Varadhan's variational inequality (Lemma 1.1.3in Catoni [12]), we get where M 1 + (M ) is the set of all positive probability measures over the set of m × p matrices equiped with the Borel σ-algebra.
We now want to bound from above r(M ) − r(M 0 ) by R(M ) − R(M 0 ).We can use Lemma 2 again, to Ti (θ) = −T i (θ) and similar computations yield successively , and so for any (data-dependent) ρ, where Here again, with the same spirit with α in (3), β is of order n also.So: Combining ( 6) and ( 4) with a union bound argument gives the general PAC-Bayesian bound Step 2: In the second step, we derive an explicit form for the upper bound in (7).The idea is that, if we restrict the infimum in the upper bound in (7) to a small set of measures ρ, we are able to provide an explicit bound for this infimum.This trick was introduced in [11].
Let M ∈ M(L), it means that M = U V T with |U i | ≤ L/K, |V j | ≤ L/K.Let us take, for any c such that κ ≤ c < ( √ 2 − 1) L/K, the probability distribution Note that, as c < ( √ 2 − 1) L/K, we have supp(ρ U,V,c ) ⊂ supp(π) and so K(ρ U,V,c , π) < ∞.Thus, (7) becomes Let us fix c, U, V .The end the proof consists in calculations to derive an upper bound for the two terms in (8).Firstly (note that we use the notation A, B F,Π = i,j A ij B ij Π ij ).As µρ U,V,c (dµ) = U and νρ U,V,c (dν) = V , it can be seen that integral of the three scalar products in the previous equation vanish.Moreover, , we have So, we have an upper bound for the first term in (8).We now deal with the Kullback-Leibler term: Note that, up to a reordering of the columns of U and V , we can assume that U By symmetry, Plugging ( 11) and ( 12)) into (10), we obtain finally our upper bound for the Kullback-Leibler term: Finally, substituting ( 9) and ( 13) into (8), Let us put c = (m + p)L/(18nK).Note that as n ≥ max(m, p) then (m + p)/(3n) < 1 and thus the condition c < ( √ 2 − 1) L/K is satisfied.So we have the following inequality with probability at least 1 − ε: where α and β have been replaced by their definitions, see ( 3) and (5).Taking now λ = λ * = n/(2C) with C = C σ,L ∨ w in the last above display, gives where we have used that 1 − t , M t , . . ., M (K) t respectively.For each chain, U i and V j are updated successively using the uniform proposal U − 2L/K, 2L/K (as the proposal is symmetric, this is actually a Metropolis algorithm). At