Nonparametric estimation of low rank matrix valued function

: Let A : [0 , 1] → H m (the space of Hermitian matrices) be a ma- trix valued function which is low rank with entries in H¨older class Σ( β,L ). The goal of this paper is to study statistical estimation of A based on the regression model E ( Y j | τ j ,X j ) = (cid:3) A ( τ j ) ,X j (cid:4) , where τ j are i.i.d. uniformly distributed in [0 , 1], X j are i.i.d. matrix completion sampling matrices, Y j are independent bounded responses. We propose an innovative nuclear norm penalized local polynomial estimator and establish an upper bound on its point-wise risk measured by Frobenius norm. Then we extend this estimator globally and prove an upper bound on its integrated risk measured by L 2 -norm. We also propose another new estimator based on bias-reducing kernels to study the case when A is not necessarily low rank and establish an upper bound on its risk measured by L ∞ -norm. We show that the obtained rates are all optimal up to some logarithmic factor in minimax sense. Finally, we propose an adaptive estimation procedure based on Lep- skii’s method and model selection with data splitting technique, which is computationally eﬃcient and can be easily implemented and parallelized on distributed systems.


Introduction
Let A : [0, 1] → H m (the space of Hermitian matrices) 1 be a matrix valued function. The goal of this paper is to study the problem of statistical estimation of a matrix valued function A based on the regression model E(Y j |τ j , X j ) = A(τ j ), X j , j = 1, . . . , n, (1.1) where τ j are i.i.d. random univariates uniformly distributed on [0, 1], X j are i.i.d. matrix completion sampling matrices, Y j are independent bounded random where S h i are the blocks on the diagonal of S h and = β . We prove that under mild conditions, the pointwise risk measured by m −2 S h (t 0 ) − A(t 0 ) 2 2 of S h (t 0 ) over Hölder class Σ(β, L) satisfies the following upper bound (1.4) where r is the low rank parameter and · 2 denotes the Frobenius norm of a matrix.
In section 4, we propose a new global estimator A based on local polynomial smoothing and prove that the integrated risk of A measured by L 2 -norm satisfies the following upper bound . (1.5) Then we study another naive kernel estimatorÃ which can be used to estimate matrix valued functions which are not necessarily low rank. This estimator is associated with another popular approach to deal with low rank recovery which is called singular value thresholding, see [6,20,10]. We prove that the sup-norm risk ofÃ satisfies the following upper bound sup t∈[h, 1−h] m −2 Ã (t) − A(t) 2 = O p m log n n 2β 2β+1 , (1.6) where · denotes the matrix spectral norm. Note that those rates coincide with that of classical matrix recovery setting when the smoothness parameter β goes to infinity. An immediate question to ask is whether the above rates are optimal. In section 5, we prove that the rates in (1.4), (1.5) and (1.6) are all optimal up to some logarithmic factor in minimax sense, which essentially verified the effectiveness of our methodology.
As one may have noticed, there is an adaptation issue involved in (1.3). Namely, one needs to choose a proper bandwidth h and a proper order of degree of polynomials. Both parameters are closely related to the smoothness parameter β of A which is unknown to us in advance. In section 6, we propose a model selection procedure based on Lepskii's method ( [25]) and the work of [3] and [37]. We prove that this procedure adaptively selects an estimator A * such that the integrated risk of A * measured by L 2 -norm has the following upper bound  which is still near optimal. What is more important, such a procedure is computationally efficient, feasible when m is comparable to n, and can be easily parallelized.
The major contribution of this paper is on the theory front. We generalized the recent developments of low rank matrix completion theory to nonparametric estimation setting by proposing an innovative optimal estimation procedure. To our best knowledge, no one has ever thoroughly studied such problems from a theoretical point of view.

Preliminaries
In this section, we introduce some important definitions, basic facts, and notations for the convenience of presentation.

Notations
For any Hermitian matrices A, B ∈ H m , we denote by A, B := tr(AB) the Hilbert-Schmidt inner product. Denote A, B L2(Π) = E A, X B, X , where Π denotes the distribution of X. The corresponding norm A 2 L2(Π) is given by We use · 2 to denote the Hilbert-Schmidt norm (Frobenius norm or Schatten 2-norm) induced by the inner product ·, · ; · to denote the operator norm (spectral norm) of a matrix: the largest singular value; · 1 to denote the trace norm (Schatten 1-norm or nuclear norm), i.e. the sum of singular values; |A| to denote the nonnegative matrix with entries |A ij | corresponding to A.
Given X 1 ,...,X n as the i.i.d. copies of the random measurement matrix X, denote where U X denotes the L ∞ -norm of the random variable X .

Matrix completion and statistical learning setting
The matrix completion setting refers to that the random sampling matrices X j are i.i.d. uniformly distributed on the following orthonormal basis X of H m : where E kk := e k ⊗ e k , k = 1, ..., m; E jk := 1 √ 2 (e k ⊗ e j + e j ⊗ e k ), 1 ≤ k < j ≤ m; E kj := i √ 2 (e k ⊗ e j − e j ⊗ e k ), 1 ≤ k < j ≤ m with {e j } m j=1 being the canonical basis of R m . The following identities are easy to check when the design matrices are under matrix completion setting: The statistical learning setting refers to the bounded response case: there exists a constant a such that max j=1,...n |Y j | ≤ a, a.s. (2.2) In this paper, we will consider model (1.1) under both matrix completion and statistical learning setting.

Matrix valued function
Let A : [0, 1] → H m be a matrix valued function. One should notice that we consider the image space to be Hermitian matrix space for the convenience of presentation. Our methods and results can be readily extended to general rectangular matrix space. Now we define the rank of a matrix valued function. Let rank A (t) := rank(A(t)), ∀t ∈ [0, 1].

Definition 1. Let β and L be two positive real numbers. The Hölder class
The parameters β and characterize the smoothness of Hölder class Σ(β, L). They are the most important parameters in our problem just like the dimension of the matrix m and sample size n. Throughout this paper, we only consider the case when is a fixed constant, or in other words m. The reason is that in the asymptotic theory of low rank matrix recovery, the size of m is often considered to be comparable to the sample size n, say m = O(n). If is also comparable to m, then our theory in this paper can be problematic.
In particular, we are interested in matrix valued functions satisfying the following assumptions: A1 Given a measurement matrix X and for some constant a 1 , A2 Given a measurement matrix X and for some constant a 2 , the derivative matrices A3 The rank of A, A , ..., A ( ) are uniformly bounded by a constant r, A4 Assume that for ∀i, j, the entry A ij is in the Hölder class Σ(β, L).

A local polynomial Lasso estimator
In this section, we study the pointwise estimation of a low rank matrix valued function A in Σ(β, L) with = β . The construction of our estimator is inspired by local polynomial smoothing and nuclear norm penalization. The intuition of the localization technique originates from classical local polynomial estimators, see [13]. The intuition behind nuclear norm penalization is that whereas rank function counts the number of non-vanishing singular values, the nuclear norm sums their amplitude. The theoretical foundations behind using nuclear norm heuristic for the rank minimization were proved by [30]. Instead of using the trivial basis {1, t, t 2 , ..., t } to generate an estimator, we use orthogonal polynomials for some technical resasons that we will specify in the proof of Theorem 3.1. Let {p i (t)} ∞ i=0 be a sequence of orthogonal polynomials with nonnegative weight function K compactly supported on [−1, 1], then with δ ij = 1{i = j} and 1{·} being the indicator function. It is easy to see that there exists an invertible linear transformation T ∈ R ( +1)×( +1) such that Apparently, T is lower triangular, and set R( be the set of block diagonal matrices with S k ∈ H m satisfying |S ij | ≤ R(T )a. With observations (τ j , X j , Y j ), j = 1, ..., n from model (1.1), we define S h as (3.1) Remark 1. Note that one can rewrite (3.1) as S h naturally induces a local polynomial estimator of order around t 0 : The point estimate of A at t 0 is given by Remark 2. Note that (3.1) only guarantees that each S h i is approximately low rank and may not exactly recover the rank of A (i) (t 0 ). However, under our assumption that as long as is small compared with the matrix size m, then S h (t 0 ) is still approximately low rank.
In the following theorem, we establish an upper bound on the pointwise risk of S h (t 0 ) when A(t) is in the Hölder class Σ(β, L) with = β . The proof of Theorem 3.1 can be found in section 8.1. 1], X and τ are independent, and |Y | ≤ a, a.s. for some constant a > 0. Let A be a matrix valued function satisfying A1, A2, A3, and A4.
for some numerical constants C 1 and D. Then for any h n ≤ t 0 ≤ 1 − h n , the following bound holds with probability at least 1 − n −mr , where C 1 (a, Φ, , L) is a constant depending on a, Φ, and L.
Remark 3. One should notice that when β → ∞, bound (3.5) coincides with a similar result in classical matrix completion of which the rate is O p mr log m n , see [20]. As long as n is of the polynomial order of m which is typical in reality, there is only up to a constant between log n and log m. In section 5, we prove that bound (3.5) is minimax optimal up to a logarithmic factor. The logarithmic factor in bound (3.5) and bound of classical matrix completion is introduced by matrix Bernstein inequality, see [34]. In the case of nonparametric estimation of real valued function, it is unnecessary, see [35]. Currently, it still remains as an open problem whether this logarithmic factor is necessary or not for our problem as well as for classical low rank matrix estimation problem.

Global estimators and upper bounds on integrated risk
In this section, we propose two global estimators and study their integrated risk measured by L 2 -norm and L ∞ -norm.

From localization to globalization
Firstly, we construct a global estimator based on (3.3). Take Without loss of generality, assume that M is even. Denote S h k (t) the local polynomial estimator around t 2k−1 as in (3.3) by using orthogonal polynomials with .., M/2 and 1{·} is the indicator function. Denote where C 2 (a, Φ, , L) is a constant depending on a, Φ, , L.

Remark 4.
When the dimension m degenerates to 1, bound (4.2) matches the nonparametric minimax rate O(n −2β/(2β+1) ) for real valued functions over Hölder class (see [35]) up to some logarithmic factor, which again is introduced by the matrix Bernstein inequality, see [34]. In section 5, we show that bound (4.2) is minimax optimal up to a logarithmic factor.

Bias reduction through higher order kernels
If A(t) is not necessarily low rank, we propose an estimator which is easy to implement and prove an upper bound on its risk measured by L ∞ -norm. Such estimators are related to another popular approach parallel to local polynomial estimators for bias reduction, namely, using higher order kernels to reduce bias. They can also be applied to another important technique of low rank estimation or approximation via singular value thresholding, see [6] and [10]. The estimator through nuclear norm penalization is shown by [20] to be equivalent to soft singular value thresholding of such type of estimators. The kernels we are interested in satisfy the following conditions: Note that when K ≥ 0, (4.3) is the solution to the following convex optimization problemÃ In the following theorem we prove an upper bound on its global performance measured by L ∞ -norm over Σ(β, L). Such kind of bounds is much harder to obtain even for classical matrix lasso problems.
where C * (K) and c * (K) are constants depending on K.

Remark 5.
When m degenerates to 1, bound (4.6) coincides with that of real valued case over Hölder class, which is O(( log n n ) 2β/(2β+1) ), see [35]. Note that the logarithmic factor under such metric for real valued Hölder class is necessary. In section 5, we show that bound (4.6) is minimax optimal up to a logarithmic factor when m log n.

Lower bounds under matrix completion setting
In this section, we prove the minimax lower bounds corresponding to estimators (3.4), (4.1) and (4.3). In the realm of classical low rank matrix estimation, [29] studied the optimality issue measured by the Frobenius norm on the classes defined in terms of a "spikeness index" of the true matrix; [31] derived optimal 3860 F. Zhou rates in noisy matrix completion on different classes of matrices for the empirical prediction error; [20] established the minimax rates of noisy matrix completion problems up to a logarithmic factor measured by the Frobenius norm. Based on the ideas of [20], standard methods to prove minimax lower bounds in real valued nonparametric estimation in [35], and some fundamental results in coding theory, we establish the corresponding minimax lower bounds of (3.5), (4.2) and (4.6) which essentially shows that the upper bounds we get are all optimal up to some logarithmic factor. For the convenience of presentation, we denote by inf A the infimum over all estimators of A. We denote by A(r, a) the set of matrix valued functions satisfying A1, A2, A3, and A4. We denote by P(r, a) the class of distributions of random triplet (τ, X, Y ) that satisfies model (1.1) with any A ∈ A(r, a).
In the following theorem, we show the minimax lower bound on the pointwise risk. The proof of Theorem 5.1 can be found in section 8.4. 1], X and τ are independent, and |Y | ≤ a, a.s. for some constant a > 0; let A be any matrix valued function in A(r, a). Then there is an where C := C(β, L, a) is a constant depending on β, L and a.

Remark 6.
Note that compared with the upper bound (3.5), the lower bound (5.1) matches it up to a logarithmic factor. As a consequence, it shows that the estimator (3.4) achieves a near optimal minimax rate of pointwise estimation. Although, the result of Theorem 5.1 is under bounded response condition, it can be readily extended to the case when the noise in (1.2) is Gaussian.
In the following theorem, we show the minimax lower bound on the integrated risk measured by L 2 -norm. The proof of Theorem 5.2 can be found in section 8.5. 1], X and τ are independent, and |Y | ≤ a, a.s. for some constant a > 0; let A be any matrix valued function in A(r, a). Then there is an whereC :=C(β, L, a) is a constant depending on L, β and a. Remark 7. The lower bound in (5.2) matches the upper bound we get in (4.2) up to a logarithmic factor. Therefore, it means that the estimator (4.1) achieves a near optimal minimax rate on the integrated risk measured by L 2 -norm. The result of Theorem 5.2 can be readily extended to the case when the noise in (1.2) is Gaussian. Now we consider the minimax lower bound on integrated risk measured by L ∞ -norm for general matrix valued functions without any rank information. Denote We denote by P(a) the class of distributions of random triplet (τ, X, Y ) that satisfies model (1.1) with any A ∈ A(a).
In the following theorem, we show the minimax lower bound over P(a) and A(a) measured by L ∞ -norm. The proof of Theorem 5.3 can be found in section 8.6. 1], X and τ are independent, and |Y | ≤ a, a.s. for some constant a > 0; let A be any matrix valued function in A(a). Then there exist an absolute constant η ∈ (0, 1) such that is a constant depending on β, L and a.

Remark 8.
Recall that in the real valued case, the minimax lower bound measured by L ∞ -norm over Hölder class is O(( log n n ) 2β/(2β+1) ), see [35]. According to bound (5.3), if dimension m degenerates to 1, we get the same result as in real valued case and it is optimal. While the dimension m is large enough such that m log n, the lower bound (5.3) shows that the estimator (4.3) achieves a near optimal minimax optimal rate up to a logarithmic factor.

Model selection
Despite the fact that estimators (3.4) and (4.1) achieve near optimal minimax rates in theory with properly chosen bandwidth h and order of degree , such parameters depend on quantities like β and L which are unknown to us in advance. In this section, we propose an adaptive estimation procedure to choose h and adaptively.
Two popular methods to address such problems are proposed in the past few decades. One is Lepskii's method, and the other is aggregation method. In the 1990s, many data-driven procedures for selecting the smoothing parameter h emerged. Among them, a series of papers stood out and shaped a method what is now called Lepskii's method. This method has been described in its general form and in great detail in [25]. Later, [24] proposed a bandwidth selection procedure based on pointwise adaptation of a kernel estimator that achieves optimal minimax rate of pointwise estimation over Hölder class, and [23] proposed a new bandwidth selector that achieves optimal rates of convergence over Besov classes with spatially imhomogeneous smoothness. The basic idea of Lepskii's method is to choose a bandwidth from a geometric grid to get an estimator not very different from those indexed by smaller bandwidths on the grid. Although Lepskii's method is shown to give optimal rates in pointwise estimation over Hölder class in [24], it has a major defect when applied to our problem: the procedure already requires a huge amount of computational cost when real valued functions are replaced by matrix valued functions. Indeed, with Lepskii's method, in order to get a good bandwidth, one needs to compare all candidates indexed by smaller bandwidth with the target one, which leads to dramatically growing computational cost. Still, we have an extra parameter that needs to fit with h. As a result, we turn to aggregation method to choose a bandwidth from the geometric grid introduced by Lepskii's method, which is more computationally efficient for our problem. The idea of aggregation method can be briefly summarized as follows: one splits the data set into two parts; the first is used to build all candidate estimators and the second is used to aggregate the estimates to build a new one (aggregation) or select one (model selection) which is at least as good as the best among all candidates.
The model selection procedure we use was initially introduced by [3] in classical nonparametric estimation with bounded response. [37] generalized this method to the case where the noise can be unbounded but with a finite p-th moment for some p > 2. One can find a more detailed review on such penalization methods in [16].
Firstly, we introduce the geometric grid created by [24] where to conduct our model selection procedure. Assume that the bandwidth falls into the range [h min , h max ]. Recall that the optimal bandwidth h n in theory is given as Assume that [β * , β * ] and [L * , L * ] are the ranges of β, L to be considered respectively. Then h max and h min can be chosen as where * = β * and * = β * . When those ranges are not given, a natural upper bound of h max is 1, and a typical choice of h min can be set to n −1/2 . Denote {h k } on the grid H is a decreasing sequence and the sequence becomes denser as k grows. Now, we consider possible choices of . A trivial candidate set is If the size of this set is large, one can shrink it through the correspondence (6.1) indicates the more the data, the narrower the range. We denote the candidate set for as L. Then the setH indexed a countable set of candidate estimators.

Remark 9.
In general, selecting h is considered to be more challenging and important than selecting and ε. On one hand, one needs to select h from an interval which is an uncountable set compared with selecting from only a finite set of integers. On the other hand, the performance of the estimator is much more sensitive to different choices of h, namely, a very small change of h can lead to huge performance degradation. We shall see this through our simulation study in section 7.2. Once h and are chosen, one can get ε i by plug in the value of (h i , i ) to get the corresponding ε i = ( i + 1)R(T )Φ log 2m nmhi .
Now we introduce our model selection procedure based onH. We split the data (τ j , X j , Y j ), j = 1, ..., 2n, into two parts with equal size. The first part of the observations {(τ j , X j , Y j ) : j ∈ n } contains n data points, which are randomly drawn without replacement from the original data set. We construct a sequence of estimators A k , k = 1, 2, ... based on the training data set n through (4.1) for each pair inH. Our main goal is to select an estimator A among { A k }, which is as good as the one that has the smallest mean square error. We introduce an quantity π k associated with each estimator A k which serves as a penalty term. We use the remaining part of the data set {(τ j , X j , Y j ) : j ∈ τ † n } to perform the selection procedure: 2. Equally split the data set (τ j , X j , Y j ), j = 1, ..., N into two parts n and τ † n by randomly drawing without replacement; 3. For each pair inH, construct an estimator A k defined in (4.1) using data in n; 4. Perform the selection procedure in (6.3) using data in τ † n .
The selection procedure described in Algorithm 1 have several advantages: firstly, it chooses a global bandwidth instead of a local one; secondly, since our selection procedure as in (6.3) is only based on computation of entries of A k , no matrix computation is involved in the last step, which can efficiently save computational cost when m is large; finally, step 3 and 4 can be easily parallelized on distributed platforms.
The following theorem shows that the integrated risk of A * measured by L 2norm can be bounded by the smallest one among all candidates plus an extra term of order O(n −1 ) which is negligible. The proof of Theorem 6.1 can be found in section 8.7. 1], X and τ are independent, and |Y | ≤ a, a.s. for some constant a > 0; let A be a matrix valued function satisfying A1, A2, A3, and A4; let { A k } be a sequence of estimators constructed fromH; let A * be the adaptive estimator selected through Algorithm 1. Then with probability at least 1−n −(mr−1) Recall that Card(H) = O(log n), one can take π k = kmr. Then π k ≤ c 1 mr log n uniformly for all k with some numerical constant c 1 . According to Lepskii's method that at least one candidate in H gives the optimal bandwidth associated with the unknown smoothness parameter β, the following corollary is a direct consequence of Theorem 4.1 and 6.1, which shows that A * is adaptive. Corollary 6.1. Assume that the conditions of Theorem 6.1 hold with π k = kmr, and n > mr log n. Then with probability at least 1 − n −(mr−1) where C(a, , L) is a constant depending on a, , and L.

Numerical simulation
In this section, we present numerical simulation results of the estimators (3.1) and (4.1) to validate the theoretical bounds in (3.5), (4.2), (5.1), and (5.2). Then we present the simulation results of the model selection procedure shown in Algorithm 1. Recall that the key optimization problem we need to solve is (3.1). We develop a solver based on the well known alternating direction method of multipliers (ADMM) algorithm [5] and its applications to matrix recovery problems, see [27,11]. The algorithm can be summarized as in Algorithm 2.

Algorithm 2: ADMM Algorithm.
Set up the values of max Iteration and tolerance ε tol > 0; Initialize S (0) ,S (0) ∈ D and The underlying matrix valued function we create is in Hölder class Σ(β, L) with β = 3/2, L = 24 and rank constraint r ≤ 3. The orthogonal polynomial we choose is Chebyshev polynomials of the second kind.

Simulation results of theoretical bounds
We present the numerical simulation results to validates the theoretical bounds that we proved in section 3, 4 and 5. By plug in the optimal bandwidth in Theorem 3.1, we run Algorithm 2 to solve the pointwise estimator at t 0 = 0.5 with m = 150. Fig. 1a-Fig. 1g show different levels of recovery of the underlying true data matrix as in Fig. 1h. As we can see, the recovery quality increases evidently as sample size n grows.
In Fig. 2a, we display the comparison of pointwise risk between our theoretical bounds proved in (3.5), (5.1) and our simulation results. In Fig. 2b, we display the comparison of integrated risk measured by the L 2 -norm between the theoretical bounds proved in (4.2), (5.2) and our simulation results. Since β = 3/2 and = 1, we use piecewise linear polynomials to approximate the underlying matrix-valued function. Fig. 2a and 2b show that the simulation results match well with the minimax lower bound (5.1) and (5.2). One should notice that sometimes our simulated error rate is smaller than the theoretical minimax  lower bound. We think the discrepency is due to the constant factors depending on a, L in the minimax lower bound that we computed are not very accurate.

Simulation results of model selection
Recall that in section 6, we developed Algorithm 1 to adaptively choose parameters h and . Since the choice of is made through simply choosing one from a set of integers and is quite straight forward, and choosing a good bandwidth h is more critical and complicated, we focus on the choice of the smoothing parameter h in our simulation study. We set = 1 which is the true parameter and focus on the selection of h.
We implement Algorithm 1 in this section, and perform simulation with m = 90 and n = 3200000. The theorectially optimal bandwidth h * is around 0.09. We choose h max = 1.0 and h min = 0.01 to construct the geometric grid H as in (6.2). We display the simulation results in Table 1. To be more specific, we   .3) are displayed in the third column. One should expect better integrated risk with smaller value of the third column. The data are plotted in Fig. 3. As we can see, our model selection procedure selects h = 0.0853 with the smallest criterion value of 0.3490, which shows that h is very close to the optimal value of h. The corresponding integrated risk is also the smallest among all candidates on the grid and stays very close to the global minimum.

Proof of Theorem 3.1
Proof. Firstly, we introduce a sharp oracle inequality of locally integrated risk of estimator (3.3) in the following lemma. The proof of Lemma 1 can be found in the appendix, which follows the same derivision as the proof of Theorem 19.1

F. Zhou
in [19]. To be more specific, one just needs to rewrite (3.1) as and for arbitrary η > 0, the estimator (3.4) satisfies with probability at least Then we consider Therefore, from (8.2) and (8.3), we have for any S ∈ D 4) where we used the fact that for any positive constants a and b, 2ab ≤ 1 c 2 a 2 +c 2 b 2 for some c > 1. Take S such that Note that this is possible since the right hand side is a matrix valued polynomial of τ −t0 h up to order , and span{p 0 (t), p 1 (t), ..., p (t)} = span{1, t, ..., t }. Under the condition that all entries of A (k) (t) are bounded by a, then entries of S k are bounded by R(T )a. Thus, the corresponding S ∈ D. Obviously, rank(S i ) ≤ ( + 1 − i)r. Since A ∈ Σ(β, L), we consider -th order Taylor expansion of A at t 0 to get . Then we apply the Taylor expansion (8.6) and identity (8.5) to get where U denotes the matrix with all entries being 1. The first inequality is due to A ij ∈ Σ(β, L), and the second is due to |τ − t 0 | ≤ h. Under the condition 3870 F. Zhou that X is uniformly distributed in X , and the orthogonality of {p i (t)} i=0 , it is easy to check that By optimizing the right hand side with respect to h and take η = mr log n, we take where C is a numerical constant. This completes the proof of the theorem.

Proof of Theorem 4.1
Proof. It is easy to see that By (8.2), (8.7) and arguments used to prove Theorem 3.1, we have with probability at least 1 − 1 n mr , Then take the union bound over k, from (8.10) we get with probability at least 1 − 1 n mr−1 , where C 2 (a, Φ, , L) is a constant depending on a, Φ, , L.

Proof of Theorem 4.2
Proof. In this proof, we use C(K) to denote any constant depending on K which may vary from place to place.

EÃ(t)−A(t) .
(8.11) The first term on the right hand side is recognized as the variance and the second is the bias. Firstly, we deal with the bias term. Denote By applying the Taylor expansion of A(τ ) as in (8.6) and the fact that K is a kernel of order , we get whereÃ is the same as in (8.6). It is easy to check that the first term on the right hand side is A(t 0 ). Therefore we rewrite B(t 0 ) as where the second equity is due to the fact that each element of A(t) is in Σ(β, L) and K is a kernel of order . Then we can bound each element of matrix (8.13) Next, we bound both terms on the right hand side respectively. For each t i , The right hand side is a sum of zero mean random matrices, we apply the matrix Bernstein inequality, see [34]. Under the assumption of Theorem 4.2, one can easily check that with probability at least 1 − e −η , S n (t i ) ≤ C(K)m 2 a 2 (η + log 2m) mnh a(η + log 2m) nh .
By taking the union bound over all i and setting η = 4 log n, we get with probability at least 1 − n −2 , As for the second term on the right hand side of (8.13), by the assumption that K is a Lipschitz function with Lipschitz constant L K , we have Chooseh

Proof of Theorem 5.1
Proof. Without loss of generality, we assume that both m and r are even numbers. We introduce several notations which are key to construct the hypothesis set. For some constant γ > 0, denote and consider the set of block matrices (8.14) where O denotes the m/2 × (m/2 − r m/r /2) zero matrix. Then we consider a subset of Hermitian matrices S m ⊂ H m , An immediate observation is that for any matrix A ∈ S m , rank(A) ≤ r. Due to the Varshamov-Gilbert bound (see Lemma 2.9 in [35]), there exists a subset A 0 ⊂ S m with cardinality Card(A 0 ) ≥ 2 mr/32 + 1 containing the zero m × m matrix 0 such that for any two distinct elements A 1 and A 2 of A 0 , for some sufficient small α > 0. It is easy to check that f n (t) ∈ Σ(β, L) on [0, 1]. We consider the following hypotheses of A at t 0 : The following claims are easy to check: firstly, any element in A β 0 together with its derivative have rank uniformly bounded by r, and the difference of any two elements of A β 0 satisfies the same property for fixed t 0 ; secondly, the entries of any element of A β 0 together with its derivative are uniformly bounded by some constant for sufficiently small chosen γ; finally, each element of A(t) ∈ A β 0 belongs to Σ(β, L). Therefore, A β 0 ⊂ A(r, a) with some chosen γ. According to (8.16), for any two distinct elements A 1 (t) and A 2 (t) of A β 0 , the difference between A 1 (t) and A 2 (t) at point t 0 is given by On the other hand, we consider the joint distributions P A τ,X,Y such that τ ∼ U [0, 1], X ∼ Π 0 where Π 0 denotes the uniform distribution on X , τ and X are independent, and One can easily check that as long as A(τ ) ∈ A β 0 , such P A τ,X,Y belongs to the distribution class P(r, a). We denote the corresponding n-product probability measure by P A . Then for any A(τ ) ∈ A β 0 , the Kullback-Leibler Divergence between P 0 and P A is Thus by the inequality − log(1 + u) ≤ −u + u 2 /2, ∀u > −1, and the fact that P A (Y = a|τ, X) ∈ [1/4, 3/4], we have Recall that A(τ ) = Af n (τ ) ∈ A β 0 , by τ ∼ U [0, 1] and X ∼ Π 0 , we have Therefore, provided the fact that Card(A 0 ) ≥ 2 mr/32 + 1, together with (8.19), we have 1 is satisfied for any α > 0 if γ is chosen as a sufficiently small constant. In view of (8.18) and (8.20), the lower bound (5.1) follows from Theorem 2.5 in [35].

Proof of Theorem 5.2
Proof. Without loss of generality, we assume that both m and r are even numbers. Take a real number c 1 > 0, define where f is defined the same as in (8.17). Meanwhile, we consider the set of all binary sequences of length M : In what follows, we combine two fundamental results in coding theory: one is Varshamov-Gilbert bound ( [14,36]) in its general form of a q-ary code, the other is the volume estimate of Hamming balls. Let A q (n, d) denote the largest size of a q-ary code of block length n with minimal Hamming distance d.
We now have all the elements needed in hand to construct our hypotheses set. Denote Ω 1 = {ω 1 , ..., ω N }, which is a subset of Ω 0 without ω 0 . We then consider a subset E 1 of E which is given by Clearly, S 1 := Card(E 1 ) ≥ 2 M/8 . Then we define a new collection of matrix valued functions as Obviously, the collection C is a S 1 -ary code of block length mr/4. Thus, we can apply the result of Proposition 8.1. It is easy to check that for p = 1/4, and q ≥ 4 In our case, q = S 1 ≥ 2 M/8 and n = mr/4. If we take p = 1/4, we know that A S1 (mr/4, mr/16) ≥ A S1 (mr/4, mr/16 In other words, (8.24) guarantees that there exists a subset H 0 ⊂ C with Card(H 0 ) ≥ 2 Mmr/128 such that for any A 1 , A 2 ∈ H 0 , the Hamming distance between A 1 and A 2 is at least mr/16. Now we define the building blocks of our hypotheses set where O m 2 × r 2 is the m 2 × r 2 zero matrix. Obviously, H has size Card(H) ≥ 2 Mmr/64 + 1, and for any A 1 (t), A 2 (t) ∈ H, the minimum Hamming distance is still greater than mr/16. We consider the set of matrix valued functions where O denotes the m/2 × (m/2 − r m/r /2) zero matrix. Finally, our hypotheses set of matrix valued functions H m is defined as

Now we consider any two different hypotheses
where ω = ω . Based on (8.21), we have where c * is a constant depending on f 2 , L, c 1 and γ.
On the other hand, we repeat the same analysis on the Kullback-Leibler divergence K(P 0 , P A ) as in the proof of Theorem 5.1. One can get is satisfied for any α > 0 if γ is chosen as a sufficiently small constant. In view of (8.27) and (8.29), the lower bound follows from Theorem 2.5 in [35].

Proof of Theorem 5.3
Proof. Without loss of generality, assume that m is an even number. For some constant γ > 0, denote V = v ∈ C Consider the set of matrices Clearly, B(V) is a collection of rank one matrices. Then we construct another matrix set V m , whereÕ is the m/2 × m/2 zero matrix. Apparently, V m ⊂ H m .

F. Zhou
On the other hand, we define the grid on [0, 1] where f is defined the same as in (8.17), and c 2 is some constant. Denote Φ := φ j : j = 1, ...M . We consider the following set of hypotheses: By construction, the following claims are obvious: any element A(t) of A β B has rank at most 2; the entries of A(t) ∈ A β B are uniformly bounded for some sufficiently small γ, and A ij (t) ∈ Σ(β, L). Thus A β B ⊂ A(a). Now we consider the distance between two distinct elements A(t) and due to the fact that ∀t ∈ (0, 1), rank(A(t)−A (t)) ≤ 4. Then we turn to get lower bound on sup t∈(0,1) Recall that by construction of A β B , we have for There are three cases need to be considered: 1). A 1 = A 2 and j = k; 2). A 1 = A 2 = 0 and j = k; 3 where c * 1 is a constant depending on f 2 ∞ , β, L and γ. For case 2, where c * 2 is a constant depending on f 2 ∞ , β, L and γ.
For case 3, where c * 3 is a constant depending on f 2 ∞ , β, L and γ. Therefore, by the analysis above we conclude that for any two distinct elements A(t) and 32) where c * is a constant depending on f 2 ∞ , L, γ and β. Meanwhile, we repeat the same analysis on the Kullback-Leibler divergence K(P 0 , P A ) as in the proof of Theorem 5.1. One can get that for any A ∈ A β B , the Kullback-Leibler divergence K(P 0 , P A ) between P 0 and P A satisfies Combine (8.31) and (8.33) we know that is satisfied for any α > 0 if γ is chosen as a sufficiently small constant. In view of (8.32) and (8.34), the lower bound follows from Theorem 2.5 in [35].

Proof of Theorem 6.1
Proof. For any A k , denote the difference in empirical loss between A k and A by It is easy to check that The following concentration inequality developed by [12] to prove Bernstein's inequality is key to our proof.
Firstly, we bound the variance of U j . Under the assumption that |Y | and | A(τ ), X | are bounded by a constant a, one can easily check that h = 8a 2 /3. Given E(Y j |τ j , X j ) = A(τ j ), X j , we know that the covariance between the two terms on the right hand side of (8.35) is zero. Conditionally on (τ, X), the second order moment of the first term satisfies To see why, one can consider the random variableỸ with the distribution P{Ỹ = a} = P{Ỹ = −a} = 1/2. The variance of Y is always bounded by the variance ofỸ which is a 2 under the assumption that |Y j | and | A k (τ j ), X j | are bounded by a constant a > 0. Similarly, we can get that the variance of the second term conditioned on (τ, X) is also bounded by 4a 2 E A k (τ j )−A(τ j ), X j 2 . As a result, A). By the result of Lemma 2, we have for any A k with probability at least 1 − e −t Set t = επ k + log 1/δ, we get with probability at least 1 − δ/e επ k By the definition of A * , we have with probability at least 1 − δ/e ε π * where π * is the penalty terms associated with A * . Now we apply the result of Lemma 2 one more time and set t = log 1/δ, we get with probability at least 1 − δ Apply the union bound of (8.36) and (8.37), we get with probability at least By taking ε = 3/32a 2 and c = εh, By taking δ = 1/n mr and adjusting the constant, we have with probability at least 1 − 1/n mr where C(a) is a constant depending on a.

Appendix: Proof of Lemma 1
The proof of Lemma 1 follows from a similar approach introduced by [19].
Proof. For any S ∈ H m of rank r, S = Let P L , P ⊥ L be the following orthogonal projectors in the space (H m , ·, · ): where P L denotes the orthogonal projector on the linear span of {e 1 , ..., e r }, and P L ⊥ is its orthogonal complement. Clearly, this formulation provides a decomposition of a matrix A into a "low rank part" P L (A) and a "high rank part" P ⊥ L (A) if rank(S) = r is small. Given b > 0, define the following cone in the space H m : which consists of matrices with a "dominant" low rank part if S is low rank. Firstly, we can rewrite (3.1) as Denote the loss function as L Ỹ ; S(τ ),X := Ỹ j − S,X j 2 , and the risk Since S h is a solution of the convex optimization problem (A.1), there exists a V ∈ ∂ S h 1 , such that for ∀S ∈ D (see [2] Chap. 2) This implies that, for all S ∈ D, where L denotes the partial derivative of L(y; u) with respect to u. One can easily check that for ∀S ∈ D, whereΠ denotes the distribution ofX. If EL(Ỹ ; S h ,X ) ≤ EL(Ỹ ; S,X ) for ∀S ∈ D, then the oracle inequality in Lemma 1 holds trivially. So we assume that EL(Ỹ ; S h ,X ) > EL(Ỹ ; S,X ) for some S ∈ D. Thus, inequalities (A.2) and (A.3) imply that According to the well known representation of subdifferential of nuclear norm, see [17] Sec. A.4, for any V ∈ ∂ S 1 , we have By the duality between nuclear norm and operator norm Therefore, by the monotonicity of subdifferentials of convex function · 1 , for any V := sign(S) + P ⊥ L (W ) ∈ ∂ S 1 , we have we can use (A.5) to change the bound in (A.4) to get For the simplicity of representation, we use the following notation to denote the empirical process: The following part of the proof is to derive an upper bound on the empirical process (A.7). Before we start with the derivation, let us present several vital ingredients that will be used in the following literature. For a given S ∈ D and for δ 1 , δ 2 , δ 3 , δ 4 ≥ 0, denote and α n (δ 1 , δ 2 ) := sup{|(P − P n )(L (Ỹ ; A,X )) A − S,X | : A ∈ A(δ 1 , δ 2 )}, Given the definitions above, Lemma 3 below shows upper bounds on the three quantities α n (δ 1 , δ 2 ),α n (δ 1 , δ 2 , δ 3 ),α n (δ 1 , δ 4 ). The proof of Lemma 3 can be found in section A.1. Denote where ε j are i.i.d. Rademacher random variables.
Since both S h and S are in D, by the definition ofα andα, we have 1 ), (A.12) and Assume for a while that 15) By the definition of subdifferential, for any V ∈ ∂ S h 1 , Then we apply (A.13) in bound (A.4) and use the upper bound onα n (δ 1 , δ 4 ) of Lemma 3, and get with probability at least 1 − e −η , 1 ) to (A.6) and get with probability at least 1 − e −η , where the first inequality is due to the fact that With assumption (A.17) holds, we get from (A. 19) (A.20)

F. Zhou
If the following is satisfied: we can just conclude that P (L(Ỹ ; S h ,X )) ≤ P (L(Ỹ ; S,X )) + which is sufficient to meet the bound of Lemma 1. Otherwise, by the assumption that P (L(Ỹ ; S h ,X )) > P (L(Ỹ ; S,X )), one can easily check that which implies that S h − S ∈ K(D; L; 5). This fact allows us to use the bound on α n (δ 1 , δ 2 ) of Lemma 3. We get from (A.6) P (L(Ỹ ; S h ,X )) + S h − S 2 L2(Π) + ε P ⊥ L ( S h ) 1 ≤ P (L(Ỹ ; S,X )) + ε sign(S), S − S h nh . We still need to specify δ − k , δ + k , k = 1, 2, 3, 4 to establish the bound of the theorem. By the definition of S h , we have η ∨η ∨η ≤ η * , for a proper choice of numerical constant B in the definition of η * . When condition (A.15) does not hold, which means at least one of the numbers δ − k , k = 1, 2, 3, 4 we chose is not a lower bound on the corresponding norm, we can still use the bounds , (A.26) instead of (A.12), (A.13). In the case when S h − S ∈ K(D; L; 5), we can use the bound , (A.27) instead of bound (A.14). Then one can repeat the arguments above with only minor modifications. By the adjusting the constants, the result of Lemma 1 holds.
The last thing we need to specify is the size of ε which controls the nuclear norm penalty. Recall that from condition (A.17), the essence is to control E Ξ . Here we use a simple but powerful noncommutative matrix Bernstein inequalities. The original approach was introduced by [1]. Later, the result was improved by [34] based on the classical result of [26]. We give the following lemma which is a direct consequence of the result proved by [34], and we omit the proof here. and by integrating its exponential tail bounds where C is a numerical constant.
Together with (A.17), we know for some numerical constant D > 0, which completes the proof of Lemma 1.
The proofs of the second and the third bounds are similar to this one, we omit the repeated arguments.