Minimax risks for sparse regressions: Ultra-high-dimensional phenomenons

Consider the standard Gaussian linear regression model $Y=X\theta+\epsilon$, where $Y\in R^n$ is a response vector and $ X\in R^{n*p}$ is a design matrix. Numerous work have been devoted to building efficient estimators of $\theta$ when $p$ is much larger than $n$. In such a situation, a classical approach amounts to assume that $\theta_0$ is approximately sparse. This paper studies the minimax risks of estimation and testing over classes of $k$-sparse vectors $\theta$. These bounds shed light on the limitations due to high-dimensionality. The results encompass the problem of prediction (estimation of $X\theta$), the inverse problem (estimation of $\theta_0$) and linear testing (testing $X\theta=0$). Interestingly, an elbow effect occurs when the number of variables $k\log(p/k)$ becomes large compared to $n$. Indeed, the minimax risks and hypothesis separation distances blow up in this ultra-high dimensional setting. We also prove that even dimension reduction techniques cannot provide satisfying results in an ultra-high dimensional setting. Moreover, we compute the minimax risks when the variance of the noise is unknown. The knowledge of this variance is shown to play a significant role in the optimal rates of estimation and testing. All these minimax bounds provide a characterization of statistical problems that are so difficult so that no procedure can provide satisfying results.


Introduction
In many important statistical applications, including remote sensing, functional MRI and gene expressions studies the number p of parameters is much larger than the number n of observations. An active line of research aims at developing computationally fast procedures that also achieve the best possible statistical performances in this "p larger than n" setting. A typical example is the study of l 1 -based penalization methods for the estimation of linear regression models. However, if p is really too large compared to n, all these new procedures fail to achieve a good estimation.
Thus, there is a need to understand the intrinsic limitations of a statistical problem: what is the best rate of estimation or testing achievable by a procedure? Is it possible to design good procedures for arbitrarily large p or are there theoretical limitations when p becomes "too large"? These limitations tell us what kind of data analysis problems are too complex so that no statistical procedure is able to provide reasonable results. Furthermore, the knowledge of such limitations may drive the research towards areas where computationally efficient procedures are shown to be suboptimal.

Linear regression and statistical problems
We observe a response vector Y ∈ R n and a real design matrix X of size n × p. Consider the linear regression model where the vector θ 0 of size p is unknown and the random vector ǫ follows a centered normal distribution N (0 n , σ 2 I n ). Here, 0 n stands for the null vector of size n and I n for the identity matrix of size n.
In some cases, the design X is considered as fixed either because it has been previously chosen or because we work conditionally to the design. In other cases, the rows of the design matrix X correspond to a n-sample of a random vector X of size p. The design X is then said to be random. A specific class of random design is made of Gaussian designs where X follows a centered normal distribution N (0 p , Σ). The analysis of fixed and Gaussian designs share many common points. In this work, we shall enhance the similarities and the differences between both settings.
There are various statistical problems arising in the linear regression model (1.1). Let us list the most classical issues: (P 1 ) : Linear hypothesis testing. In general, the aim is to test whether θ 0 belongs to a linear subspace of R p . Here, we focus on testing the null hypothesis H 0 : {θ 0 = 0 p }. In Gaussian design, this is equivalent to testing whether Y is independent from X. (P 4 ) : Support estimation aims at recovering the support of θ 0 , that is the set of indices corresponding to non-zero coefficients. The easier problem of dimension reduction amounts to estimate a set M ⊂ {1, . . . p} of "reasonable" size that contains the support of θ 0 with high probability.
Much work have been devoted to these statistical questions in the so-called high-dimensional setting, where the number of covariates p is possibly much larger than n. A classical approach to perform a statistical analysis in this setting is to assume that θ 0 is sparse, in the sense that most of the components of θ 0 are equal to 0. For the problem of prediction (P 2 ), procedures based on complexity penalization are proved to provide good risk bounds for known variance [11] and unknown variance [6] but are computationally inefficient. In contrast, convex penalization methods such as the Lasso or the Dantzig selector are fast to compute, but only provide good performances under restrictive assumptions on the design X (e.g. [8,13,50]). Exponential weighted aggregation methods [18,40] are another example of fast and efficient methods. The l 1 penalization methods have also been analyzed for the inverse problem (P 3 ) [8] and for support estimation (P 4 ) [36,49]. Dimension reduction methods are often studied in more general settings than linear regression [17,26]. In the linear regression model, the SIS method [25] based on the correlation between the response and the covariates allows to perform dimension reduction. The problem of high-dimensional hypothesis testing (P 1 ) has so far attracted less attention. Some testing procedures are discussed in [7,3] for fixed design and in [44,34] for Gaussian design.

Sparsity and ultra-high dimensionality
Given a positive integer k, we say that the vector θ 0 is k-sparse if θ 0 contains at most k nonzero components. We call k the sparsity parameter. In this paper, we are interested in the setting k < n < p. We note Θ[k, p] the set of k-sparse vectors in R p .
In linear regression, most of the results about classical procedures require that the triplet (k, n, p) satisfies k[1 + log(p/k)] < n. When k is "small", this corresponds to assuming that p is subexponential with respect to n. The analysis of the Lasso in prediction, inverse problems [8], and support estimation [38] entail such assumptions. In dimension reduction, the SIS method [25] also requires this assumption. If the multiple testing procedure of [7] can be analyzed for k[1 + log(p/k)] larger than n, it exhibits a much slower rate of testing in this case. In noiseless problems (σ = 0), compressed sensing methods [23] fail when k[1 + log(p/k)] is large compared to n (see [22] for numerical illustrations). In the sequel, we say that the problem is ultra-high dimensional 1 when k[1+log(p/k)] is large compared to n. Observe that ultra-high dimensionality does not necessarily imply that p is exponential with respect to n. As an example, taking p = n 3 and k = n/ log log(n) asymptotically yields an ultra-high dimensional problem.
Why should we care about ultra-high dimensional problem? In this setting, there are so many variables that statistical questions such as the estimation of θ 0 (P 3 ) or its support (P 4 ) are likely to be difficult. Nevertheless, if the signal over noise ratio is large, do there exist estimators that perform relatively well? The answer is no. We prove in this paper that a phase transition phenomenon occurs in an ultra-high dimensional setting and that most of the estimation and testing problems become hopeless. This phase transition phenomenon implies that some statistical problems that are tackled in postgenomic of functional MRI cannot actually be addressed properly.
Example 1.1 (Motivating example). In some gene network inference problems (e.g. [16]), the number p of genes can be as large as 5000 while the number n of microarray experiments is only of order 50. Let us consider a gene A. We note G A the set of genes that interact with the gene A and k stands for the cardinality of G A . How large can be k so that it is still "reasonable" to estimate G A from the microarray experiments? In statistical terms, inferring the set of genes interacting with A amounts to estimate the support of a vector θ 0 in a linear regression model (see e.g. [38]). Our answer is that if k is larger than 4, then the problem of network estimation becomes extremely difficult. We will come back to this example and explain this answer in Section 7.

Minimax risks
A classical way to assess the performance of an estimator θ is to consider its maximal risk over a class Θ ⊂ R p . This is the minimax point of view. For the time being, we only define the notions of minimaxity for estimation problems (P 2 and P 3 ). Their counterpart in the case of testing (P 1 ) and dimension reduction (P 4 ) will be introduced in subsequent sections. Given a loss function l(., .) and estimator θ, the maximal risk of θ over Θ[k, p] for a design X (or a covariance Σ in the Gaussian design case) and a variance σ 2 is defined by sup θ0∈Θ[k,p] E θ0,σ [l( θ, θ 0 )]. Taking the infimum of the maximal risk over all possible estimators θ, we obtain the minimax risk We say that an estimator θ is minimax if its maximal risk over Θ[k, p] is close to the minimax risk.
In practice, we do not know the number k of non-zero components of θ 0 and we seldom know the variance σ 2 of the error. If an estimator θ does not require the knowledge of k and nearly achieves the minimax risk over Θ[k, p] for a range of k, we say that θ is adaptive to the sparsity. Similarly, an estimator θ is adaptive to the variance σ 2 , if it does not require the knowledge of σ 2 and nearly achieves the minimax risk for all σ 2 > 0. When possible, the main challenge is to build adaptive procedures. In some statistical problems considered here, adaptation is in fact impossible and there is an unavoidable loss when the variance or the sparsity parameter is unknown. In such situations, it is interesting to quantify this unavoidable loss.

Our contribution and related work
In the specific case of the Gaussian sequence model, where n = p and X = I n , the minimax risks over k-sparse vectors have been studied for a long time. Donoho and Johnstone [21,35] have provided the asymptotic minimax risks of prediction (P 2 ). Baraud [5] has studied the optimal O(n β ) with β < 1. We argue in this paper that that as soon as k log(p)/n goes to 0, the case log(p) = O(n β ) is not intrinsically more difficult than conditions such as p = O(n δ ) with δ > 0. rate of testing from a non-asymptotic point of view while Ingster [31,32,33] has provided the asymptotic optimal rate of testing with exact constants.
Recently, some high-dimensional problems have been studied from a minimax point of view. Wainwright [45,46] provides minimax lower bounds for the problem of support estimation (P 4 ). Raskutti et al. [39] and Rigollet and Tsybakov [40] have provided minimax upper bounds and lower bounds for (P 2 ) and (P 3 ) over l q balls for general fixed designs X when the variance σ 2 is known (see also Ye and Zhang [47] and Abramovich and Grinshtein [1]). Arias-Castro et al. [3] and Ingster et al. [34] have computed the asymptotic minimax detection boundaries for the testing problem (P 1 ) for some specific designs. However, their study only encompasses reasonable dimensional problems (p grows polynomially with n). Some minimax lower bounds have also been stated for testing (P 1 ) and prediction (P 2 ) problems with Gaussian design [42,44]. All the aforementioned results do not cover the ultra-high dimensional case and do not tackle the problem of adaptation to both k and σ.
This paper provides minimax lower bounds and upper bounds for the problems (P 1 ), (P 2 ), (P 3 ) when the regression vector θ 0 is k-sparse for fixed and random designs, known and unknown variance, known and unknown sparsities. The lower and upper bounds match up to possible differences in the logarithmic terms. The main discoveries are the following: 1. Phase transition in an ultra-high dimensional setting. Contrary to previous work, our results cover both the high-dimensional and ultra-high dimensional setting. We establish that for each of the problems (P 1 ), (P 2 ) and (P 3 ), an elbow effect occurs when k log(p/k) becomes large compared to n. Let us emphasize the difference between the high-dimensional and the ultra-high dimensional regimes for two problems: prediction (P 2 ) and support estimation (P 4 ).
Prediction with random design. In the (non-ultra) high-dimensional setting, the minimax risk of prediction for a random design regression is of order σ 2 k log(p/k)/n (see Section 3). Thus, the effect of the sparsity k is linear and the effect of the number of variables p is logarithmic.
In an ultra-high dimensional setting, that is when k log(p/k)/n is large, we establish that an elbow effect occurs in the minimax risk. In this setting, the minimax risk becomes of order σ 2 exp[Ck{1 + log(p/k)}/n], where C is a positive constant : it grows exponentially fast with k and polynomially with p (see the red curve in Figure 1). If it was expected that the minimax risk cannot be small for such problems, we prove here that the minimax risk is in fact exponentially larger than the usual k log(p/k)/n term. Support estimation. In a non-ultra high dimensional setting it is known [46] that under some assumptions on the design X (e.g. each component of X is drawn from iid. standard normal distribution) the support of a k-sparse vector θ 0 is recoverable with high probability if where C is a numerical constant. In an ultra-high dimensional setting, even if it is not possible to estimate the support of θ 0 with high probability. Observe that the condition (1.3) is much stronger than (1.2). In fact, it is not even possible to reduce drastically the dimension of the problem without forgetting relevant variables with positive probability. More precisely, for any dimension reduction procedure that selects a subset of variables M ⊂ {1, . . . p} of size p δ with some 0 < δ < 1 (described in Proposition 6.7), we have supp(θ 0 ) M with probability away from zero (see Proposition 6.7). Thus, it is almost hopeless to have a reliable estimation of the support of θ 0 even if θ 0 2 p /σ 2 is large. This impossibility of dimension reduction for ultra-high dimensional problems is numerically illustrated in Section 7.
2. Adaptation to the sparsity k and to the variance σ 2 . Most theoretical results for the problems (P 1 ) and (P 2 ) require that the variance σ 2 is known. Here, we establish these minimax bounds for both known and unknown variance and known and unknown sparsity. The knowledge of the variance is proved to play a fundamental role for the testing problem (P 1 ) when k[1 + log(p/k)] is large compared to √ n. The knowledge of σ 2 is also proved to be crucial for (P 2 ) in an ultra-high dimensional setting. Thus, specific work is needed to develop fast and efficient procedures that do not require the knowledge of the variance. Furthermore, variance estimation is extremely difficult in an ultra-high dimensional setting. 3. Effect of the design. Lastly, the minimax bounds of (P 1 ), (P 2 ) and (P 3 ) are established for fixed and Gaussian designs. Except for the problem of prediction (P 2 ), the minimax risks are shown to be of the same nature for both forms of the design. Furthermore, we investigate the dependency of the minimax risks on the design X (resp. Σ) in Sections 4-6.
The minimax bounds stated in this paper are non asymptotic. While some upper bounds are consequences of recent results in the literature, most of the effort is spent here to derive the lower bound. These bounds rely on Fano's and Le Cam's methods [48] and on geometric considerations. In each case, near optimal procedures are exhibited.

Organization of the paper
In Section 3, we summarize the minimax bounds for specific designs called "worst-case" and "bestcase" designs in order to emphasize the effects of dimensionality. The general results are stated in Section 4 for the tests and Section 5 for the problem of prediction. The problems of inverse estimation, support estimation, and dimension reduction are studied in Section 6. In Section 7, we address the following practical question: For exactly what range of (k, p, n) should we consider a statistical problem as ultra-high dimensional? A small simulation study illustrates this answer. Section 8 contains the final discussion and side results about variance estimation. Section 9 is devoted to the proof of the mains minimax lower bounds. Specific statistical procedures allow to establish the minimax upper bounds. Most of these procedures are used as theoretical tools but should not be applied in a high dimensional setting because they are computationally inefficient. In order to clarify the statements of the results in Sections 4-6, we postpone the definition of these procedures to Section 10. The remaining proofs are described in a technical appendix [43].

Notations and preliminaries
We respectively note . n and . p the l 2 norms in R n and R p , while . n refers to the inner product in R n . For any θ 0 ∈ R p and σ > 0, P θ0,σ and E θ0,σ refer to the joint distribution of (Y, X). When there is no risk of confusion, we simply write P and E. All references with a capital letter such as Section A or Eq.(A.3) refer to the technical Appendix [43].
In the sequel, we note supp(θ 0 ) the support of θ 0 . For any 1 ≤ k ≤ p, M(k, p) stands for the collections of all subsets of {1, . . . , p} with cardinality k. Given i ∈ {1, . . . , p}, we note X i the vector of size n corresponding to i-th column of X. For m ⊂ {1, . . . , p}, X m stands for the n × |m| submatrix of X that contains the columns X i , i ∈ m. In what follows, we note X T the transposed matrix of X.
Gaussian design and conditional distribution. When the design is said to be "Gaussian", the n rows of X are n independent samples of a random row vector X such that X T ∼ N (0 p , Σ). Thus, (Y, X) if a n-sample of the random vector (Y, X T ) ∈ R p+1 , where Y is defined by where ǫ ∼ N (0, σ 2 ). The linear regression model with Gaussian design is relevant to understand the conditional distribution of a Gaussian variable Y conditionally to a Gaussian vector since E[Y |X] = Xθ 0 and Var(Y |X) = σ 2 . This is why we shall often refer to σ 2 as the conditional variance of Y when considering Gaussian design. This model is also closely connected to the estimation of Gaussian graphical models [38,44].
As explained later, the minimax risk over Θ[k, p] strongly depends on the design X. This is why we introduce some relevant quantities on X.
Definition 2.1. Consider some integer k > 0 and some design X. In fact, Φ k,+ (X) and Φ k,− (X) respectively correspond to the largest and the smallest restricted eigenvalue of order k of X T X.
Given a symmetric real square matrix A, ϕ max (A) stands for the largest eigenvalue of A. Finally, C, C 1 ,. . . denote positive universal constants that may vary from line to line. The notation C(.) specifies the dependency on some quantities.
In the propositions, the constants involved in the assumptions are not always expressly specified. For instance, sentences of the form "Assume that n ≥ C. Then, . . ." mean that "There exists an universal C > 0 such that if n ≥ C, then . . .".

Main results
The exact bounds are stated in Section 4-6. In order to explain these results, we now summarize the main minimax bounds by focusing on the role of (k, n, p) rather than on the dependency on the design X. In order to keep the notations short, we do not provide in this section the minimal assumptions of the results. Let us simply mention that all of them are valid if the sparsity k satisfies k ≤ (p 1/3 ) ∧ (n/5) and that p ≥ n ≥ C where C a positive numerical constant.

Definitions
First, the results are described for the problem of prediction (P 2 ) since the problem of minimax estimation is more classical in this setting. Different prediction loss functions are used for fixed and Gaussian designs. When the design is considered as fixed, we study the loss X(θ 1 − θ 2 ) 2 n /(nσ 2 ). For Gaussian design, we consider the integrated prediction loss function: Given a design X, the minimax risk of prediction over Θ[k, p] with respect to X is For a Gaussian design with covariance Σ, we study the quantity These minimax risks of prediction do not only depend on (k, n, p) but also on the design X (or on the covariance Σ). The computation of the exact dependency of the minimax risks on X or Σ is a challenging question. To simplify the presentation in this section, we only describe the minimax prediction risks for worst-case designs defined by the supremum being taken over all designs X of size n × p (resp. all covariance matrices Σ). The quantity R F [k] corresponds to the smallest risk achievable uniformly over Θ[k, p] and all designs X. It is shown in Section 5 that the quantity R R [k] is achieved (up to constants) for a covariance Σ = I p while the quantity R F [k] is achieved with high probability for designs X that are realizations of the standard Gaussian design (all the components of X are drawn independently from a standard normal distribution). This corresponds to designs used in compressed sensing [23]. In fact, the maximal risks R F [k] and R R [k] for the prediction problem correspond to typical situations where the designs is well-balanced, that is as close as possible to orthogonality.

Results
In the sequel, we say that R F [k] is of order f (k, p, n, C), where C is positive constant when there exist two positive universal constants C 1 and C 2 such that These minimax risks are computed in Section 5 and are gathered in Table 1. They are also depicted on Figure 1.   Gaussian Design: When k log(p/k) remains small compared to n, the minimax risk of prediction is of the same order for fixed and Gaussian design. The k log(p/k)/n risk is classical and has been known for a long time in the specific case of the Gaussian sequence model [35]. Some procedures based on complexity penalization or aggregation (e.g. [11]) are proved to achieve these risks uniformly over all designs X. Computationally efficient procedures like the Lasso or the Dantzig selector are only proved to achieve a k log(p)/n risk under assumption on the design X [8]. If the support of θ 0 is known in advance, the parametric risk is of order k/n. Thus, the price to pay for not knowing the support of θ 0 is only logarithmic in p.
In an ultra-high dimensional setting, the minimax prediction risk in fixed designs remains smaller than one. It is the minimax risk of estimation of the vector E(Y) of size n. This means that the sparsity index k does not play anymore a role in ultra-high dimension. For a Gaussian design, the minimax prediction risk becomes of order C 1 (p/k) C2k/n : it increases exponentially fast with respect to k and polynomially fast with respect to p. Comparing this risk with the parametric rate k/n, we observe that the price to pay for not knowing the support of θ 0 is now far higher than log(p).
In Section 5, we also study the adaptation to the sparsity index k and to the variance σ 2 . We prove that adaptation to k and σ 2 is possible for a Gaussian design. In fixed design, no procedure can be simultaneously adaptive to the sparsity k and the variance σ 2 (see the red curve in Figure  1 that corresponds to fixed design, σ and k unknown).

Definitions
Let us turn to the problem (P 1 ) of testing H 0 : We fix a level α > 0 and a type II error probability δ > 0. Minimax lower and upper bounds for this problem are discussed in Section 4.
Suppose we are given a test procedure Φ α of level α for fixed design X and known variance σ 2 . The δ-separation distance of Φ α over Θ[k, p], noted ρ F [Φ α , k, X] is the minimal number ρ, such that Φ α rejects H 0 with probability larger than 1 − δ if Xθ 0 n / √ n ≥ ρσ. Hence, ρ F [Φ α , k, X] corresponds to the minimal distance such that the hypotheses Although the separation distance also depends on δ, n, and p, we only write ρ F [Φ α , k, X] for the sake of conciseness. By definition, the test Φ α has a power larger than 1 − δ for The infimum runs over all level-α tests. We call this quantity the (α, δ)-minimax separation distance over Θ[k, p] with design X and variance σ 2 . The minimax separation distance is a non-asymptotic counterpart of the detection boundaries studied in the Gaussian sequence model [20]. Similarly, we define the (α, δ)-minimax separation distance over Θ[k, p] with Gaussian design by replacing the distance Xθ 0 n / √ n by the distance √ Σθ 0 p : Various bounds on ρ * F [k, X], ρ * R [k, Σ] are stated in Section 4. In this section, we only provide the orders of magnitude of the minimax separation distances in the "worst case" designs in order to emphasize the effect of dimensionality: This is the smallest separation distance that can be achieved by a procedure Φ α uniformly over all designs X (resp. Σ). As for the prediction problem, it will be proved in Section 4, that the quantity ρ * F [k] and ρ * R [k] are achieved for well-balanced designs. It is not always possible to achieve the minimax separation distances with a procedure Φ α that does not require the knowledge of the variance σ 2 . This is why we also consider ρ * F,U [k] and ρ * R,U [k] the minimax separation distance for fixed and Gaussian design when the variance is unknown. Roughly, ρ * F,U [k] corresponds to the minimal distances ρ 2 that allows to separate well the hypotheses {θ 0 = 0 p and σ > 0} and {θ 0 ∈ Θ[k, p] and σ > 0 , Xθ 0 2 n /σ 2 ≥ nρ 2 } when σ is unknown. We shall provide a formal definition at the beginning of Section 4.

Results
In Table 2, we provide the orders of the minimax separation distances over Θ[k, p] for fixed and Gaussian designs, known and unknown variance (see also Figure 2). σ unknown

Fixed and Gaussian Design
Known In contrast to (P 2 ), the minimax separation distances are of the same order for fixed and Gaussian design.
1. When k log(p) ≤ √ n, all the minimax separation distances are of order k log(p)/n. This quantity also corresponds to the minimax risk of prediction (P 2 ) stated in the previous subsection. This separation distance has already been proved in the specific case of the Gaussian sequence model [5,20]. 2. When k log(p) ≥ √ n, the minimax separation distances are different under known and unknown variance. If the variance is known, the minimax separation distance over Θ[k, p] stays of order 1/ √ n. Here, 1/ √ n corresponds in fixed design to the minimax separation distance of the hypotheses {E[Y] = 0 n } against the general hypothesis {E[Y] = 0 n } for known variance (see Baraud [5]). 3. If the variance is unknown, the minimax separation distance over Θ[k, p] is still of order k log(p)/n if k log(p) is small compared to n. In contrast, the minimax separation distance blows up to the order C 1 p C2k/n in a ultra-high dimensional setting. This blow up phenomenon has also been observed in the previous section for the problem of prediction (P 2 ) in Gaussian design. In conclusion, the knowledge of the variance is of great importance for k log(p) larger than √ n.

Definitions
In the inverse problem (P 3 ), we are primarily interested in the estimation of θ 0 rather than Xθ 0 . This is why the loss function under study is θ 1 − θ 2 2 p . Minimax lower and upper bounds for this loss function are discussed in Section 6. For a fixed design X, the minimax risk of estimation is If one transforms the design X by an homothety of factor λ > 0, then this multiplies the minimax risk for the inverse problem by a factor 1/λ 2 . For the sake of simplicity, we restrict ourselves to designs X such that each column has been normed to √ n. The collection of such designs is noted D n,p . The supremum of the minimax risks over the designs D n,p is +∞. Take for instance a design where the two first columns are equal. In this section, we only present the infimum of the minimax risks over Θ[k, p] as X varies across D n,p : The quantity RI F [k] is interpreted the following way: given (k, n, p) what is the smallest risk we can hope if we use the best possible design? Alternatively, given n observations, what is the intrinsic difficulty of estimating a k-sparse vector of size p? We call this quantity the minimax risks for the inverse problem over Θ[k, p].
In Section 6, we also study the corresponding the minimax risks of the inverse problem in the random design case. Let S p stand for the set of covariance matrices that contain only ones on the diagonal. We respectively define the minimax risk of estimation over Θ[k, p] for a covariance Σ and the minimax risk of estimation over Θ[k, p] as

Results
In Table 3, we provide the minimax risks in fixed design for different values of (k, n, p) (see also Figure 3).  If k log(p/k) remains smaller than n, it is possible to recover the risk Ck log(p/k) for "good" designs. This risk is for instance achieved by the Dantzig selector of Candès and Tao [15] for nearlyorthogonal designs, that roughly means that the restricted eigenvalues Φ 3k,+ (X) and Φ 3k,− (X) of X T X are close to one. In an ultra high-dimensional setting, it is not anymore possible to build nearly-orthogonal designs X and the minimax risk of the inverse problem blows up as for testing problems (P 1 ) or prediction problems in Gaussian design (P 2 ). Moreover, adaptation to the sparsity k and to the variance σ 2 is possible for the inverse problem. As explained in Section 6, the quantities RI R [k, Σ] and RI R [k] behave somewhat similarly to their fixed design counterpart.
In Section 6, we also discuss the consequences of the minimax bounds on the problem of support estimation (P 4 ). We prove that, in an ultra-high dimensional setting, it is not possible to estimate with high probability the support of θ 0 unless the ratio θ 0 2 p /σ 2 is larger than C 1 (p/k) C2k/n . In fact, even the problems of support estimation is almost hopeless in an ultra-high dimensional setting.

Hypothesis Testing
We start by the testing problem (P 1 ) because some minimax lower bounds in prediction and inverse estimation derive from testing considerations.

Gaussian design
As mentioned in the introduction, the knowledge of σ 2 = Var(Y |X) is really unlikely in many practical applications. Nevertheless, we study this case to enhance the differences between known and unknown conditional variances. Furthermore, these results turn out to be useful for analyzing the minimax separation distances in fixed design problems. We recall that the notions of minimax have been defined in Section 3.2. Theorem 4.1. Assume that α + δ ≤ 53%, p ≥ n 2 , and that n ≥ 8 log(2/δ). For any 1 ≤ k ≤ n, the (α, δ)-minimax separation distance (3.6) with covariance I p is lower bounded by For any 1 ≤ k ≤ p and any covariance Σ, we have Furthermore, this upper bound is simultaneously achieved for all k and Σ by a procedure T * α (defined in Section 10.1.1).
Remark 4.1. [Adaptation to sparsity] It follows from Theorem 4.1 that adaptation to the sparsity is possible and that the optimal optimal separation distance is of order for all sparsities k between 1 and n. In contrast, the minimax lower bound (4.1) is restricted to the case Σ = I p . This implies that there exists some constant C(α, δ) such that , In other words, the testing problem is more complex (up to constants) for an independent design than for a correlated design.
These two bounds are of order of (4.3) as it is assumed that p ≥ n 2 . However, the dependency of the logarithmic terms on k in the last bounds do not allow to provide the minimax separation distance when p = n and k is close to √ n. For instance, if p = n and k = √ n/ log(n), the two bounds only match up to a factor log(n)/ log log(n). The non-asymptotic minimax bounds of Baraud [5] in the Gaussian sequence model suffer the same weakness. Up to our knowledge the dependency on log(k) of the minimax separation distances has only been captured in an asymptotic setting [3,34] ((k, p, n) → ∞).

Fixed design
The separation distances are similar to the Gaussian design case.
Theorem 4.2. Assume that α + δ ≤ 33%, p ≥ n 2 ≥ C(α, δ), and that n ≥ 8 log(2/δ). For any 1 ≤ k ≤ n, there exist some n × p designs X such that For any 1 ≤ k ≤ p and any design X, we have Furthermore, this upper bound is simultaneously achieved for all k and X by a procedure T * α (defined in Section 10.1.1).
As for the random design case, we conclude that adaptation to the sparsity is possible and that In fact, the proof shows that, with large probability, designs X whose components are independently sampled from a standard normal variable satisfy (4.4).
Arias-Castro et al. [3] and Ingster et al. [34] have recently provided the asymptotic minimax separation distance with exact constant for known variance when the design satisfies very specific conditions. Theorem 4.2 provides the non-asymptotic counterpart of their result, but the constants in (4.4) and (4.5) are not optimal.

Preliminaries
We now turn to the study of the minimax separation distances when the variance σ 2 is unknown. In Section 3.2, we have introduced the notions of δ-separation distances and (α, δ)-minimax separation distances when the variance σ 2 . We now define their counterpart for an unknown variance σ 2 .
Let us consider a test Φ α of the hypothesis H 0 for the linear regression model with fixed design X. We say that Φ α has a level α under unknown variance if This means that the type I error probability is controlled uniformly over all variance σ 2 . Similarly, we want to control the type II error probabilities uniformly over all variances. The δ-separation Hence, ρ F,U [Φ α , k, X] corresponds to the minimal distance such that the hypotheses {θ 0 = 0 p and σ > 0} and Taking the infimum over all level α tests, we get the (α, δ) minimax separation distance over Θ[k, p] with design X and unknown variance is In the Gaussian design, we define analogously to (4.6) and (4.7) by replacing the norm Xθ 0 n / √ n by √ Σθ 0 p .

Gaussian design
Minimax bounds have been proved in [44] in the non ultra-high dimensional setting. The next theorem encompasses high dimensional and ultra-high dimensional settings.
Theorem 4.3. Suppose that α+ δ ≤ 53% and that p ≥ n ≥ 8 log(2/δ). For any 1 ≤ k ≤ ⌊p 1/3 ⌋, the (α, δ)-minimax separation distance over Θ[k, p] with covariance I p and unknown variance satisfies For any 1 ≤ k ≤ n/2 and any covariance Σ, we have Furthermore, this upper bound is simultaneously achieved for all k and Σ by a procedure T α (defined in Section 10. Remark 4.5. The condition k ≤ p 1/3 can be replaced by k ≤ p 1/2−γ with γ > 0, the only difference being that the constants involved in (4.8) would depend on γ. These conditions are not really restrictive for a sparse high-dimensional regression since the usual setting is k ≤ n ≪ p.
As a consequence (4.10) does not necessarily capture the right dependency on k in the logarithmic terms. This observation also holds for all the next results that require k ≤ p 1/3 . Remark 4.6. [Dependent design] As for the known variance case, we have ρ * R,U [k, , that is the testing problem is more complex for an independent design than for a correlated design. For some covariance matrices Σ, the minimax separation distance with covariance Σ is much smaller than ρ * R,U [k, I p ]. Verzelen and Villers [44] provide such an example of a matrix Σ in (see Propositions 8 and 9). However, the arguments used in the proof of their example are not generalizable to other covariances. In fact, the computation of sharp minimax bounds that capture the dependency of ρ * R,U [k, Σ] on Σ remains an open problem.

Fixed design
Ingster et al. [34] derive the asymptotic minimax separation distance for some specific design when k log(p)/n goes to 0. Here, we provide the non asymptotic counterpart that encompass all the regimes.
Proposition 4.4. Assume that α + δ ≤ 26% and that p ≥ n ≥ C(α, δ). For any 1 ≤ k ≤ ⌊p 1/3 ⌋, there exist some n × p designs X such that For any 1 ≤ k ≤ n/2 and any n × p design X, we have Furthermore, this upper bound is simultaneously achieved for all k and X by a procedure T α (defined in Section 10.1.2).
Again, we observe a phenomenon analogous to the random design case.

Comparison between known and unknown variance
There are three regimes depending on (k, p, n). They are depicted on Figure 2: The minimax separation distances are of the same order for known and unknown σ 2 . The minimax distance k log(p)/n is also of the same order as the minimax risk of prediction.

2.
√ n ≤ klog(p) ≤ n. If σ 2 is known, the minimax separation distance is always of order 1/ √ n. In such a case, an optimal procedure amounts to test the hypothesis n ] > nσ 2 } using the statistic Y 2 n /σ 2 . If σ 2 is unknown, the statistic Y 2 n /σ 2 is not available and the minimax separation distance behaves like k log(p)/n. 3. klog(p) ≥ n. If σ 2 is unknown, the minimax separation distance blows up. It is of order (p/k) Ck/n . Consequently, the problem of testing {θ 0 = 0 p } becomes extremely difficult in this setting.

Prediction
In contrast to the testing problem, the minimax risks of prediction (P 2 ) exhibit really different behaviors in fixed and in random design. The big picture is summarized in Figure 1. We recall that the minimax risks , and R R [k] are defined in Section 3.1.

Gaussian design
for any k ≤ n ≤ p/2. This statement has been proved in [42] (Proposition 4.5) in the special case of restricted isometry, but the proof straightforwardly extends to restricted eigenvalue conditions. For Σ = I p , the lower bound (5.2) does not capture the elbow effect in an ultra-high dimensional setting (compare with (5.1)).

Theorem 5.2. [Minimax upper bound]
Assume that n ≥ C. There exists an estimator θ V (defined in Section 10.2.1) such that the following holds: 1. The computation of θ V does not require the knowledge of σ 2 or k.
2. For any covariance Σ, any σ > 0, any 1 ≤ k ≤ ⌊(n − 1)/4⌋, and any θ 0 ∈ Θ[k, p] we have In contrast to similar results such as Theorem 1 in Giraud [27] or Theorem 3.4 in Verzelen [42], we do not restrict k to be smaller than n/(2 log p), that is we encompass both high-dimensional and ultra-high dimensional setting. The proof of the theorem is based on a new deviation inequality for the spectrum of Wishart matrices stated in Lemma 11.2.

Remark 5.2. [Minimax risk]
We derive from Theorem 5.2 and Proposition 5.1 that the minimax risk R R [k] is of order If k log(p/k) is small compared to n, the minimax risk of estimation is of order Ck log(p/k)/n. In an ultra-high dimensional setting, we again observe a blow up. 3 is restricted to the identity covariance. This implies that the minimax prediction risk for a general matrix Σ is at worst of the same order as in the independent case: there exists a universal constant C > 0 such that for all covariance Σ, In Remark 5.1, we have stated a minimax lower bound for prediction that depends on the restricted eigenvalues of Σ. Fix some 0 < γ < 1. If we consider some covariance matrices Σ such

Known variance
The minimax prediction risk with known variance has been studied in Raskutti et al. [39] and Rigollet and Tsybakov [40] (see also [1,47]). For any design X and any 1 ≤ k ≤ n, these authors have proved that the minimax risk R F [k, X] satisfies Next, we bound the supremum sup X R F [k, X] and we study the possibility of adaptation to the sparsity.
Proposition 5.3. For any 1 ≤ k ≤ n, the supremum sup X R F [k, X] is lower bounded as follows Assume that p ≥ n. There exists an estimatorθ BM (defined in Section 10.2.2) which satisfies for any 1 ≤ k ≤ n.
Remark 5.5. If k log(p/k) is small compared to n, the minimax risk is of order Ck log(p/k)/n. In an ultra-high dimensional setting, this minimax risk remains close to one. This corresponds (up to renormalization) to the minimax risk of estimation of the vector E[Y] of size n . As a consequence, the sparsity assumption does not play anymore a role in a ultra-high dimensional setting. From (5.6), we derive that adaptation to the sparsity is possible when the variance σ 2 is known.
For designs X, such that the ratio Φ 2k,− (X)/Φ 2k,+ (X) is close to one, the lower bounds and upper bounds of (5.4) agree with each other. This is for instance the case of the realizations (with high probability) of a Gaussian standard independent design (see the proof of Proposition 5.3 for more details). However, the dependency of the minimax lower bound in (5.4) on X is not sharp when the ratio Φ 2k,− (X)/Φ 2k,+ (X) is away from one. Take for instance an orthogonal design with p = n and duplicate the last column. Then, the lower bound (5.4) for this new design X is 0 while the minimax risk is of order k log(p/k)/n.
Similarly, the dependency of the minimax upper bound in (5.4) on X is not sharp. For very specific design, it is possible to obtain a minimax risk R F [k, X] that is much smaller than k/n log(p/k)∧ 1 (see Abramovich and Grinshtein [1]).
Remark 5.7. [Comparison with l 1 procedures] The designs X for which l 1 procedures such as the Lasso or the Dantzig selector are proved to perform well require that Φ 2k,− (X)/Φ 2k,+ (X) is close to one. It is interesting to notice that these designs X precisely correspond to situations where the minimax risk is close to its maximum k log(p/k)/n (see Equation (5.4)). We refer to [39] for a more complete discussion.

Unknown variance
We now consider the problem of prediction when the variance σ 2 is unknown.
Proposition 5.4. For any 1 ≤ k ≤ n, there exists an estimator θ (k) that does not require the knowledge of σ 2 such that Thus, the optimal risk of prediction over Θ[k, p] remains of the same order for known and unknown σ 2 .
Let us now study to what extent adaptation to the sparsity is possible when the variance σ 2 is unknown. In order to get some ideas let us provide risk bounds for two procedures that do not require the knowledge of σ: the estimator θ V already studied for Gaussian design (defined in Section 10.2.1) and the estimator θ n defined by θ n ∈ arg min θ∈R p Y − Xθ 2 n .
For any 1 ≤ k ≤ n, the maximal risk of θ n over Θ[k, p] is upper bounded as follows The risk bound (5.8) is also satisfied by the procedure of Baraud et al. [6]. The proof of (5.8) is a consequence of one of their results. Remark 5.9. As a consequence,θ V simultaneously achieves the minimax risk over all Θ[k, p] for all k ≤ ⌊(n − 1)/4⌋ such that k(1 + log(p)/k) ≤ n. In an ultra-high dimensional setting, the maximum risk ofθ V over Θ[k, p] is controlled by (ep/k) Ck/n while the minimax risk is smaller than 1. If the upper bound (5.8) is sharp then this would imply thatθ V is not adaptive to the sparsity in an ultra-high dimensional setting.
In contrast, θ n is minimax adaptive over all Θ[k, p] such that k(1+log(p)/k) ≥ n, but its behavior is suboptimal in a non-ultra-high dimensional setting.
In order to get an estimator that is adaptive to all indexes k, we would need to merge the properties of θ V (for non-ultra-high dimensional cases) and of θ n (for ultra-high dimensional cases). The following proposition tells us that it is in fact impossible.

Proposition 5.6. [Adaptation to the sparsity is impossible under unknown variance]
Consider any p ≥ n ≥ C 1 and 1 ≤ k ≤ ⌊p 1/3 ⌋ such that k log(ep/k) ≥ C 2 n. There exists a design X of size n × p such that for any estimator θ, we have either As a benchmark, we recall the minimax upper bounds: The proof of proposition 5.6 is based on the minimax lower bounds (4.11) for the testing problem (P 1 ) under unknown variance. The proof uses designs X that are realizations of standard Gaussian designs.
Remark 5.10. In the setup of Proposition 5.6, any estimator θ that does not require the knowledge of k and σ 2 has to pay at least one of these two prices: 1. The estimator θ does not use the sparsity of the true parameter θ 0 . Its risk for estimating 0 p is of the same order as the minimax risk over R p . The estimator θ n has this drawback.

For any
This is the price for adaptation when σ 2 is unknown. The estimator θ V exhibits this behavior.
As a conclusion, it is impossible to merge the qualities of θ V and of θ n . The best prediction risk that can be achieved by a procedure that aim to adaptation to the sparsity is of order k n log p k exp C k n log (p/k) .
In other words, the unavoidable loss for adaptation for unknown variance is a factor exp[Ck/n log(p/k)] In this sense, the estimator θ V (and as a byproduct the procedure of Baraud et al. [6]) achieves the optimal prediction risk under unknown variance and unknown sparsity.
In conclusion, the minimax risks of prediction are of the same order for fixed and Gaussian design and for known and unknown variance when k log(p/k) is small compared to n. In an ultrahigh dimensional setting, the minimax risks behave differently. For Gaussian design, the minimax risk is of the order (p/k) Ck/n . In contrast, the minimax risk of prediction remains smaller than one for fixed design regression with known variance. When the sparsity and the variance are unknown, there is a price to pay for adaptation under fixed design. All these behaviors are depicted on Figure  1.

Minimax risk of estimation
We recall that the minimax risks of estimation for the inverse problem RI F [k, X], RI F [k], RI R [k, Σ], and RI R [k] have been defined in Section 3.3.

Fixed design
First, we consider the problem (P 3 ) for a fixed design regression model. The minimax risk of estimation over Θ[k, p] with a design X is noted RI F [k, X] and is defined in (3.8). Raskutti et al. [39] have recently provided the following bounds that holds for any fixed design X and any 1 ≤ k ≤ n. The lower and upper bounds match up to the factor Φ 2k∧p,+ (X)/Φ 2k∧p,− (X). The upper bound is achieved by least-squares estimator over Θ[k, p] [39]. If the restricted eigenvalues of X are close to one, then the minimax risk is of order k log(ep/k). Next, we improve the lower bound in (6.1) in order to grasp the behavior of the minimax risk for non orthogonal design.
Proposition 6.1. For any design X and any 1 ≤ k ≤ n, we have In order to interpret these bounds let us restrict ourselves to design X such that each column has √ n norm, as justified in Section 3.3. The collection of such designs is noted D n,p . Observe that X ∈ D n,p enforces Φ 1,+ (X) = n.
In the sequel, we are interested in the smallest minimax risk RI F [k, X] that is achievable if we can choose the n × p design X ∈ D n,p , that is we want to bound RI F [k] = inf X∈Dn,p RI F [k, X]. The minimax risk RI F [k] tells us the intrinsic difficulty of estimating a k sparse vector of size p with n observations. Proposition 6.2. 1. Assume that k[1 + log(p/k)] ≤ Cn. Then, we have This bound is for instance achieved for designs X that are realizations (with a high probability) of normalized standard Gaussian design. 2. For any design X ∈ D n,p and any k ≤ n ∧ p/2, we have 3. For any k ≤ n/4 ∧ p/2, we have (6.5) Remark 6.1. The bound (6.3) tells us that the best minimax risk that is achievable in a non-ultrahigh dimensional setting is of order k log(ep/k)/n. The Lasso achieves the (almost optimal) risk bound k log(p)/n under some assumptions on the design matrix.
Remark 6.2. The lower bound (6.4) is of geometric nature. Combined with (6.2), it implies the lower bound of (6.5). In an ultra-high dimensional setting, it is not possible to build a design X such that Φ 2k,+ (X) /Φ 2k,− (X) is close to one (see Remark 5.8). In fact, the quantity Φ −1 2k,− (X) blows up because of geometric constrains. When k[1 + log(p/k)] is larger compared to n log(n), both bounds in (6.5) are comparable and the minimax risk is of order exp[Ck/n log(p/k)]. As a consequence, the inverse problem becomes extremely difficult in an ultra-high dimensional setting. Remark 6.3. While the quantity k log(p/k) in (6.3) is due to the "size" of the parameter space Θ[k, p], the exponential term of the minimax risk in ultra-high dimension is essentially driven by geometrical constrains on the design X. Proposition 6.3 (Adaptation to the sparsity and the variance). As in the prediction case, we consider the estimator θ V (defined in Section 10.2.1). Assume that p ≥ 2n. For any design X, any σ > 0, any 1 ≤ k ≤ ⌊(n − 1)/4⌋, and any θ 0 ∈ Θ[k, p], we have with probability larger than 1 − e −n − C/p. Remark 6.4. Although the bound (6.6) is in probability and not in expectation, it suggests that adaptation to the sparsity and to the variance are possible.

Random design
Let us turn to the Gaussian design case. We are interested in bounding RI R [k, Σ] and RI R [k] as defined in (3.9).
Proposition 6.4. For any 1 ≤ k ≤ (n − 1)/4, and any covariance Σ we have  .7) is blowing up, the lower bound remains as small as k log(p/k)/n. Nevertheless, we know from Proposition 5.1 that This suggests that RI R [k] is blowing up in an ultra-high dimensional setting but the problem remains open.
In the next proposition, we state the counterpart of Proposition 6.3 in the random design case.

Consequences on support estimation
We deduce from the minimax lower bounds for the inverse problem (P 3 ) some consequences for the support estimation problem (P 4 ) in a ultra-high dimensional setting. The case k[1 + log(p/k)] small compared to n has been studied in Wainwright [45].
Definition 6.1. For any ρ > 0 and any k ≤ p, the set C p k (ρ) is made of all vectors θ in Θ[k, p] such that θ contains exactly k non-zero coefficients that are all equal to ρ/ √ k.
In a non-ultra high dimensional setting, Wainwright [46] has proved, that under suitable conditions on a design X ∈ D n,p , it is possible to recover the support of any vector θ 0 that belong to C p k (ρ) with ρ of order of k log(p)/nσ. Here, we prove that ρ has to be much larger in an ultra-high dimensional setting. For any design X ∈ D n,p it is not possible to recover the support of θ 0 with high probability, unless θ 0 satisfies: θ 0 2 p This quantity is blowing up in an ultra-high dimensional setting and it can be much larger than the usual k log(p)/n that can be achieved in a non-ultra high dimensional setting.
As it is almost impossible to estimate the support of θ 0 in an ultra-high dimensional setting, we may aim to an easier objective. Can we choose a subset M of {1, . . . , p} of size p 0 ≤ p that contains the support of θ 0 with high probability? This would allow to reduce the dimension of the problem from p to p 0 . Dimension reductions techniques are popular for analyzing high dimensional problems. We study here to what extent dimension reduction is a realistic objective: how large should be the non-zero components of θ 0 ? How small can we choose p 0 ? Proposition 6.7. Consider a Gaussian design regression with Σ = I p and σ 2 = 1. We assume that p ≥ k 3 ∨ C and n ≥ C. Set There exists a universal constant 0 < δ < 1 such that for any measurable subset M of {1, . . . , p} of size p 0 ≤ p δ , we have In an ultra-high dimensional setting, it is therefore not possible to reduce the dimension of the problem to p δ unless the square norm of θ 0 is of order exp[Ck/n log(p)]σ 2 . In (6.10), the number 1/8 is of no particular significance. It can be replaced by any constant c ∈ (0, 1) if we take an asymptotic point of view ((k, p, n) → ∞).
Remark 6.6. In Proposition 6.7, we have taken the maximal risk points of view. If we put an uniform prior π on C p k (ρ), it is possible to replace (6.10) by where C is a positive constant.
Remark 6.7. In order to shed light on the problem of dimension reduction, let us consider a simple asymptotic example: p n = exp(n γ1 ) and k n = n 1−(γ1∧1)+γ2 with γ 1 > 0 and γ 2 > 0. If we assume that θ n ∈ Θ[k n , p n ] is such that θ n 2 p ≤ exp(Cn γ2+(γ1−1)+ ), then it is not possible to find a subset M n of size exp(δn γ1 ) that contains the support of θ n with probability going to one, where δ is defined as in Proposition 6.7. Consequently, we still have to keep at least exp(δn γ1 ) variables after the process of dimension reduction if we do not want to forget relevant variables! 7. What is an ultra-high dimensional problem?
Until now, we have stated that a problem is ultra-high dimensional when k log(p/k) is large compared to n. It has been proved that in such a setting, estimation of θ 0 , support estimation and even dimension reduction become almost impossible. In this section, we numerically illustrate this phase transition phenomenon. This allows us to quantify on specific examples how large should be k log(p/k)/n for the phase transition to occur.
Dimension reduction procedures. We apply the SIS method [25] to reduce the dimension to a set M S of size p 0 = 50. We then compute the Power of the procedure, The power measures whether the dimension reduction has been performed efficiently. We also compute the regularization path of the Lasso using the LARS [24] algorithm. Before applying the Lasso, each column of X is normalized. We consider the set M L made of the p 0 covariates occurring first in the regularization path. We do not argue that SIS and the Lasso are the best methods here. We have chosen them because they are classical and easy to implement.  Results. The results are presented on Figure 4. When k is small, the dimension reduction problem is not ultra-high dimensional and the Lasso and the SIS methods keep all the relevant covariates. For large k, the both methods miss some of the relevant covariates. For p = 5000, there is a clear decrease in the power beyond k = 4. For p = 5000 and k = 8, both methods only have a power close to 0.5. In expectation, only four covariates belong to the sets M S and M L of size 50. For p = 200, there is not a so clear transition, but the power decreases slowly for k > 8. If there was no elbow effect in the minimax risk of estimation, then it would still be possible to recover the support of θ 0 with high probability. Indeed, each non-zero component of θ 0 is larger than 4 log(p)/n which is detectable in a reasonable setting (see e.g. [46]). For instance, for k = 6 and p = 5000, θ 0 2 p /σ 2 = 16k log(p)/n ≈ 16.4. Here, the elbow effect implies that even for a huge signal over noise ratio, it is impossible to reduce the dimension of the problem without forgetting relevant variables.
Second simulation setting. We still take p = 5000, n = 50, Σ = I p , σ = 1, and k ranging from 1 to 5. k being fixed, we take θ 0 such that (θ 0 ) 1 = . . . = (θ 0 ) k = u log(p)/n and (θ 0 ) k+1 = . . . = (θ 0 ) p = 0. Relying on N = 100 experiments, we estimate u * k the smallest u such that M L has a power larger than 0.9. u * k corresponds (up to the renormalization log(p)/n) to the minimal intensity of the signal so that the dimension reduction method does not forget relevant covariates.
Results. The results are presented on Figure 5. For small k, u * k remains close to √ 2. In contrast, we observe that u * k blows up at k = 5. We have not depicted u * 6 , but we have u * 6 ≥ 100. These two simulation studies confirm that when k becomes large (in comparison to p and n), the dimension reduction problem becomes extremely difficult. Remark 7.1 (Rule of thumb). From these simulations and from other theoretical arguments (e.g. [27,22,45]), we derive a simple rule of thumb. We say that a problem is ultra-high dimensional if For p = 5000 and n = 50, this corresponds to k ≥ 4. Setting p = 200 and n = 50 yields k ≥ 8.
In practice, we do not know k in advance. Nevertheless, this criterion (7.1) helps us to know what is the largest sparsity index such that the statistical problem remains reasonably difficult in the minimax sense.

Discussion
As stated in Sections 4-6, the behaviors of the minimax separation distances and of the minimax risks become really different in an ultra-high dimensional setting. Apart from the test problem (P 1 ) with known variance and the problem of prediction (P 2 ) with fixed design, all the other separations distances and minimax risks blow up when k log(p/k) becomes larger than n. This elbow effect has important practical implications: there is no hope of selecting the relevant covariates in an ultra-high dimensional setting, except if signal over noise ratio is exponentially large. Moreover, even dimension reduction techniques cannot work well in such a setting.
In linear testing (P 1 ), we have proved that the optimal separation distances highly depend on the knowledge of the variance. Most of the testing procedures in the literature rely on the knowledge of σ 2 . Some specific work is therefore needed to derive fast and efficient procedures under unknown variance (but see [34] for a procedure in a specific situation).
We have not discussed so far the problem of variance estimation. From the minimax lower bounds of testing, we deduce the following lower bound.
As a consequence, the problem of variance estimation becomes extremely difficult in an ultrahigh dimensional setting.
In Propositions 5.3 and 6.1, we have provided minimax lower bounds for (P 2 ) and (P 3 ) over Θ[k, p] for arbitrary designs X. Our corresponding upper bounds match these lower bounds when the restricted eigenvalues of X T X are close to each other. However, these bounds do not agree anymore when these restricted eigenvalues are away from each other. Deriving the exact dependency of the minimax risks on X would require sharper lower bounds and the analysis of new estimation procedures.
Our minimax results use the Gaussianity of the noise ǫ and the Gaussianity of the design X in the random design setting. In an ultra-high dimensional setting, the minimax upper bounds do not seem to be robust with respect to the Gaussianity. In smaller dimensions (k[1 + log(p/k)] < n), the Gaussian distribution of the design is less critical. For instance, consider a design X where all the components are independent and follow a subgaussian distribution. By a result of Rudelson and Vershynin [41], the restricted eigenvalues of X T X remain away from 0 with high probability. Consequently, some of the minimax bounds should still hold for subgaussian designs. Nevertheless, the derivation of sharp minimax bounds for non-Gaussian designs and noises remains an open problem

Proofs of the minimax lower bounds
Some propositions contain both minimax lower bounds and upper bounds. This section is devoted to the proof of the main lower bounds, while the upper bounds are proved in Appendix B in [43]. In order to keep our notations as short as possible, we set We also note . T V for the total variation norm. For any subset T ⊂ R p , α ∈ (0, 1), covariance matrix Σ, and any variance σ 2 , we denote β R Σ,σ,α (T ) the quantity the infimum being taken over all tests Φ α satisfying P 0p,σ [Φ α = 0] ≤ α. Its counterpart for unknown variance is defined by the infimum being taken over all tests Φ α satisfying sup σ>0 P 0p,σ [Φ α = 0] ≤ α. Similarly, we define β F X,σ,α (T ) for fixed design and β F X,α (T ) for fixed design and unknown variance. Most of the minimax lower bounds in this paper are based on an approach which goes back to Ingster [28,29,30]. The following lemma encompasses fixed and random design and known and unknown variance. Lemma 9.1. Let T be a subset of R p \ {0 p } and let σ and σ 0 be two positive integers. Consider µ a probability measure on σT := {σθ, θ ∈ T }. We note P µ,σ = σT P θ,σ dµ and L µ = dP µ,σ /dP 0p,σ0 . Then, Here, β α (T ) can be replaced by β F X,α (T ) or β R Σ,α (T ). If we also have σ = σ 0 , then β α (T ) can be replaced by β R Σ,σ0,α (T ) or β F X,σ0,α (T ). We refer to Baraud [5] Section 7.1 for a proof and further explanations in a close framework. The main idea is to find a prior probability on T so that the total variation distance between P µ,σ and P 0p,σ0 is as large as possible. We derive from Lemma 9.1 that β α (T ) ≥ δ if E 0p,σ0 [L 2 µ (Y, X)] ≤ 1+η 2 .

Proof of the lower bound (4.1) in Theorem 4.1
Proof of Theorem 4.1. By homogeneity, we can assume that σ 2 = Var(Y |X) = 1. We first build a suitable prior probability µ ρ in order to apply Lemma 9.1.
Let us take a setm of size k uniformly in M(k, p) (defined in Section 2). Let ξ = (ξ j ) 1≤j≤p be a sequence of independent Rademacher random variables. Consider some ρ > 0. Define λ = ρ/ √ k and consider µ ρ the distribution of the random variable θm ,ξ = j∈m λξ j e j . P µρ,1 stands for the distribution of (Y, X) with θ 0 ∼ µ ρ and σ = 1. Here, (e j ) 1≤j≤p is the orthonormal family of vectors of R p defined by (e j ) i = 1 if i = j and (e i ) j = 0 otherwise.
The likelihood ratio L µρ (X, Y) = P µρ,1 /P 0p,1 writes where E ξ,m stands for the expectation with respect to the distribution of ξ and m.
In order to apply Lemma 9.1, we need to upper bound the expectation of L 2 µρ (X, Y). Let us first take the expectation of L 2 µρ (X, Y) with respect to Y.
Lemma 9.2. If we assume that ρ 2 ≤ C k n log 1 + p k 2 ∧ 1 √ n , then we have In this lemma, we have specifically distinguished the integration with respect to X from the integration with respect to Y. This will be useful for deriving minimax lower bound in fixed design (Proposition 4.2). Gathering Lemmas 9.1 and 9.2 allows to derive that This last bound allows to conclude since p ≥ n 2 .
where I |m1∪m2| is the identity matrix of size |m 1 ∪ m 2 | and C is block symmetric matrix of size |m 1 ∪ m 2 | defined by Each block corresponds to one of the four previously defined subsets of m 1 ∪ m 2 (i.e. m 1 \ m 2 , m 2 \ m 1 , m 3 , and m 4 ). The matrix C is of rank at most four. Hence, I |m1∪m2| − λ 2 C has the same determinant as the matrix D of size 4 defined by: After some computations, we lower bound the determinant of D From now on, we assume that ρ 2 ≤ 1/20 so that |D| ≥ 1/2. Hence, we get Then, we take the expectation with respect to ξ (1) , ξ (2) , m 1 and m 2 . When m 1 and m 2 are fixed the expression (9.3) depends on ξ (1) and ξ (2) only through the cardinality of m 3 . As ξ (1) and ξ (2) follow independent Rademacher distributions, the random variable 2|m 3 | − |m 1 ∩ m 2 | follows the distribution of Z, a sum of |m 1 ∩ m 2 | independent Rademacher variables and where E Z stands for the expectation with respect to Z. We now proceed as in the proof of Theorem 1 in Baraud [5] in order to upper bound the term Following Baraud's arguments, we get that E Z exp 2nλ 2 Z ≤ 1 + η 2 when Moreover, we have exp(8ρ 4 n) ≤ 1 + η 2 as soon as ρ 2 ≤ C/ √ n since η ≥ 0.94. Gathering these observations with (9.4), we conclude that E X E 0p,1 { L 2 µρ (Y, X) X} ≤ 1 + η 2 as soon as 9.2. Proof of the lower bound (4.8) in Theorem 4.3 Proof of (4.8) in Theorem 4.3. Consider the Condition We deduce Theorem 4.3 from the following result.
for any ρ 2 > 0 such that If we assume that Condition (A.1) holds, (9.5) holds for any ρ > 0 such that If p ≥ k 3 ∨ C and k log(p)/n ≥ C 1 with C and C 1 large enough, then Assumption (A.1) is satisfied. For C large enough, the quantity k log(p)/ log(k) is large enough so that the lower bound (9.7) satisfies Let us now assume that p ≥ k 3 ∨ C and k log(p)/n ≤ C 1 where C 1 has been previously fixed. Then, the first lower bound (9.6) satisfies: Gathering the two previous lower bounds with Lemma 9.3 allows to conclude.
Proof of Lemma 9.3. Consider some ρ > 0. To apply Lemma 9.1, we first have to define a suitable prior µ ρ on θ 0 and a suitable σ 2 . More specifically, we set σ 2 = (1 + ρ 2 ) −1 and the distribution µ ρ is supported by Θ[k, p, ρ] defined by Letm be a random variable uniformly distributed over M(k, p). Let µ ρ be the distribution of the random variable θ = j∈m λe j where and where (e j ) 1≤j≤p is the orthonormal family of vectors of R p defined by (e j ) i = 1 if i = j and (e i ) j = 0 otherwise. By Lemma 9.1, we only have to prove under conditions (9.6) or (9.7) with (A.1), we have E 0p,1 (L 2 µρ (Y, X)) ≤ 1 + η 2 . (9.8) Observe here that we use a variance 1 for H 0 and a variance 1 − θ 0 2 p for the hypothesis H 1 . Using these two different variances allows us to take advantage of the fact that we work under unknown variance.
As a specific case of [44] Eq.(8.5), we have where Z follows an hypergeometric distribution with parameters p, k, and k/p. We know from Aldous (p.173) [2] that Z follows the same distribution as the random variable E(W |B p ) where W is a binomial random variable of parameters k, k/p and B p some suitable σ-algebra. By a convexity argument, we get (9.9) Hence, we only need to upper bound the expectation of the second random variable. CASE 1: Proof of Equation (9.6). Since log(1 + x) ≤ x and since W ≤ k, we have As a consequence, the condition (9.8) holds if ρ 2 ≤ k n log 1 + p k 2 log(1 + η 2 ) . Observe that log(1 + η 2 ) ≥ 0.6. Since log(1 + ux) ≥ u log(1 + x) for any 0 < u < 1 and any x > 0, the last condition is enforced by ρ 2 ≤ k 2n log 1 + p k 2 .
Gathering this bound with Lemma 9.4, we get a new deviation inequality for W .
for any x < 1. We apply this bound with x = i/k. Then, Inequality (9.10) holds if Taking the logarithm of this expression leads to Since i is constrained to be smaller than k/2, we get − ik n log p ek 2 + k n log 4/η 2 + 2i ≤ 0 .
By Assumption (A.1), k/n log[p/(ek 2 )] is larger than 2. Consequently, the worst case among all i between 1 and k/2 is i = 1. Hence, we only need to prove that Since η is larger than 0.94, log(4e/η 2 ) is smaller than 3 and this last inequality is ensured by Assumption (A.1).
Proof of FACT 2. We consider here the case 1/2 < i/k ≤ 1. We derive from (9.14) that Consequently, we want to ensure that for any i between ⌊k/2⌋ and k. For any x and u between 0 and 1, (1 − x) u ≤ (1 − xu). Setting u = i/k and x = ρ 2 /(1 + ρ 2 ), we obtain that the last inequality holds if 2ek p k/n 2k η 2

k/(in)
Since 2k/η 2 is positive, the largest term in the bound corresponds to i = k/2. Hence, it remains to prove that We conclude that the upper bounds hold if Proof of Lemma 9.4. We prove this deviation inequality using the Laplace transform of W/k. Consider some x ∈ (0, 1) and λ > 0.
Deriving with respect to λ an upper bound of the last expression leads to the following choice Hence, we get Since we assume x < 1, we conclude that Since P(W = k) = [k/p] k , this upper bound is also valid when x = 1.
Gathering this result with Equations (9.15) and (9.16) allows to conclude.
For any estimator σ > 0, we define σ by σ ∈ arg min σ∈{1,σ0} l( σ, σ). For any σ ∈ {1, σ 0 }, the loss l( σ, σ) is controlled as follows: Thus, we get the minimax lower bound Let us note two numbers η 1 = 1.5 and η 2 = 1.8. If X is a standard Gaussian design and if k ≤ p 1/3 , then the proof of Theorem 4.3 states for where the expectation is taken both with respect to Y and X. Applying Markov's inequality, we derive that with positive probability, while we have θ 1 − θ 2 2 p ≥ r 2 /2. Applying Birgé's version of Fano's lemma [10] we conclude that: where Conv[A] stands for the convex hull of A. Taking r 2 = k[1 + log(p/k)]σ 2 /Φ 2k,+ (X) allows to conclude. The proof of the minimax lower bound (6.7) in Proposition 6.4 follows exactly the same steps. The minimax lower bound (6.8) is a consequence of (6.7) and the fact that Φ 1,+ ( √ Σ) = 1 for any Σ ∈ S p . 9.8. Proof of Proposition 6.2 Proof of the first result. First, the minimax lower bound is a straightforward consequence of (6.2), since Φ 1,+ (X) = n if X ∈ D n,p . Let us turn to the upper bound. Thanks to the minimax upper bound (6.1), we only have to prove that there exists a design X such that its 2k-restricted eigenvalues remain close from n.
Proof of the second result. Let X be a design in D n,p . Take δ ∈ (0, 1]. Let us consider the collection M(k, p) (defined in Section 2). As explained in the proof of Proposition 6.1, there exists M ′ (k, p) ⊂ M(k, p) of size larger than exp[Ck log(ep/k)] such that any pairs of distinct sets m 1 , m 2 in M ′ (k, p), we have |m 1 ∩ m 3 | ≤ 3k/4. For any m ∈ M ′ (k, p), we define a vector θ m such that |(θ m ) i | = 1/ √ k if i ∈ m and 0 else and that Xθ m 2 n ≤ n. Such a construction is justified in the proof of Proposition 6.1. For any m 1 = m 2 in M ′ (k, p), we have θ m1 − θ m2 2 p ≥ 1/2. If there exist two distinct sets (m 1 , m 2 ) ∈ M ′ (k, p) such that X(θ m1 − θ m2 ) 2 n ≤ nδ 2 , then the design X satisfies Φ 2k,− (X) ≤ 2nδ 2 . A necessary condition for X to satisfy Φ 2k,− (X) ≥ 2nδ 2 is therefore that the vectors Xθ m are √ nδ-separated.
If X satisfies Φ 2k,− (X) ≥ 2nδ 2 , then the balls in R n with radius √ nδ centered at Xθ m are all disjoint. Thus, the sum of their volumes, is smaller than the volume of a ball a radius √ n(1 + δ) in R n . This implies that δ ≤ 2(k/ep) Ck/n . Hence, for any design X with unit columns, we have Φ 2k,− (X) ≤ C 1 k ep C2k/n , which allows to prove the second result.

Proof of Proposition 6.7
For the sake of simplicity, we assume that σ 2 = 1 and that p is even. Consider any estimator M of size p 0 . We set ρ 2 = C 1 2 k n log(p) exp C 2 2 k n log(p) (9.20) where the constants C 1 , C 2 correspond to the ones used at the end of the proof of Proposition 5.1. We also consider the set C p k (ρ). Suppose that we have The procedure T * α is defined by T * α = ∨ 1≤k<k * T * α/(2k * ),k ∨ T * α/2,n . (10.1) The hypothesis H 0 is rejected if T * α is positive. T * α,k corresponds to a Bonferroni multiple testing procedure based on a large number of parametric tests of the hypothesis H 0 : {θ 0 = 0 p } against H 1,m : {θ 0 = 0 and supp(θ 0 ) ⊂ m} for any m ∈ M(k, p). As a consequence, T * α,k allows to test the hypothesis H 0 :{θ 0 = 0} against H 1,k : {θ 0 ∈ Θ[k, p] \ {0 p }}. Then, T * α corresponds to a Bonferroni multiple testing procedures based on the statistics T * α,k , k ∈ {1, . . . k * } ∪ {n}. Obviously, the procedure T * α is computationally intensive. It is used here as a theoretical tool to derive minimax upper bounds.

Unknown variance: test T α
We introduce a second testing procedure to handle the case of unknown variance σ 2 . whereF k,n−k (u) denotes the probability for a Fisher variable with k and n − k degrees of freedom to be larger than u. Finally, the statistic T α is defined by The hypothesis H 0 is rejected when T α is positive.
In fact, T α is a Bonferroni multiple testing procedure. Contrary to T * α , it is based on Fisher statistics to handle the unknown variance. The ideas underlying this statistic have been introduced in Baraud et al. [7] in the context of fixed design regression. where K > 0 is a tuning parameter. The dimension k V is selected as follows k V ∈ arg min 1≤k≤⌊(n−1)/4⌋ log Y − X θ k 2 n + pen(k) .
For short, we note θ V = θ k V .
This variable selection procedure relies on complexity penalization. The penalty pen(k) depends on the size of k and on the number p k of subsets of {1, . . . , p} of size k. Observe that the estimator θ V does not require the knowledge of σ 2 .
The choice of the tuning parameter K is universal: it neither depends on n, p, k, nor on Σ, θ 0 , σ 2 . It is only constrained to be larger than a positive numerical constant so that the equations if k = n , We recall that for k ≤ k * , the estimators θ k are defined in (10.5) and that θ n ∈ arg min θ∈R p Y − Xθ 2 n . The size k BM is selected by minimizing the following penalized criterion For short, we write θ BM = θ k BM .
Observe that the estimator θ BM requires the knowledge of the variance σ 2 . Then, Eq. (5.6) is a special case of Theorem 1 in Birgé and Massart [12].

Deviation inequalities
The proofs of the deviation inequalities stated in this section are postponed to Appendix C in [43].

3)
where C is a numerical constant.
The two first deviation inequalities are taken from Theorem 2.13 in [19]. The bound (11.3) allows to control the tail distribution of the smallest eigenvalue of a Wishart distribution. Rudelson and Vershynin [41] have provided a control similar to (11.3) under subgaussian assumptions. However, their results only holds for events of probability smaller than 1 − e −n .