Detection of sparse additive functions

We study the problem of detection of a high-dimensional signal function in the white Gaussian noise model. As well as a smoothness assumption on the signal function, we assume an additive sparse condition on the latter. The detection problem is expressed in terms of a nonparametric hypothesis testing problem and it is solved according to the asymptotical minimax approach. The minimax test procedures are adaptive in the sparsity parameter for high sparsity case. We extend to the functional case the known results in the detection of sparse high-dimensional vectors. In particular, our asymptotic detection boundaries are derived from the same asymptotic relations as in the vector case.


Introduction
Over the past years, boosted by applications and computer performance, problems in highdimensions have been explored in a number of statistical studies.If no additional structure is assumed, high-dimensional data processing suffers from some intrinsic difficulties such as the curse of dimensionality that results in a loss in the efficiency of statistical procedures, and inconsistency of classical statistical procedures -even in the linear regression model -unless the dimension of variables is less than the sample size.
In order to overcome the curse of dimensionality in a nonparametric framework, where typical functional classes are Sobolev, Holder, or Besov balls, some additional conditions, including additivity or tensor product structure, are assumed, see, for instance, [20,6,18,14,15,16] and references therein.Even if one of these conditions is assumed, yet it is required that the sample size is to be larger than the data dimension.One way to free oneself from the latter condition is to impose an additional sparsity constraint.
In this paper we focus on the problem of detection of high-dimensional signal functions in the Gaussian white noise model.To avoid difficulties stemming from high-dimensional settings, we suppose that a signal function satisfies an additional structural condition.Specifically, it is assumed to be sparse additive.This means that a high-dimensional function of interest is a sum of few univariate functions.Formally, we consider an d-dimensional (d ∈ IN and d > 0) Gaussian white noise model where W (t) is the Wiener process, ǫ > 0 is the noise level, and f , the quantity of interest, is the signal function.The additive sparse structure means that f is the sum of d univariate functions f j : where the ξ j 's are unknown but deterministic taking their values in {0, 1} : "0" means that the jth component f j is non active whereas "1" means that f j is active.Denote by K the positive number of active components, that is, K = There is a huge statistical literature on estimation in sparse models, see, for instance, [1,2,3] and references therein.In particular, there are many works related to the well-known Lasso introduced by Tibshirani [21] in 1996.There are also a number of papers that deal with nonparametric estimation in sparse additive models.For a complete review of these topics, we refer to [19], where minimax estimation rates in sparse additive models are obtained, to [5], where the Lasso-type estimate in sparse additive models is studied, and to [20], where various structural assumptions on models in high dimensions are discussed.
Back to our study, the detection problem at hand can be expressed in terms of a nonparametric hypothesis testing problem with the null hypothesis stating that "the signal is a constant", and "there is no signal" being a particular case of the null hypothesis.In order to specify an alternative hypothesis, recall that, within the minimax framework, it is impossible to detect signal functions that are "too close" to the null one, as well as to test the null and alternative hypotheses for too large alternative classes.Therefore, we are interested in the following nonparametric hypothesis testing problem: where    const 0 , const 1 are some constants, 2 ≤ 1}.
The functional class Sτ is the Sobolev ball, expressed via the Sobolev semi-norm 2 , that contains τ -smooth functions, which are assumed 1-periodic and orthogonal to a constant.Due to the periodic constraints, it is possible to express in terms of Fourier coefficients; this will be done in Section 2. The quantity τ is the smoothness parameter.Both the smoothness condition and the separation condition between H 0 and H 1 are expressed in terms of the components f j that are linked to the whole signal f via (1.2): each active component f j is smooth and is separated from the null hypothesis in the L 2 -norm by a positive value r ǫ .
In Section 6, we generalize the hypothesis testing problem (1.3) by considering a more general class of alternatives that consists of signals f equal, up to a constant, to a function f 1 ∈ F d,b , which is separated from the null hypothesis in the L 2 ([0, 1] K )-norm, and whose smoothness is expressed in terms of the whole function f .For these two hypothesis testing problems, the main questions are: what are the separation rates in the problem, i.e., what are the asymptotics for the minimal r ǫ such that one can distinguish between H 0 and H 1 ?And, also, what are the optimal test procedures that provide distinguishability?
To answer these questions, we use asymptotically minimax approach that provides detection boundaries or distinguishability conditions, i.e., necessary and sufficient conditions for the possibility of successful detection; these detection boundaries yield asymptotics for the minimal r ǫ separating the areas of distinguishability and non-distinguishability (between H 0 and H 1 ).The asymptotics for the minimal values of r ǫ are called either the (minimax) separation rates or the minimax rates of testing; in the present paper, the separation rates are denoted by r ⋆ ǫ .In connection with the current study, a number of works on detection and classification boundaries in Gaussian sequence models could be mentioned, see, for example, [7,8,9,13,12,4,15,16,11].Also, in [17], rather than considering a Gaussian sequence model, the authors generalize the problem of finding a detection boundary in the linear regression model.Another paper [10] deals with the signal detection problem in a multichannel model in the functional framework.At the end of the next paragraph, we explain what are the differences between the results in [10] and our study.
The main contribution of this paper consists of extending the results on detection boundaries obtained for d-dimensional sparse Gaussian vectors, see, for instance, [12], to the functional case.In particular, we obtain the same detection boundaries as in the vectorial case.However, in the case of high sparsity when b > 1/2, an additional assumption on the growth of d as a function of ǫ is required.Distinguishability is possible when the sum of the type I error probability and the maximum over alternatives of the type II error probability vanishes asymptotically, and distinguishability is not possible when this sum tends to one.Boundary conditions depend on the quantity a(r ǫ ) = a(r ǫ , d, τ ), which is a solution of a certain extremal problem stated in Section 4. In the vectorial case, the quantity a(r ǫ ) corresponds to the energy of a signal (see [12] and [10]).In the functional case, this quantity characterizes the distinguishability in a one-variable hypotheses testing problem.The minimax separation rates obtained in this paper depend on the value of b: for large b they are worse than for small b.Such a behaviour is expected because, with large b, only few components are active, and hence the problem of distinguishing between the alternative and null hypothesis becomes more difficult.
For the most difficult case of b ∈ (1/2, 1), not only separation rates, but also sharp separation rates, that include both rates and constants, are obtained.We also provide optimal test procedures for which minimax rates of testing are achieved asymptotically.Depending on the value of b, we propose two types of test procedures: one is of a χ 2 type, the other one is related to a Higher-Criticism statistic introduced in [4] and based on the Tukey's ideas.In the case of b ∈ (1/2, 1), our test procedure is adaptive in the sparsity index b, see Remark 5.3.
In the paper [10], which is focused on a similar problem of multichannel signal detection, the optimal rates are obtained.In our study, we obtain sharp separation rates for b ∈ (1/2, 1).The main difference between the study of [10] and our work is in the quantity a(r ǫ ) that characterizes the distinguishability: in our work, it is just a solution of a certain extremal problem, whereas in [10], it is obtained directly from the use of the respective test procedures.
The rest of the paper is organized as follows.Section 2 is concerned with the problem of finding detection boundary in a sparse Gaussian d-vectors model.In Section 3, we give a new formulation of the problem (1.3) in terms of sequence spaces.Section 4 is devoted to the description of the extremal problem that gives the distinguishability characteristics.The main results are stated in Section 5.In Section 6, we generalize the hypothesis testing problem (1.3) by considering more general alternatives.The proofs are given in Section 7.

Detection boundaries in a vectorial Gaussian model
Hypothesis testing problems for d-dimensional vectors, under the sparse conditions similar to the ones we use, were studied in [7,12,4].Namely, let X = (X 1 , . . ., X d ) be a random vector of the form X j = v j + η j , where η j i.i.d.∼ N (0, 1), j = 1, . . ., d, and . Then, the testing problem is stated as follows: it is required to test Here the questions of interest are: what are the asymptotics for a = a d as d → +∞ for which the hypotheses H 0 and H 1 separate asymptotically?Also, what are the optimal test procedures that provide the distinguishability (or separation) of H 0 and H 1 ?
The answer to each question depends essentially on the sparsity index b ∈ (0, 1), see [7,12,4] Observe that the function ϕ is positive, continuous, and increasing in b ∈ (0, 1].The test procedure that provides distinguishability in the high-sparsity case is based on the Higher-Criticism statistics introduced in [4].It is defined as where, here and later, Φ stands for the standard Gaussian cumulative distribution function.Note that it suffices to take the maximum of L d over a discrete grid of the form 1) is small enough.

Transformation of the statistical testing problem
Consider the tensor structure of the space Then, the corresponding orthonormal basis ( φd where (φ 1 k ) k∈Z Z is an orthonormal basis of L 2 ([0, 1]).It is assumed that φ 1 0 = 1.For any (j, k) ∈ {1, . . ., d} × Z Z, let us define φd j,k as where k is the j-th component of l.Observe that φd j,0 = 1.Using the orthonormal system ( φd j,k ) (j,k)∈{1,...,d}×Z Z , consider the statistics (x j ) 1≤j≤d = {x j,k ; k ∈ Z Z} 1≤j≤d defined by where the random variables η j,k = [0,1] d φd j,k (t)dW (t) are i.i.d.real standard Gaussian random variables and θ j,k = [0,1] φ 1 k (t j )f j (t j )dt j .Set θ j = (θ j,k ) k∈Z Z and θ = (θ j ) 1≤j≤d .Thanks to the periodic constraints, we may consider (φ 1 k ) k∈Z Z as the standard Fourier basis.Then the Sobolev semi-norm of f j can be expressed in terms of its Fourier coefficients as follows: . Therefore, the functional class F d (τ, r ǫ , b) can be equivalently represented as the sequence space Θ d (τ, r ǫ , b): The testing problem of interest (1.3) can be rewritten in the form Denote by IP 0 and IP θ the distributions under the null and alternative hypotheses, respectively.Also, denote by IE 0 , Var 0 , IE θ , and Var θ the expectations and variances with respect to IP 0 and IP θ , respectively.The notation IP θj , IE θj and Var θj also will be used: they are related to the distribution of the observations x j = (x j,k ) k∈Z Z .
For any test procedure ψ, that is, for any function measurable with respect to the observations and taking its values on the interval [0, 1], let ω(ψ) = IE 0 (ψ) be the type I error probability and let β(ψ, Θ d (τ, r ǫ , b)) = sup where the infimum is taken over all test procedures.One can not distinguish between H 0 and H 1 if γ → 1, and distinguishability occurs if it exists ψ such that either γ(ψ, once ψ has a given asymptotic level.The aim of this paper is to provide separation rates for the alternatives Θ d (τ, r ǫ , b) and to determine statistical procedures ψ and/or ψ α asymptotically of level α, i.e., ω(ψ α ) ≤ α + o(1), for which these separation rates are achieved.
By the separation rates we mean a family r ⋆ ǫ such that By the sharp separation rates, we mean a family r ⋆ ǫ such that Typically, asymptotics for models like model (1.1) are given as ǫ → 0. However, we are mainly interested in high-dimensional settings when d → +∞.Therefore, here and later, asymptotics and symbols o, O, ∼ and ≍ are used when ǫ → 0 and d → +∞, except for the cases when it is explicitly specified, say, o d is used when d → +∞.The notation A ∆ = B means that we use notation A for quantity B.

Extremal problem
In this section, we explain what is the quantity a(r ǫ ) that corresponds to the energy of a signal in the vectorial case.Only in this section, we assume that the observations have the form x k = θ k +ǫη k for k ∈ Z Z, where the η k 's are i.i.d.real standard Gaussian random variables.The quantity a(r ǫ ) denotes the solution of the extremal problem and characterizes distinguishability in the minimax detection problem for one-variable functions lying in Sτ and separated from the null hypothesis in L 2 by positive values r ǫ , i.e., for t ∈ [0, 1], Namely, if a(r ǫ ) → 0 then the minimax total error probability γ(Θ(τ, r ǫ )) → 1, and if a(r ǫ ) → +∞, then γ(Θ(τ, r ǫ )) → 0. Furthermore, let θ ⋆ ∆ = θ ⋆ (r ǫ ) be a sequence in l 2 (Z Z) that provides solution to the extremal problem (4.1).Set Then, we get the sharp asymptotics For the reader's convenience, we give a sketch of the proofs of these results.The proofs are based on the methods and results of Sections 3.1, 3.3, 4.3 in [13].In the vectorial case in hand, we also describe the structure of asymptotically minimax tests.
Under assumption (4.3), taking the prior based on the extremal sequence in the problem (4.1), one can show that the Bayesian log-likelihood ratio is asymptotically Gaussian: where η ǫ → η ∼ N (0, 1) and ρ ǫ → 0 in IP 0 -probability.The proof is based on Taylor's expansion, see Section 4.3.1 of [13].This yields the sharp lower bounds.
In order to obtain upper bounds, take a sequence q = (q k ) k∈Z Z such that q k ≥ 0, k q 2 k = 1/2, and consider t q , a centered and normalized (under IP 0 ) statistic of a weighted χ 2 -type: Consider also the test procedures ψ H,q = 1I tq>H .Observe that IE 0 t q = 0, Var 0 t q = 1, and t q are asymptotically standard Gaussian under IP 0 .These observations imply w(ψ H,q ) = Φ(−H) + o(1).
Denote by κ(θ, q) and κ(q) the following functions: Then, and hence, by Chebyshev's inequality, β(ψ H,q , Θ(τ, r ǫ )) → 0 when ǫ −2 κ(q) → +∞ and H ≤ cǫ −2 κ(q), c ∈ (0, 1).Under assumption (4.3), one can check that the statistic tq = t q − IE θ t q is asymptotically standard Gaussian under IP θ such that IE θ t q = O(1).Therefore In order to determine "asymptotically the best sequence" (q k ) k∈Z Z , it suffices to find a solution of the following maximin problem: Then, by convexity of the set and using the minimax theorem, we get Thus, asymptotically the best sequence (q k ) k∈Z Z is the sequence w(r ǫ ) 2), and the value of the problem (4.5) coincides with the value of the problem (4.1).Setting H = a(r ǫ )/2, we get the upper bounds and the structure of asymptotically minimax tests.
The solution of the extremal problem (4.1) is obtained in Ingster and Suslina [13], Section 4.3.Adapting the derivations on pages 146-147 of Section 4.3.2. in [13] to our case, we set where Remark 4.1.One must note that r ǫ → 0 is the only condition we need to obtain the asymptotic solution of (4.1).In particular, it is not required that ǫ → 0 and Lemma 4.1 is valid whatever the value of ǫ > 0 is.

Sketch of proof of Lemma 4.1.
Following Chapter 4 in [13], observe that by setting v k = θ 2 k / √ 2 for all k ∈ Z Z, one can transform the minimization problem under constraints (4.1) into the following one: where V + is defined by equation (4.6).The space l + 1 (Z Z) contains non-negative sequences lying in l 1 (Z Z).Note that v 2 ǫ = ǫ 4 a 2 (r ǫ ).The convexity of the set V + assures the uniqueness of v 2 ǫ .In order to determine the solution, rewrite as in Section 4.3. in [13] the sequence (v k ) k∈Z Z as follows: and m > 0. By using the Lagrange multipliers rule, it is possible to obtain the following relations, as r ǫ → 0 and m → +∞: , and thus then the first and second relations in (4.10) entail that which implies that m → +∞ since the third relation in (4.10) yields m ≍ v

Main results
Depending on the values of b, we distinguish between two types of sparsity: the moderate sparsity case with b ∈ (0, 1/2] and the high sparsity case with b ∈ (1/2, 1).In each case, although being of different types, the "best" test procedures that achieve the separation rates are based on the χ 2 -type statistics (t j ) 1≤j≤d determined in the same way as the "best statistic" t q of a weighted χ 2 -type in Section 4.
Let us introduce a general version of the χ 2 -type statistics of interest.For j in {1, . . ., d}, put where (w k ) k∈Z Z is the sequence of weights such that w k ≥ 0 for all k in Z Z and k∈Z Z w 2 k = 1 2 .Set also Recall that T d = √ log d (see Section ??).Similarly to (2.3) and for any u ∈ (0, √ 2], let us define the statistics L(u) on which the Higher-Criticism type test procedure is built: where on b in (0, 1).Then, for all j ∈ {1, . . ., d}, we consider the statistics t j,b as in (5.1) with the weight sequence (w k (r ⋆ ǫ )) k∈Z Z , that is, Also, denote by t b the normalized empirical mean of the t j,b 's: Similarly, replacing t j by t j,b , consider the statistics L(u, b), C u,b , and Φ0,b defined by equations (5.3), (5.5) and (5.4) respectively, that is,

Moderate sparsity
In case of moderate sparsity, for any α ∈ (0, 1), consider the χ 2 -type test procedure: where t b is defined in (5.6) and T α is the (1 − α)-quantile of a real standard Gaussian random variable.
where the function L is defined in (5.
Finally, combining ψ L and ψ max , we define the test procedure that rejects H 0 if both ψ L and ψ max reject H 0 .
For the high sparsity case, not only separation rates but also sharp asymptotics are obtained; two ranges of b should be distinguished: the range of b in (1/2, 3/4], called the intermediate sparsity case, and the range of b in (3/4, 1), called the highest sparsity case.Theorem 5.2.Assume that r ǫ → 0 and that log d = o(ǫ −2/(2τ +1) ).Let a(r ǫ ) be given by (4.8) and let ϕ be given by (2.2). • Set a(r ⋆ ǫ ) = T d ϕ(b).In our sparse functional framework, the distinguishability conditions are the same as for a d-dimensional sparse vector (see, e.g., [12]), with the only difference that in our case the assumption log d = o(ǫ −2/(2τ +1) ) is required.Under this assumption, the result of Theorem 5.2 means that distinguishability is impossible if lim sup a(r ǫ )/a(r ⋆ ǫ ) < 1 and it is possible if lim inf a(r ǫ )/a(r ⋆ ǫ ) > 1. Due to (4.8), these conditions provide sharp separation rates since they are equivalent to lim sup r ǫ /r ⋆ ǫ < 1 and lim inf r ǫ /r ⋆ ǫ > 1, respectively, where and c 1 (τ ) is defined by (4.9).Note that the condition r ⋆ ǫ → 0 is fulfilled under the assumption The values r ⋆ ǫ mark the border between the areas of distinguishability and nondistinguishability.Indeed, for r ǫ → 0 such that lim sup r ǫ /r ⋆ ǫ < 1, the alternatives separated from the null hypothesis by r ǫ are not distinguishable and, on the other side, for r ǫ → 0 such that lim inf r ǫ /r ⋆ ǫ > 1, the alternatives separated from the null hypothesis by r ǫ are distinguishable.
• Actually, the assumption which is required when dealing with the asymptotic behavior of the tail distribution of t j,b (see Lemma 7.1) since 12) follows from the relations in (4.10).Concerning the lower bound, condition (5.12) is necessary when we evaluate the second moment of the Bayesian likelihood ratio under the null hypothesis.
For the moderate sparsity case, it is worth noting that the family of test procedures depends on b ∈ (0, 1/2] since the sequence of weights w(r ⋆ ǫ (b)) does.It is shown in Theorem 3 of [10] that "adaptive" separation rates for unknown b ∈ (0, 1/2) are of the form r ⋆ ǫ ≍ (ǫ 4 d 2b−1 log log d) τ /(4τ +1) , i.e., the adaptive case leads to an unavoidable log log-loss in separation rates compared to non-adaptive setting.Using the Bonferroni method, it is possible to prove that the test procedures based on a grid of tests of the form ψ χ 2 α d ,b l are adaptive rate-optimal test procedures.Since this result is similar to the one stated in [10], we omit it.

Extended problem
In this section, we generalize the hypothesis testing problem stated in (1.3) to more general alternatives.The null hypothesis H 0 is still characterized by some constant const 0 and, as in (1.3), under the alternative, the signal function f is, up to some constant, equal to f 1 , i.e., f = const 1 + f 1 .The additive sparse structure on f 1 is still assumed, i.e., f 1 ∈ F d,b , as well as every component f 1 j is assumed 1-periodic and orthogonal to a constant (recall that for any t ∈ [0, 1] d f 1 (t) = d j=1 ξ j f 1 j (t j ) where ξ j ∈ {0, 1} and t j ∈ [0, 1] for any j ∈ {1, . . ., d} ).We then denote by Fd,b the set of signal functions in F d,b whose components are 1-periodic and orthogonal to a constant.Rather than imposing smoothness constrains component-wise, we now study the alternative classes for which the smoothness and separation conditions are expressed in terms of the whole signal function f 1 .In other words, the main difference between the extended and initial detection problems is that the distinguishability problem is studied with respect to a global signal.
Then, given the alternatives that include signal functions f as in (1.3), where f 1 belongs to the functional class F ext d (τ, L, r ǫ , b), the testing problem of interest is stated as follows: where 2 ) 2 .Due to the periodic constraint, we consider the standard Fourier basis.This allows to express the semi-norm in terms of Fourier coefficients.As in Section 3, we then transform the functional space F ext d (τ, r ǫ , L, b) to the sequence space Θ ext d (τ, L, r ǫ , b), which consists of sequences θ = (ξ j θ j,k ) j,k such that imsart-generic ver.2011/11/15 file: Corrected-Gayraud-Ingster-Second-Round-Submitted.tex date: July 24, 2012 Note that if L 2 = K and r2 ǫ = Kr 2 ǫ , then we have ).This implies that the results on the lower bound continue to hold for Θ ext d (τ, L, rǫ , b) with the separation rates (r ⋆ ǫ ) 2 = K(r ⋆ ǫ ) 2 , where r ⋆ ǫ is defined by either (5.9) or (5.11) depending on the values of b.Here, the quantity of interest is ã(r ǫ ), the solution of the following extremal problem: As follows from Section 4.3 in [13], the solution of the extremal problem (6.2) is given by where c 1 (τ ) is defined in (4.9).That is, ã(r ǫ ) = Ka(r ǫ ), where a(r ǫ ) is the solution (4.8) of the extremal problem (4.1).
Remark 6.1.Consider the function κ defined by (4.4), for which the sequence of weights w(r ǫ ) = (w k (r ǫ )) k is defined as in (4.2).Then we obtain from (4.7) that and similarly to Proposition 4.1 for any D ≥ 1, inf Now, as in Section 3, with the use of the orthonormal system, instead of considering the random process X(t) defined in model (1.1), we observe a family of random sequences (x j,k ) k∈Z Z,j∈{1,...,d} defined by (3.1).Finally, the remained question is: do the families of test procedures ψ χ 2 α given by (5.8) and ψ HC given by (5.10) provide distinguishability?The answer is affirmative and is given below.Note that it is then sufficient to study the type II error probability of these tests since their type I error probability has been already studied for the hypothesis testing problem (1.3).Theorem 6.1.Assume that r ǫ → 0 and let a(r ǫ ) and ϕ be given by (4.8) and (2.2), respectively.Then, the following results hold true.
ii) High sparsity-Type II error probability of ψ HC defined by (5.10).
Assume that . Remark 6.2.One should note that the detection boundaries are the same for the hypothesis testing problems (1.3) and (6.1), the initial one and its generalization.

Proofs
Proofs of our main results require some preliminary results that are stated below both under the null and alternative hypotheses.Specifically, we establish asymptotic tail distributions of the test statistics in hand and find their first and second moments.

Properties of test statistics
In this section, we consider the statistics t j defined by (5.1) with any sequence of weights w = (w k ) k∈Z Z such that w k ≥ 0, ∀k ∈ Z Z and k w 2 k = 1/2.Therefore the quantities L(u), C(u), and Φ0 are those defined by (5.3), (5.5) and (5.4).
Assume T max k w k = o(1), then Proof of Proposition 7.1.
We consider only the distribution IP θj since IP 0 is a particular case of IP θj .The proof consists of bounding IP θj (t j > T ) from above and below.This is done by using the cumulant-generating function of t j under IP θj which is defined by φ θj (h) = log(IE θj (exp(ht j ))) for any h.Let us consider only positive h and let us introduce a new family of probability measures IP θj,h such that dIP θj ,h dIP 0 = exp(ht j ) exp(−φ θj (h)).This yields Let us start with the upper bound.
Upper bound.The second term o the right-hand side of (7.1) is less than 1.Hence there is a straightforward upper bound on IP θj (t j > T ): To complete this part of the proof, it remains to determine the minimum value of a positive value h on the right-hand side of (7.2).The minimum is attained for positive h such that where (•) ′ and (•) ′′ denote the first and second derivatives with respect to h, respectively, and, IE θj,h and Var θj,h are the expectation and variance with respect to IP θj ,h .
In order to find h that solves equation (7.3), we need to determine φ θj .For this, set ν j,k = θ j,k ǫ .
imsart-generic ver.2011/11/15 file: Corrected-Gayraud-Ingster-Second-Round-Submitted.tex date: July 24, 2012 Then for any positive h such that h → +∞ and hmax k w k = o(1), we obtain where the last equality sign in (7.4) follows from (T − IE θj (t j )) → +∞ and T max k w k = o(1) as T → +∞.Next, differentiating the right-hand side of (7.4) with respect to h yields As (T − IE θj (t j )) T →+∞ −→ +∞, this leads to the following optimal upper bound for right-hand side of (7. 2 ) as T goes to infinity.This completes the proof of the upper bound.
Lower Bound.We are interested in obtaining a lower bound for (7.1).This is done by first considering a new family of probability distributions under which the normalized statistics t j are proved to be asymptotically Gaussian.
For h > 0 satisfying equation (7.3), let us introduce the following probability measures IP θj ,h,k : with t j,k defined in (5.2), φ θ j,k (h) = log IE θ j,k (exp(ht j,k )) and where IE θ j,k stands for the expectation with respect to the observations (x j,k ) j,k of (3.1).Denote by IE θj,h,k and Var θj,h,k the expectation and variance with respect to IP θj ,h,k .
To establish the asymptotic normality of t j , we will check that the Lyapunov condition is satisfied.To this end, set σ 2 j,h,k = Var θj,h,k (t j,k ) and σ 2 j,h = k σ 2 j,h,k .Denote by φ (2) θ j,k and φ (4) θ j,k the second and fourth derivatives of φ θ j,k with respect to h, respectively.Using well-known relations between moments of t j under IP θj ,h,k and the successive derivatives of imsart-generic ver.2011/11/15 file: Corrected-Gayraud-Ingster-Second-Round-Submitted.tex date: July 24, 2012 φ θ j,k (h) with respect to h, in particular, σ 2 j,h = k φ (2) where the last relation follows from max w k = o(1) and relation (7.4), since by (7.4) we get k φ (4) . The Lyapunov condition is then satisfied.This implies that under is asymptotically a real standard Gaussian random variable.
Let us return to relation (7.1),where h is chosen to have IE θj ,h (t j ) = T , and observe that Due to the asymptotic normality of t j , for any δ > 0, Up to o(h 2 ), the right-hand side of (7.6) corresponds to the argument of the exponential function on the right-hand side of (7.2).This entails that the right-hand side of (7.6) is equivalent to . This completes the proof of the lower bound, and thus Proposition 7.1 is proved.
• (i) Expectation and variance of t j defined by (5.1).
Remark 7.1.Under IP 0 , the statistics t j and L(u) have zero mean and unit variance.Moreover, under IP 0 and the assumption max (ii) For any u ∈ (0, √ 2], as T d → +∞, Proposition 7.1 gives a control over C u defined by (5.5): . Case 1: for the nonzero ξ j 's, assume that lim sup(uT d −IE θj (t j )) < +∞.In this case, the probability IP θj (t j > uT d ) = IP θj (t j − IE θj (t j ) > uT d − IE θj (t j )) is bounded away from zero.This follows from the asymptotic normality of t j − IE θj (t j ) for IE θj (t j ) = O(T d ) (see Remark 7.1) Case 2: for the nonzero ξ j 's, assume that uT d − IE θj (t j ) → +∞.Then, for any nonzero ξ j , Proposition 7.1 implies that log Recall that the number of nonzero ξ j is equal to K = d 1−b and that for all nonzero ξ j , IE θj (t j ) ≥ cT d for some positive c such that max j:ξj =1 IE θj (t j ) = O(T d ).To sum up, the cases 1 and 2 entail that Similarly, let us study the variance of L(u).Using Proposition 7.1, we obtain Proof of (ii)-Theorem 5.1.Type I error probability of ψ χ 2 α .It follows from the Central Limit Theorem that, under the null hypothesis, t b is asymptotically a standard normal random variable.Therefore
Finally, denote the main term in the exponent of d in (7.14) by To obtain the result, it is sufficient to prove that M is positive and bounded away from zero for any δ > 0.
Intermediate sparsity case.This case corresponds to b The latter inequalities are obviously satisfied.This leads to the result.Highest sparsity case.In this case b ∈ (3/4, 1) and Again, the latter inequalities are satisfied, and the result follows.
The proof of the fact that the type II error probability of ψ HC goes to zero as d → +∞ is similar to the one of Theorem 5.2.Recall that K = d 1−b is the number of nonzero ξ j 's and suppose without loss of generality that ξ j = 1, ∀j ∈ {1, . . ., K} and ξ j = 0, ∀j ∈ {K + 1, . . ., d}.Note that relations (7.10) and (7.11) remain valid for any θ ∈ Θ ext d (τ, K 1/2 , K 1/2 r ǫ , b).First, similarly to (7.12), for any θ ∈ Θ ext d (τ, K 1/2 , K 1/2 r ǫ , b) such that for the nonzero ξ j 's, there exists l ∈ {1, . . ., N } for which IE θj t j,b l ≥ D 1 T d with D 1 > D, the type II error probability of ψ HC vanishes asymptotically.Therefore, it suffices to study the test procedures Therefore, let us take δ > 0 and consider the alternatives that are as far away from the null hypothesis as r ǫ such that Second, for any l ∈ {1, . . ., N }, observe that the only difference between the proofs of the extended and initial problems lies in the study of inf Now it is no more possible to control (7.15) by using Lemma 7.1 (ii) because the condition IE θj (t j ) ≥ cT d is not necessarily satisfied for all nonzero ξ j 's.In fact, the only condition we have is Let us now explain why the current proof is reduced to the study of (7.15).As in (7.13), we get for any Due to Lemma 7.1 and the fact that H = O d ((log d) C ) with C > 1/4, in order to obtain the result, it remains to prove that for any l such that b l ≤ b ≤ b l+1 , inf +∞ as a positive power of d.Finally, recall that where The term on the right-hand side of (7.16) corresponds to the product of (7.15) and C u l ,b l .The quantity C u l ,b l is controlled by Lemma 7.1 and Proposition 7.1.Thus it remains to study (7.15).Third, the application of Proposition 7.1 gives the following approximation of (7.15), Recall that a(r ǫ ) given by (4.8) is the solution of the extremal problem (4.1).Set η j = IE θj (t j,b l ), where R = R(T ) > 0 will be specified later on.Consider also Due to relation (6.4), we have for the sequence w(r ⋆ ǫ (b l )) that Then, in order to obtain the same right-hand side as in (7.14), it is sufficient to show that for any l in {1, . . ., N } such that T = u l T d , relation (7.17) which is stated below, holds: This is handled by a technical result similar to the one stated in Lemma 7.4 and Lemma 7.5 in Ingster et al. [17].The proof of Lemma 7.2 is postponed to Section 7.4.

2
, the conditions in (7.18) are then satisfied.Therefore the application of Lemma 7.2 yields the results since for all θ ∈ Θ ext d (τ, which corresponds to the right-hand side of (7.14).

Lower Bound
The prior distribution we consider is a classical one for a functional Gaussian model.In Section 4.3 of [13] it is referred to as the symmetric Three-point Factors.
Prior.Before defining the prior Π d formally, we shall start with an informal discussion.
The prior Π d adds mass on (ξ j θ j ) 1≤j≤d : the components are i.i.d. and ξ j and θ j are supposed to be independent.A natural choice for ξ j is a Bernoulli with a parameter p d ∈ (0, 1) such that IE( d j=1 ξ j ) ∼ K.The θ j 's are binary random variables (with probability 1/2) such that θ 2 j = (θ ⋆ ) 2 where the sequence θ ⋆ is a solution of the extremal problem (3.1); this guarantees that θ j belongs to Θ(τ, r ǫ ).−→ +∞, ∀b ∈ (0, 1), ∀s > 0. Consider two sequences (ξ j ) j and (θ j,k ) j,k of independent random variables whose distributions are the following: 2 , j ∈ {1, . . ., d}, k ∈ Z Z.The sequence (z k ) k∈Z Z is deterministic and is defined as follows: (ǫ z k ) k = (θ ⋆ k ) k∈Z Z = θ ⋆ where θ ⋆ is the sequence that leads to the solution (4.8) of the extremal problem (4.1).In particular, this entails that The sequences (ξ j ) j and (θ j,k ) j,k are also taken mutually independent.For each j in {1, . . ., d}, we define the prior distribution π d j on (ξ j , θ j ) as follows: where Proof of Proposition 7.2.
Finally, in view of Proposition 2.11 in [13], it remains to check that Similarly to the proof of Proposition 2.9. in [13], it is easily seen that (7.30) follows from the relation Acting as in the proof of Proposition 3 in [12], we obtain by Chebyshev's inequality, where the ratio on the right-hand side tends to zero as d goes to infinity.Relation (7.31) is then proved.
Due to Proposition 7.2.the proof of the lower bound is reduced to bounding γ ⋆ ∆ = γ(IP Π d ) from below.inequality 1 + x ≤ exp(x), ∀x ∈ I R, we obtain where sinh denotes the hyperbolic sine.In view of Remark 7.
As d goes to infinity, the right-hand side of (7.36) goes to one provided that Proof of (i)-Theorem 5.1.
Recall that by assumption b ∈ (0, 1/2].We shall distinguish between two cases depending on the values of r ǫ with respect to r ⋆ ǫ defined in (5.9).Again, we shall consider two cases depending on the values of r ǫ with respect to r ⋆ ǫ , where r ⋆ ǫ is now defined by (5.11).

.44)
In order to get a lower bound for the minimax total error probability, it is sufficient to prove (see the proof of Theorem 4.1 in [11]) that IE 0 (( LΠ d − 1) 2 ) = o(1), where LΠ d is defined in (7.40) provided that In fact, it is enough to prove that  First, consider the term IE 0 ( LΠ d ): where D j = { lj ≤ a(r ǫ ) (2 + v) log d} and D j denotes the complement of D j .Relation (7.46) entails the convergence to zero of the second term in the log term of the right-hand side of (7.50).Therefore, in order to obtain IE 0 ( LΠ d ) → 1, it is sufficient to prove that 2) for any positive v, we can applied relation (7.48) of Lemma 7.3 to get where the right-hand side of (7.52) goes to zero as soon as c(r ǫ ) < √ 2 + v − 2(1 − b).This yields relation (7.51).
Second, we need to study IE 0 ( L2 Π d ): Since the relations dIP 0 (D j ) = o(1) and ).To this end, observe that

.53)
The first term on the right-hand side of (7.53) tends to zero as d goes to infinity since dp To study of the second term on the right-hand side of (7.53), we take into account the following two points: (i) since sup k z 2 k = o(1), we can apply Lemma 7.4 of Section 7.4 with h = 2, X = x j,k /ǫ, and z = z k , and obtain exp(2 lj ) = exp(2a 2 (r ǫ )). (

.56)
On the other hand, T the first and second derivatives of g T , respectively.Note that g and g ′ T is increasing on ]T + 1, +∞[.Moreover, g ′ T (T − 1) > 0 and g ′ T (T ) = −λ < 0, so that there exists t ∈]T − 1, T [ such that g ′ T (t) = 0.This yields that η 0 is a local minimum of g T .In order to prove that η 0 is a global minimum of g T , it is sufficient to show that g T (R) − g T (η 0 ) > 0. Let us set R = T + x, with a positive real x.We already know that x < T − η 0 since g T (T + (T − η 0 )) = f T (η 0 ) − λ(T + T − η 0 ) = f T (η 0 ) − λη 0 − 2λ(T − η 0 ) < g T (η 0 ), where the last inequality is valid because of the choice of λ and T − η 0 .For x < (T − η 0 ), we obtain   where, with our choice of δ, the right-hand side of (7.62) is small.Now, we move on to the term G 1 (h, δ).If δ is small enough, then |zX| is also small and the following relation holds: Then the routine calculations of exponential moments as above lead to the following:  Next, due to (7.62), (7.64) and using the fact that h = O(1), z = o(1), for small δ such that z 0 δ −1 = o(1) and δ = o(1), we obtain log(IE(exp(hg(z, X)))) = log(G 1 (h, δ) ).Due to (7.61), for any h such that h = O(1), Lemma 7.4 can be applied to the moment-generating function Λ j,k (h).

ξ
j , and assume that K = d 1−b , where b ∈ (0, 1) is the sparsity index.If d 1−b is not an integer then take K as its integer part.Denote by F d,b the functional class of additive sparse signals f of the form (1.2) with K = d 1−b active components and d b non-active components.Model (1.1) with the sparse additive structure (1.2) is a natural generalization of the sparse linear model: the nonparametric nature of the problem suggests to consider more flexible models.

Lemma 4 . 1 .
The solution of the extremal problem (4.1) is given by

. 13 )
For any b l ∈ (1/2, 1), if we prove that inf θ∈Θ d (τ,rǫ,b) IE θ (L(u l , b l )) goes to infinity as a power of d (d → +∞), then Lemma 7.1 and the choice of H (recall H = O d ((log d) C ), with C > 1/4) yield the result since in this case the right-hand side of relation (7.13) goes to zero.Third, for b ∈ (1/2, 1), take an index l in {1, . . ., N − 1} such that b l ≤ b ≤ b l+1 .This, combined with the continuity of ϕ, yields b imsart-generic ver.2011/11/15 file: Corrected-Gayraud-Ingster-Second-Round-Submitted.tex date: July 24, 2012 Now, we define the prior distribution more precisely.Let ρ d be any sequence of positive numbers such that ρ d d→+∞ −→ 0 and d 1−b ρ s d d→+∞ puts mass on θ j,k and δ is the Dirac mass.Finally, we define the global prior Π d by Π d = Minimax and Bayesian risks.Denote by IP Π d the mixture of the measures IP θ over the prior Π d , and let γ(Q) be the minimal total error probability for testing a simple null hypothesis H 0 : IP = IP 0 against a simple alternative H 1 : IP = Q regarding the measure IP of our observations (x j,k ) k∈Z Z,1≤j≤d .Proposition 7.2.γ ≥ γ(IP Π d ) + o(1), (7.25)where γ is the minimax total error probability over Θ d (τ, r ǫ , b) (see (3.2)).