Optimal Inference with a Multidimensional Multiscale Statistic

We observe a stochastic process $Y$ on $[0,1]^d$ ($d\geq 1$) satisfying $dY(t)=n^{1/2}f(t)dt$ + $dW(t)$, $t \in [0,1]^d$, where $n \geq 1$ is a given scale parameter (`sample size'), $W$ is the standard Brownian sheet on $[0,1]^d$ and $f \in L_1([0,1]^d)$ is the unknown function of interest. We propose a multivariate multiscale statistic in this setting and prove its almost sure finiteness; this extends the work of D\"umbgen and Spokoiny (2001) who proposed the analogous statistic for $d=1$. We use the proposed multiscale statistic to construct optimal tests for testing $f=0$ versus (i) appropriate H\"{o}lder classes of functions, and (ii) alternatives of the form $f=\mu_n \mathbb{I}_{B_n}$, where $B_n$ is an axis-aligned hyperrectangle in $[0,1]^d$ and $\mu_n \in \mathbb{R}$; $\mu_n$ and $B_n$ unknown. In the process we generalize Theorem 6.1 of D\"umbgen and Spokoiny (2001) about stochastic processes with sub-Gaussian increments on a pseudometric space, which is of independent interest.


Introduction
Let us consider the following continuous multidimensional white noise model: is the unobserved d-dimensional Brownian sheet (see Definition 6.1), and n is a known scale parameter. Estimation and inference in this model is closely related to that of nonparametric regression based on sample size n. We work with this white noise model as this formulation is more amiable to rescaling arguments; see e.g., Donoho and Low [9], Dümbgen and Spokoiny [10], Carter [5].
In this paper we develop optimal tests (in an asymptotic minimax sense) based on a newly proposed multidimensional multiscale statistic (i.e., d ≥ 1) for testing: (i) f = 0 versus a Hölder class of functions with unknown degree of smoothness; (ii) f = 0 against alternatives of the form f = µ n I Bn , where B n is an unknown hyperrectangle in [0, 1] d with sides parallel to the coordinate axes (i.e., axisaligned) and µ n ∈ R is unknown (for different regimes of µ n and B n ).
Scenario (i) arises quite often in nonparametric regression where the goal is to test whether the underlying f is 0 versus f = 0 with unknown smoothness; see e.g., Lepski and Tsybakov [30], Horowitz and Spokoiny [18], Ingster and Sapatinas [23] and the references therein. Our proposed multiscale statistic, which extends the work of Dümbgen and Spokoiny [10] that considered the analogous statistic for d = 1, leads to rate optimal detection in this problem. Moreover, with the knowledge of the smoothness of the underlying f , we construct a asymptotically minimax test which even attains the exact separation constant (see Section 1.2 for formal definitions and related concepts). Setting (ii) is a prototypical problem in signal detection -an unknown (constant) signal spread over an unknown hyperrectangular region -and the goal is to detect the presence of such a signal; see e.g., Arias-Castro et al. [3], Chan [6], Walther [42], Frick et al. [12], Butucea et al. [4], Chan and Walther [7], Glaz and Zhang [15], König et al. [28] for a plethora of examples and applications.
Although several minimax rate optimal tests have been proposed in the literature for this problem (see e.g., Arias-Castro et al. [3], Chan [6], Butucea et al. [4] and König et al. [28]), as far as we are aware, our proposed multiscale test is the only test that attains the exact separation constant -this leads to simultaneous optimal detection of signals both at small and large scales.
We first motivate and introduce our multiscale statistic below (Section 1.1) and briefly describe the asymptotic minimax testing framework and our main optimality results in Section 1.2.

Multiscale statistic when d ≥ 1
To motivate our multiscale statistic let us first look at the following testing problem: where H β,L is the Hölder class of function with parameters β > 0 and L > 0. For β ∈ (0, 1] and L > 0 the Hölder class H β,L is defined as
Our multiscale statistic is based on the idea of kernel averaging. Suppose that ψ : R d → R is a measurable function such that (i) ψ is 0 outside [−1, 1] d ; (ii) ψ ∈ L 2 (R d ), i.e., R d ψ 2 (x)dx < ∞; (iii) ψ is of bounded (HK)-variation (see Definition 6.3); and (iv) R d ψ(x)dx > 0. We call such a function a kernel. For any h := (h 1 , . . . , h d ) ∈ (0, 1/2] d we define For any t ∈ A h we define the centered (at x) and scaled kernel function ψ t,h : [0, 1] d → R as For a fixed t ∈ A h we can construct a kernel estimatorf h (t) of f (t) based on the data process Y (·) asf where for any two functions g 1 , g 2 ∈ L 2 (R d ), we define g 1 , g 2 := R d g 1 (x)g 2 (x)dx. We consider the normalized version of the above kernel estimatorf h (t): where ψ 2 := R d ψ 2 (x)dx < ∞. We can useΨ(t, h) to test This is because, for a fixed scale h, sup t∈A h |Ψ(t, h)| = O p ( 2 log(1/(2 d h 1 · · · h d ))); see Giné and Guillou [13]. Thus, to use the above approach to construct a valid test for (1.2) we need to put the test statistics sup t∈A h |Ψ(t, h)| at different scales (i.e., h) in the same footing -this leads to the following definition of the multiscale statistic in d-dimensions: T (Y, ψ) := sup In Theorem 2.1, a main result in this paper, we show that the above multivariate multiscale statistic T (Y, ψ) is well-defined and finite a.s. for any kernel function ψ, when f ≡ 0. This result immediately extends the main result of Dümbgen and Spokoiny [10, Theorem 2.1] beyond d = 1. Although there has been several proposals that extend the definition and the optimality properties of the multiscale statistic of Dümbgen and Spokoiny [10] beyond d = 1 (see e.g., König et al. [28], Walther [42], Chan and Walther [7]), we believe that our proposed multiscale statistic is the right generalization. Further, the exact form of T (Y, ψ) leads to optimal tests for (1.2) and other alternatives (which the other competing procedures do not necessarily yield; see Remarks 2.3 and 3.4 for more details).
To show the finiteness of the proposed multiscale statistic we prove a general result about a stochastic process with sub-Gaussian increments (Theorem 2.2) on a pseudometric space which may be of independent interest. This result has the same conclusion as that of Dümbgen and Spokoiny [10, Theorem 6.1] but assumes a weaker condition on the packing numbers of the pseudometric space on which the stochastic process is defined. This weaker condition on the packing numbers is crucial to the proof of Theorem 2.1; see Remark 2.1 where we compare our result with the existing result of Dümbgen and Spokoiny [10, Theorem 6.1]. Moreover, Lemma 2.1 gives a tighter bound on the packing numbers of the pertinent (to our application) pseudometric space, which we believe is also new; see Remarks 2.2 and 2.3 where we compare our result with some relevant recent work.

Optimality of the multiscale statistic
Before we describe our main results let us first introduce the asymptotic minimax hypothesis testing framework. There is an extensive literature on nonparametric testing of the simple hypothesis {0}. As a staring point we refer the readers to Ingster and Suslina [24]. In the nonparametric setting it is usually assumed that f belongs to a certain class of functions F and its distance from the null function f = 0 is defined by a seminorm · . In this setting, given α ∈ (0, 1), the goal is to find a level α test is as large as possible for some δ > 0 and ρ n > 0 where ρ n → 0 as n → ∞ (ρ n is a function of the sample size n); in the above notation E g denotes expectation under the alternative function g. However, it can be shown that given F and · , the constants δ and ρ n cannot be chosen arbitrarily if one wants to have a statistically meaningful framework (see the survey papers Ingster [20], Ingster [21], Ingster [22] for d = 1 and Ingster and Sapatinas [23] for d > 1). It turns out that if δρ n is too small then it is not possible to test the null hypothesis with nontrivial asymptotic power (i.e., the infimum in (1.9) cannot be strictly larger than α + o(1)). On the other hand if δρ n is very large many procedures can test f ≡ 0 with significant power (i.e., the infimum in (1.9) goes to 1 as n → ∞). The hypothesis testing problem then reduces to: (a) finding the largest possible δρ n such that no test can have nontrivial asymptotic power (i.e., under the alternative f such that f ≤ δρ n , the asymptotic power is less than or equal to the level α), and (b) trying to construct test procedures that can detect signals f , with f ≥ δρ n , with considerable power (power going to 1 as n → ∞). More specifically, δ and ρ n are defined such that δρ n is the largest for which, for all > 0, we have where the supremum is taken over all sequence of level α tests φ n . In this case ρ n is called the minimax rate of testing and δ is called the exact separation constant (see Lepski and Tsybakov [30], Ingster and Stepanova [19] for more details about minimax testing). On the other hand, we want to find a testφ n such that In such a scenario,φ n is called an asymptotically minimax test. Here we would also like to point out that if there exists a testφ n and a constantδ > δ such that then the testφ n is called a rate optimal test.
In Section 3 we show that our proposed multiscale statistic yields an asymptotically minimax test for the following scenarios: 1. (Optimality for Hölderian alternatives). Consider testing hypothesis (1.2). If where f belongs to the Hölder class H β,L with β > 0 and L > 0, f ∞ := sup x∈[0,1] d |f (x)| denotes the sup-norm of f , and c * is a constant (defined explicitly in Theorem 3.1), we show that we can construct a level α test based on the multiscale statistic (1.8) that has power converging to 1, as n → ∞, provided n does not go to 0 too fast (see Theorem 3.1 for the exact order of n ). We note that this multiscale statistic would require the knowledge of β but not of L.
Moreover, we show that if f ∞ ≤ c * (1 − n )(log(en)/n) β/2β+d no test of level α ∈ (0, 1) can have nontrivial asymptotic power; see Theorem 3.1 for the details. This shows that our proposed multiscale test is asymptotically minimax with rate of testing ρ n = (log(en)/n) β/(2β+d) and exact separation constant δ = c * . As far as we are aware this is the first instance of an asymptotically minimax test for the Hölder class H β,L when d > 1 (under the supremum norm). Moreover, if the smoothness β of the Hölder class H β,L is unknown (but β ≤ 1) then we can still construct a rate optimal test for this problem; see Proposition 3.1 for the details. where µ n = 0 ∈ R and for all i = 1, . . . , d} are unknown, for some h (n) ∈ (0, 1/2] d and t (n) ∈ A h (n) , and I Bn denotes the indicator of the hyperrectangle B n . First, consider the scenario lim inf n→∞ |B n | > 0 where |B n | denote the Lebesgue measure of B n . Then, if lim n→∞ √ n|µ n | → +∞, we can construct a level α test based on the multiscale statistic (1.8) that has power converging to 1 as n → ∞; see Theorem 3.2. Further, we show that, if lim sup n→∞ √ n|µ n | < ∞, no test of level α can detect the alternative with power going to 1. Thus, the multiscale test is optimal for detecting signals on large scales.
On the other hand, let us now consider the case lim n→∞ |B n | = 0. If |µ n | n|B n | ≥ (1 + n ) 2 log(1/|B n |), for all n, we can construct a test of level α, based on the proposed multiscale statistic, that has power converging to 1 as n → ∞, provided n does not go to 0 too fast (see Theorem 3.2). Furthermore, we can show that if |µ n | n|B n | = (1 − n ) 2 log(1/|B n |), for all n, no test can detect the signal reliably with nontrivial power (i.e., for any level α test φ n there exists a signal f n of the above described strength such that φ n will fail to detect f n with asymptotic probability at least 1 − α); see Theorem 3.2 for the details. This shows that our multiscale test is asymptotically minimax for signals at small scales.

Literature review and connection to existing works
Our multiscale statistic (1.8) can be thought of as a penalized scan statistic, as it is based on the maximum of an ensemble of local test statistics |Ψ(t, h)|, penalized and properly scaled. Scan-type procedures have received much attention in the literature over the past few decades. Examples of such procedures can be found in Siegmund and Venkatraman [39], Siegmund and Yakir [40], Naus and Wallenstein [31], Kulldorff [29], Haiman and Preda [16], Jiang [26], etc. All the above mentioned papers consider d = 1 and no penalization term (like Γ(·) in our case) was used. Asymptotic properties of the scan statistic have been studied expensively. In Naus and Wallenstein [31] and Pozdnyakov et al. [33] the authors give asymptotic approximations of the distribution of the scan statistic when d = 1. For d = 2, similar results can be found in Glaz and Zhang [15], Haiman and Preda [16], Wang and Glaz [43], among others. Recently in Sharpnack and Arias-Castro [38] the authors give exact asymptotics for the scan statistic for any dimension d.
In all of the above papers it is noted that the scan statistic is dominated by small scales; this creates a problem for detecting large scale signals. One common proposal to fix this problem is to modify the scan statistic so that instead of the maximum over all scales we look at the maximum over scales that are in an appropriate interval containing the true scale of the signal; see e.g., Sharpnack and Arias-Castro [38], Naus and Wallenstein [31]. In particular, the last two papers show that if the extent of the signal is of a certain order (log n) then this approach leads to power comparable to an oracle. An obvious drawback with the above approach is that we need to have some prior knowledge on which scales the signal(s) may be present. In contrast, our multiscale method does not require any such knowledge.
Another approach that has been proposed to optimally detect signals on both large and small scales is to use different critical values (of the scan statistic) to test for signals at different scales separately (see e.g., Chan and Walther [7], Walther [42]) and use multiple testing procedures (see Hall and Jin [17] and the references within) to calibrate the method. However, note that a vast majority of the multiple testing literature either assume that the test statistics are independent (which is not the case here) or are too generic and generally quite conservative.
Conceptually, our work is most related to that of Dümbgen and Spokoiny [10], where the authors proposed our multiscale statistic for d = 1. Thus, our work can be thought of as a generalization of Dümbgen and Spokoiny [10] to multidimension (d > 1).

Organization of the paper
The proposed multiscale statistic is studied in Section 2. In Section 3 we construct optimal tests for: (i) f = 0 versus Hölderian alternatives; (ii) f = 0 versus alternatives of the form f = µ n I Bn , where B n is an axis-aligned hyperrectangle in [0, 1] d and µ n ∈ R (for different regimes of µ n and B n , both unknown). We compare the performance of our multiscale based test with other competing methods in Section 4. In Section 5 we discuss some open problems and possible applications/extensions of our work. Section 6 gives the proof of Theorem 2.1. The proofs of the other results are relegated to Appendix A.

Multidimensional multiscale statistic
Let us first recall the definition of the multivariate multiscale statistic T (Y, ψ) given in (1.8). The following theorem, our main result in this section, shows that the multiscale statistic T (Y, ψ) is well-defined and finite a.s. for any (reasonable) kernel function ψ; see Section 6.4 for a proof. Theorem 2.1 immediately extends the main result of Dümbgen and Spokoiny [10, Theorem 2.1] beyond d = 1. The proof of the above theorem crucially relies on the following two results. We first introduce some notation. Definition 2.1 (Packing number). For any pseudometric space (F , ρ) and > 0, the packing number N ( , F ) is defined as the supremum of the number of elements in F where F ⊆ F and for all a = b ∈ F we have ρ(a, b) > .
We will prove Theorem 2.1 as a consequence of the following more general result about stochastic processes with sub-Gaussian increments on some pseudometric space (see Section 6.2 for its proof).
Theorem 2.2. Let X be a stochastic process on a pseudometric space (F , ρ) with continuous sample paths. Suppose that the following three conditions hold: (a) There is a function σ : F → (0, 1] and a constant K ≥ 1 such that Then the random variable S(X) := sup a∈F X 2 (a)/σ 2 (a) − 2V log(1/σ 2 (a)) log log(e e /σ 2 (a)) (2.1) is finite almost surely. More precisely, P(S(X) > r) ≤ ξ(r) for some function ξ : R + → R depending only on the constants K, L, M, A, B, p, V such that lim r→∞ ξ(r) = 0. To apply Theorem 2.2 to prove Theorem 2.1 we need to define a suitable pseudometric space (F , ρ) and a stochastic process, and verify that conditions (a)-(c) in Theorem 2.2 hold. In that vein, let us define the following set with the following pseudometric denotes the symmetric difference of the sets A and B, and |A| denotes the Lebesgue measure of the set A. Also, define The following important result shows that indeed for the above defined pseudometric space (F , ρ) condition (c) of Theorem 2.1 holds.
Lemma 2.1. Let F , ρ(·, ·) and σ(·) be as described above. Then for some constant K depending only on d.

Remark 2.2.
Here we would like to point out that Lemma 2.1 shows that condition (c) of Theorem 2.2 holds with B = 2d, p = d − 1 and most importantly for V = 1, which was also the case when d = 1 (as shown in Dümbgen and Spokoiny [10]). It is well-known that we should choose the constant V in the penalization term Γ V as small as possible (see e.g., König et al. [28, Section 1.1]) for optimal testing. In our proposed multiscale statistic we take V = 1. The following proposition shows that indeed V = 1 is the smallest possible permissible value; see Section A.1 for a proof.

Optimality of the multiscale statistic in testing problems
In this section we prove that we can construct tests based on the multiscale statistic that are optimal for testing (1.2) and (1.10). For both the testing problems we can define a multiscale test based on kernel ψ as follows: Let where W is the standard Brownian sheet on [0, 1] d . For notational simplicity we would denote κ α,ψ by κ α from now on. For testing (1.2) and (1.10) a test of level α can be defined as follows: Let us call this testing procedure the multiscale test. Although any kernel ψ can be used to construct the above test, in Sections 3.1 and 3.2 we show that specific choices of the kernel function ψ leads to asymptotically minimax tests.

Optimality against Hölder classes of functions
Let us recall the definition of the Hölder class of functions H β,L , for β ∈ (0, 1] and L > 0, as in (1.3); see Definition 6.2 for the formal definition of H β,L for any β > 0. Let ψ β : R d → R, for 0 < β < ∞, be the unique solution of the following optimization problem: Minimize ψ over all ψ ∈ H β,1 with ψ(0) ≥ 1.
Elementary calculations show that for 0 < β ≤ 1, we have see Section A.2 for a proof. For β > 1, ψ β can be calculated numerically. We consider the kernel ψ β , for β > 0, described above and state our first optimality result for testing (1.2); see Section A.3 for a proof.
Then, for arbitrary n > 0 with n → 0 and n √ log n → ∞ as n → ∞, the following hold: (a) For any arbitrary sequence of tests φ n with level α for testing (1.2), we have where g Jn,∞ := sup t∈Jn |g(t)|.
The above result generalizes Dümbgen and Spokoiny [10, Theorem 2.2] beyond d = 1. Theorem 3.1 can be interpreted as follows: (a) for every test φ n there exists a function with supremum norm (1 − n )c * ρ n which cannot be detected with nontrivial asymptotic power; whereas (b) when we restrict to functions with signal strengths (i.e., supremum norm in the interior of [0, 1] d ) just a bit larger than the above threshold, our proposed multiscale test is able to detect every such function with asymptotic power 1. In this sense our proposed test is optimal in detecting departures from the zero function for Hölder classes H β,L . We note here that to calculate T β we need the knowledge of β but we do not need to know L.
If β is unknown, but is less than or equal to 1, we can use T 1 as a test statistic for testing (1.2). Although the resulting test is not asymptotically minimax, the test is still rate optimal. The following result formalizes this; see Section A.3.2 for its proof. and let M be any constant such that M > 2dL d/β ψ 1 2 Remark 3.1. Instead of using the test statistic T β if we use the test statistic with the kernel ψ β , then the same conclusions as that of Theorem 3.1 and Proposition 3.1 would hold. Thus the multiscale statistic T β is also optimal against Hölderian alternatives.
Remark 3.2. Note that in König et al. [28] the authors propose a multiscale statistic like T β , with a slightly different penalization term instead of Γ(·). A close inspection of our proof of Theorem 3.1 reveals that for such a statistic, only signals with g Jn,∞ ≥ √ V (1 + n )c * ρ n will be detected with power converging to 1. This shows how a proper penalization (as in our multiscale statistic) can lead to the testing procedure attaining the exact separation constant for testing (1.2).

Optimality against axis-aligned hyperrectangular signals
In Theorem 3.1 we proved the optimality of the multiscale test when the supremum norm of the signal is large. A natural question that arises next is: "What if the signal is not peaked but distributed evenly on some subset of [0, 1] d ?". To answer this question we look at the testing problem (1.10), and establish below the optimality of our multiscale test in this setting (see Section A.3 for a proof of Theorem 3.2). Note that when d = 1 similar optimality results are known for the multiscale statistic; see Frick et al. [12,Theorem 2.6] and Chan and Walther [7].
Let f n = µ n I Bn where B n is an axis-aligned hyperrectangle and let |B n | denote the Lebesgue measure of the set B n . Then we have the following results: (a) Suppose that lim inf n→∞ |B n | > 0. Let φ n be any test of level α ∈ (0, 1) for (1.10). Then, for any f n = µ n I Bn such that lim sup n |µ n | n|B n | < ∞, we have Moreover, for the proposed multiscale test based on T , we have with n → 0 and n 2 log(1/|B n |) → ∞. (Here we have omitted the dependence of h n in the notation G − n ). If φ n be any test of level α ∈ (0, 1) for (1.10) then we have lim sup Moreover, let Then for our multiscale test we have If we use the test statistic T , as defined in (3.2) (with the kernel ψ 0 ), instead of T in Theorem 3.2, the optimality results described in the theorem still hold.
Our first result in Theorem 3.2 shows that as long as lim inf n→∞ |B n | > 0, for any test to have power converging to 1 we need to have lim |µ n | n|B n | = ∞, in which case our multiscale test achieves asymptotic power 1. Thus our multiscale test is optimal for detecting large scale signals. The next result can be interpreted as follows: (i) For signals with small spatial extent (i.e., lim n→∞ |B n | = 0) if the signal strength is too small (|µ n | n|B n | ≤ (1 − n ) 2 log(1/|B n |)) no test can detect the signal reliably with nontrivial probability (i.e., for every test φ n there exist a signal such that φ n will fail to detect it with probability 1 − α + o(1)); (ii) on the other hand, if the signal strength is a bit larger than the threshold (i.e., the exact separation constant) described above our multiscale test will detect the signal with asymptotic power 1. This shows that our multiscale test achieves optimal detection for signals with small spatial footprint. We would like to emphasize here that by using the same exact test (using the same kernel ψ 0 ) we are able to optimally detect both large and small scale signals.
Remark 3.4. As we mentioned in Remark 3.2 if we used Γ V (·) (see (3.3)), for V > 1, instead of Γ(·), in defining the multiscale statistic then we would only be able to detect signals (when |B n | → 0) if |µ n | n|B n | ≥ √ V (1 + n ) 2 log(1/|B n |) which is not the exact separation constant as mentioned in Theorem 3.2. This agains illustrates the importance of choosing the right penalization term Γ(2 d h 1 . . . h d ) in defining the multiscale statistic.
Remark 3.5. Here we would like to point out that proofs for the minimax lower bound that have been derived for the two scenarios in Theorems 3.1 and 3.2 follows the standard techniques that have been used in Ingster [20], Ingster [21], Ingster [22], Lepski and Tsybakov [30], Dümbgen and Spokoiny [10], Ingster and Sapatinas [23] etc.

Comparison with the scan and average likelihood ratio statistics when d = 1
When d = 1 there exists an extensive literature on the optimal detection threshold for signals of the form f n = µ n I Bn , where now B n ⊆ [0, 1] is an interval. In Chan and Walther [7] the authors compare the performance of the scan statistic (i.e., the statistic (1.7) in the discrete setup with ψ = I [−1,1] ) and the average likelihood ratio (ALR) statistic (which is the discrete analogue of  Table 1 Critical values κ 0.05 for different n = m 2 .
Note that 0.95 quantiles necessarily increase as n increases. But in our simulations the 0.95 quantile for n = 150 2 turned out to be slightly less than that of n = 125 2 due to sampling variability.
When lim inf n→∞ |B n | > 0 the scan statistic can only detect the signal, with asymptotic power 1, when |µ n | √ n ≥ (1 + n ) √ 2 log n, whereas the ALR statistic (and the proposed multiscale statistic) can detect the signal whenever we have |µ n | √ n → ∞ (which is a less stringent condition). Note that |µ n | √ n → ∞ is also required for any test to detect the signal with asymptotic power 1. This shows that the scan statistic is not optimal for detecting large scale signals.
On the other hand if lim n→∞ |B n | = 0, the scan statistic can detect the signal if |µ n | n|B n | ≥ (1 + n ) √ 2 log n whereas the ALR statistic can detect the signal when |µ n | n|B n | ≥ √ 2(1 + n ) 2 log(1/|B n |). The optimal detection threshold in this scenario is |µ n | n|B n | ≥ (1 + n ) 2 log(1/|B n |), which is attained by the multiscale statistic. Thus that scan statistic is optimal in detecting signals only when |B n | = O(1/n). The ALR statistic requires the signal to be at least √ 2 times the (detectable) threshold. This shows that neither the standard scan or the ALR is able to achieve the optimal threshold for detecting small scale signals.
Frick et al. [12,Theorem 2.6] shows the optimality of the multiscale statistic (which is a modification of the scan statistic) in detecting signals in both cases when d = 1. In Rivera and Walther [35] and Chan and Walther [7] the authors propose a condensed ALR statistic which, much like the multiscale statistic, is able to attain the optimal threshold for detection in both regimes of B n . As far as we are aware the condensed ALR statistic has not been extended beyond d = 1 and therefore whether it achieves the optimal threshold for d > 1 is not known. In summary, Theorem 3.2 shows that our multidimension multiscale test is asymptotically minimax even when d > 1.

Simulation studies
In this section we demonstrate the performance of the multiscale testing procedure described in Section 3 and compare it with other competing methods through simulation studies. For computational tractability, we replace the continuous white noise model (1.1) with a discrete one and consider the case d = 2. More specifically, we consider data on the m × m grid S n = {(i/m, j/m) : 1 ≤ i, j ≤ m} (here n = m 2 ), where the model is for i, j = 1, . . . , m,  Table 1 we give the empirical 0.95-quantile of the multiscale statistic T (W, ψ) (see (1.8)) for different values of n; the computation of the empirical quantiles were based on 3000 replications. Observe that the empirical quantiles seem to stabilize as m increases beyond 100. Figure 1 shows the empirical distribution function estimates, based on 3000 replications, of the multiscale statistic for different values of n.
In Tables 2 and 3 we compare the powers of the multiscale test, a test based on a scan-statistic, and the ALR test (see Chan and Walther [7] for the details). Formally, we consider testing (1.10) against alternatives of the form H 1 : f = µ n I Bn , for both small and large scale signals (B n ). We briefly describe the above two competing procedures. Let The ALR test statistic (see Chan [6]) is defined as  Table 3 Power of the scan, the multiscale and the ALR tests for m = 100 (i.e., n = 100 2 ) as µ changes.
The scan test (ALR test) rejects the null hypothesis if the observed M n (A n ) exceeds the 0.95-quantile for M n (A n ) under the null hypothesis. In Tables 2 and 3 we compare the performance of the three procedures. Here µ denotes the signal strength, and k/m denotes the length of each side of the square signal B n (here m = 40 and 100 for the two cases). The power of the tests were calculated using 1000 replications.
We make the following observations. For both the cases (m = 40 and 100) when the signal is at the smallest scale, e.g., k = 1, the scan statistic outperforms everything else. However, when m = 100, even in relatively small scales, e.g., k = 8 (i.e., about 0.6% of the observations contain the signal) our multiscale test starts to outperform the scan test. Note that in this setting (small scales) the ALR performs the worst. As the spatial extent of the signal increases, our multiscale procedure and the ALR procedure starts performing favorably whereas the performance of the scan statistics deteriorates. Thus, the simulation experiments corroborates our theoretical findings.

Discussion
In this paper we have proposed a multidimensional multiscale statistic in the continuous white noise model and used this statistic to construct asymptotically minimax tests for testing f = 0 against (i) Hölder classes of functions; and (ii) alternatives of the form f = µ n I Bn , where B n is an unknown axis-aligned hyperrectangle in [0, 1] d and µ n ∈ R is unknown. However, there are many open questions in this area. We briefly delineate a few of them below and in the process describe some important papers in related areas of research.
We have shown that for the Hölder class H β,L , if the smoothness parameter β is known, we can construct an asymptotically minimax test. However, if β is unknown (and β ≤ 1) we can only construct a rate optimal test. A natural question that arises is whether a test can be constructed that is asymptotically minimax (for the Hölder class of functions with the supremum norm) without the knowledge of the smoothness parameter β (and L > 0); see Ji and Nussbaum [25, Section 1.3]. Another interesting question would be to try to extend our results to other smoothness classes like Sobolev/Besov classes; in Ingster and Stepanova [19] the authors gave the minimax rate of testing for Sobolov class, but no test was proposed that achieves the exact separation constant.
Note that we have shown that our multiscale test is asymptotically minimax for detecting the presence of a signal on an axis-aligned hyperrectangle in [0, 1] d . One obvious extension of our work would be to correctly identify the hyperrectangle on which the signal is present. Further, we could go beyond hyperrectangles and try to identify signals that are present on some other geometric structures A ⊂ [0, 1] d (i.e., f = µI A where A is not necessarily an axis-aligned hyperrectangle). Examples of such geometric structures could be: (i) A is an hyperrectangle which is not necessarily axisaligned, (ii) A is a d-dimensional ellipsoid, (iii) A = k i=1 A i where each A i ⊆ [0, 1] d is an (axis-aligned) hyperrectangle, etc. Frick et al. [12] and the references therein investigated the problem of finding change points in d = 1 which can be thought of as detection of multiple intervals. In Arias-Castro et al. [3] the authors use the scan statistic to detect regions in R d where the underlying function is non-zero. Arias-Castro et al. [2] considers the problem of finding a cluster of signals (not necessarily rectangular) in a network using the scan statistic. Although the method they propose achieves the optimal boundary for detection, it requires the knowledge of whether the signal shape is "thick" or "thin". For hyperrectangles this refers to whether or not the minimum side length is of order log n/n or not. We believe that the multiscale statistic, with proper modifications, can be used to find asymptotically minimax/rate optimal tests in such problems.
In our white noise model (1.1) we assume that the distribution of the response variables is (homogeneous and independent) Gaussian. Similar questions about signal detection can be asked when the response is non-Gaussian; see e.g., König et al. [28], Chan and Walther [8], Rivera and Walther [35], Walther [42] etc. In Pein et al. [32] the authors looked at the problem of detecting change points under heterogeneous variance of the response variable (when d = 1). Rohde [36] looked at this problem where the error distribution is known to be symmetric (when d = 1). Walther [42] studied a similar problem where the response variable is binary. A multiscale approach could be used to tackle such problems as well.
Several interesting applications of the multiscale approach exist when d = 1 (following the seminal paper of Dümbgen and Spokoiny [10]): In Dümbgen and Walther [11] the authors propose a multiscale test statistic to make inference about a probability density on the real line given i.i.d. observations; Schmidt-Hieber et al. [37] use multiscale methods to make inference in a deconvolution problem; Rivera and Walther [35] use multiscale methods to detect a jump in the intensity of a Poisson process, etc. We believe that our extension beyond d = 1 will also lead to several interesting multidimensional applications.

Acknowledgements
The authors would like to thank Lutz Dümbgen and Sumit Mukherjee for several helpful discussions.

Some useful concepts
In this subsection we formally define some technical concepts that we use in this paper. In the following we give some useful properties of a Brownian sheet W (·).
• If g ∈ L 2 ([0, 1] d ) then gdW : This, in turn, implies that for any measurable function φ we have Let us now define the Hölder class of functions H β,L , for β > 0 and L > 0.
Remark 6.1. One of the most important properties of H β,L that we will use is the following: If f ∈ H β,1 then, for any h = (h 1 , . . . , h d ) > 0 and t ∈ A h , where min(h) := min i=1,...,d h i .
We say a function f has bounded HK-variation if T V (f ) < ∞.
The main property of a bounded HK-variation function that we will need in this paper is stated below.

Proof of Theorem 2.2
In the following proofs K would be used to denote a generic constant whose value would change from line to line.
For every v > 0, we define For simplicity we divide the proof in three steps.
Step 1: In this step we will prove that ∀ η > 0 and v ∈ (0, 1], (6.1) where K > 0 is a positive constant not depending on v. We will prove the above result by introducing the notion of Orlicz norm. Let λ : R + → R be a nondecreasing convex function with λ(0) = 0. For any random variable X the Orlicz norm X λ is defined as The Orlicz norm is of interest to us as any Orlicz norm easily yields a bound on the tail probability of a random variable i.e., P(|X| > x) for some pseudometric ρ on F and constant C. Then for any ζ, v > 0, for some constant K depending only on λ and C.
By taking δ = 1, = u 1/2 , condition (c) of Theorem 2.2 yields N ( , F ) ≤ A −2B . Thus, Lemma 6.1 gives (with ζ = v) The expression on the right side of the above display can be easily shown to be less than or equal to Kv log(e/v) for some constant K. This result along with an application of (6.2) with Γ(X, v) instead of X imply for some constant K.
Thus, both the terms on the right side of (6.9) have the form K exp[(C−S/K ) log log(e e /δ)] for some constants K, C, K > 0. Putting these values in (6.9) gives us, for suitable constant K > 0, we get Π(δ) ≤ K exp ((K − S/K) log log(e e /δ)) .
Therefor, as F = l≥0 F (2 −l ), First let us define the following sets: We note that F δ,(l 1 ,...,l d ) is empty unless we have for all i = 1, . . . , d; (this restriction is a consequence of the fact that h i ≤ 1/2) and (this restriction is a consequence of the fact that δ/2 < σ 2 (t, h) ≤ δ).
Step 1: First, we will show that for any (l 1 , . . . , l d ) ∈ Z d , and δ, u ∈ (0, 1], Let F be a subset of F δ,(l 1 ,...,l d ) such that for any two elements (t, h), (t , h ) ∈ F we have ρ 2 ((t, h), (t , h )) > uδ. (6.11) Our aim is to show that |F | ≤ Ku −2d δ −1 , for some constant K independent of (l 1 , . . . , l d ), u and δ. If F δ,(l 1 ,··· ,l d ) is empty then the assertion is trivial. So assume that F δ,(l 1 ,··· ,l d ) is non-empty which imposes bounds on the l i 's as shown above. Let us define the following partition of [0, 1] d into disjoint hyperrectangles: where we take c := d4 d . We would like to point out that in the above definition when i k = 1, for any k = 1, . . . , d, by (i k − 1)c −1 uδ 1/d 2 l k , i k c −1 uδ 1/d 2 l k we mean the closed interval 0, c −1 uδ 1/d 2 l k . Observe that all the sets in R are disjoint and moreover Observe that Hence we can easily see that Here the last inequality follows from the fact that d i=1 l i ≥ −(d + 1). Let us define the following set: for all k = 1, . . . , d, (6.12) Thus, our proof will be complete if we can show that |R 2 | = |F |. From the definition of R 2 and the fact that elements in R are disjoint it is easy to observe that |R 2 | ≤ |F |.
Therefore, the only thing left to show is that |F | ≤ |R 2 |. Let us assume the contrary, i.e., |R 2 | < |F |. This implies that there exist two elements (t, h) and (t , h ) ∈ F and (M i ∼ , M i ∼ ) ∈ R 2 such that both t − h and t − h belong to M i ∼ and, also, t + h and t + h belong to M i ∼ . Let us first define the following two hyperrectangles: Our goal is to show that which is implied by the following two assertions:  See the figure below for a visual illustration of (6.13) when d = 2. Now, as A similar argument shows that B ∞ (t , h ) ⊆ B 1 . Hence assertion (1) above holds. h). A similar argument shows that B 2 ⊆ B ∞ (t , h ). Therefore, assertion (2) is also satisfied. Now let us define the following set Clearly, using (6.12), Also see that w = (w 1 , . . . , w d ) ∈ B 1 \ B 2 if and only if (1) for every k = 1, . . . , d, we have w k ∈ i k − 1, i k × c −1 uδ 1/d 2 l k (this is true as w ∈ B 1 ), (2) there exists l ∈ {1, 2, . . . , d} such that either w l ∈ i l −1, i l ×c −1 uδ 1/d 2 l l or w l ∈ i l −1, i l ×c −1 uδ 1/d 2 l l (this is true as w ∈ B 2 implies that there exist l such that w l ∈ (i l , i l −1]×c −1 uδ 1/d 2 l l and w ∈ B 1 implies that w l ∈ (i l −1, i l ]×c −1 uδ 1/d 2 l l ). Therefore, we see that . Therefore, using (6.13) and the fact that c = d4 d , we easily see that which contradicts (6.11). This proves that two elements of F cannot correspond to the same pair of hyperrectangles Hence we have proved (6.10).
Note that the power of (d + 2 log(e/δ)) in the above display is d − 1 because if we fix the values of l 1 , l 2 , . . . , l d−1 then l d can only take at most (d + 1) values such that (l 1 , l 2 , . . . l d ) ∈ S (as d k=1 l k can take at most d + 1 distinct values). Also note that The above representation of F (δ) along with the trivial fact that gives us (6.14).

A.1. Proof of Proposition 2.1
The proof of this result follows from the following result. Suppose that Z 1 , Z 2 , . . . , Z n are i.i.d. standard normal random variables. Then, we know that Let F n be the distribution function of max 1≤i≤n Z i / √ 2 log n, i.e., F n (x) := P(max 1≤i≤n Z i ≤ x √ 2 log n), for x ∈ R. Therefore, for every x < 1, we have F n (x) → 0. We want to show that sup Hence it is enough to show that for every s ∈ R we have P( Let ψ ∈ H β,1 such that ψ(0) ≥ 1. Hence by the property of H β,1 we have . Hence the only thing left to prove is that Here the the third inequality follows from the fact that when β ≤ 1 the function u → u β is a β-Hölder continuous function; the last inequality follows from the triangle inequality. If x, y ∈ R d such that x ≥ 1 ≥ y then we have If x, y ∈ R d is such that x ≥ y ≥ 1 then the assertion is trivial. Hence we have proved that ψ β minimizes (3.1).

A.3. Proofs of Theorems 3.1 and 3.2
The proofs of Theorems 3.1 and 3.2 depend on the following lemma (stated and proved in Dümbgen and Spokoiny [10, Lemma 6.2]).
where min(h) := min{h 1 , h 2 , . . . , h d } and ψ Elementary calculations show that g t ∈ H β,L and g t ∞ = L min(h) β . Now let us define the set Let φ n be an arbitrary test for (1.2) with level α. Then, Here P 0 denotes the measure of the process Y under the null hypothesis f = 0 and P gt denotes the measure of Y under the alternative f = g t . Also for g ∈ H β,L , dPg dP 0 denotes the Radon-Nikodym derivative of the measure P g with respect to the measure P 0 . By Cameron-Martin-Girsanov's Theorem (see Protter [34,Chapter 3] for more details about absolute continuous measures and Radon-Nikodym derivatives) we get that log dP g dP 0 (Y ) = √ n gdW − n 2 g 2 .
Then Γ t = exp(w n Z t − w 2 n 2 ) and we can write According to Lemma A.1 the above term will go to zero if |S| → ∞ and the corresponding w n 's satisfy: Then, Also, as n → ∞, |S|/( Also notice that for all large n, log |S|/ Similarly, for suitable constants K, K > 0, as the o(1) term above is positive when n is large. This proves part (a) of Theorem 3.1 by noting that L min(h) β = (1 − n )c * ρ n .
For notational simplicity, in the following we drop the subscript n. As the term D(2 d h 1 . . . h d ) is bounded from above, for any t ∈ J, the probability of rejecting the null hypothesis, P g (T β (Y ) > κ α ), is bounded from below by, for some constant K > 0, where Φ is the standard normal distribution function. Hence, to prove our claim it suffices to show that uniformly for all g ∈ H β,L such that g J,∞ ≥ δ. Note that A h = J. Let g be any such function, and let t ∈ J be such that |g(t)| ≥ δ. Let us assume that g(t) ≥ δ; the other case where g(t) ≤ −δ can be handled similarly by looking at −g. By construction of ψ β we have δψ (β) t,h ∈ H β,L . Also note that as ψ β minimizes ψ in the set {ψ ∈ H β,1 : ψ(0) ≥ 1}, δψ (β) t,h minimizes ψ in the set {ψ ∈ H β,L : ψ(t) ≥ δ}. Note that both g and δψ (β) t,h belong to the closed convex set {ψ ∈ H β,L : ψ(t) ≥ δ}. As δψ (β) t,h is the projection of the zero function onto the above closed convex set, we have Thus, = (1 + n ) 2d 2β + d log n − K + 2d 2β + d log n log n ≥ n (2d/(2β + d)) 1/2 (log n) 1/2 + o(1) → ∞.
This proves part (b) of Theorem 3.1.
Now we would want to bound | g, ψ t,h | uniformly for all g ∈ H β,L such that g Jn,∞ ≥ M ρ n . Without loss of generality, let us assume that g(t) ≥ M ρ n for some t ∈ J n and g ∈ H β,L . Then This shows that if x − t ≤h then g(x) ≥ 0. Hence, Here the last equality follows as ψ β (x) = (1 − x β )I( x ≤ 1). Also note that Γ(2 dhd ) = 2d log 1 2 + 2d β log L M + 2d 2β + d log n log n ≤ 2d 2β + d log n for large n. Therefore, for large n, Here the last equality holds by the choice of M , as ψ β , ψ 1 ψ 1 > log n 2d 2β + d .

A.3.3. Proof of Theorem 3.2
Proof of part (a). Let us suppose that B n := B ∞ (t n , h n ) ⊆ [0, 1] d for some t n , h n ∈ [0, 1] d . Let us first look at the case when lim inf n→∞ |B n | > 0. Now assume that the location B n was known and it was also known that µ n > 0. In such a scenario the best test statistic would beΨ(t n , h n ) (with kernel ψ 0 ) which follows the normal distribution with mean 0 and variance 1, under the null hypothesis. Hence in this case, the UMP test rejects H 0 : µ n = 0 ifΨ(t n , h n ) > z 1−α where z 1−α is the (1 − α)'th quantile of imsart-generic ver. 2014/10/16 file: archive.tex date: June 7, 2018 the standard normal distribution. When B n is not known then, obviously, the power of any test φ n is less than the test described above. Hence, E fn [φ n (Y )] ≤ P µn Ψ (t n , h n ) ≥ z 1−α = P 0 Ψ (t n , h n ) + n|B n |µ n ≥ z 1−α = 1 − Φ z 1−α − n|B n |µ n → 1 unless µ n n|B n | → ∞.
Here the last inequality follows from the fact that as lim inf n |B n | > 0, Γ(|B n |) + κ α D(|B n |) is bounded from above (say, by K) for all large n.
Without loss of generality also assume that µ n > 0. Recall that B n = B ∞ (t n , h n ) for h n = (h 1,n , . . . , h d,n ) ∈ (0, 1/2] d . Let us first define the following grid points: Clearly |G hn | ≤ 1/|B n |. Also, as n → ∞, |G hn ||B n | → 1. For each t ∈ G hn define f t := µ n I B∞(t,hn) . Clearly as |B n | = |B ∞ (t, h n )|, we have f t ∈ G − n . Let φ n be a test of level α for testing (1.10). Similar arguments as in (A.1) show that Now by an argument a similar to that in the proof of Theorem 3.1, we have log dP ft dP 0 (Y ) = √ n f t dW − n f t 2 /2 = µ n n|B n |Ψ(t, h n ) − µ 2 n n|B n |/2.
This completes the proof of Theorem 3.2.