A Sharp Lower-tail Bound for Gaussian Maxima with Application to Bootstrap Methods in High Dimensions

Although there is an extensive literature on the maxima of Gaussian processes, there are relatively few non-asymptotic bounds on their lower-tail probabilities. The aim of this paper is to develop such a bound, while also allowing for many types of dependence. Let $(\xi_1,\dots,\xi_N)$ be a centered Gaussian vector with standardized entries, whose correlation matrix $R$ satisfies $\max_{i\neq j} R_{ij}\leq \rho_0$ for some constant $\rho_0\in (0,1)$. Then, for any $\epsilon_0\in(0,\sqrt{1-\rho_0})$, we establish an upper bound on the probability $\mathbb{P}(\max_{1\leq j\leq N} \xi_j\leq \epsilon_0\sqrt{2\log(N)})$ in terms of $(\rho_0,\epsilon_0,N)$. The bound is also sharp, in the sense that it is attained up to a constant, independent of $N$. Next, we apply this result in the context of high-dimensional statistics, where we simplify and weaken conditions that have recently been used to establish near-parametric rates of bootstrap approximation. Lastly, an interesting aspect of this application is that it makes use of recent refinements of Bourgain and Tzafriri's"restricted invertibility principle".


INTRODUCTION
The maxima of Gaussian processes play an essential role in many aspects of probability and statistics, and the literature describing them is highly developed [e.g.Leadbetter et al., 1983, Adler, 1990, Lifshits, 1995, Ledoux and Talagrand, 2013, Talagrand, 2014].Within this area, a variety of questions are related to showing that the maximum of a process is unlikely to deviate far above, or below, its mean.However, in comparison to the set of tools for handling the upper tail of a maximum, there are relatively few approaches for the lower tail.(Additional commentary related to this distinction may be found in [Talagrand, 2014, p.viii] [Li andShao, 2001, Sec.4.2] [Lifshits, 1995, Sec.18].) In this paper, our goal is to derive lower-tail bounds for Gaussian maxima that are motivated by statistical applications involving bootstrap methods in high dimensions.We desire bounds that are general enough to handle many types of correlation structures, yet also precise enough to yield explicit rates of convergence in distributional approximation results.
To describe the setting of our lower-tail bounds, let ξ = (ξ 1 , . . ., ξ N ) be Gaussian vector with a correlation matrix R ∈ R N ×N , as well as E(ξ j ) = 0 and var(ξ j ) = 1 for all 1 ≤ j ≤ N .We will consider the situation where there is a fixed constant ρ 0 ∈ (0, 1) such that (1.1) max In addition, we will consider some relaxations of this condition, where ρ 0 = 1, or where only subset of the off-diagonal entries of R are bounded above by a given constant.(See Corollary 2.1 and Theorem 2.3.)It is also worth noting that (1.1) does not require R to be invertible.Letting the maximum entry of ξ be denoted as we seek non-asymptotic upper bounds on the probability P(M N (ξ) ≤ t), where t is a suitable point in the lower tail.

Background
We now briefly review some leading results on lower-tail bounds for M N (ξ).Under the preceding conditions, the well-known concentration inequality for Lipschitz functions of Gaussian vectors implies that for any s > 0, (1.2) P M N (ξ) ≤ med(M N (ξ)) − s ≤ e −s 2 /2 , where med(•) is any median [Sudakov andTsirel'son, 1974, Borell, 1975].Although this bound is broadly applicable, it can fail to describe lower-tail probabilities smaller than O(N −1 ).To see this, consider using (1.2) to bound the probability P(M N (ξ) ≤ δ 0 med(M N (ξ))) for some fixed δ 0 ∈ (0, 1).If the entries of ξ are independent, then the standard fact that med(M N (ξ)) = 2 log(N )(1+ o(1)) as N → ∞ implies that (1.2) cannot give a bound better than O(N −1 ) in this case.Furthermore, such a bound is far too large.In fact, when the entries of ξ are independent, this probability is O(exp(−aN b )), for some fixed constants a, b > 0 that may depend on δ 0 but not N [cf.Schechtman, 2007, Section 2].More recently, an important result of Paouris and Valettas [Paouris and Valettas, 2018] went beyond (1.2), showing that the inequality holds for any s > 0, which can improve upon (1.2) in situations where var(M N (ξ)) is small.For further variations and related results, we refer to the papers Paouris and Valettas [2019], Tanguy [2019], Valettas [2019].)Despite the progress achieved by the bound (1.3), it can still be quite challenging to obtain precise control on the variance var(M N (ξ)) in the exponent.Indeed, it is often the case that bounds on var(M N (ξ)) have implicit dependence on the correlation matrix R that is difficult to quantify, and such bounds usually involve constants that are unspecified or conservative.(Note that since var(M N (ξ)) appears in the exponent, the constant in a variance bound will typically affect the rate of the tail bound.)We refer to the books [Boucheron et al., 2013, Chatterjee, 2014] for further background on variance bounds related to M N (ξ).
Contributions.With regard to the considerations above, a few other works have developed lower-tail bounds for M N (ξ) that provide insight into the role of the correlation structure [Hartigan, 2014, Ding et al., 2015, Tanguy, 2015].However, a limitation shared by all of these works is that they do not explicitly quantify the rate at which probabilities of the form P(M N (ξ) ≤ ǫ 0 2 log(N )) decrease with respect to N for a fixed value ǫ 0 ∈ (0, √ 1 − ρ 0 ).For instance, this limitation can arise from unspecified constants in the exponents of the bounds.See also the discussion after Theorem 2.1 below for more details.
In light of this, an important contribution of our lower-tail bounds is that they provide rates with explicit constants in their exponents.Also, our work goes further by showing that the rates are sharp (as in (2.3) and (2.4) of Theorem 2.1).Moreover, the proof techniques used here are entirely different from those used in the previously mentioned works.In particular, we extend and apply our lowertail bounds by leveraging recent refinements of Bourgain and Tzafriri's restricted invertibility principle (Section 2.1), which has been a topic of substantial interest in contemporary mathematics.Conventionally, this principle is understood as a functional-analytic result that guarantees the existence of special submatrices within large matrices.Hence, our use of this result to serve the quite different purpose of enhancing tail bounds may be viewed as another notable aspect of our work.
In addition to these contributions, Theorem 3.1 in Section 3 shows how our lower-tail bounds can be applied in the context of high-dimensional statistics, in order to simplify and weaken conditions that are sufficient for near-parametric rates of bootstrap approximation.Specifically, Theorem 3.1 advances the state of the art on such results for "max statistics" in settings where the data satisfy a condition known as "variance decay".Also, a second application of our results has recently been developed by [Yi, 2021] in connection with stochastic PDEs.Brief descriptions of both applications are given below.

Applications
Bootstrap methods in high dimensions.In recent years, inference problems related to max statistics have attracted significant attention in the highdimensional statistics literature.The prototypical example of a max statistic is the coordinate-wise maximum of a sum of random vectors X 1 , . . ., X n ∈ R p , say Statistics of this type arise often in the construction of hypothesis tests and confidence regions, and consequently, the performance of many inference procedures is determined by how accurately the distribution L(T ) can be approximated.Bootstrap methods are a general approach to construct such approximations in a data-driven way, and they are designed to generate a random variable T * whose conditional distribution given the data, denoted L(T * |X), is close to L(T ).In this context, accuracy is commonly measured with respect to the Kolmogorov metric d K , and the challenge of developing non-asymptotic upper bounds on d K (L(T * |X), L(T )) has stimulated an active line of work [e.g.Chernozhukov et al., 2019, 2020, Deng and Zhang, 2020, Lopes et al., 2020, Lopes, 2020].As an application of our lower-tail bounds, we will consider a recent bootstrap approximation result of this type from [Lopes et al., 2020], and we will improve upon it by showing that it holds under assumptions that are simpler and less restrictive.
Macroscopic properties of solutions to stochastic PDEs.In the study of stochastic partial differential equations, it is of interest to determine whether or not solutions exhibit high peaks over large regions at different scales.Solutions having this property are said to be "multifractal at macroscopic scales".In order to demonstrate that a solution has this property in a precise sense, it is necessary to analyze regions where a solution rises above a certain height (exceedance sets), and quantify the "macroscopic Hausdorff dimension" of such regions.
During the last few years, a growing number of results have demonstrated that solutions to fundamental stochastic PDEs possess this multifractical property [e.g.Khoshnevisan et al., 2017, 2018, Kim, 2019, Yi, 2021].Quite recently, our first main result (Theorem 2.1) was used in Yi [2021] to establish this property for certain versions of the stochastic heat and wave equations.Specifically, the lower-tail bound in our Theorem 2.1 was sharp enough to enable exact calculations of the macroscopic Hausdorff dimension of exceedence sets of certain Gaussian random fields.(See Theorems 1.1 and 1.3, as well as Section 3.2 in Yi [2021].)

Notation
In order to simplify presentation, we always implicitly assume that N ≥ 2, and we allow symbols for constants such as c, C, c 0 , C 0 , . . . to change values with each appearance.When dealing with iterated logarithms, we use the abbreviation ℓ 2 (N ) := max{1, log log(N )}.For a real matrix A, define A F = tr(A ⊤ A), A 1 = i,j |A ij |, and also define A op to be the largest singular value of A. For a real vector x, we write x 2 for the Euclidean norm.
The identity matrix of size N × N is I N , and the standard basis vectors in R N are e 1 , . . ., e N .For the distribution of a random variable V , we write L(V ), and we define its ψ 1 -Orlicz norm as If ζ is a Gaussian random vector with mean 0 and covariance matrix Σ, we write ζ ∼ N (0, Σ).For the univariate standard Gaussian distribution, the symbols Φ and φ denote the distribution function and density.Lastly, if a and b are real numbers, we use the notation a ∨ b = max{a, b} and a ∧ b = min{a, b}.

LOWER-TAIL BOUNDS
To clarify the statements of our main results, we first state a basic proposition describing the sizes of E(M N (ξ)) and med(M N (ξ)) under condition (1.1).This proposition shows that the value 2(1 − ρ 0 ) log(N ) is a natural reference level for a lower-tail bound.Although this fact might be considered well-known among specialists, it is not easily referenced in the form given below, and so we provide a short proof at the end of Section 4.
Proposition 2.1.Let µ N stand for either E(M N (ξ)) or med(M N (ξ)).If the condition (1.1) holds for some ρ 0 ∈ (0, 1), then there is a universal constant c 0 > 0 such that The following is our first main result, which will be extended and refined later in Corollary 2.1 and Theorem 2.3.The proof is deferred to Section 4.
Furthermore, the bound (2.3) is sharp in the sense that if R ij = ρ 0 for all i = j, then there is a constant c > 0 depending only on (δ 0 , ρ 0 ), such that .
Remarks.To comment on some basic features of the theorem, first note that the dominant exponent −(1 − ρ 0 )(1 − δ 0 ) 2 /ρ 0 takes larger negative values as ρ 0 becomes smaller.Hence, the bound respects the fact that the lower-tail probability decays faster than any power of N −1 when the entries of ξ are independent.Also, the theorem conforms with the reference level motivated by Proposition 2.1, since (2.3) implies that med(M N (ξ)) cannot be much less than 2(1 − ρ 0 ) log(N ).
Regarding other recent lower-tail bounds for M N (ξ), their relation to (2.3) and (2.4) can be summarized as follows (in the setting of Theorem 2.1 with ρ 0 and δ 0 regarded as fixed with respect to N ).First, the paper [Ding et al., 2015, Theorem 1.6] gives a two-sided tail bound for M N (ξ), implying that if a and b are positive constants satisfying E(M N (ξ)) ≥ a log(N ) and b ≤ a/100, then the probability , for an unspecified constant c(a) > 0. Second, the paper [Tanguy, 2015, Proposition 7] gives a two-sided tail bound implying that for any constant b > 0, the probability , where c > 0 is an unspecified constant.(Note that even if c were specified, this bound would be of larger order than N −ǫ for any ǫ ∈ (0, 1).)Third, the paper [Hartigan, 2014, Theorem 3.4] develops a lowertail bound of the form P(M N (ξ) ≤ t N (α)) ≤ 2α for any α ∈ (0, 1/2), where t N (α) is a threshold with a complex dependence on N , α, as well as other parameters related to the correlation structure of ξ.In particular, Section 4 of the paper explains that the threshold can be expressed in terms of the minimum eigenvalue of R. For instance, when R satisfies R ij = ρ 0 for all i = j, the minimum eigenvalue is equal to 1 − ρ 0 , and in this case, Theorem 3.4 and the comments preceding equation (4.1) in [Hartigan, 2014] yield the following formula: where we put ).However, it seems that this formula for t N (α) is not quite correct, since it is missing a square root.(Note that the quantity κ N (α) scales like log(N ), rather than log(N ), as a function of N .) 1 Apart from this issue, the intricate form of the threshold also makes our result ostensibly easier to use.Lastly, in comparison to our work, none of the three mentioned papers resolve the question of whether or not the lower-tail bounds yield sharp rates.
Extra correlation structure.We now present a direct corollary of Theorem 2.1 that allows for extra structure in the matrix R to be used.If the matrix R satisfies max i =j R ij = ρ 0 , but "most" off-diagonal entries are substantially less than ρ 0 , then we can gain considerable improvement.Roughly speaking, if there is a number ρ 1 ∈ (0, ρ 0 ) such that a large number of off-diagonal entries satisfy R ij ≤ ρ 1 , then Theorem 2.1 can be improved by effectively replacing ρ 0 with the better value ρ 1 .
Another point worth noting about this corollary is that it remains applicable even in situations where some variables are perfectly correlated and ρ 0 = 1.

Further improvements using the restricted invertibility principle
In order to leverage the full strength of Corollary 2.1, we need an index set J ⊂ {1, . . ., N } with large cardinality, such that the off-diagonal entries R ij with i, j ∈ J are uniformly small.Quite remarkably, it turns out that such an index set is guaranteed to exist under rather general conditions, as a consequence of a seminal result known as the restricted invertibility principle [Bourgain and Tzafriri, 1987].Over the past decade, this result has received considerable attention, due to its close links with the solution of the longstanding Kadison-Singer problem by Marcus, Spielman and Srivastava [Marcus et al., 2015].Below, we will apply a recent refinement of the restricted invertibility principle, established in a companion paper by the same group [Marcus et al., 2021].
To introduce some notation, define the stable rank r(A) of a non-zero positive semidefinite matrix A as This quantity always satisfies 1 Relatedly, it also seems that the expression 'λn 2 log(n)' in the paper's abstract should be replaced with 2λn log(n).
imsart-sts ver.2014/10/16 file: lower_tail_boot_arxiv.texdate: December 2, 2021 and has the useful property that it approximately counts the number of dominant eigenvalues of A. In terms of the notion of stable rank, the restricted invertibility principle roughly says that for any matrix L ∈ R N ×N , there exists an index set J ⊂ {1, . . ., N } with cardinality |J| ≈ r(L ⊤ L), such that the column submatrix of L indexed by J has singular values that are well separated from zero.
Theorem 2.2 (Restricted invertibility principle Marcus et al. [2021]).Let L ∈ R N ×N be a non-zero matrix.Then, for any positive integer k ≤ r(L ⊤ L), there exists an index set J ⊂ {1, . . ., N } with cardinality |J| = k, such that the following inequality holds for any real numbers (a j ) j∈J , We will apply the restricted invertibility principle above with R 1/2 playing the role of L. Note that because R is a correlation matrix, we have To simplify the inequality (2.5), let R(J) ∈ R |J|×|J| denote the submatrix of R with entries indexed by J ×J.Also, suppose there is an integer k ≥ 2 and a scalar ǫ ∈ (0, 1), such that k ≤ ǫ 2 4 r(R).In this case, the restricted invertibility principle ensures there is a choice of J with cardinality equal to k such that In turn, this implies that the off-diagonal entries of R(J) are uniformly small, i.e.
The next theorem combines this information about J with Corollary 2.1.Later on, this connection will be used in our application dealing with rates of bootstrap approximation in high dimensions.
Remarks.To discuss the choice of k, consider a basic situation where the root mean square of eigenvalues, say , is upper bounded by a constant C. For instance, this condition is quite natural in the context of principal components analysis, where the matrix R may only have a handful of dominant eigenvalues.Note too that this condition is much weaker than requiring R op ≤ C, since it is possible for the largest eigenvalue of R to diverge while λ rms (R) remains bounded as N → ∞.Due to the fact that R is a correlation matrix, the condition and consequently, the integer k in Theorem 2.3 may be chosen to be of order N .This leads to an upper bound in (2.6) of order N −(1−ǫ)(1−δ)2 /ǫ , up to a logarithmic factor.To see why this illustrates the benefit of the restricted invertibility principle, note that it enables us to bypass an assumption of the form max i =j R ij ≤ ǫ, which would have been necessary if Theorem 2.1 had been used directly.Indeed, a condition like λ rms (R) ≤ C is often more appealing from a statistical standpoint, because it allows for many off-diagonal entries of R to be close to 1.

APPLICATION TO BOOTSTRAP METHODS IN HIGH DIMENSIONS
Preliminaries.Let X 1 , . . ., X n ∈ R p be centered i.i.d.observations with a standardized sum denoted as S n = n −1/2 n i=1 X i .In addition, define the coordinatewise variances σ 2 j = var(X 1j ) for each j = 1, . . ., p, which are assumed to be positive, and define the partially standardized max statistic where τ ∈ [0, 1) is a tuning parameter.This type of max statistic was studied previously in [Lopes et al., 2020], and reduces to the basic example (1.4) in the case when τ is chosen to be 0. In order to analyze bootstrap approximation of L(M ), we will focus on the "Gaussian multiplier bootstrap method" popularized in [Chernozhukov et al., 2013].To describe the method, define the sample mean X = 1 n n i=1 X i , the sample covariance matrix Σ = 1 n n i=1 (X i − X)(X i − X) ⊤ , and the coordinatewise sample variances σ 2 j = Σ jj for j = 1, . . ., p.The bootstrapped statistic M ⋆ is constructed by generating a Gaussian random vector S ⋆ n ∼ N (0, Σ), and defining 2 Here, the key point is that it is possible in practice to directly simulate the conditional distribution of M ⋆ given the data, denoted L(M ⋆ |X).Accordingly, the multiplier bootstrap method uses L(M ⋆ |X) as an approximation to L(M ).
As one last preliminary item, we will adopt the standard convention in highdimensional statistics of considering a sequence of models indexed by n, in which all parameters may depend on n, except when stated otherwise.In particular, the dimension p = p(n) is allowed to have arbitrary dependence on n.Likewise, if a parameter does not depend on n, then it does not depend on p either.To simplify notation, we will write a n b n for two sequences of real numbers a n and b n if there is a constant c > 0 not depending on n such that a n ≤ cb n holds for all large n.In the case when both a n b n and b n a n hold, we write a n ≍ b n .

A brief review of bootstrap approximation under variance decay
Recently, the paper [Lopes et al., 2020] analyzed how well L(M ⋆ |X) approximates L(M ) in the setting of "variance decay", where the sorted coordinate-wise variances σ 2 (1) ≥ • • • ≥ σ 2 (p) have a polynomial decay profile.This means that there is a constant γ > 0 not depending on n such that the condition σ (j) ≍ j −γ holds for all j = 1, . . ., p. (The constant γ is allowed to be arbitrarily large or small.)For instance, the structure of variance decay arises naturally in a variety of high-dimensional statistical applications related to principal components analysis, count data, and the Fourier coefficients of functional data.Under the assumption of variance decay, as well as some additional assumptions on the correlation and tail-behavior of the covariates, Theorem 3.2 in the paper [Lopes et al., 2020] established the following bootstrap approximation result.Namely, for any fixed δ ∈ (0, 1/2), there is a constant C > 0 not depending on n such that the bound holds with probability at least 1 − C/n.The result (3.1) has some significant distinguishing features in relation to the other recent works [Lopes, 2020, Chernozhukov et al., 2020] that have obtained near n −1/2 rates of bootstrap approximation for max statistics in highdimensional settings.First, the use of variance decay makes it possible for the bound (3.1) to hold independently of p, whereas the other works do not use this structure and obtain rates with polylogarithmic dependence on p.Second, the other works require that the correlation matrix cor(X 1 ) ∈ R p×p be either positive definite or well-approximated by a positive definite matrix, whereas the bound (3.1) does not.Nevertheless, the bound (3.1) does rely on some other assumptions about the matrix cor(X 1 ).Below, we outline these assumptions, and later in Section 3.2, we will show how these assumptions can be simplified and weakened by applying our work from Section 2.
To introduce some notation, consider an arbitrary index d ∈ {1, . . ., p}, and let J(d) ⊂ {1, . . ., p} denote a set of indices corresponding to the d largest coordinatewise variances, i.e. {σ 2 (1) , . . ., σ 2 (d) } = {σ 2 j | j ∈ J(d)}.Next, let R(d) denote the d × d correlation matrix of the variables (X 1j ) j∈J(d) , and let the matrix In terms of this notation, the paper [Lopes et al., 2020] makes the following three correlation assumptions in order to establish (3.1): (a) There is a constant ǫ 0 ∈ (0, 1) not depending on n such that Concerning assumption (b), this is non-trivial because the operation of taking the entrywise non-negative part of a matrix does not generally preserve positive semidefiniteness [Guillot and Rajaratnam, 2015, Theorem 4.11].

Bootstrap approximation result with relaxed correlation assumptions
In this subsection, we will present a result of the form (3.1) in which the previous correlation assumptions (a) and (b) are removed, and in which (c) will be replaced by a version involving a weaker norm.Apart from the correlation structure, the following conditions on the data-generating model and the variance decay profile are slightly simplified versions of the ones used in [Lopes et al., 2020].
(i) There is a positive semidefinite matrix Σ ∈ R p×p , such that the observations X 1 , . . ., X n ∈ R p are generated as for each i = 1, . . ., n, where Z 1 , . . ., Z n are i.i.d.random vectors, with are positive, and there is a constant γ > 0 not depending on n such that the condition holds for all j = 1, . . ., p.
The following is the main result of the current section, and the proof will be given in Section 5.
Remarks.To interpret the correlation assumption (3.2), it should be emphasized that the upper bound R(l n ) 2 F ≤ l 2 n holds under all circumstances, because any positive semidefinite matrix A satisfies A 2 F ≤ [tr(A)] 2 .So, given that δ may be taken to be arbitrarily small, it is not possible to substantially weaken (3.2) in general.Furthermore, it is worth noting that the condition (3.2) only applies to the small set of variables indexed by J(l n ), while all other variables indexed by {1, . . ., p} \ J(l n ) have a correlation structure that is unrestricted.
A large class of examples of correlation matrices satisfying (3.2) can be constructed in the following way.Let f : [0, ∞) → [0, 1] be any continuous convex function that satisfies f (0) = 1, as well as f (t) ≤ ct −δ for some fixed constants c, δ > 0 and all t ≥ 0.Then, by Pólya's criterion [Pólya, 1949, Theorem 1], any matrix with entries defined by R ij = f (|i − j|) will be a correlation matrix that satisfies (3.2).Moreover, any other correlation matrix obtained by permuting the rows and columns will continue to satisfy (3.2).For additional discussion of such matrices, and some of their connections to high-dimensional statistics, we refer to [Bickel and Levina, 2008].
With regard to our lower-tail bounds from Section 2, the connection with Theorem 3.1 can be explained as follows.Overall, the proof of this result is based on a dimension-reduction strategy that involves showing M is well approximated by a statistic of the form M ′ = max j∈J ′ S n,j /σ τ j , where J ′ ⊂ {1, . . ., p} is an index set with cardinality much less than p.In order to show that M = M ′ holds with high probability, we will localize the maximizing index for M to the set J ′ .That is, if  is a random index such that M = S n,  /σ τ  , then we will show that  falls into J ′ with high probability.It is in this localization step that the lower-tail bound from Theorem 2.3 will be used: Two values x < y will be determined, such that M ′ is likely to be above y, while the maximum of S n,j /σ τ j over the indices j ∈ J ′ is likely to be below x.Hence, the first of these items requires a lower-tail bound, because it involves showing that the probability P(M ′ ≤ y) is small.

PROOF OF THEOREM 2.1
Throughout the proofs, we always assume that N is sufficiently large for any given expression to make sense -since this can be accommodated by an adjustment of the constants C and c in the statement of Theorem 2.1.The symbols c and C will also be frequently be re-used with the understanding that they never depend on N (and similarly with respect to the sample size n in Section 5).In addition, it will simplify presentation to introduce the the quantities , so that the bounds in Theorem 2.1 are proportional to N −α 0 (log(N )) Proof of the upper bound (2.3).Let t N = δ 0 2(1 − ρ 0 ) log(N ), and let ξ ′ ∼ N (0, R ′ ) be a Gaussian vector in R N , where R ′ is a correlation matrix satisfying R ′ ij = ρ 0 for all i = j.Due to the assumption that max i =j R ij ≤ ρ 0 , it follows from Slepian's Lemma [Slepian, 1962] that To control the larger probability, let ζ 0 , ζ 1 , . . ., ζ N denote independent N (0, 1) random variables, and note that the coordinates of ξ ′ may be jointly represented in distribution as This yields the following representation of the maximum which allows us to view M N (ξ ′ ) as the convolution of the two independent random variables on the right side.Consequently, a direct calculation may be used to obtain the exact formula where the integrand is defined by Remarks.Our choice of the cut-off point c N is a crucial element of the proof.To give a sense of how delicate this is, a close inspection of the proof shows that the 1/4 coefficient on the trailing term ℓ 2 (N )/ log(N ) is needed to obtain both the upper and lower bounds in Theorem 2.1.Some intuition for the definition of c N can be given as follows.First, define b N = 2 log(N )−ℓ 2 (N )/ 8 log(N ) and the random variable ) converges weakly to a standard Gumbel distribution as N → ∞ [Leadbetter et al., 1983, Theorem 1.5.3].Based on this notation and (4.2), it can be checked that (4.5) Next, the probability on the right can be approximated heuristically by dropping the random variable (1 − ρ 0 )/(2ρ 0 ) G N / log(N ), because it is of order 1/ log(N ) (in probability), and hence asymptotically negligible compared to ζ 0 .
After this heuristic is used, some further calculation then leads to the surmise that P(M N (ξ ′ ) ≤ t N ) should be of order N −α 0 log(N ) 2 .However, this line of reasoning will not be used in the formal proof, because the bounds in Theorem 2.1 are non-asymptotic.Also, note that this asymptotic heuristic suppresses the fact that 1/ log(N ) approaches 0 very slowly, which matters much more from a non-asymptotic standpoint.
Returning to the main thread, the problem of upper bounding the integral of ψ N (s) over the interval [−c N , 0] is the most involved part of the proof, and is postponed to Lemma 4.1 later on.(The reason for this difficulty is that the function ψ N (s) is especially sensitive to small changes in s over [−c N , 0].In particular, this stage of the analysis involves developing separate bounds over two sub-intervals of [−c N , 0] in order to account for changes in the local behavior of the function.)Once the proof of that lemma is complete, it will be straightforward to control the integral over [0, ∞), which is done subsequently in Lemma 4.2.For the moment, we only handle the interval (−∞, −c N ], since it requires no further preparation.Indeed, we have imsart-sts ver.2014/10/16 file: lower_tail_boot_arxiv.texdate: December 2, 2021 It follows that for all δ ∈ I N , Due to the lower-bound form of Mill's inequality, we have where w N (δ) is defined by the last line.When N is sufficiently large, it is simple to check that the condition 0 < w N (δ) < 1 holds for all δ ∈ I N , which gives log(1 − w N (δ)) ≤ −w N (δ).Combining the last few steps, the following bound holds for all δ ∈ I N , The work up to this point provides us with a useful majorant for ψ N .By looking at the equation (4.13) and the bound (4.15), it is clear that if we define the function holds for all δ ∈ I N .Integrating this bound gives (4.18) We now introduce another change of variable, and write δ as a function of a positive number η using If we define the interval , then δ N (•) maps J N to I N , and the integral bound (4.18) becomes The remainder of the proof will be divided into two parts, in which the integral over J N is decomposed with the subintervals and , (1−δ 0 ) log(N ) .
In handling these subintervals below, it will be convenient to label the summands of f N in line (4.16) according to imsart-sts ver.2014/10/16 file: lower_tail_boot_arxiv.texdate: December 2, 2021 where The integral over J ′ N .By expanding out the square δ N (η) 2 , and dropping the smallest positive term, the following bound holds for any η, In addition, if we expand the square (δ 0 + δ N (η)) 2 , and use the fact that every η ∈ J ′ N is bounded above by (log(N )) 1/4 /ℓ 2 (N ), then a short calculation gives for small enough c > 0. Directly combining the last two steps gives To simplify the previous bound, define x(η) = 2η − 1 2 .Since x(η) is non-negative for all η ∈ J ′ N , we may approximate exp{ℓ 2 (N )x(η)} from below using a secondorder Taylor expansion 1 + ℓ 2 (N )x(η) + 1 2 ℓ 2 (N ) 2 x(η) 2 .After some arithmetic, the bound (4.20) becomes where we define the function Integrating the bound (4.21) over J ′ N , we obtain To handle the last integral, note that the function ϕ N (x) can be written in the form ϕ N (x) = exp(−ax 2 + bx), and that the elementary Gaussian integral bound holds for any a > 0 and b ∈ R. Therefore, N ) .
Combining this with the bound (4.22) completes the work on J ′ N .
Lemma 4.2.There is a constant C > 0 depending only on (δ 0 , ρ 0 ) such that is decreasing in s, and so Then, the inequality (4.15) gives which is clearly of smaller order than the stated bound.

PROOF OF THEOREM 3.1
Let Σ = E(X 1 X ⊤ 1 ) be the population covariance matrix, and let Σ be the sample covariance matrix defined in Section 3, with σ 2 j = Σ jj for every j = 1, . . ., p. Recall that S ⋆ n denotes a Gaussian random vector drawn from N (0, Σ), and define a corresponding Gaussian vector Sn ∼ N (0, Σ).For each index d ∈ {1, . . ., p}, let the index set J(d) be as defined in Section 3 and define three associated max statistics To compare these statistics, we will use the Kolmogorov metric, defined for generic random variables U and V as In this notation, the proof amounts to showing there is a constant C > 0 not depending on n such that the event holds with probability at least 1 − C/n.The overall structure of the proof will decomposed into two parts by considering the triangle inequality and then separately bounding the two distances on the right.Hence, it suffices to prove the following proposition.
Proposition 5.1.Suppose the conditions of Theorem 3.1 hold.Then, and there is a constant C > 0 not depending on n, such that the event holds with probability at least 1 − C/n.
The next two subsections will give the proofs of (i) and (ii) respectively.
As a preparatory step towards applying Theorem 2.3 to the last line of (5.12), we need the following basic observation.Let (Y j ) j∈J(ln) be a generic set of random variables, and let (a j ) j∈J(ln) be positive numbers satisfying the condition max j∈J(ln) a j ≤ b, for some number b.It is straightforward to check that (5.13) P max j∈J(ln) Based on Assumption 3.1, there is a positive constant c 0 not depending on n such that the inequality σ /c 0 holds for all j ∈ J(l n ).Accordingly, we will use (5.13) with the choices a j = σ /c 0 , and Y j = Sn,j /σ τ j .Also, we may choose c 2 in the definition of t 2,n to have the value c 2 = ωc 0 2(1 − ω), which implies b t 2,n = ω 2(1 − ω) log(d n ).Under these choices, the inequality (5.13) becomes Now, we apply Theorem 2.3 to the right side, with (l n , d n , ω, ω) playing the roles of (N, k, ǫ, δ) in that statement of that result.(Under these choices, the application of Theorem 2.3 is justified because ω ∈ (0, 1) and the inequalities d n ≤ (ω 2 /4)r(R(l n )) and d n ≥ 2 hold for all large n.) Hence, P max j∈J(ln) Sn,j /σ τ j ≤ t 2,n d .
Furthermore, using the lower bound on d n from (5.7), there is a constant c not depending on n such that which completes the proof.
imsart-sts ver.2014/10/16 file: lower_tail_boot_arxiv.texdate: December 2, 2021 5.2 Proof of Proposition 5.1(ii) The proof of part (ii) is structured mostly along the same lines as the proof of part (i), and so we only sketch out the main steps for the sake of brevity.The current proof will also continue to use the same notation.Consider the inequality d K L( Mp ), L(M ⋆ p |X) ≤ I ′ n + II ′ n (X) + III ′ n (X), where we define S ⋆ n,j > t .
To bound the two probabilities on the right side of (5.18), we will use values t ′ 1,n and t ′ 2,n having the form for certain constants c ′ 1 , c ′ 2 > 0 that do not depend on n. (Note that for any such choices, the condition t ′ 1,n ≤ t ′ 2,n will hold for all large n.)The probability P(A ′ (t ′ 2,n )|X) can be handled using the argument after (5.12) in the proof of part (i), together with the bound (5.17).Specifically, it can be shown that there are constants C and c ′ 2 such that holds with probability at least 1 − C/n.Finally, an argument analogous to the one used to establish (5.9) earlier shows there are constants C and c ′ 1 such that P(B ′ (t ′ 1,n )|X) ≤ n −1 holds with probability at least 1−C/n.(A more detailed version of this argument can be found in the proof of Lemma C.1 part (b) in [Lopes et al., 2020].)

Background results
The following two bounds (5.19) and (5.20) follow from the proofs of Propositions B.1 and C.1 in [Lopes et al., 2020].These proofs can be applied in the setting of Theorem (3.1) in this paper with no essential changes.
Lemma 5.1.Fix any δ ∈ (0, 1/2), and suppose the conditions of Theorem 3.1 hold.Also, let II n and II ′ n (X) be as defined in (5.2) and (5.15) respectively.Then, (5.19) and there is a constant C > 0 not depending on n such that the event (5.20) holds with probability at least 1 − C/n.
For the next lemma, recall that • q denotes the L q norm of a random variable.This lemma is effectively a restatement of Lemma D.4 in [Lopes et al., 2020], and the proof given there can be applied in the same manner under the conditions used here.
with φ and Φ being the standard normal density and distribution function.The remainder of the proof consists in bounding integral ∞ −∞ ψ N (s)ds over the intervals (−∞, −c N ], [−c N , 0], and [0, ∞), where we define (4.4) Note that I ′n is non-random, whereas II ′ n (X) and III ′ n (X) are random.Establishing a bound on I ′ n of order n −1/2+δ requires no further work, because I ′ n is equal to III n in the proof of part (i).Next, it follows from Lemma 5.1 in Section 5.3 that the event (5.17)II ′ n (X) ≤ C n − 1 2 +δholds with probability at least 1 − C/n.So, it remains to establish a bound on III ′ n (X) of order n −1/2+δ .For this purpose, let t ′ 1,n and t ′ 2,n be any real numbers satisfying t ′ 1,n ≤ t ′ 2,n .The reasoning leading up to (5.6) can be re-used to show that the following bound holds almost surely(5.18)III′ n (X) ≤ P(A ′ (t ′ 2,n )|X) + P(B ′ (t ′ 1,n )|X), where we define the following events for arbitrary t ∈ R,