Adaptive threshold-based classification of sparse high-dimensional data

Abstract: We revisit the problem of designing an efficient binary classifier in a challenging high-dimensional framework. The model under study assumes some local dependence structure among feature variables represented by a block-diagonal covariance matrix with a growing number of blocks of an arbitrary, but fixed size. The blocks correspond to non-overlapping independent groups of strongly correlated features. To assess the relevance of a particular block in predicting the response, we introduce a measure of “signal strength” pertaining to each feature block. This measure is then used to specify a sparse model of our interest. We further propose a threshold-based feature selector which operates as a screen-and-clean scheme integrated into a linear classifier: the data is subject to screening and hard threshold cleaning to filter out the blocks that contain no signals. Asymptotic properties of the proposed classifiers are studied when the sample size n depends on the number of feature blocks b, and the sample size goes to infinity with b at a slower rate than b. The new classifiers, which are fully adaptive to unknown parameters of the model, are shown to perform asymptotically optimally in a large part of the classification region. The numerical study confirms good analytical properties of the new classifiers that compare favorably to the existing threshold-based procedure used in a similar context.


Introduction
Statistical methodology for high-dimensional data is a rapidly growing area where inferential and algorithmic procedures for models with the number of features exceeding the number of observations are of great interest. High-dimensional statistical problems emerge in a variety of applied fields such as genomics and proteomics, cosmology, information technology, finance and banking. Classification is one of the key techniques of high-dimensional statistics where the goal is to predict the categorical class labels of new instances based on past observations.
Despite the abundance of off-the-shelf classifiers with excellent performance in the classical large-sample scenario (examples include support vector machines, AdaBoost, CART, and Artificial Neural Networks classifiers), a straightforward extension of these procedures to high-dimensional settings encounters serious challenges for the following reasons. First, these classifiers fail to explore the sparsity patterns of the high-dimensional data. With a variety of modern experimental techniques that make it possible to automatically measure a high number of features on each subject, the number of individually relevant features, or groups (blocks) of such features, is often a small part of the entire set and is hidden in that set. Incorporating too many noise feature variables with little or no relevance to the classification problem at hand can severely deteriorate the performance accuracy. Second, many classification problems in high dimensions stem from applications where identifying useful features, or groups of features that are jointly informative for the class label, is of primary importance. Examples of applications include, among others, the problems of cancer classification with genomics data and disease classification with medical imaging data, where the goal is to design a parsimonious classifier that would not only control the total number of features in the model without noticeable loss of quality but also allow for effective training procedures and good interpretation.
Such type of applications may require the use of feature selection techniques that would effectively operate in high-dimensional settings under various sparsity and weakness assumptions.
These thoughts have motivated us to look at the classification problem in a sparse setup, where only a small fraction of a large number of feature blocks (which are unknown to us) are "useful", and each useful block of feature variables contributes weakly to distinguishing between classes. Aiming at modelling the phenomena of growing dimensions, we use the asymptotic framework that operates over a sequence of classification problems with increasingly many feature blocks and relatively fewer observations.
In general, the classical theory of supervised classification is not designed to work in a sparse framework. Therefore, over the last decades, substantial efforts have been made to develop appropriate alternatives for standard classification procedures, such as linear and quadratic classifiers (see, for example, Ahmad and Pavlenko [1], Aoshima and Yata [2], Chan and Hall [6], Fan et al. [11], Ingster et al. [16]). Several effective classifiers that are suitable for situations where the class-covariance matrices are diagonal have been recently proposed and studied (see, for example, Ingster et al. [16] and Donoho and Jin [10]). Some of these procedures suggest, prior to the classification step, a feature selection step by thresholding. The most recent classification studies pertaining to sparse models have shown that, even under the relatively strong assumption of independence of feature variables, many statistical challenges are yet preserved.
In this paper, we examine a new sparse block-diagonal model reflecting the situation where only a small fraction of feature blocks are useful for classification. Statistical properties of the proposed classifiers depend crucially on the accuracy of a cleaning step that identifies relevant feature blocks. Cleaning is done by means of hard thresholding, with carefully chosen data-driven thresholds, that filter out the blocks containing no signals. The choice of threshold depends on the level of signal separation strength: the weaker the signal, the harder the problem of removing useless feature blocks from the subsequent classification analysis. Where possible, we adopt our newly proposed variable selection techniques to set up an appropriate threshold that would retain all useful feature blocks and perhaps a few useless ones. When the signal strength of useful blocks are too weak to allow feature selection, we propose to use hard thresholding, which is obtained by employing weighted Kolmogorov-Smirnov test statistics with suitably chosen weight functions. These statistics are known to distinguish between the pure noise and the sparse mixtures of noise and signal.
By construction, the proposed classifiers contain a random number of terms, representing classification functions for useful feature blocks, which makes the study of their efficiency properties highly nontrivial. In a large part of the classification region, the proposed classifiers are shown to have the maximum classification error and the Bayes classification error tending to zero as the number of feature blocks increases; for the rest of the classification region, numerical results with simulated data are reported.
In Section 2 we introduce a high-dimensional model of our interest and indicate the fundamental limits of sparse classification. In Section 3 we study the classification problem at hand when the covariance matrix Σ of the data is known. The more difficult case of unknown Σ is treated in Section 4. Results of the numerical study are summarized in Section 5. Concluding remarks are given in Section 6. Proofs of Lemmas 1-3, which are essential ingredients for the proofs of Theorems 1 and 2, the main results of this work, are deferred to Section 7.
Throughout the paper, the symbol χ 2 ν pλq is used for a chi-square random variable with ν degrees of freedom and noncentrality parameter λ. The symbol F ν1,ν2 pλq is used for an F distributed random variable with ν 1 numerator and ν 2 denominator degrees of freedom and noncentrality parameter λ. Φ denotes the cumulative distribution function (cdf) of a normal N p0, 1q distribution. The notation 8, respectively. We use the symbol log a for the natural (base e) logarithm of the number a. For an event A, IpAq is the indicator of A. We denote by vnw the set t1, . . . , nu for some n P N. The Euclidean norm of a vector x P R k , k ě 1, is denoted by }x}. The stochastic symbols o PΠ l p1q and O PΠ l p1q are short for a sequence of random variables that converge to zero in probability and for a sequence that is bounded in probability, respectively, reflecting the fact that the sequence of random variables involves an observation generated by distribution Π l , l P v2w.

Model and problem
Let X p1q " pX p1q j q jPvnw and X p2q " pX p2q j q jPvnw be random samples drawn from the populations Π 1 " N p p0, Σq and Π 2 " N p pμ, Σq, respectively, where X plq j " pX plq 1j , . . . , X plq pj q J for j P vnw and l P v2w. The mean vector μ ‰ 0 and the common covariance matrix Σ " CovpX plq j q, l P v2w, are generally unknown. Assume further we observe a random vector X 0 P R p , which is independent of X plq , l P v2w, and the distribution of X 0 is known to be either Π 1 (the pure noise) or Π 2 (the signal). The goal is to design a classifier ψ " ψ`X 0 ; X p1q , X p2q˘t hat would assign X 0 to either Π 1 or Π 2 and would have small classification error when the dimension p is much larger than the sample size n. The problem of allocating X 0 to either Π 1 or Π 2 is difficult only when Π 1 and Π 2 are "close" to each other. A particular type of closeness for large p is described by the sparsity assumption, which is stated rigorously in Section 2.1 below. Under this assumption, the data is grouped in a large number of blocks, and only a small fraction of the blocks are relevant for classification.
Let E Πi denote the expectation with respect to the joint distribution of X p1q , X p2q and X 0 when X 0 " Π i for i P v2w. In the present situation of equally-sized random samples, it is natural to measure the accuracy of ψ by the Bayes risk πE Π2 pψq`p1´πqE Π1 p1´ψq with π " 1{2, that is, by and also by the maximum risk Here, E Π2 pψq is the probability of misclassifying X 0 as Π 1 when X 0 P Π 2 .
In what follows, Rpψq will be either the Bayes risk R B pψq or the maximum risk R M pψq.
Assume that Σ is a block-diagonal matrix of the form Σ " Diag`Σ r1s , . . . , Σ rbsw ith each block Σ rks being symmetric and positive definite. Then, the new observation and each element of the training samples can be split into b feature blocks: for j P vnw, l P v2w For k P vbw and b " 2, 3, . . ., we defineμ rks "μ rks,b and p Σ rks " p Σ rks,b bŷ and takeμ " pμ J r1s , . . . ,μ J rbs q J as an estimator of μ " pμ J r1s , . . . , μ J rbs q J and p Σ " Diagp p Σ r1s , . . . , p Σ rbs q as an estimator of Σ " DiagpΣ r1s , . . . , Σ rbs q. In the case of known Σ, we propose to use the classifierψ b "ψ b pX 0 ; X p2q q given byψ whereω k is one if the kth feature block of the data is "useful" and zero otherwise. The new observation X 0 is allocated to Π 1 whenψ b pX 0 q " 1 and to Π 2 otherwise. As seen from Theorem 1 in Section 3.2, the risk Rpψ b q ofψ b with suitably chosenω k , k P vbw, tends to zero when b tends to infinity in a large part of the classification region. Similarly, in the case of unknown Σ, we may consider the classifierψ b "ψ b pX 0 ; X p1q , X p2q q defined aŝ whereω k is one if the kth feature block of the data is "useful" and zero otherwise, which allocates X 0 to to Π 1 whenψ b pX 0 q " 1 and to Π 2 otherwise. As follows from Theorem 2 in Section 4.1, the risk Rpψ b q ofψ b with suitably chosenω k , k P vbw, converges to zero as b tends to infinity in a large part of the classification region. The behavior ofψ b andψ b in the remaining part of the classification region, where the selection of useful feature blocks is impossible, is examined in Sections 3.3 and 4.2, and the related numerical results are presented in Section 5. The random functionsω k andω k are some good estimators of ω k " IpΔ 2 k,b ‰ 0q, k P vbw, that attempt to remove most of the useless blocks of the data for which ω k " 0 from further consideration. Due to the technical issues presented by the case of unknown Σ, the cases of known and unknown Σ will be treated separately.

Asymptotic regime and sparsity assumption
We shall design a classifier ψ " ψ`X 0 ; X p1q , X p2q˘i n a high-dimensional framework when (i) the sample size n and the dimension p go to infinity together in such a way that n " n p Ñ 8 and n " oppq as p Ñ 8, (ii) the covariance matrix Σ is a sparse block-diagonal matrix of the form Σ " Diag`Σ r1s , . . . , Σ rbs˘, where each block Σ rks is symmetric and positive definite, and (iii) feature variables that are deemed useful for classification appear in groups (or blocks), according to the structure of Σ; the useful feature blocks are rare and each block contributes weakly to the classification decision.
We first treat the case of equally-sized p 0ˆp0 blocks so that bp 0 " p, and then comment on the case of unequally-sized blocks. In modern settings, it is often the case that the dimension p exceeds the number of observations n. In this work, we consider a sequence of classification problems in which p 0 (p 0 ă n) is a fixed known integer, the number of blocks b is the driving parameter, and n relates to b through for some known θ P p0, 1q, implying n " oppq as p Ñ 8. (Below, we may think that n " rb θ s.) This assumption yields log n " θ log p as p Ñ 8, which corresponds to scenario (C) in Ingster et al. [16] and refers to as the regular growth of dimensionality. This asymptotic approach is also similar to the triangular array setup studied in Greenshtein and Ritov [13]. The number of blocks b of Σ is assumed to be al least 2 because the parametrization used for the model of our interest requires from log b to be nonzero (see relation (2.8) below).
In an ideal setup, when μ and Σ are known, the optimal (using the Bayes risk with equal prior probabilities) classifier ψ 0 " ψ 0 pX 0 q, which is obtained by employing the likelihood ratio approach, has the risk where Δ 2 " μ J Σ´1μ is the squared Mahalanobis distance between Π 1 and Π 2 . For the block-diagonal matrix Σ " Diag`Σ r1s , . . . , Σ rbs˘, the squared Mahalanobis distance Δ 2 depends on b and can be expressed as where μ " pμ J r1s , . . . , μ J rbs q J . The quantity Δ 2 k,b is the signal strength of the kth block of the data; it measures the contribution of the kth block towards the total strength of separation Δ 2 between the populations Π 1 and Π 2 . "Large" values of Δ 2 k,b suggest that the kth block is useful for classification, and therefore the data pX plq j,rks q jPvnw , l P v2w, should be used for constructing a suitable classification rule; at the same time, "small" values of Δ 2 k,b mean that the kth block of the data is useless for classification and should be removed from further consideration. In this work, we demonstrate how accurate classification can be achieved by means of a classifier that includes an effective screen-and-clean threshold-based feature selector as its integrated part.
To set up a sparse model of our interest, we take two numbers s and a such that s P vbw and a ą 0, and consider the set of vectors v " pv k q kPvbw given by Γ b ps, aq " tv P R b : there exists a set S Ă vbw with s elements such that v k ě a for all k P S, and v k " 0 for all k R Su. (2.7) The statistical model that consists of observing two independent random samples X p1q and X p2q of size n from the respective p-dimensional populations Π 1 and Π 2 , where p " p 0 b, is said to have an ps, aq-sparse block-diagonal structure if a vector pnΔ 2 k,b q kPvbw , where Δ 2 k,b is defined in (2.6), belongs to the set Γ b ps, aq. In what follows, both parameters s and a will depend on the driving parameter b. Namely, we assume that the parameter s satisfies as b Ñ 8 implying s " opbq. This type of parametrization for s is quite common in the literature on high-dimensional statistical inference. We speak of β as the sparsity parameter. The parameter a cannot be too small (see, for example, Remark 1 in Ingster et al. [16]). A suitable range for a that makes the classification problem at hand interesting is a " a b " 2r log b for some 0 ă r ă 4, (2.8) that is, the parameter a is only moderately large. Indeed, in this case, the squared Mahalanobis distance for the kth block satisfies This brings us to a nontrivial problem of classification which is closely related to the classification problem for the diagonal matrix Σ " σ 2 I pˆp , as studied in Ingster et al. [16] and Donoho and Jin [10]. The restriction on the range of r in (2.8) being the interval p0, 4q is due to a related feature selection problem; the assumption of r ě 4, which corresponds to relation (2.3) in Ingster et al. [16], makes selecting useful blocks obvious and hence the problem of classifying X 0 easy.
We shall now introduce the collection of parameters μ " μ pˆ1 and Σ " Σ pˆp of our interest. For 0 ă β ă 1 and 0 ă r ă 4, define the set M b,β,r as follows: M b,β,r "tpμ, Σq : μ " pμ J r1s , . . . , μ J rbs q J ‰ 0, Σ " DiagpΣ r1s , . . . , Σ rbs q is positive definite and symmetric, and vector pnΔ k,b q kPvbw belongs to the set Γ b prb 1´β s, 2r log bqu, where n relates to b through (2.5) and Γ b ps, aq is as in (2.7). Then, our sparsity assumption on the model, which is characterized by numbers β P p0, 1q and r P p0, 4q, says that the pair of parameters pμ, Σq is an element of M b,β,r . We say that the kth block of the data is useful (for classification) if nΔ 2 k,b ě 2r log b, and it is useless if nΔ 2 k,b " 0. The value b´β may be viewed as the 'probability' of occurrence of useful feature blocks among the b blocks available. Thus, very few blocks of features are useful for classification, and the information carried by each of these blocks contributes weakly to the classification decision. This type of sparse model is sometimes referred to in the literature as the rare and weak feature model (see, for example, Donoho and Jin [10]).
The classifier proposed in (2.3) depends on p μ and Σ´1 only through their block-wise products; a similar comment applies to the classifier in (2.4). Therefore, the idea of imposing the sparsity assumption directly on the signal separation strength vector pnΔ k,b q kPvbw is a natural one. This type of sparsity assumption is somewhat weaker and more flexible as compared to some commonly used assumptions that require the sparsity of μ and Σ (or Σ´1) separately. For instance, Shao et al. [21] proposed some thresholding procedures in which μ and Σ are first estimated separately and then plugged into classification rules. In general, in the context of classification, the idea of imposing sparsity assumptions separately on μ and Σ (or Σ´1) may be inappropriate as there are cases where neither μ nor Σ´1 are sparse but μ J Σ´1μ is (see, for example, Cai and Liu [5]).
Below, we consider the regions inside the parameter space tpβ, rq P R 2 : 0 ă β ă 1, 0 ă r ă 4u where successful classification is possible in the sense that lim bÑ8 inf ψ sup pμ,ΣqPM b,β,r Rpψq " 0, (2.10) where Rpψq is either R B pψq or R M pψq and the infimum is over all measurable functions of X 0 and the training data X plq , l P v2w, with values on r0, 1s, and construct classifiers that would provide successful classification in part of these regions. Following Ingster et al. [16], we say that a classifier ψ " ψ b is asymptotically optimal if for all β and r such that successful classification is possible, we have lim bÑ8 sup pμ,ΣqPM b,β,r Rpψq " 0.

Classification regions
Given the sparse model in question, we shall restrict our attention to the most interesting case of high β-sparsity with values of β between p1´θq{2 and 1´θ. The reason for this is that a classification problem for "moderately β-sparse" vectors with β P p0, p1´θq{2q is easy and not of much interest, whereas successful classification of "very highly β-sparse" vectors with β P p1´θ, 1q is impossible (see Remark 1 in Ingster et al. [16]). In the case of moderate β-sparsity, successful classification is possible without preliminary selecting useful feature blocks and is provided, for example, by the classification rule ψb " . Another parameter of the model at hand is r; it may be viewed as the signal strength parameter. Depending on the values of r (as a function of β), we will suggest different classification procedures. In general, the larger the value of r, the easier the classification problem.
Given a parameter θ P p0, 1q which relates n and b through (2.5), we shall consider the following two regions of the parameter space tpβ, rq P R 2 : 0 ă β ă 1, 0 ă r ă 4u where classification is possible: Figure 1 displays the two regions: the region D 1 pθqYD 2 pθq where classification is possible and its complement in pp1´θq{2, 1´θqˆp0, 4q where classification is impossible, along with the detection boundary r " ρpβq and the selection boundaries r " ρ 1 pβq and r " ρ 2 pβq. In the region D 1 pθq, we construct an asymptotically optimal classifier that is fully data-driven and does not require the knowledge of β to be applied (for details, see Section 3.2). In the region D 2 pθq, where the classification problem is much harder, based on certain heuristic arguments, we propose a classifier that works well and improves the idea of Donoho and Jin [10] (for details, see Section 3.3). The properties of this classifier are studied numerically in Section 5.
In case of unknown covariance matrix Σ, the division of the classification region into two subregions, denoted below by D 0 1 pθq and D 0 2 pθq, is slightly different (see Figure 7 in Section 4.1); this is due to the impact of estimating the true Σ´1 by p Σ´1 on the classification error. Similar to the case of known Σ, in the region D 0 1 pθq the problem of classification is easier, whereas in the region D 0 2 pθq it is more difficult. We first consider in detail the case of known covariance matrix Σ and then extend the obtained results to the case of unknown Σ (see Section 4).

Classification when Σ is known
In the present setup, we distinguish between the regions D 1 pθq and D 2 pθq. In the region D 1 pθq one can identify useful feature blocks in a precise enough way. In the region D 2 pθq, where the parameter r (signal strength) is relatively small, the problem of identifying useful blocks is much more difficult.

Some useful statistics
For b " 2, 3, . . ., letμ "pμ J r1s , . . . ,μ J rbs q J be the estimator of μ " pμ J r1s , . . . , μ J rbs q J and let p Σ " Diagp p Σ r1s , . . . , p Σ rbs q be the estimator of Σ " DiagpΣ r1s , . . . , Σ rbs q as above. The random matrix p2n´1q p Σ rks has a (central) Wishart W p0 pΣ rks , 2n´1q distribution. The distribution of p Σ´1 rks {p2n´1q is called the inverted Wishart distribution, and E´p Σ´1 rks {p2n´1q¯" p2n´p 0´2 q´1Σ´1 rks . For k P vbw and b " 2, 3, . . ., we further definẽ Then, if Σ is known, we may consider a triangular array of statistics The statisticsT k,b are independent within each series and The difficulty of identifying useful blocks of the data in the region D 2 pθq as compared to the region D 1 pθq is seen from Figures 2 and 3. Figure 2 shows a histogram for the chi-square data tT k,b : k P vbwu, in the region D 1 pθq, where variable selection is possible. Figure 3 shows a histogram for the data tT k,b : k P vbwu, in the region D 2 pθq, where variable selection is impossible (but classification is still possible). On Figures 2 and 3, the central and noncentral chi-square density curves are seen as red and blue lines, respectively. The failure to classify a new observation X 0 as belonging to either Π 1 or Π 2 outside of the region D 1 pθq Y D 2 pθq in the rectangle pp1´θq{2, 1´θqˆp0, 4q is illustrated by Figure 4, which shows a histogram for the data tT k,b : k P vbwu, in the region where classification is impossible. In this case, rb 1´β s noncentral chi-square statisticsT k,b get too close to the remaining b´rb 1´β s central chisquare statisticsT k,b to allow successful classification in the sparse regime of our interest.
In a more realistic scenario when Σ is unknown, we shall make use of the statistics Histogram for the chi-square data tT k,b : k P vbwu when pβ, rq P D 1 pθq.  The statisticsT k,b are independent within each series and satisfy (see Section 8b of Rao [20] "In distribution" closeness ofT k,b and p 0Tk,b for large b (for details, see Section 4.1) and the consistency of p Σ´1 rks as an estimator of Σ´1 rks allow us to extend the results obtained for the case of known Σ to the case of unknown Σ.
By the sparsity assumption on the model, only s " rb 1´β s " opbq statistics among tT k,b : k P vbwu have a noncentral chi-square distribution, and the remaining pb´sq " b`opbq ones follow a central chi-square distribution. Similarly, only s " rb 1´β s " opbq statistics among tT k,b : k P vbwu have a noncentral F distribution, and the remaining pb´sq " b`opbq ones follow a central F distribution. The noncentrally distributed statistics tend to take larger values as compared to the corresponding centrally distributed statistics. Therefore, "large" values of T k,b andT k,b would suggest that the kth block of the data is useful and should be used for classification. These observations lead us to the estimatorsω k and ω k of ω k , k P vbw, as given below by (3.4) and (4.1).

Classification rule in the region D 1 pθq
LetT k,b , k P vbw, be the statistics as in (3.1). Consider a classifierψ b defined by (2.3) for whichω called here a selector, is an estimator of ω k " IpΔ 2 k,b ‰ 0q, k P vbw, with the threshold levelt "t`X p2q˘ą 0 chosen as follows.
Pick a large number M " M b such that and consider an equidistant grid of points p1´θq{2 ă In view of the above assumptions on M , yielding for all large enough b b δ ď const. (3.8) Next, for all k P vbw and m P vM w, put where " b ą 0 satisfies We define an adaptive selector by the formulã wherem "m b is chosen by Lepski's method (see Section 2 of Lepski [18]) as follows, cf. relation (37) in Butucea and Stepanova [4]: m " max tm P vM w : dpωpβ m q,ωpβ j qq ď v j for all j ď mu , (3.12) andm " 1 if the set in (3.12) is empty. Here dpω, ωq " ř b k"1 |ω k´ωk | is the Hamming loss that counts the number of positions at whichω " pω k q kPvbw and ω " pω k q kPvbw differ, and the quantities Algorithmically, Lepski's procedure for choosingm works as follows. We start by settingm " 1 and attempt to increase the value ofm from 1 to 2. If dpωpβ 2 q,ωpβ 1 qq ď v 1 , we setm " 2; otherwise, we keepm equal to 1. In casẽ m is increased to 2, we continue the process attempting to increase it further. If dpωpβ 3 q,ωpβ 2 qq ď v 2 and dpωpβ 3 q,ωpβ 1 qq ď v 1 , we setm " 3; otherwise, we keepm equal to 2; and so on. By construction v 1 ě v 2 ě . . . ě v M . It can be seen from the proof of (3.13) below that if m 0 P vM´1w is such that the true β P pβ m0 , β m0`1 s, thenm ě m 0 with high probability.
The next result shows that, asymptotically,ωpβmq identifies correctly most of the noncentrally distributed chi-square statistics among tT k,b : k P vbwu.

Lemma 1.
Consider the prb 1´β s, 2r log bq-sparse block-diagonal normal model with known covariance matrix Σ, and let the statistics tT k,b : k P vbw, b " 2, 3, . . .u be as defined in (3.1). Then, the selectorωpβmq in (3.11) based on tT k,b : k P vbwu withm defined by (3.12) is an almost full selector in the sense that for all p1´θq{2 ă β ă 1´θ and β ă r ă 4 (3.13) Lemma 1, whose proof is given in Section 7, says that in the region D 1 pθq the maximum Hamming risk ofωpβmq is small relative to the number of noncentrally distributed statistics among tT k,b : k P vbwu. This suggests that in the definition of the classifierψ b given by (2.3) and (3.4) the thresholdt "t b should be set at the levelt " p2βm` q log b. (3. 14) The next result shows that in the region D 1 pθq the classification ruleψ b defined by (2.3), (3.4), and (3.14) is asymptotically optimal.
Theorem 1. Let X p1q " pX p1q j q jPvnw and X p2q " pX p2q j q jPvnw be training samples of size n in the prb 1´β s, 2r log bq-sparse block-diagonal normal model, with n and b related through (2.5) for a given number θ P p0, 1q, and let X 0 be a new observation to be classified. Assume that the covariance matrix Σ is known. Then, for all pβ, rq P D 1 pθq, the classifierψ b defined by (2.3), (3.4), and (3.14) satisfies Proof. For a number θ P p0, 1q, let pβ, rq be an arbitrary point in the region where E Πi denotes the expectation with respect to the joint distribution of X p1q , X p2q and X 0 when X 0 " Π i for i P v2w. 16) and observe that the classifierψ b can be expressed as whereω k "ω k pβmq is the kth component ofωpβmq in (3.11). Denote also and note that, where, by assumption and the fact that n " b θ , for all pβ, rq P D 1 pθq, as b Ñ 8 The following result shows that the main contribution to The proof of Lemma 2, where the key role is taken by Lemma 1, is given in Section 7. Next, for b " 2, 3, . . ., let us introduce the event Then, by (3.19) and Chebyshev's inequality, for all sufficiently large b

T. Pavlenko et al.
Together with relations (3.18) and (3.20), and Lemma 2 applied to the term sup pμ,ΣqPM b,β,r P Π2 pA b q, this upper bound yields, as b Ñ 8 uniformly in pβ, rq P D 1 pθq for all θ P p0, 1q. This proves the first relation in (3.15). The second relation in (3.15) is proved completely analogously. The proof is complete.

Remark 1.
Inspection of the proof of Theorem 1 shows that it can be extended to a more realistic case of unequally-sized blocks Σ r1s , . . . , Σ rbs of respective sizes p 1ˆp1 , . . . , p bˆpb , where p k ě 3, k P vbw, are uniformly bounded integers such that ř b k"1 p k " p. This is so because, in view of relations (7.2)-(7.4) in the proof of Lemma 1, for large b the tails of the central and noncentral chi-square statistics tT k,b : k P vbwu are not essentially affected by their degrees of freedom.

Classification rule in the region D 2 pθq
In the region D 2 pθq, where the parameter r is small and feature selection is impossible, the classification problem is very hard (see Figure 3). We suggest that, in this region of pβ, rq-values, the thresholdt of the classifierψ b given by (2.3) and (3.4) be chosen by using the weighted Kolmogorov-Smirnov statistic with a suitable weight function q. Namely, we set the thresholdt "t q at the level (see formula (3.29) below)t q "T pb`1´kqq whereT pkq is the kth order statistic of the data tT k,b : k P vbwu and the indexk q is given by formula (3.28). The idea behind this choice oft q is tightly connected to the problem of signal detection in sparse chi-square mixtures by means of weighted Kolmogorov-Smirnov tests and is detailed below. It is similar to the suggestion of Donoho and Jin [10]; the main difference is a more general classification model and a different choice of the weight function used.
In view of the intrinsic difficulty of the problem, based on heuristic arguments, we propose a classifier that numerically works well. As seen from Section 5, our classifierψ b given by (2.3), (3.4), and (3.29) works numerically better than the procedure of Donoho and Jin [10]. There also exists a Donoho-Jin type classifier proposed by Fan et al. [11] for the situation when the inverse Σ´1 of a covariance matrix Σ admits an "acceptable" estimator which selects useful information by means of truncated higher-criticism thresholding. Its quality, however, is hard to assess because the numerical results in Section 4 of Fan et al. [11] are only given for the region where feature selection is possible and where the classification problem is relatively easy, whereas the obtained analytical results for the whole classification region are "strongly asymptotic", and it is not clear for what p the asymptotics start to give reasonably accurate descriptions of the actual finite sample performance.
Assume that pβ, rq P D 2 pθq and consider the worst case scenario when all nonzero noncentrality parameters nΔ 2 k,b , k P vbw, are equal to 2r log b with some ρ˚pβq ă r ă β. Then, asymptotically, the statistics tT k,b : k P vbwu obey a chi-square mixture model where ε b " b´β for 0 ă β ă 1 and γ b " 2r log b for ρ˚pβq ă r ă β. Therefore, we may consider an axillary problem of testing the null hypothesis H 0 versus the alternative (more precisely, a sequence of alternatives) H 1,b given by where ε b " b´β for 0 ă β ă 1, p 0 is as before, and γ b " 2r log b for 0 ă r ă 1. Next, we transform the statistics t k to the uniformly distributed on the interval p0, 1q statistics s k " 1´G p0 pt k ; 0q, k P vbw, where G ν px; γq " Ppχ 2 ν pγq ď xq, x P R. In terms of a common cdf F puq of the s k 's, the problem of testing H 0 versus H 1,b is equivalent to that of testing H 0 : F puq " F 0 puq, the uniform U p0, 1q cdf versus a sequence of upper-tailed alternatives In connection with testing H 0 versus H 1,b (or, equivalently, H 0 versus H 1,b ), consider the function ρpβq defined in (2.11). It is known (see, for example, Section 4 of Stepanova and Pavlenko [24]) that if r ą ρpβq then the hypotheses separate asymptotically, whereas if r ă ρpβq then these hypotheses merge asymptotically, that is, no consistent test exists. More precisely, let Ips k ă uq, 0 ă u ă 1, be the empirical distribution function (edf) based on the s k 's and for σ " 1{2, 0, 1{2 let The function q´1 {2 puq " a up1´uq is a regularly varying function which is known in the literature as the standard deviation proportional (SDP) weight function. The function q 0 puq " a up1´uq log logp1{pup1´uqqq is an Erdős-Feller-Kolmogorov-Petrovski (EFKP) upper-class function of a Brownian bridge; the importance of such weight functions in the theory of weighted quantile and empirical processes has been demonstrated by Csörgő et al. [7]. The function q 1{2 puq " a up1´uq log logp1{pup1´uqqq is an example of the Chibisov-O'Reilly function. For the use of these three classes of functions in the theory of weighted quantile and empirical processes, we refer to Csörgő et al. [7] and Csörgő and Horváth [8]. It is known (see Donoho and Jin [9] and Stepanova and Pavlenko [24]) that the tests based on the (one-sided) weighted Kolmogorov-Smirnov statistics where α 0 P p0, 1{2q is a small number (say, α 0 " 0.2) chosen by the statistician, distinguish between H 0 and H 1,b when r ą ρpβq, with H 0 being rejected for "small" values of Db pq σ q. Moreover, the use of weight functions q 0 and q 1{2 makes the problem of distinguishing between the two hypotheses easier, as compared to using q´1 {2 . This is so because, under H 0 , the statistics Db pq 0 q and Db pq 1{2 q are finite with probability 1, whereas the statistic Db pq´1 {2 q introduced by Donoho and Jin [9] tends to infinity, in probability and even almost surely, under both H 0 and H 1,b , making the problem of separating these two hypotheses relatively hard. Let s p1q ă s p2q ă . . . ă s pbq be the order statistics of the sample s 1 , . . . , s b . Then, as each weight function q σ is monotone on p0, α 0 q for small α 0 P p0, 0.2q, the statistic Db pq σ q is asymptotically equivalent to the statistic Db pq σ q " max 1ďkďrα0bs ?
In addition to weights q σ puq, σ "´1{2, 0, 1{2, we also explore one more weight function: which is an example of the Chibisov-O'Reilly function. As shown in Ingster et al. [16] and Fan et al. [11], in somewhat different yet similar settings, if H 0 and H 1,b are indistinguishable (merge asymptotically), then successful classification cannot be achieved. It can only be achieved in the region of r ą ρ˚pβq ą ρpβq with ρ˚pβq as in (2.12). Thus, recalling (3.24), we arrive at the following idea of selecting useful feature blocks by means of Db pq σ q-thresholding in the region D 2 pθq. This idea is similar to that of Donoho and Jin [10] to make feature selection via higher criticism thersholding, that is, by using the statistic, cf. (3.26), which is the statistic Db pq´1 {2 q in our notation. First, consider the statistics S k,b " 1´G p0 pT k,b ; 0q, k P vbw, and note that, under H 0 , the transformed statistics tS k,b : k P vbw, b " 2, 3, . . .u form a triangular array of iid uniform U p0, 1q random variables. Next, denote by S pkq the kth order statistic of the sample tS k,b : k P vbwu and define the indexk q bỹ k q " arg max 1ďkďrα0bs ?
bpk{b´S pkq q qpk{bq , (3.28) where q is one of the weight functions q σ with σ "´1{2, 0, 1{2, as given in (3.25) or q 1{4 as in (3.27). Finally, we take S pkq q as a (random) feature selection threshold, that is, for all l P v2w, j P vnw, and k P vbw, the kth sub-vector X plq j,rks of vector X plq j is deemed useful for classification if S k,b is smaller than S pkqq or, equivalently, ifT k,b is larger thant q , wherẽ t q " G´1 p0 p1´S pkq q ; 0q "T pb`1´kqq . ? bpk{b´S pkqq qpk{bq on the interval p0, 0.2q for four different weight functions q and the corresponding thresholds S pkq q are shown on Figure 5. Figure 6 shows a histogram forT k,b , k P vbw, and the thresholdt q for the four weight functions q of our interest. As seen from Figure 6, the threshold obtained by using the SDP weight function q´1 {2 (red line) retains too many useless feature blocks for future use in classification, whereas the thresholds that correspond to the Chibisov-O'Reilly weight functions q 1{2 and q 1{4 (yellow and green lines) appear to ignore a certain number of useful feature blocks contained in the data. The threshold obtained by using the EFKP upper-class weight function q 0 (blue line) is a compromiser that gives a better classification result (see Table 1 in Section 5).
Note in passing that, in the rare and weak regime in question, various false discovery rate (FDR) controlling multiple testing procedures, including the Benjamini-Hochberg rule, provide very few discoveries and thus lead to high classification error. The desirable properties of FDR controlling procedures in multiple testing have been analytically justified mainly for the situations where rare signals are strong.

Classification rule in the region D 0 1 pθq
LetT k,b , k P vbw, be the statistics as in (3.2). In the present settings when Σ is unknown, consider a classifierψ b defined by (2.4) for whicĥ  with some threshold levelt "t`X p1q ; X p2q˘ą 0. We need to set up the threshold t in such a way that the maximum (over all pμ, Σq P M b,β,r and all pβ, rq P D 1 pθq) risk ofψ b is small when b is large. In the case of unknown Σ, we will have to narrow down the region D 1 pθq of pβ, rq-values, which is the price paid for not knowing Σ.
Note that for all b " 2, 3, . . . the statisticsT k,b , k P vbw, are independent and there exists a set S Ă vbw with s " rb 1´β s elements such thatT k,b " F p0,2n´p0 pnΔ 2 k,b q for all k P S, andT k,b " F p0,2n´p0 p0q for all k R S. For x P R, let F ν1,ν2 px; γq " Ppν 1 F ν1,ν2 pγq ď xq, G ν px; γq " Ppχ 2 ν ď xq. Then, it follows from formula (6.8) of Siotani [22] that for any x ě 0, any ν 1 ą 0, and all large enough ν 2 and γ, with γ tending to infinity not very fast, Relation (4.2) shows that for large b, when multiplied by a constant factor p 0 , the F -distributed statistics tT k,b : k P vbwu in (3.2) are well approximated by the chi-square statistics tT k,b : k P vbwu in (3.1). For all 0 ă θ ă 1, we now define the two subregions D 0 1 pθq and D 0 2 pθq of the classification region as follows: In the region D 0 1 pθq, we define a selectorωpβmq " pω k pβmqq kPvbw based on tT k,b : k P vbwu similar to the one in (3.11)-(3.12). Namely, we first pick a large number M " M b , the equidistant grid points p1´θq{2 ă β 1 ă . . . ă β M ă 1´θ, and a small number δ " δ b as in (3.5)-(3.7). Next, for all k P vbw and m P vM w, we set, cf. m " max tm P vM w : d pωpβ m q,ωpβ j qq ď v j for all j ď mu , (4.4) andm " 1 if the set in (4.4) is empty. Here the quantities v j " v j,b are set to be v j " b 1´βj {τ b , j P vmw, with a sequence of numbers τ b Ñ 8 satisfying It is not difficult to show, cf. Lemma 1, that the selectorωpβmq given by  For the purpose of classification, however, the threshold that would exclude most of the useless blocks from the classification procedure needs to be higher and, as a result of this, the region where the classifierψ b does its job properly is narrowed down, as compared to the region D 1 pθq whereψ b works well, to become D 0 1 pθq. Namely, return to the definition of the classifierψ b given by (2.4) and (4.1), and define the thresholdt "t b in (4.1) bŷ t " p´1 0 p2βm`θ` q log b.  Proof. The proof of Theorem 2 is similar to that of Theorem 1 yet more technical due the presence of the estimator p Σ´1 rks of Σ´1 rks , k P vbw, in the definition ofψ b . For a number θ P p0, 1q, let pβ, rq be an arbitrary point in the region D 0 1 pθq. We need to show that where, as in the proof of Theorem 1, E Πi denotes the expectation with respect to the joint distribution of X p1q , X p2q and X 0 when X 0 " Π i for i P v2w. As the proofs of both relations in (4.6) go along the same lines, we shall only prove the first one. Using the notationΔ 2 k,b " p μ J rks p Σ´1 rks p μ rks , for k P vbw, b " 2, 3, . . ., we putV Recall also the random variables V k defined in (3.17) that, under Π 2 , satisfy whereω k "ω k pβmq is the kth component ofωpβmq in (4.3) and, cf. (3.19), with the term ř b k"1 : ω k "1 Δ 2 k,b obeying relation (3.20). As seen from the next result, the main contribution to ř b k"1 :ω k "1V k is made by The proof of Lemma 3 is given in Section 7. Now, with Lemma 3 available, the rest of the proof of Theorem 2 resembles that of Theorem 1 after Lemma 2.

Remark 2.
Inspection of the proof of Theorem 2 shows that it can be extended to the case of blocks Σ r1s , . . . , Σ rbs of different sizes p 1ˆp1 , . . . , p bˆpb , where p k ě 3, k P vbw, are uniformly bounded integers such that ř b k"1 p k " p. Indeed, by (4.2) and relations (7.2)-(7.4), for large b, the tails of the statisticsT k,b " F p k ,2n´p k pnΔ 2 k,b q, k P vbw, are not essentially affected by the change of the numerator degree of freedom p k and the denominator degree of freedom p2n´p k q by a finite (independent of n) integer number. In this case, the sequence of numbers τ b which defines the quantities v j " b 1´βj {τ b , j P vmw, in (3.12) and

Classification rule in the region D 0 2 pθq
We shall now discuss a suitable choice for the thresholdt (for notational simplicity, we suppress the dependence oft on b) of the classifierψ b defined by (2.4) and (4.1). By the sparsity assumption on the model, only s " rb 1´β s statistics among tT k,b : k P vbwu have a noncentral F distribution, whereas the remaining pb´sq " b`opbq statistics are centrally F distributed. In view of (4.2), for all k P vbw and all large enough b, a central random variable p 0 F p0,2n´p0 p0q is close in distribution to a χ 2 p0 p0q, and a noncentral random variable p 0 F p0,2n´p0 pnΔ 2 k,b q is close in distribution to a χ 2 p0 pnΔ 2 k,b q. Therefore, similar to the case of known Σ, we may consider the problem of testing the hypotheses where ε b " b´β for 0 ă β ă 1 and γ b " 2r log b for 0 ă r ă 1, by means of the weighted Kolmogrov-Smirnov test statistics.
To this end, transform the statistics t k to the uniformly U p0, 1q distributed under H 0 statistics where F ν1,ν2 px; γq " PpF ν1,ν2 pγq ď xq, x P R. In terms of a common cdf F puq of the u k 's, the problem of testing H 0 versus H 1,b is equivalent to that of testing H 0 : F puq " F 0 puq, the uniform U p0, 1q cdf versus a sequence of upper-tailed alternatives Similar to the problem of testing H 0 versus H 1,b , the hypotheses H 0 and H 1,b are separated by a weighted Kolmogorov-Smirnov test statistic where q is one of the weight functions q σ , σ "´1{2, 0, 1{2, of our interest, as defined in (3.25), or function q 1{4 as in (3.27). Similar to the choice of thresholdt "t q in Section 3.3, we now choose the thresholdt "t q to be that order statistic of the sample tT k,b : k P vbwu for which the objective function under the maximum sign in Db pqq based on the translated observations U k,b " 1´F p0,2n´p0 pT k,b ; 0q, k P vbw, in maximized. Namely, using relation (4.2) and the arguments that have led us to the thresholdt q in (3.29), we define the thresholdt q "t q pX p1q , X p2q q bŷ t q " F´1 p0,2n´p0 p1´U pkq q ; 0q.
where the index 1 ďk q ď rα 0 bs is chosen as, cf. (3.28), and U pkq is the kth order statistic of the sample tU k,b : k P vbwu. Alternatively, we can writet q "T pb`1´kqq . (4.12) IfT k,b ąt q , the kth block is rendered useful and hence is retained to contribute toψ b . The classifierψ b defined by (2.4), (4.1), and (4.12) is fully adaptive in the parameters of the model. In the regions D 1 pθq and D 0 1 pθq, with θ " 0.5, p 0 " 3, b " 10 3 and θ " 0.5, p 0 " 5, b " 10 4 , and various configurations of the parameters β and r, the estimated risks Rpψ p1q b q and Rpψ p1q b q obtained by averaging over 100 independent cycles of simulations, were found to be zero up to three decimal places. Note that the choice of θ " 0.5 leads to an interesting case when n " b 1{2 is much smaller than b for large b.

Numerical study
We now present some simulation results related to high-dimensional classification in the regions D 2 pθq and D 0 2 pθq, where feature selection is impossible. Table 1 gives the numerical summary of the performance of the classifiersψ  (when Σ is unknown) in the region where variable selection is impossible for four different choices of weight function q in (3.29) and (4.12). To run simulations, we picked b " 10 4 , θ " 0.5, p 0 " 3, β " 0.375, r " 0.25, and averaged the results over 100 simulation cycles. It is seen that, in the region where variable selection is impossible, the classifiersψ p2q b andψ p2q b , for which the selection of useful blocks is done by means of weighted Kolmogorov-Smirnov thresholding, works best when the EFKP weight function q 0 puq is used. At the same time, the SDP weight function q´1 {2 puq employed by Donoho and Jin [10] in a similar context does not appear to be a good choice.

Concluding remarks
This work was inspired by the need for accurate parsimonious classification procedures in sparse high-dimensional settings. Instead of imposing the usual (and often unrealistic) assumption of mutual independence of feature variables, we suggest a different approach by allowing some local dependence, which is modelled by means of a block-diagonal covariance matrix with blocks of possibly different sizes. The assumption on a block-diagonal covariance matrix allows for a variety of within-block covariance structures. The proposed framework has some definite advantages. In particular, it enables to obtain an accurate classifier with incorporated group-wise adaptive feature selector. The sparse classification model at hand is described by several known parameters, including θ and p 0 , and two unknown parameters β and r. For each of the two assumptions regarding the covariance matrix Σ (known or unknown), depending on the location of the point pβ, rq inside the classification region, we have proposed two different classifiers:ψ were shown to be asymptotically optimal in providing successful classification (see Theorems 1 and 2). For small values of r, when the problem of classification is very difficult, the adaptive proceduresψ were proposed and studied numerically. Although all our classifiers are adaptive, that is, their definitions do not involve β and r, the application ofψ requires that the respective assumptions r ą β and r ą β`θ{2 be valid. If one cannot guarantee that r is large enough to useψ in case of estimated Σ. If we are in a position to assume that r ą 1, then the classifiersψ p1q b andψ p1q b , that work well for both equally-sized and unequally-sized blocks, should be used.

Proof of Lemmas
Proof of Lemma 1. The proof is largely based on the fact that if index m 0 P vM´1w is such that β P pβ m0 , β m0`1 s, then with high probabilitym ě m 0 , wherem is given by (3.12). To verify this fact, Bernstein's inequality will be used. Fact 1. Bernstein's inequality. If X 1 , . . . , X b , b P N, are independent random variables such that for all i P vbw and for some H ą 0 EpX i q " 0 and |EpX m i q| ď

Remark 3. (i) This version of Bernstein's inequality can be found on pages
162-166 of [3] and in Section 2.2 of [19]. (ii) For independent random variables X 1 , . . . , X b with the properties EpX i q " 0 and |X i | ď L, i P vbw, for some L ą 0, the Bernstein condition (7.1) holds with H " L{3. (iii) Below Bernstein's inequality will be applied for the case of t ě D 2 b {H. The following asymptotics for the chi-square tail probabilities will also be of great help: for any ν ě 1 as b Ñ 8 P`χ 2 ν p0q ą 2s log b˘" O´b´s log ν{2´1 b¯, 0 ă s ă 8, (7.2) P`χ 2 ν p2r log bq ą 2s log b˘" O´b´p The first two relations follow from formula (5.5) in Donoho and Jin [9]. The third one can be obtained by using relations (1) and (2) in Han [14] which give an expression of the cdf of a non-central chi-square distribution with odd degrees of freedom in terms of the cdf and pdf of the standard normal distribution. Consider the selectorωpβmq " pω 1 pβmq, . . . ,ω b pβmqq given by (3.9)-(3.12). Let index m 0 be such that the true (but unknown) parameter β P pp1´θq{2, 1´θq satisfies β Ppβ m0 , β m0`1 s. Using the law of total expectation, we can write where ω " pω k q kPvbw " pIpΔ 2 k,b ‰ 0qq kPvbw . Consider the term I 1,b . Whenm ě m 0 , by the triangle inequality and the definition ofm in (3.12), for all pμ, Σq P M b,β,r dpωpβmq, ωq ď dpωpβmq,ωpβ m0 qq`dpωpβ m0 q, ωq ď v m0`d pωpβ m0 q, ωq, Next, for any non-negative random variable Y , for which the expectations below are well defined, one has EpY |BqPpBq ď EpY q, and hence we can write For the term N 1,b , using relations (3.8), (3.10), and (7.2), we obtain For the term N 2,b , by relation (7.4) and the fact that for all sufficiently large b (recall that " b Ñ 0 as b Ñ 8) one has r ą β m0` {2, we obtain as b Ñ 8 Therefore, in view of (7.7)-(7.9), uniformly in p1´θq{2 ă β ă 1´θ and β ă r ă 4 It remains to show that, uniformly in p1´θq{2 ă β ă 1´θ and β ă r ă 4, it is also true for the term I 2,b in (7.5) that I 2,b " op1q. We have Now, using Fact 1, we show that Ppm ă m 0 q is small yielding I 2,b " op1q as b Ñ 8. Indeed, recalling (3.12) and writingT i instead ofT i,b for i P vbw, we have ą v j¸. (7.11) Now, introducing the events we obtain from (7.11) that PpA i q¸, (7.12) where the random variables X i are defined by To apply Bernstein's inequality to the term P´ř b i"1 X i ą v j´ř b i"1 PpA i q¯on the right-hand side of (7.12), we first show that ř b i"1 PpA i q " opv j q as b Ñ 8. Using (7.2) and (7.4) and recalling that rb 1´β s statistics among tT i : i P vbwu follow a noncentral chi-square distribution and the remaining statistics have a central chi-square distribution, we get for all j P vkw and k P vm 0´1 w as where β j ď β m0´1 ă β ă r and β k`1 ď β m0 ă β ă r. From this, noting that Since by definition v j " τ´1 b b 1´βj and τ b " o´b {2 log 1´p0{2 bq¯, it now follows from (7.14) that for all j P vkw one has ř b i"1 PpA i q " opv j q as b Ñ 8 and hence Also, since the variance of a random variable taking values in t0, 1u is smaller than its expectation, we have by (7.14) and by the independence ofT 1 , . . . ,T b that as b Ñ 8 Thus, for the random variables X 1 , . . . , X b defined in (7.13) we have |X i | ď 2 and EpX i q " 0 for i P vbw, and hence for all j P vkw and k P vm 0´1 w Therefore the application of Bernstein's inequality stated in Fact 1 with t " v j p1`op1qq, Remark 3, and the fact that β j ď β m0´1 for all j P vkw and k P vm 0´1 w give From this and (7.12) we deduce that for all large enough b Ppm ă m 0 q ď M 2 expˆ´b 1´βm 0´1 4τ b˙, and hence 4τ b˙" op1q, (7.15) where the last equality is due to the fact that 1´β m0´1 is separated from zero, which follows from the assumptions p1´θq{2 ă β ă 1´θ and β m0´1 ă β m0 ă β ď β m0`1 . Now, combining (7.5), (7.10) and (7.15), we obtain uniformly in pβ, rq P D 1 pθq for all 0 ă θ ă 1. This shows that the selectorωpβmq provides almost full selection in the region D 1 pθq for all 0 ă θ ă 1.
Proof of Lemma 2. Throughout the proof, θ is an arbitrary number in the interval p0, 1q and pβ, rq is an arbitrary point in the region D 1 pθq. For brevity, we shall write Δ 2 k,b andΔ 2 k,b as Δ 2 k andΔ 2 k for k P vbw, b " 2, 3, . . .. Let us first check the validity of (3.21). For all k P vbw and b " 2, 3, . . ., consider k¯, (7.16) where Δ 2 k " μ J rks Σ´1 rks μ rks andΔ 2 k " p μ J rks Σ´1 rks p μ rks . It is easy to see that for all pμ, Σq P M b,β,r , k P vbw, (7.17) and hence (recall that n " b θ and #tk P vbw : Therefore, by the triangle and Chebyshev's inequalities, using the block-wise independence of the data, for any ε ą 0 and all large enough b Consider the numerator on the right side of (7.18). Using relation (7.16), the inequalities pa´bq 2 ď 2pa 2`b2 q and pu J vq 2 ď }u} 2 }v} 2 , and the independence of X 0,rks and p μ rks , From this, using the identity X J AX " TrpAXX J q, the fact E Π2´X0,rks X J 0,rks¯" Σ rks , and the equalities Epχ 2 ν pλqq " ν`λ and Varpχ 2 ν pλqq " 2pν`2λq, we may continue The combination of (7.18) and (7.19) now yields that for any ε ą 0 and all pβ, rq P D 1 pθq, as b Ñ 8 sup pμ,ΣqPM b,β,r P Π2˜ˇb ÿ k"1 : ω k "1 pṼ k´Vk qˇˇˇˇě εb 1´β´θ log b¸" op1q, and hence (3.21) is proved. Next, let us verify relation (3.22). For brevity, we shall omit the argument βm of the selectorωpβmq in (3.11) and writeω " pω 1 , . . . ,ω b q. First, we have Indeed, using Lemma 1, for all pβ, rq P D 1 pθq, uniformly in pμ, Σq P M b,β,r , for all large enough b where C ą 0 is an absolute constant. From (7.20), by the triangle and Chebyshev's inequalities, using the block-wise independence of the data, for any ε ą 0 and all large enough b Consider the numerator on the right side of (7.21). Using pa´bq 2 ď 2pa 2`b2 q and pu J vq 2 ď }u} 2 }v} 2 , and noting that E Π2´X J 0,rks Σ´1 rks X 0,rks¯" p 0 and p2βm` q log b ď 3 log b when b is large, we obtain for all large enough b J 0,rks Σ´1 rks X 0,rks¯E´nΔ 2 k IpnΔ 2 k ď p2βm` q log bq1 From this, recalling (2.5) and applying Lemma 1, we get for all pβ, rq P D 1 pθq It now follows from (7.21) and (7.22) that all pβ, rq P D 1 pθq as b Ñ 8 k Ipω k " 0qˇˇˇˇě εb 1´β´θ log b¸" op1q, yielding relation (3.22). It remains to prove (3.23). Aiming again at using Chebyshev's inequality, we first show that E Π2´ř b k"1 : ω k "0Ṽ k Ipω k " 1q¯" o`b 1´β´θ log b˘, uniformly in pμ, Σq P M b,β,r . To this end, we note that for each k for which ω k " 0 the statistic nΔ 2 k has a central chi-square χ 2 p0 p0q distribution with pdf Also, as seen from the proof of Lemma 1, the grid point β m0 " δm 0 , which is chosen to have β m0 ă β ď β m0`1 , satisfies where M " M b is as in (3.5), that is, the probability Ppm ă m 0 q decreases to zero at an exponential rate as b Ñ 8. In particular, for all large enough b Ppm ă m 0 q ď b´2 β . (7.23) Therefore, since E Π2 pX 0,rks q " 0 for all those indices k for which ω k " 0, we havěˇˇˇˇE Now, applying the Cauchy-Schwarz inequality to E´nΔ 2 k Ipm ă m 0 q¯, using (7.23) and the asymptotic relation where the last two equalities are due to the fact 0 ă β´β m0 ă δ and relations (3.8) and (3.10). Thus, as b Ñ 8 Therefore, by the triangle and Chebyshev's inequalities, using the block-wise independence of the data, for any ε ą 0 and all large enough b Consider the numerator on the right side of (7.25). Applying the arguments similar to those that have led us to (7.24), we obtain the relation which together with (7.25) yields (3.23). Noting that θ P p0, 1q and pβ, rq P D 1 pθq were chosen arbitrary completes the proof.
Proof of Lemma 3. Throughout the proof, θ is an arbitrary number in the interval p0, 1q and pβ, rq is an arbitrary point in the region D 0 1 pθq. As in the proof of Lemma 2, for all k P vbw and b " 2, 3, . . ., we shall write as Δ 2 k ,Δ 2 k , andΔ 2 k , suppressing the dependence on b. For brevity, we shall also omit the argument βm of the selectorωpβmq in (4.3) and writeω " pω 1 , . . . ,ω b q. We first verify relation (4.9). Let V k ,Ṽ k , andV k be as defined in (3.16), (3.17), and (4.7). We have k´Vk¯.

Remark 4.
As seen from the derivation of (7.35), the choice of thresholdt " p´1 0 p2βm`θ` q log b instead oft " p´1 0 p2βm` q log b, which would be sufficient for selecting useful feature blocks, is done to have relation (4.11) valid, and hence successful classification possible.