## The Annals of Statistics

### Optimal classification in sparse Gaussian graphic model

#### Abstract

Consider a two-class classification problem where the number of features is much larger than the sample size. The features are masked by Gaussian noise with mean zero and covariance matrix $\Sigma$, where the precision matrix $\Omega=\Sigma^{-1}$ is unknown but is presumably sparse. The useful features, also unknown, are sparse and each contributes weakly (i.e., rare and weak) to the classification decision.

By obtaining a reasonably good estimate of $\Omega$, we formulate the setting as a linear regression model. We propose a two-stage classification method where we first select features by the method of Innovated Thresholding (IT), and then use the retained features and Fisher’s LDA for classification. In this approach, a crucial problem is how to set the threshold of IT. We approach this problem by adapting the recent innovation of Higher Criticism Thresholding (HCT).

We find that when useful features are rare and weak, the limiting behavior of HCT is essentially just as good as the limiting behavior of ideal threshold, the threshold one would choose if the underlying distribution of the signals is known (if only). Somewhat surprisingly, when $\Omega$ is sufficiently sparse, its off-diagonal coordinates usually do not have a major influence over the classification decision.

Compared to recent work in the case where $\Omega$ is the identity matrix [Proc. Natl. Acad. Sci. USA 105 (2008) 14790–14795; Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 367 (2009) 4449–4470], the current setting is much more general, which needs a new approach and much more sophisticated analysis. One key component of the analysis is the intimate relationship between HCT and Fisher’s separation. Another key component is the tight large-deviation bounds for empirical processes for data with unconventional correlation structures, where graph theory on vertex coloring plays an important role.

#### Article information

Source
Ann. Statist., Volume 41, Number 5 (2013), 2537-2571.

Dates
First available in Project Euclid: 19 November 2013

https://projecteuclid.org/euclid.aos/1384871345

Digital Object Identifier
doi:10.1214/13-AOS1163

Mathematical Reviews number (MathSciNet)
MR3161437

Zentralblatt MATH identifier
1294.62061

Subjects
Primary: 62G05: Estimation
Secondary: 62G32: Statistics of extreme values; tail inference

#### Citation

Fan, Yingying; Jin, Jiashun; Yao, Zhigang. Optimal classification in sparse Gaussian graphic model. Ann. Statist. 41 (2013), no. 5, 2537--2571. doi:10.1214/13-AOS1163. https://projecteuclid.org/euclid.aos/1384871345

#### References

• [1] Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis, 3rd ed. Wiley, Hoboken, NJ.
• [2] Arias-Castro, E., Candès, E. J. and Plan, Y. (2011). Global testing under sparse alternatives: ANOVA, multiple comparisons and the higher criticism. Ann. Statist. 39 2533–2556.
• [3] Bickel, P. J. and Levina, E. (2004). Some theory of Fisher’s linear discriminant function, “naive Bayes”, and some alternatives when there are many more variables than observations. Bernoulli 10 989–1010.
• [4] Bickel, P. J. and Levina, E. (2008). Regularized estimation of large covariance matrices. Ann. Statist. 36 199–227.
• [5] Bollobás, B. (1998). Modern Graph Theory. Graduate Texts in Mathematics 184. Springer, New York.
• [6] Breiman, L. (2001). Random forests. Mach Learning 24 5–32.
• [7] Burges, C. (1998). A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2 121–167.
• [8] Cai, T. and Liu, W. (2011). A direct estimation approach to sparse linear discriminant analysis. J. Amer. Statist. Assoc. 106 1566–1577.
• [9] Cai, T., Liu, W. and Luo, X. (2011). A constrained $\ell_{1}$ minimization approach to sparse precision matrix estimation. J. Amer. Statist. Assoc. 106 594–607.
• [10] Cai, T. T., Jin, J. and Low, M. G. (2007). Estimation and confidence sets for sparse normal mixtures. Ann. Statist. 35 2421–2449.
• [11] Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when $p$ is much larger than $n$. Ann. Statist. 35 2313–2351.
• [12] Cayon, L., Jin, J. and Treaster, A. (2005). Higher criticism statistic: Detecting and identifying non-Gaussianity in the WMAP first-year data. Mon. Not. R. Astron. Soc. 362 826–832.
• [13] Dempster, A. P. (1972). Covariance selection. Biometrics 28 157–175.
• [14] Dettling, M. and Bühlmann, P. (2003). Boosting for tumor classification with gene expression data. Bioinformatics 19 1061–1069.
• [15] Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. Ann. Statist. 32 962–994.
• [16] Donoho, D. and Jin, J. (2008). Higher criticism thresholding: Optimal feature selection when useful features are rare and weak. Proc. Natl. Acad. Sci. USA 105 14790–14795.
• [17] Donoho, D. and Jin, J. (2009). Feature selection by higher criticism thresholding achieves the optimal phase diagram. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 367 4449–4470.
• [18] Donoho, D. L. and Johnstone, I. M. (1994). Minimax risk over $l_{p}$-balls for $l_{q}$-error. Probab. Theory Related Fields 99 277–303.
• [19] Efron, B. (2009). Empirical Bayes estimates for large-scale prediction problems. J. Amer. Statist. Assoc. 104 1015–1028.
• [20] Fan, J. and Fan, Y. (2008). High-dimensional classification using features annealed independence rules. Ann. Statist. 36 2605–2637.
• [21] Fan, J., Feng, Y. and Tong, X. (2012). A road to classification in high dimensional space: The regularized optimal affine discriminant. J. R. Stat. Soc. Ser. B Stat. Methodol. 74 745–771.
• [22] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
• [23] Fan, Y., Jin, J. and Yao, Z. (2013). Supplement to “Optimal classification in sparse Gaussian graphic model.” DOI:10.1214/13-AOS1163SUPP.
• [24] Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics 7 179–188.
• [25] Friedman, J., Hastie, T. and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9 432–441.
• [26] Genovese, C. R., Jin, J., Wasserman, L. and Yao, Z. (2012). A comparison of the lasso and marginal regression. J. Mach. Learn. Res. 13 2107–2143.
• [27] Hall, P. and Jin, J. (2010). Innovated higher criticism for detecting sparse signals in correlated noise. Ann. Statist. 38 1686–1732.
• [28] Hall, P., Pittelkow, Y. and Ghosh, M. (2008). Theoretical measures of relative performance of classifiers for high dimensional data with small sample sizes. J. R. Stat. Soc. Ser. B Stat. Methodol. 70 159–173.
• [29] He, S. and Wu, Z. (2012). Gene-based higher criticism methods for large-scale exonic SNP data. BMC Proceedings 5 S65.
• [30] Ingster, Y. I. (1997). Some problems of hypothesis testing leading to infinitely divisible distributions. Math. Methods Statist. 6 47–69.
• [31] Ingster, Y. I. (1999). Minimax detection of a signal for $l^{p}_{n}$-balls. Math. Methods Statist. 7 401–428.
• [32] Ingster, Y. I., Pouet, C. and Tsybakov, A. B. (2009). Classification of sparse high-dimensional vectors. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 367 4427–4448.
• [33] Jager, L. and Wellner, J. A. (2007). Goodness-of-fit tests via phi-divergences. Ann. Statist. 35 2018–2053.
• [34] Ji, P. and Jin, J. (2012). UPS delivers optimal phase diagram in high-dimensional variable selection. Ann. Statist. 40 73–103.
• [35] Jin, J. (2009). Impossibility of successful classification when useful features are rare and weak. Proc. Natl. Acad. Sci. USA 106 8859–8864.
• [36] Jin, J. and Wang, W. (2012). Optimal spectral clustering by higher criticism thresholding. Unpublished manuscript.
• [37] Li, C. and Li, H. (2008). Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics 24 1175–1182.
• [38] Ravikumar, P., Wainwright, M. J., Raskutti, G. and Yu, B. (2011). High-dimensional covariance estimation by minimizing $\ell_{1}$-penalized log-determinant divergence. Electron. J. Stat. 5 935–980.
• [39] Sabatti, C., Service, S. and Hartikainen, A. t. (2008). Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nat. Genet. 41 35–46.
• [40] Shao, J., Wang, Y., Deng, X. and Wang, S. (2011). Sparse linear discriminant analysis by thresholding for high dimensional data. Ann. Statist. 39 1241–1265.
• [41] Shorack, G. R. and Wellner, J. A. (1986). Empirical Processes with Applications to Statistics. Wiley, New York.
• [42] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 58 267–288.
• [43] Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. USA 99 6567–6572.
• [44] Tukey, J. W. (1976). T13 N: The higher criticism. Course notes, Stat. 411, Princeton Univ.
• [45] Zhong, P., Chen, S. and Xu, M. (2012). Alternative tests to higher criticism for high dimensional means under sparsity and column-wise dependency. Unpublished manuscript.