## The Annals of Statistics

### Hypothesis testing for high-dimensional sparse binary regression

#### Abstract

In this paper, we study the detection boundary for minimax hypothesis testing in the context of high-dimensional, sparse binary regression models. Motivated by genetic sequencing association studies for rare variant effects, we investigate the complexity of the hypothesis testing problem when the design matrix is sparse. We observe a new phenomenon in the behavior of detection boundary which does not occur in the case of Gaussian linear regression. We derive the detection boundary as a function of two components: a design matrix sparsity index and signal strength, each of which is a function of the sparsity of the alternative. For any alternative, if the design matrix sparsity index is too high, any test is asymptotically powerless irrespective of the magnitude of signal strength. For binary design matrices with the sparsity index that is not too high, our results are parallel to those in the Gaussian case. In this context, we derive detection boundaries for both dense and sparse regimes. For the dense regime, we show that the generalized likelihood ratio is rate optimal; for the sparse regime, we propose an extended Higher Criticism Test and show it is rate optimal and sharp. We illustrate the finite sample properties of the theoretical results using simulation studies.

#### Article information

Source
Ann. Statist., Volume 43, Number 1 (2015), 352-381.

Dates
First available in Project Euclid: 6 February 2015

Permanent link to this document
https://projecteuclid.org/euclid.aos/1423230083

Digital Object Identifier
doi:10.1214/14-AOS1279

Mathematical Reviews number (MathSciNet)
MR3311863

Zentralblatt MATH identifier
1308.62094

#### Citation

Mukherjee, Rajarshi; Pillai, Natesh S.; Lin, Xihong. Hypothesis testing for high-dimensional sparse binary regression. Ann. Statist. 43 (2015), no. 1, 352--381. doi:10.1214/14-AOS1279. https://projecteuclid.org/euclid.aos/1423230083

#### References

• Arias-Castro, E., Candès, E. J. and Plan, Y. (2011). Global testing under sparse alternatives: ANOVA, multiple comparisons and the higher criticism. Ann. Statist. 39 2533–2556.
• Baraud, Y. (2002). Non-asymptotic minimax rates of testing in signal detection. Bernoulli 8 577–606.
• Cai, T. T., Jeng, X. J. and Jin, J. (2011). Optimal detection of heterogeneous and heteroscedastic mixtures. J. R. Stat. Soc. Ser. B Stat. Methodol. 73 629–662.
• 1000 Genomes Project Consortium and others (2012). An integrated map of genetic variation from 1,092 human genomes. Nature 491 56–65.
• Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. Ann. Statist. 32 962–994.
• Fu, W., O’Connor, T. D., Jun, G., Kang, H. M., Abecasis, G., Leal, S. M., Gabriel, S., Rieder, M. J., Altshuler, D., Shendure, J. et al. (2013). Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature 493 216–220.
• Hall, P. and Jin, J. (2010). Innovated higher criticism for detecting sparse signals in correlated noise. Ann. Statist. 38 1686–1732.
• Ingster, Yu. I. and Suslina, I. A. (2003). Nonparametric Goodness-of-Fit Testing Under Gaussian Models. Lecture Notes in Statistics 169. Springer, New York.
• Ingster, Y. I., Tsybakov, A. B. and Verzelen, N. (2010). Detection boundary in sparse regression. Electron. J. Stat. 4 1476–1526.
• Komlós, J., Major, P. and Tusnády, G. (1975). An approximation of partial sums of independent $\mathrm{RV}$’s and the sample $\mathrm{DF}$. I. Z. Wahrsch. Verw. Gebiete 32 111–131.
• Lee, S., Abecasis, G., Boehnke, M. and Lin, X. (2014). Analysis of rare variants in sequencing-based association studies. The American Journal of Human Genetics 95 5–23.
• Mukherjee, R., Pillai, N. S. and Lin, X. (2014). Supplement to “Hypothesis testing for high-dimensional sparse binary regression.” DOI:10.1214/14-AOS1279SUPP.
• Nelson, M. R., Wegmann, D., Ehm, M. G., Kessner, D., Jean, P. S., Verzilli, C., Shen, J., Tang, Z., Bacanu, S.-A., Fraser, D. et al. (2012). An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science 337 100–104.
• Plan, Y. and Vershynin, R. (2013a). One-bit compressed sensing by linear programming. Comm. Pure Appl. Math. 66 1275–1297.
• Plan, Y. and Vershynin, R. (2013b). Robust 1-bit compressed sensing and sparse logistic regression: A convex programming approach. IEEE Trans. Inform. Theory 59 482–494.
• Tang, H., Jin, X., Li, Y., Jiang, H., Tang, X., Yang, X., Cheng, H., Qiu, Y., Chen, G., Mei, J. et al. (2014). A large-scale screen for coding variants predisposing to psoriasis. Nature Genetics 46 40–50.
• Victor, R. G., Haley, R. W., Willett, D. L., Peshock, R. M., Vaeth, P. C., Leonard, D., Basit, M., Cooper, R. S., Iannacchione, V. G., Visscher, W. A. et al. (2004). The Dallas Heart Study: A population-based probability sample for the multidisciplinary study of ethnic differences in cardiovascular health. The American Journal of Cardiology 93 1473–1480.
• Wald, A. (1950). Statistical Decision Functions. Chelsea, New York.

#### Supplemental materials

• Supplementary material: Supplement to “Hypothesis testing for high-dimensional sparse binary regression”. The supplementary material contain the proofs of all theorems, propositions and supporting lemmas.