The Annals of Statistics

Detecting rare and faint signals via thresholding maximum likelihood estimators

Yumou Qiu, Song Xi Chen, and Dan Nettleton

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text

Abstract

Motivated by the analysis of RNA sequencing (RNA-seq) data for genes differentially expressed across multiple conditions, we consider detecting rare and faint signals in high-dimensional response variables. We address the signal detection problem under a general framework, which includes generalized linear models for count-valued responses as special cases. We propose a test statistic that carries out a multi-level thresholding on maximum likelihood estimators (MLEs) of the signals, based on a new Cramér-type moderate deviation result for multidimensional MLEs. Based on the multi-level thresholding test, a multiple testing procedure is proposed for signal identification. Numerical simulations and a case study on maize RNA-seq data are conducted to demonstrate the effectiveness of the proposed approaches on signal detection and identification.

Article information

Source
Ann. Statist., Volume 46, Number 2 (2018), 895-923.

Dates
Received: August 2016
Revised: April 2017
First available in Project Euclid: 3 April 2018

Permanent link to this document
https://projecteuclid.org/euclid.aos/1522742440

Digital Object Identifier
doi:10.1214/17-AOS1574

Mathematical Reviews number (MathSciNet)
MR3782388

Zentralblatt MATH identifier
06870283

Subjects
Primary: 62H15: Hypothesis testing
Secondary: 62G20: Asymptotic properties 62G32: Statistics of extreme values; tail inference

Keywords
Detection boundary false discovery proportion generalized linear model moderate deviation multiple testing procedure RNA-seq data

Citation

Qiu, Yumou; Chen, Song Xi; Nettleton, Dan. Detecting rare and faint signals via thresholding maximum likelihood estimators. Ann. Statist. 46 (2018), no. 2, 895--923. doi:10.1214/17-AOS1574. https://projecteuclid.org/euclid.aos/1522742440


Export citation

References

  • Anders, S. and Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biol. 11 R106.
  • Arias-Castro, E., Candès, E. J. and Plan, Y. (2011). Global testing under sparse alternatives: ANOVA, multiple comparisons and the higher criticism. Ann. Statist. 39 2533–2556.
  • Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B. Stat. Methodol. 57 289–300.
  • Bradley, R. C. (2005). Basic properties of strong mixing conditions. A survey and some open questions. Probab. Surv. 2 107–144.
  • Chen, S. X. and Qin, Y.-L. (2010). A two-sample test for high-dimensional data with applications to gene-set testing. Ann. Statist. 38 808–835.
  • Delaigle, A., Hall, P. and Jin, J. (2011). Robustness and accuracy of methods for high dimensional data analysis based on Student’s $t$-statistic. J. R. Stat. Soc. Ser. B. Stat. Methodol. 73 283–301.
  • Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. Ann. Statist. 32 962–994.
  • Fan, Y., Jin, J. and Yao, Z. (2013). Optimal classification in sparse Gaussian graphic model. Ann. Statist. 41 2537–2571.
  • Fan, J. and Song, R. (2010). Sure independence screening in generalized linear models with NP-dimensionality. Ann. Statist. 38 3567–3604.
  • Genovese, C. R. and Wasserman, L. (2006). Exceedance control of the false discovery proportion. J. Amer. Statist. Assoc. 101 1408–1417.
  • Goeman, J. J., van Houwelingen, H. C. and Finos, L. (2011). Testing against a high-dimensional alternative in the generalized linear model: Asymptotic type I error control. Biometrika 98 381–390.
  • Guo, B. and Chen, S. X. (2016). Tests for high dimensional generalized linear models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 78 1079–1102.
  • Hall, P. and Jin, J. (2008). Properties of higher criticism under strong dependence. Ann. Statist. 36 381–402.
  • Hall, P. and Jin, J. (2010). Innovated higher criticism for detecting sparse signals in correlated noise. Ann. Statist. 38 1686–1732.
  • Inglot, T. and Kallenberg, W. C. M. (2003). Moderate deviations of minimum contrast estimators under contamination. Ann. Statist. 31 852–879.
  • Ingster, Yu. I. (1997). Some problems of hypothesis testing leading to infinitely divisible distributions. Math. Methods Statist. 6 47–69.
  • Jensen, J. L. and Wood, A. T. A. (1998). Large deviation and other results for minimum contrast estimators. Ann. Inst. Statist. Math. 50 673–695.
  • Ji, P. and Jin, J. (2012). UPS delivers optimal phase diagram in high-dimensional variable selection. Ann. Statist. 40 73–103.
  • Lehmann, E. L. (1959). Testing Statistical Hypotheses. Wiley, New York.
  • Lund, S. P., Nettleton, D., McCarthy, D. J. and Smyth, G. K. (2012). Detecting differential expression in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates. Stat. Appl. Genet. Mol. Biol. 11 Art. 8.
  • McCulloch, C. E., Searle, S. R. and Neuhaus, J. M. (2008). Generalized, Linear, and Mixed Models, 2nd ed. Wiley, Hoboken, NJ.
  • Paschold, A., Larson, N. B., Marcon, C., Schnable, J. C., Yeh, C. T., Lanz, C., Nettleton, D., Piepho, H.-P., Schnable, P. S. and Hochholdinger, F. (2017). Non-syntenic genes frive highly dynamic complementation of gene expression in maize hybrids. Plant Cell. To appear.
  • Petrov, V. V. (1995). Limit Theorems of Probability Theory: Sequences of Independent Random Variables. Clarendon Press, London.
  • Qiu, Y., Chen, S. X and Nettleton, D. (2018). Supplement to “Detecting rare and faint signals via thresholding maximum likelihood estimators.” DOI:10.1214/17-AOS1574SUPP.
  • Robinson, M. D. and Smyth, G. K. (2007). Moderated statistical tests for assessing differences in tag abundance. Bioinformatics 23 2881–2887.
  • Robinson, M. D. and Smyth, G. K. (2008). Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics 9 321–332.
  • Saulis, L. and Statulevičius, V. A. (1991). Limit Theorems for Large Deviations. Mathematics and Its Applications (Soviet Series) 73. Kluwer Academic, Dordrecht.
  • van de Geer, S., Bühlmann, P., Ritov, Y. and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist. 42 1166–1202.
  • van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics 3. Cambridge Univ. Press, Cambridge.
  • Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 217–242.
  • Zhong, P.-S. and Chen, S. X. (2011). Tests for high-dimensional regression coefficients with factorial designs. J. Amer. Statist. Assoc. 106 260–274.
  • Zhong, P.-S., Chen, S. X. and Xu, M. (2013). Tests alternative to higher criticism for high-dimensional means under sparsity and column-wise dependence. Ann. Statist. 41 2820–2851.

Supplemental materials

  • Supplement to “Detecting rare and faint signals via thresholding maximum likelihood estimators”. The supplemental article contains additional empirical results, as well as the proofs of all the theoretical results not in the Appendix.