## The Annals of Statistics

### Permutation $p$-value approximation via generalized Stolarsky invariance

#### Abstract

It is common for genomic data analysis to use $p$-values from a large number of permutation tests. The multiplicity of tests may require very tiny $p$-values in order to reject any null hypotheses and the common practice of using randomly sampled permutations then becomes very expensive. We propose an inexpensive approximation to $p$-values for two sample linear test statistics, derived from Stolarsky’s invariance principle. The method creates a geometrically derived reference set of approximate $p$-values for each hypothesis. The average of that set is used as a point estimate $\hat{p}$ and our generalization of the invariance principle allows us to compute the variance of the $p$-values in that set. We find that in cases where the point estimate is small, the variance is a modest multiple of the square of that point estimate, yielding a relative error property similar to that of saddlepoint approximations. On a Parkinson’s disease data set, the new approximation is faster and more accurate than the saddlepoint approximation. We also obtain a simple probabilistic explanation of Stolarsky’s invariance principle.

#### Article information

Source
Ann. Statist., Volume 47, Number 1 (2019), 583-611.

Dates
Revised: February 2018
First available in Project Euclid: 30 November 2018

https://projecteuclid.org/euclid.aos/1543568599

Digital Object Identifier
doi:10.1214/18-AOS1702

Mathematical Reviews number (MathSciNet)
MR3909943

Zentralblatt MATH identifier
07036212

#### Citation

He, Hera Y.; Basu, Kinjal; Zhao, Qingyuan; Owen, Art B. Permutation $p$-value approximation via generalized Stolarsky invariance. Ann. Statist. 47 (2019), no. 1, 583--611. doi:10.1214/18-AOS1702. https://projecteuclid.org/euclid.aos/1543568599

#### References

• Ackermann, M. and Strimmer, K. (2009). A general modular framework for gene set enrichment analysis. BMC Bioinform. 10 1–20.
• Barnard, G. A. (1963). Discussion of the spectral analysis of point processes (by M. S. Bartlett). J. Roy. Statist. Soc. Ser. B 25 294.
• Bilyk, D., Dai, F. and Matzke, R. (2018). Stolarsky principle and energy optimization on the sphere. Constr. Approx. 48 31–60.
• Blanchet, J. and Glynn, P. (2008). Efficient rare-event simulation for the maximum of heavy-tailed random walks. Ann. Appl. Probab. 18 1351–1378.
• Brauchart, J. S. and Dick, J. (2013). A simple proof of Stolarsky’s invariance principle. Proc. Amer. Math. Soc. 141 2085–2096.
• Fadista, J., Manning, A. K., Florez, J. C. and Groop, L. (2016). The (in) famous GWAS $p$-value threshold revisited and updated for low-frequency variants. Eur. J. Hum. Genet. 24 1202–1205.
• He, H. Y. (2016). Efficient permutation-based $p$-value estimation for gene set tests. Ph.D. thesis, Stanford Univ., Stanford, CA.
• He, H. Y., Basu, K., Zhao, Q. and Owen, A. B. (2019). Supplement to “Permutation $p$-value approximation via generalized Stolarsky invariance.” DOI:10.1214/18-AOS1702SUPP.
• Jiang, Z. and Gentleman, R. (2005). Reproducible research: A bioinformatics case study. Bioinformatics 23 306–313.
• Knijnenburg, T. A., Wessels, L. F. A., Reinders, M. J. T. and Shmulevich, I. (2009). Fewer permutations, more accurate $p$-values. Bioinformatics 25 i161–i168.
• Larson, J. L. and Owen, A. B. (2015). Moment based gene set tests. BMC Bioinform. 16 132.
• Lee, Y. and Kim, W. C. (2014). Concise formulas for the surface area of the intersection of two hyperspherical caps. Technical report, Korea Advanced Institute of Science and Technology.
• Lehmann, E. L. and Romano, J. P. (2005). Testing Statistical Hypotheses, 3rd ed. Springer, New York.
• Mardia, K. V. and Jupp, P. E. (2000). Directional Statistics. Wiley, Chichester. [Revised reprint of Mardia, K. V. (1972). Statistics of Directional Data. Probability and Mathematical Statistics 13. Academic Press, London. MR0336854.]
• Moran, L. B., Duke, D. C., Deprez, M., Dexter, D. T., Pearce, R. K. B. and Graeber, M. B. (2006). Whole genome expression profiling of the medial and lateral substantia nigra in Parkinson’s. Disease Neurogenet. 7 1–11.
• Narcowich, F. J., Sun, X., Ward, J. D. and Wu, Z. (2010). LeVeque type inequalities and discrepancy estimates for minimal energy configurations on spheres. J. Approx. Theory 162 1256–1278.
• Niederreiter, H. (1992). Random Number Generation and Quasi-Monte Carlo Methods. CBMS-NSF Regional Conference Series in Applied Mathematics 63. SIAM, Philadelphia, PA.
• R Core Team (2017). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
• Reid, N. (1988). Saddlepoint methods and statistical inference. Statist. Sci. 3 213–238.
• Robinson, J. (1982). Saddlepoint approximations for permutation tests and confidence intervals. J. Roy. Statist. Soc. Ser. B 44 91–101.
• Scherzer, C. R., Eklund, A. C., Morse, L. J., Liao, Z., Locascio, J. J., Fefer, D., Schwarzschild, M. A., Schlossmacher, M. G., Hauser, M. A., Vance, J. M., Sudarsky, L. R., Standaert, D. G., Growdon, J. H., Jensen, R. V. and Gullans, S. R. (2007). Molecular markers of early Parkinson’s disease based on gene expression in blood. Proc. Natl. Acad. Sci. USA 104 955–960.
• Stolarsky, K. B. (1973). Sums of distances between points on a sphere. II. Proc. Amer. Math. Soc. 41 575–582.
• Tian, L., Greenberg, S. A., Kong, S. W., Altschuler, J., Kohane, I. S. and Park, P. J. (2005). Discovering statistically significant pathways in expression profiling studies. Proc. Natl. Acad. Sci. USA 102 13544–13549.
• Zhang, Y., James, M., Middleton, F. A. and Davis, R. L. (2005). Transcriptional analysis of multiple brain regions in Parkinson’s disease supports the involvement of specific protein processing, energy metabolism, and signaling pathways, and suggests novel disease mechanisms. Amer. J. Med. Genet. B Neuropsych. Genet. 137B 5–16.
• Zhou, C., Wang, H. J. and Wang, Y. M. (2009). Efficient moments-based permutation tests. Adv. Neural Inf. Process. Syst. 22 2277–2285.

#### Supplemental materials

• Supplement to “Permutation $p$-value approximation via generalized Stolarsky invariance”. The supplement presents additional material, including lengthier proofs.