Statistical Science

Monte Carlo Null Models for Genomic Data

Egil Ferkingstad, Lars Holden, and Geir Kjetil Sandve

Full-text: Open access

Abstract

As increasingly complex hypothesis-testing scenarios are considered in many scientific fields, analytic derivation of null distributions is often out of reach. To the rescue comes Monte Carlo testing, which may appear deceptively simple: as long as you can sample test statistics under the null hypothesis, the $p$-value is just the proportion of sampled test statistics that exceed the observed test statistic. Sampling test statistics is often simple once you have a Monte Carlo null model for your data, and defining some form of randomization procedure is also, in many cases, relatively straightforward. However, there may be several possible choices of a randomization null model for the data and no clear-cut criteria for choosing among them. Obviously, different null models may lead to very different $p$-values, and a very low $p$-value may thus occur due to the inadequacy of the chosen null model. It is preferable to use assumptions about the underlying random data generation process to guide selection of a null model. In many cases, we may order the null models by increasing preservation of the data characteristics, and we argue in this paper that this ordering in most cases gives increasing $p$-values, that is, lower significance. We denote this as the null complexity principle. The principle gives a better understanding of the different null models and may guide in the choice between the different models.

Article information

Source
Statist. Sci., Volume 30, Number 1 (2015), 59-71.

Dates
First available in Project Euclid: 4 March 2015

Permanent link to this document
https://projecteuclid.org/euclid.ss/1425492440

Digital Object Identifier
doi:10.1214/14-STS484

Mathematical Reviews number (MathSciNet)
MR3317754

Zentralblatt MATH identifier
1332.62411

Keywords
Monte Carlo methods hypothesis testing genomics

Citation

Ferkingstad, Egil; Holden, Lars; Sandve, Geir Kjetil. Monte Carlo Null Models for Genomic Data. Statist. Sci. 30 (2015), no. 1, 59--71. doi:10.1214/14-STS484. https://projecteuclid.org/euclid.ss/1425492440


Export citation

References

  • Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300.
  • Berman, B. P., Nibu, Y., Pfeiffer, B. D., Tomancak, P., Celniker, S. E., Levine, M., Rubin, G. M. and Eisen, M. B. (2002). Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc. Natl. Acad. Sci. USA 99 757–762.
  • Bickel, P. J., Boley, N., Brown, J. B., Huang, H. and Zhang, N. R. (2010). Subsampling methods for genomic inference. Ann. Appl. Stat. 4 1660–1697.
  • Cairns, B. R. (2009). The logic of chromatin architecture and remodelling at promoters. Nature 461 193–198.
  • Efron, B. (2004). Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. J. Amer. Statist. Assoc. 99 96–104.
  • Ernst, J., Kheradpour, P., Mikkelsen, T. S., Shoresh, N., Ward, L. D., Epstein, C. B., Zhang, X., Wang, L., Issner, R., Coyne, M., Ku, M., Durham, T., Kellis, M. and Bernstein, B. E. (2011). Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 473 43–49.
  • Ferkingstad, E., Holden, L. and Sandve, G. K. (2013). Monte Carlo null models in ecology. Technical Report SAMBA/20/13, Norwegian Computing Center. Available at http://publications.nr.no/1370000051/NullModelsEcology-Ferkingstad.pdf.
  • Fisher, R. A. (1935). The Design of Experiments. Oliver & Boyd, London.
  • Flicek, P., Amode, M. R., Barrell, D., Beal, K., Brent, S., Carvalho-Silva, D., Clapham, P., Coates, G., Fairley, S., Fitzgerald, S., Gil, L., Gordon, L., Hendrix, M., Hourlier, T., Johnson, N., Kähäri, A. K., Keefe, D., Keenan, S., Kinsella, R., Komorowska, M., Koscielny, G., Kulesha, E., Larsson, P., Longden, I., McLaren, W., Muffato, M., Overduin, B., Pignatelli, M., Pritchard, B., Riat, H. S., Ritchie, G. R. S., Ruffier, M., Schuster, M., Sobral, D., Tang, Y. A., Taylor, K., Trevanion, S., Vandrovcova, J., White, S., Wilson, M., Wilder, S. P., Aken, B. L., Birney, E., Cunningham, F., Dunham, I., Durbin, R., Fernández-Suarez, X. M., Harrow, J., Herrero, J., Hubbard, T. J. P., Parker, A., Proctor, G., Spudich, G., Vogel, J., Yates, A., Zadissa, A. and Searle, S. M. J. (2012). Ensembl 2012. Nucleic Acids Res. 40 D84–D90.
  • Fortin, M. J. and Jacquez, G. M. (2000). Randomization tests and spatially auto-correlated data. Bulletin of the Ecological Society of America 81 201–205.
  • Gionis, A., Mannila, H., Mielikäinen, T. and Tsaparas, P. (2007). Assessing data mining results via swap randomization. ACM Transactions on Knowledge Discovery from Data 1 14.
  • Goecks, J., Nekrutenko, A., Taylor, J. and Galaxy Team (2010). Galaxy: A comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11 R86.
  • Goodman, S. and Greenland, S. (2007). Assessing the unreliability of the medical literature: A response to “Why most published research findings are false.” Working Paper 135, Dept. Biostatistics, Johns Hopkins Univ., Baltimore, MD. Available at http://www.bepress.com/jhubiostat/paper135.
  • Gotelli, N. J. (2000). Null model analysis of species co-occurrence patterns. Ecology 81 2606–2621.
  • Gotelli, N. J. and Graves, G. R. (1996). Null Models in Ecology. Smithsonian Institution, Washington, DC.
  • Gotelli, N. J. and Ulrich, W. (2010). The empirical Bayes approach as a tool to identify non-random species associations. Oecologia 162 463–477.
  • Hanhijärvi, S., Garriga, G. and Puolamäki, K. (2009). Randomization techniques for graphs. In Proceedings of the 9th SIAM International Conference on Data Mining (SDM’09) 780–791. SIAM, Philadelphia, PA.
  • Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Med. 2 e124.
  • Ioannidis, J. P. A., Allison, D. B., Ball, C. A., Coulibaly, I., Cui, X., Culhane, A. C., Falchi, M., Furlanello, C., Game, L., Jurman, G., Mangion, J., Mehta, T., Nitzberg, M., Page, G. P., Petretto, E. and van Noort, V. (2008). Repeatability of published microarray gene expression analyses. Nat. Genet. 41 149–155.
  • Kallio, A., Vuokko, N., Ojala, M., Haiminen, N. and Mannila, H. (2011). Randomization techniques for assessing the significance of gene periodicity results. BMC Bioinformatics 12 330.
  • Kell, D. B. and Oliver, S. G. (2004). Here is the evidence, now what is the hypothesis? The complementary roles of inductive and hypothesis-driven science in the post-genomic era. Bioessays 26 99–105.
  • Kim, T. H., Barrera, L. O., Zheng, M., Qu, C., Singer, M. A., Richmond, T. A., Wu, Y., Green, R. D. and Ren, B. (2005). A high-resolution map of active promoters in the human genome. Nature 436 876–880.
  • Kornberg, R. D. and Lorch, Y. (1999). Twenty-five years of the nucleosome, fundamental particle of the eukaryote chromosome. Cell 98 285–294.
  • Langaas, M., Lindqvist, B. H. and Ferkingstad, E. (2005). Estimating the proportion of true null hypotheses, with application to DNA microarray data. J. Roy. Statist. Soc. Ser. B 67 555–572.
  • Lehmann, E. L. and Romano, J. P. (2005). Testing Statistical Hypotheses, 3rd ed. Springer, New York.
  • Lijffijt, J., Papapetrou, P. and Puolamäki, K. (2014). A statistical significance testing approach to mining the most informative set of patterns. Data Min. Knowl. Discov. 28 238–263.
  • Manly, B. F. J. (2007). Randomization, Bootstrap and Monte Carlo Methods in Biology, 3rd ed. Chapman & Hall/CRC, Boca Raton, FL.
  • Mesirov, J. P. (2010). Computer science. Accessible reproducible research. Science 327 415–416.
  • Noseda, M. and McLean, G. R. (2008). Where did the scientific method go? Nat. Biotechnol. 26 28–29.
  • Pounds, S. and Cheng, C. (2006). Robust estimation of the false discovery rate. Bioinformatics 22 1979–1987.
  • R Development Core Team (2011). R: A Language and Environment for Statistical Computing. Vienna, Austria.
  • Ripley, B. D. (1976). The second-order analysis of stationary point processes. J. Appl. Probab. 13 255–266.
  • Ruf, S., Symmons, O., Uslu, V. V., Dolle, D., Hot, C., Ettwiller, L. and Spitz, F. (2011). Large-scale analysis of the regulatory architecture of the mouse genome with a transposon-associated sensor. Nat. Genet. 43 379–386.
  • Sandve, G. K., Ferkingstad, E. and Nygård, S. (2011). Sequential Monte Carlo multiple testing. Bioinformatics 27 3235–3241.
  • Sandve, G. K., Gundersen, S., Rydbeck, H., Glad, I. K., Holden, L., Holden, M., Liestøl, K., Clancy, T., Ferkingstad, E., Johansen, M., Nygaard, V., Tøstesen, E., Frigessi, A. and Hovig, E. (2010). The Genomic HyperBrowser: Inferential genomics at the sequence level. Genome Biol. 11 Article ID R121.
  • Sandve, G. K., Gundersen, S., Johansen, M., Glad, I. K., Gunathasan, K., Holden, L., Holden, M., Liestøl, K., Nygård, S., Nygaard, V., Paulsen, J., Rydbeck, H., Trengereid, K., Clancy, T., Drabløs, F., Ferkingstad, E., Kalas, M., Lien, T., Rye, M. B., Frigessi, A. and Hovig, E. (2013a). The Genomic HyperBrowser: An analysis web server for genome-scale data. Nucleic Acids Res. 41 W133–W141.
  • Sandve, G. K., Nekrutenko, A., Taylor, J. and Hovig, E. (2013b). Ten simple rules for reproducible computational research. PLoS Comput. Biol. 9 e1003285.
  • Storey, J. D. (2002). A direct approach to false discovery rates. J. Roy. Statist. Soc. Ser. B 64 479–498.
  • Strub, T., Giuliano, S., Ye, T., Bonet, C., Keime, C., Kobi, D., Gras, S. L., Cormont, M., Ballotti, R., Bertolotto, C. and Davidson, I. (2011). Essential role of microphthalmia transcription factor for DNA replication, mitosis and genomic stability in melanoma. Oncogene 30 2319–2332.
  • Visel, A., Rubin, E. M. and Pennacchio, L. A. (2009). Genomic views of distant-acting enhancers. Nature 461 199–205.
  • Wang, Z., Zang, C., Rosenfeld, J. A., Schones, D. E., Barski, A., Cuddapah, S., Cui, K., Roh, T.-Y., Peng, W., Zhang, M. Q. and Zhao, K. (2008). Combinatorial patterns of histone acetylations and methylations in the human genome. Nat. Genet. 40 897–903.
  • Wingender, E., Dietze, P., Karas, H. and Knüppel, R. (1996). TRANSFAC: A database on transcription factors and their DNA binding sites. Nucleic Acids Res. 24 238–241.
  • Zhou, Q. and Wong, W. H. (2004). CisModule: De novo discovery of cis-regulatory modules by hierarchical mixture modeling. Proc. Natl. Acad. Sci. USA 101 12114–12119.