The Annals of Applied Statistics

A multi-functional analyzer uses parameter constraints to improve the efficiency of model-based gene-set analysis

Zhishi Wang, Qiuling He, Bret Larget, and Michael A. Newton

Full-text: Open access


We develop a model-based methodology for integrating gene-set information with an experimentally-derived gene list. The methodology uses a previously reported sampling model, but takes advantage of natural constraints in the high-dimensional discrete parameter space in order to work from a more structured prior distribution than is currently available. We show how the natural constraints are expressed in terms of linear inequality constraints within a set of binary latent variables. Further, the currently available prior gives low probability to these constraints in complex systems, such as Gene Ontology (GO), thus reducing the efficiency of statistical inference. We develop two computational advances to enable posterior inference within the constrained parameter space: one using integer linear programming for optimization and one using a penalized Markov chain sampler. Numerical experiments demonstrate the utility of the new methodology for a multivariate integration of genomic data with GO or related information systems. Compared to available methods, the proposed multi-functional analyzer covers more reported genes without mis-covering nonreported genes, as demonstrated on genome-wide data from association studies of type 2 diabetes and from RNA interference studies of influenza.

Article information

Ann. Appl. Stat., Volume 9, Number 1 (2015), 225-246.

First available in Project Euclid: 28 April 2015

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Gene-set enrichment Bayesian analysis integer linear programming


Wang, Zhishi; He, Qiuling; Larget, Bret; Newton, Michael A. A multi-functional analyzer uses parameter constraints to improve the efficiency of model-based gene-set analysis. Ann. Appl. Stat. 9 (2015), no. 1, 225--246. doi:10.1214/14-AOAS777.

Export citation


  • Arratia, R., Goldstein, L. and Gordon, L. (1990). Poisson approximation and the Chen–Stein method. Statist. Sci. 5 403–434.
  • Barry, W. T., Nobel, A. B. and Wright, F. A. (2008). A statistical framework for testing functional categories in microarray data. Ann. Appl. Stat. 2 286–315.
  • Bauer, S., Gagneur, J. and Robinson, P. N. (2010). GOing Bayesian: Model-based gene set analysis of genome-scale data. Nucleic Acids Res. 38 3523–3532.
  • Bauer, S., Robinson, P. N. and Gagneur, J. (2011). Model-based gene set analysis for Bioconductor. Bioinformatics 27 1882–1883.
  • Carvalho, L. E. and Lawrence, C. E. (2008). Centroid estimators for inference in high-dimensional discrete spaces. Proc. Natl. Acad. Sci. USA 105 3209–3214.
  • Gentleman, R., Carey, V. J., Bates, D. M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y. et alet al. (2004). Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol. 5 R80.
  • Goeman, J. J. and Bühlmann, P. (2007). Analyzing gene expression data in terms of gene sets: Methodological issues. Bioinformatics 23 980–987.
  • Hao, L., He, Q., Wang, Z., Craven, M., Newton, M. A. and Ahlquist, P. (2013). Limited agreement of independent RNAi screens for virus-required host genes owes more to false-negative than false-positive factors. PLoS Comput. Biol. 9 e1003235, 20.
  • Kanehisa, M. and Goto, S. (2000). KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28 27–30.
  • Khatri, P., Sirota, M. and Butte, A. J. (2012). Ten years of pathway analysis: Current approaches and outstanding challenges. PLoS Comput. Biol. 8 e1002375.
  • Matthews, L., Gopinath, G., Gillespie, M., Caudy, M., Croft, D., de Bono, B., Garapati, P., Hemish, J., Hermjakob, H., Jassal, B., Kanapin, A., Lewis, S., Mahajan, S., May, B., Schmidt, E., Vastrik, I., Wu, G., Birney, E., Stein, L. and D’Eustachio, P. (2009). Reactome knowledgebase of biological pathways and processes. Nucleic Acids Res. 37 D619–D622.
  • Morris, A. P. et alet al. (2012). Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nat. Genet. 44 981–990.
  • Newton, M. A., He, Q. and Kendziorski, C. (2012). A model-based analysis to infer the functional content of a gene list. Stat. Appl. Genet. Mol. Biol. 11 Art. 9, 27.
  • R Development Core Team (2011). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Available at
  • Sartor, M. A., Leikauf, G. D. and Medvedovic, M. (2009). LRpath: A logistic regression approach for identifying enriched biological groups in gene expression data. Bionformatics 25 211–217.
  • The Gene Ontology Consortium (2000). Gene ontology: Tool for the unification of biology. Nat. Genet. 25 25–29.
  • Wang, Z., He, Q., Larget, B. and Newton, M. A. (2014). Supplement to “A multi-functional analyzer uses parameter constraints to improve the efficiency of model-based gene-set analysis.” DOI:10.1214/14-AOAS777SUPP.

Supplemental materials

  • More on role modeling.: We provide further details on violation probabilities, on estimating false-positive and true-positive error rates, on preparing data for the ILP algorithm, and on further data analysis findings in the T2D and RNAi examples.