The Annals of Applied Statistics

Sparse regulatory networks

Gareth M. James, Chiara Sabatti, Nengfeng Zhou, and Ji Zhu

Full-text: Open access

Abstract

In many organisms the expression levels of each gene are controlled by the activation levels of known “Transcription Factors” (TF). A problem of considerable interest is that of estimating the “Transcription Regulation Networks” (TRN) relating the TFs and genes. While the expression levels of genes can be observed, the activation levels of the corresponding TFs are usually unknown, greatly increasing the difficulty of the problem. Based on previous experimental work, it is often the case that partial information about the TRN is available. For example, certain TFs may be known to regulate a given gene or in other cases a connection may be predicted with a certain probability. In general, the biology of the problem indicates there will be very few connections between TFs and genes. Several methods have been proposed for estimating TRNs. However, they all suffer from problems such as unrealistic assumptions about prior knowledge of the network structure or computational limitations. We propose a new approach that can directly utilize prior information about the network structure in conjunction with observed gene expression data to estimate the TRN. Our approach uses L1 penalties on the network to ensure a sparse structure. This has the advantage of being computationally efficient as well as making many fewer assumptions about the network structure. We use our methodology to construct the TRN for E. coli and show that the estimate is biologically sensible and compares favorably with previous estimates.

Article information

Source
Ann. Appl. Stat. Volume 4, Number 2 (2010), 663-686.

Dates
First available: 3 August 2010

Permanent link to this document
http://projecteuclid.org/euclid.aoas/1280842135

Digital Object Identifier
doi:10.1214/10-AOAS350

Mathematical Reviews number (MathSciNet)
MR2758644

Citation

James, Gareth M.; Sabatti, Chiara; Zhou, Nengfeng; Zhu, Ji. Sparse regulatory networks. The Annals of Applied Statistics 4 (2010), no. 2, 663--686. doi:10.1214/10-AOAS350. http://projecteuclid.org/euclid.aoas/1280842135.


Export citation

References

  • Alter, O., Brown, P. and Botstein, D. (2000). Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl. Acad. Sci. 97 10101–10106.
  • Anderson, T. (1984). An Introduction to Multivariate Statistical Analysis. Wiley, New York.
  • Beal, M., Falciani, F., Ghahramani, Z., Rangel, C. and Wild, D. (2005). A Bayesian approach to reconstructing genetic regulatory networks with hidden factors. Bioinformatics 21 349–356.
  • Boulesteix, A. and Strimmer, K. (2005). Predicting transcription factor activities from combined analysis of microarray and chip data: A partial least squares approach. Theor. Biol. Med. Model 2 23.
  • Brynildsen, M., Tran, L. and Liao, J. (2006). A Gibbs sampler for the identification of gene expression and network connectivity consistency. Bioinformatics 22 3040–3046.
  • Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n (with discussion). Ann. Statist. 35 2313–2351.
  • Chang, C., Ding, Z., Hung, Y. and Fung, P. (2008). Fast network component analysis (fastNCA) for gene regulatory network reconstruction from microarray data. Bioinformatics 24 1349–1358.
  • Courcelle, J., Khodursky, A., Peter, B., Brown, P. O. and Hanawalt, P. C. (2001). Comparative gene expression profiles following UV exposure in wild-type and SOS-deficient Escherichia coli. Genetics 158 41–64.
  • Efron, B., Hastie, T., Johnston, I. and Tibshirani, R. (2004). Least angle regression (with discussion). Ann. Statist. 32 407–451.
  • Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
  • Friedman, J., Hastie, T., Hofling, H. and Tibshirani, R. (2007). Pathwise coordinate optimization. Ann. Appl. Statist. 1 302–332.
  • Fu, W. (1998). Penalized regressions: The Bridge versus the Lasso. J. Comput. Graph. Statist. 7 397–416.
  • James, G. M. and Radchenko, P. (2009). A generalized Dantzig selector with shrinkage tuning. Biometrika 96 323–337.
  • Khodursky, A. B., Peter, B. J., Cozzarelli, N. R., Botstein, D., Brown, P. O. and Yanofsky, C. (2000). DNA microarray analysis of gene expression in response to physiological and genetic changes that affect tryptophan metabolism in Escherichia coli. Proc. Natl. Acad. Sci. USA 97 12170–12175.
  • Lee, D. D. and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature 401 788–793.
  • Lee, D. D. and Seung, H. S. (2001). Algorithms for non-negative matrix factorization. Advances in Neural Information Processing Systems 13 556–562.
  • Lee, S. and Batzoglou, S. (2003). Application of independent component analysis to microarrays. Genome Biol. 4 76.
  • Li, Z., Shaw, S., Yedwabnick, M. and Chan, C. (2006). Using a state-space model with hidden variables to infer transcription factor activities. Bioinformatics 22 747–754.
  • Liao, J. C., Boscolo, R., Yang, Y., Tran, L., Sabatti, C. and Roychowdhury, V. (2003). Network component analysis: Reconstruction of regulatory signals in biological systems. Proc. Natl. Acad. Sci. 100 15522–15527.
  • Meinshausen, N. (2007). Relaxed Lasso. Comput. Statist. Data Anal. 52 374–393.
  • Meinshausen, N. and Buehlmann, P. (2008). Stability selection. J. Roy. Stat. Soc. Ser. B. To appear. Available at arXiv:0809. 2932v1.
  • Oh, M. K. and Liao, J. C. (2000a). DNA microarray detection of metabolic responses to protein overproduction in Escherichia coli. Metabolic Engineering 2 201–209.
  • Oh, M. K. and Liao, J. C. (2000b). Gene expression profiling by dna microarrays and metabolic fluxes in Escherichia coli. Biotechnol. Prog. 16 278–286.
  • Oh, M. K., Rohlin, L. and Liao, J. C. (2002). Global expression profiling of acetate-grown Escherichia coli. J. Biol. Chem. 277 13175–13183.
  • Pournara, I. and Wernisch, L. (2007). Factor analysis for gene regulatory networks and transcription factor activity profiles. BMC Bioinformatics 8.
  • Radchenko, P. and James, G. M. (2008). Variable inclusion and shrinkage algorithms. J. Amer. Statist. Assoc. 103 1304–1315.
  • Sabatti, C. and James, G. M. (2006). Bayesian sparse hidden components analysis for transcription regulation networks. Bioinformatics 22 737–744.
  • Sabatti, C. and Lange, K. (2002). Genomewise motif identification using a dictionary model. IEEE Proceedings 90 1803–1810.
  • Sanguinetti, G., Lawrence, N. and Rattray, M. (2006). Probabilistic inference of transcription factor concentrations and gene-specific regulatory activities. Bioinformatics 22 2775–2781.
  • Sun, N., Carroll, R. and Zhao, H. (2006). Bayesian error analysis model for reconstructing transcriptional regulatory networks. Proc. Natl. Acad. Sci. 103 7988–7993.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • Tran, L., Brynildsen, M., Kao, K., Suen, J. and Liao, J. (2005). gNCA: A framework for determining transcription factor activity based on transcriptome: Identifiability and numerical implementation. Metab. Eng. 7 128–141.
  • West, M. (2003). Bayesian factor regression models in the “large p, small n” paradigm. Bayesian Statist. 7 733–742.
  • Witten, D. M., Tibshirani, R. and Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10 515–534.
  • Yu, T. and Li, K. (2005). Inference of transcriptional regulatory network by two-stage constrained space factor analysis. Bioinformatics 21 4033–4038.
  • Zou, H. (2006). The adaptive Lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418–1429.
  • Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. Roy. Statist. Soc. Ser. B 67 301–320.