The Annals of Applied Statistics

Bayesian clustering of replicated time-course gene expression data with weak signals

Audrey Qiuyan Fu, Steven Russell, Sarah J. Bray, and Simon Tavaré

Full-text: Open access

Abstract

To identify novel dynamic patterns of gene expression, we develop a statistical method to cluster noisy measurements of gene expression collected from multiple replicates at multiple time points, with an unknown number of clusters. We propose a random-effects mixture model coupled with a Dirichlet-process prior for clustering. The mixture model formulation allows for probabilistic cluster assignments. The random-effects formulation allows for attributing the total variability in the data to the sources that are consistent with the experimental design, particularly when the noise level is high and the temporal dependence is not strong. The Dirichlet-process prior induces a prior distribution on partitions and helps to estimate the number of clusters (or mixture components) from the data. We further tackle two challenges associated with Dirichlet-process prior-based methods. One is efficient sampling. We develop a novel Metropolis–Hastings Markov Chain Monte Carlo (MCMC) procedure to sample the partitions. The other is efficient use of the MCMC samples in forming clusters. We propose a two-step procedure for posterior inference, which involves resampling and relabeling, to estimate the posterior allocation probability matrix. This matrix can be directly used in cluster assignments, while describing the uncertainty in clustering. We demonstrate the effectiveness of our model and sampling procedure through simulated data. Applying our method to a real data set collected from Drosophila adult muscle cells after five-minute Notch activation, we identify 14 clusters of different transcriptional responses among 163 differentially expressed genes, which provides novel insights into underlying transcriptional mechanisms in the Notch signaling pathway. The algorithm developed here is implemented in the R package DIRECT, available on CRAN.

Article information

Source
Ann. Appl. Stat., Volume 7, Number 3 (2013), 1334-1361.

Dates
First available in Project Euclid: 3 October 2013

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1380804798

Digital Object Identifier
doi:10.1214/13-AOAS650

Mathematical Reviews number (MathSciNet)
MR3127950

Zentralblatt MATH identifier
1283.62050

Keywords
Bayesian clustering mixture model random effects Dirichlet process Chinese restaurant process Markov-chain Monte Carlo (MCMC) label switching multivariate analysis time series microarray gene expression

Citation

Fu, Audrey Qiuyan; Russell, Steven; Bray, Sarah J.; Tavaré, Simon. Bayesian clustering of replicated time-course gene expression data with weak signals. Ann. Appl. Stat. 7 (2013), no. 3, 1334--1361. doi:10.1214/13-AOAS650. https://projecteuclid.org/euclid.aoas/1380804798


Export citation

References

  • Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Ann. Statist. 2 1152–1174.
  • Beal, M. J. and Krishnamurthy, P. (2006). Gene expression time course clustering with countably infinite hidden Markov models. In Proc. Conference on Uncertainty in Artificial Intelligence.
  • Booth, J. G., Casella, G. and Hobert, J. P. (2008). Clustering using objective functions and stochastic search. J. R. Stat. Soc. Ser. B Stat. Methodol. 70 119–139.
  • Bray, S. J. (2006). Notch signalling: A simple pathway becomes complex. Nat. Rev. Mol. Cell Bio. 7 678–689.
  • Celeux, G., Martin, O. and Lavergne, C. (2005). Mixture of linear mixed models for clustering gene expression profiles from repeated microarray experiments. Stat. Model. 5 1–25.
  • Cooke, E. J., Savage, R. S., Kirk, P. D. W., Darkins, R. and Wild, D. L. (2011). Bayesian hierarchical clustering for microarray time series data with replicates and outlier measurements. BMC Bioinformatics 12 399.
  • Dhavala, S. S., Datta, S., Mallick, B. K., Carroll, R. J., Khare, S., Lawhon, S. D. and Adams, L. G. (2010). Bayesian modeling of MPSS data: Gene expression analysis of bovine Salmonella infection. J. Amer. Statist. Assoc. 105 956–967.
  • Dunson, D. B. (2010). Nonparametric Bayes applications to biostatistics. In Bayesian Nonparametrics (N. L. Hjort, C. Holmes, P. Müller and S. G. Walker, eds.) Cambridge Series on Statistical and Probabilistic Mathematics 28 223–273. Cambridge Univ. Press, Cambridge.
  • Elowitz, M. B., Levine, A. J., Siggia, E. D. and Swain, P. S. (2002). Stochastic gene expression in a single cell. Science 297 1183–1186.
  • Escobar, M. D. and West, M. (1995). Bayesian density estimation and inference using mixtures. J. Amer. Statist. Assoc. 90 577–588.
  • Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1 209–230.
  • Fox, E. B. (2009). Bayesian nonparametric learning of complex dynamical phenomena. Ph.D. thesis, MIT, Cambridge, MA.
  • Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. J. Amer. Statist. Assoc. 97 611–631.
  • Fraley, C. and Raftery, A. E. (2006). MCLUST version 3 for R: Normal mixture modeling and model-based clustering. Technical Report 504, Dept. Statistics, Univ. Washington, Seattle, WA.
  • Fu, A. Q., Russell, S., Bray, S. J. and Tavaré, S. (2013). Supplement to “Bayesian clustering of replicated time-course gene expression data with weak signals.” DOI:10.1214/13-AOAS650SUPP.
  • Green, P. J. (2010). Colouring and breaking sticks: Random distributions and heterogeneous clustering. In Probability and Mathematical Genetics (N. H. Bingham and C. M. Goldie, eds.). London Mathematical Society Lecture Note Series 378 319–344. Cambridge Univ. Press, Cambridge.
  • Griffin, J. and Holmes, C. (2010). Computational issues arising in Bayesian nonparametric hierarchical models. In Bayesian Nonparametrics (N. L. Hjort, C. Holmes, P. Müller and S. G. Walker, eds.). Cambridge Series on Statistical and Probabilistic Mathematics 28 208–222. Cambridge Univ. Press, Cambridge.
  • Heard, N. A., Holmes, C. C. and Stephens, D. A. (2006). A quantitative study of gene regulation involved in the immune response of Anopheline mosquitoes: An application of Bayesian hierarchical clustering of curves. J. Amer. Statist. Assoc. 101 18–29.
  • Hjort, N. L., Holmes, C., Müller, P. and Walker, S. G., eds. (2010). Bayesian Nonparametrics. Cambridge Series in Statistical and Probabilistic Mathematics 28. Cambridge Univ. Press, Cambridge.
  • Housden, B. (2011). Notch targets and EGFR pathway regulation. Ph.D. thesis, Univ. Cambridge.
  • Housden, B. E., Fu, A. Q., Krejci, A., Bernard, F., Fischer, B., Tavaré, S., Russell, S. and Bray, S. J. (2013). Transcriptional dynamics elicited by a short pulse of Notch activation involves feed-forward regulation by E(spl)/Hes genes. PLoS Genet. 9 e1003162.
  • Hubert, L. and Arabie, P. (1985). Comparing partitions. J. Classification 2 193–218.
  • Jain, S. and Neal, R. M. (2004). A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model. J. Comput. Graph. Statist. 13 158–182.
  • Jain, S. and Neal, R. M. (2007). Splitting and merging components of a nonconjugate Dirichlet process mixture model. Bayesian Anal. 2 445–472.
  • Jennings, B., Preiss, A., Delidakis, C. and Bray, S. (1994). The Notch signalling pathway is required for Enhancer of split bHLH protein expression during neurogenesis in the Drosophila embryo. Development 120 3537–3548.
  • Kalli, M., Griffin, J. E. and Walker, S. G. (2011). Slice sampling mixture models. Stat. Comput. 21 93–105.
  • Krejci, A., Bernard, F., Housden, B. E., Collins, S. and Bray, S. J. (2009). Direct response to Notch activation: Signaling crosstalk and incoherent logic. Sci. STKE 2 ra1.
  • Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Res. Logist. Quart. 2 83–97.
  • Lau, J. W. and Green, P. J. (2007). Bayesian model-based clustering procedures. J. Comput. Graph. Statist. 16 526–558.
  • Ma, P., Castillo-Davis, C. I., Zhong, W. and Liu, J. S. (2006). A data-driven clustering method for time course gene expression data. Nucleic Acids Res. 34 1261–1269.
  • MacEachern, S. N. and Müller, P. (1998). Estimating mixture of Dirichlet process models. J. Comput. Graph. Statist. 7 223–238.
  • McAdams, H. H. and Arkin, A. (1997). Stochastic mechanisms in gene expression. Proc. Natl. Acad. Sci. USA 94 814–819.
  • McNicholas, P. D. and Murphy, T. B. (2010). Model-based clustering of longitudinal data. Canad. J. Statist. 38 153–168.
  • Medvedovic, M. and Sivaganesan, S. (2002). Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics 18 1194–1206.
  • Medvedovic, M., Yeung, K. Y. and Burngarner, R. E. (2004). Bayesian mixture model based clustering of replicated microarray data. Bioinformatics 20 1222–1232.
  • Merton, R. C. (1971). Optimum consumption and portfolio rules in a continuous-time model. J. Econom. Theory 3 373–413.
  • Munkres, J. (1957). Algorithms for the assignment and transportation problems. J. Soc. Indust. Appl. Math. 5 32–38.
  • Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Statist. 9 249–265.
  • Papaspiliopoulos, O. and Roberts, G. O. (2008). Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models. Biometrika 95 169–186.
  • Pitman, J. (2006). Combinatorial Stochastic Processes. Lecture Notes in Math. 1875. Springer, Berlin.
  • Pitman, J. and Yor, M. (1997). The two-parameter Poisson–Dirichlet distribution derived from a stable subordinator. Ann. Probab. 25 855–900.
  • Qin, Z. S. (2006). Clustering microarray gene expression data using weighted Chinese restaurant process. Bioinformatics 22 1988–1997.
  • Rasmussen, C. E., de la Cruz, B. J., Ghahramani, Z. and Wild, D. L. (2009). Modeling and visualizing uncertainty in gene expression clusters using Dirichlet process mixtures. IEEE/ACM Trans. Comput. Biol. Bioinf. 6 615–628.
  • Richardson, S. and Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components. J. R. Stat. Soc. Ser. B Stat. Methodol. 59 731–792.
  • Schliep, A., Costa, I. G., Steinhoff, C. and Schönhuth, A. (2005). Analyzing gene expression time-courses. IEEE/ACM Trans. Comput. Biol. Bioinf. 2 179–193.
  • Searle, S. R., Casella, G. and McCulloch, C. E. (2006). Variance Components. Wiley-Interscience, Hoboken, NJ.
  • Spudich, J. L. and Koshland, J. D. E. (1976). Non-genetic individuality: Chance in the single cell. Nature 262 467–471.
  • Stephens, M. (2000a). Bayesian analysis of mixture models with an unknown number of components—an alternative to reversible jump methods. Ann. Statist. 28 40–74.
  • Stephens, M. (2000b). Dealing with label switching in mixture models. J. R. Stat. Soc. Ser. B Stat. Methodol. 62 795–809.
  • Storey, J. D., Xiao, W., Leek, J. T., Tompkins, R. G. and Davis, R. W. (2005). Significance analysis of time course microarray experiments. Proc. Natl. Acad. Sci. USA 102 12837–12842.
  • Taylor, H. M. and Karlin, S. (1998). An Introduction to Stochastic Modeling, 3rd ed. Academic Press, San Diego, CA.
  • Walker, S. G. (2007). Sampling the Dirichlet mixture model with slices. Comm. Statist. Simulation Comput. 36 45–54.
  • Zhou, C. and Wakefield, J. (2006). A Bayesian mixture model for partitioning gene expression data. Biometrics 62 515–525.

Supplemental materials