## The Annals of Applied Statistics

### Functional clustering in nested designs: Modeling variability in reproductive epidemiology studies

#### Abstract

We discuss functional clustering procedures for nested designs, where multiple curves are collected for each subject in the study. We start by considering the application of standard functional clustering tools to this problem, which leads to groupings based on the average profile for each subject. After discussing some of the shortcomings of this approach, we present a mixture model based on a generalization of the nested Dirichlet process that clusters subjects based on the distribution of their curves. By using mixtures of generalized Dirichlet processes, the model induces a much more flexible prior on the partition structure than other popular model-based clustering methods, allowing for different rates of introduction of new clusters as the number of observations increases. The methods are illustrated using hormone profiles from multiple menstrual cycles collected for women in the Early Pregnancy Study.

#### Article information

Source
Ann. Appl. Stat. Volume 8, Number 3 (2014), 1416-1442.

Dates
First available in Project Euclid: 23 October 2014

https://projecteuclid.org/euclid.aoas/1414091219

Digital Object Identifier
doi:10.1214/14-AOAS751

Mathematical Reviews number (MathSciNet)
MR3271338

Zentralblatt MATH identifier
1303.62040

#### Citation

Rodriguez, Abel; Dunson, David B. Functional clustering in nested designs: Modeling variability in reproductive epidemiology studies. Ann. Appl. Stat. 8 (2014), no. 3, 1416--1442. doi:10.1214/14-AOAS751. https://projecteuclid.org/euclid.aoas/1414091219

#### References

• Abraham, C., Cornillon, P. A., Matzner-Løber, E. and Molinari, N. (2003). Unsupervised curve clustering using B-splines. Scand. J. Stat. 30 581–595.
• Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Ann. Statist. 2 1152–1174.
• Bigelow, J. L. and Dunson, D. B. (2009). Bayesian semiparametric joint models for functional predictors. J. Amer. Statist. Assoc. 104 26–36.
• Booth, J. G., Casella, G. and Hobert, J. P. (2008). Clustering using objective functions and stochastic search. J. R. Stat. Soc. Ser. B Stat. Methodol. 70 119–139.
• Brumback, B. A. and Rice, J. A. (1998). Smoothing spline models for the analysis of nested and crossed samples of curves. J. Amer. Statist. Assoc. 93 961–994.
• Chiou, J.-M. and Li, P.-L. (2007). Functional clustering and identifying substructures of longitudinal data. J. R. Stat. Soc. Ser. B Stat. Methodol. 69 679–699.
• DiMatteo, I., Genovese, C. R. and Kass, R. E. (2001). Bayesian curve-fitting with free-knot splines. Biometrika 88 1055–1071.
• Dunson, D. B. (2009). Nonparametric Bayes local partition models for random effects. Biometrika 96 249–262.
• Dunson, D. B., Baird, D. D., Wilcox, A. J. and Weinberg, C. R. (1999). Day-specific probabilities of clinical pregnancy based on two studies with imperfect measures of ovulation. Hum. Reprod. 14 1835–1839.
• Escobar, M. D. and West, M. (1995). Bayesian density estimation and inference using mixtures. J. Amer. Statist. Assoc. 90 577–588.
• Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1 209–230.
• Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. J. Amer. Statist. Assoc. 97 611–631.
• García-Escudero, L. A. and Gordaliza, A. (2005). A proposal for robust curve clustering. J. Classification 22 185–201.
• Gelman, A. and Rubin, D. (1992). Inferences from iterative simulation using multiple sequences. Statist. Sci. 7 457–472.
• Heard, N. A., Holmes, C. C. and Stephens, D. A. (2006). A quantitative study of gene regulation involved in the immune response of anopheline mosquitoes: An application of Bayesian hierarchical clustering of curves. J. Amer. Statist. Assoc. 101 18–29.
• Hjort, N. (2000). Bayesian analysis for a generalized Dirichlet process prior. Technical report, Univ. Oslo.
• Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. J. Amer. Statist. Assoc. 96 161–173.
• Ishwaran, H. and James, L. F. (2002). Approximate Dirichlet process computing in finite normal mixtures: Smoothing and prior information. J. Comput. Graph. Statist. 11 508–532.
• James, G. M. and Sugar, C. A. (2003). Clustering for sparsely sampled functional data. J. Amer. Statist. Assoc. 98 397–408.
• Lau, J. W. and Green, P. J. (2007). Bayesian model-based clustering procedures. J. Comput. Graph. Statist. 16 526–558.
• Li, J. (2005). Clustering based on a multilayer mixture model. J. Comput. Graph. Statist. 14 547–568.
• Li, Q. and Racine, J. (2004). Cross-validated local linear nonparametric regression. Statist. Sinica 14 485–512.
• Liang, F., Paulo, R., Molina, G., Clyde, M. A. and Berger, J. O. (2008). Mixtures of $g$ priors for Bayesian variable selection. J. Amer. Statist. Assoc. 103 410–423.
• Luan, Y. and Li, H. (2003). Clustering of time-course gene expression data using a mixed effects model with b-splines. Bioinformatics 19 474–482.
• McCloskey, J. W. T. (1965). A model for the distribution of individuals by species in an environment. Ph.D. thesis, Michigan State Univ.
• Medvedovic, M. and Sivaganesan, S. (2002). Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics 18 1194–1206.
• Ongaro, A. and Cattaneo, C. (2004). Discrete random probability measures: A general framework for nonparametric Bayesian inference. Statist. Probab. Lett. 67 33–45.
• Paciorek, C. J. (2006). Misinformation in the conjugate prior for the linear model with implications for free-knot spline modelling. Bayesian Anal. 1 375–383 (electronic).
• Pitman, J. (1995). Exchangeable and partially exchangeable random partitions. Probab. Theory Related Fields 102 145–158.
• Pitman, J. (1996). Some developments of the Blackwell–MacQueen urn scheme. In Statistics, Probability and Game Theory. Papers in Honor of David Blackwell (T. S. Ferguson, L. S. Shapeley and J. B. MacQueen, eds.) 245–267. IMS, Hayward, CA.
• Quintana, F. A. and Iglesias, P. L. (2003). Bayesian clustering and product partition models. J. R. Stat. Soc. Ser. B Stat. Methodol. 65 557–574.
• Racine, J. and Li, Q. (2004). Nonparametric estimation of regression functions with both categorical and continuous data. J. Econometrics 119 99–130.
• Ramoni, M. F., Sebastiani, P. and Kohane, I. S. (2002). Cluster analysis of gene expression dynamics. Proc. Natl. Acad. Sci. USA 99 9121–9126 (electronic).
• Ramsay, J. O. and Silverman, B. W. (2005). Functional Data Analysis, 2nd ed. Springer, New York.
• Ray, S. and Mallick, B. (2006). Functional clustering by Bayesian wavelet methods. J. R. Stat. Soc. Ser. B Stat. Methodol. 68 305–332.
• Robert, C. P. and Casella, G. (1999). Monte Carlo Statistical Methods. Springer, New York.
• Rodriguez, A. and Dunson, D. B. (2014). Supplement to “Functional clustering in nested designs: Modeling variability in reproductive epidemiology studies.” DOI:10.1214/14-AOAS751SUPP.
• Rodríguez, A., Dunson, D. B. and Gelfand, A. E. (2008). The nested Dirichlet process. J. Amer. Statist. Assoc. 103 1131–1144.
• Rodríguez, A., Dunson, D. B. and Gelfand, A. E. (2009). Bayesian nonparametric functional data analysis through density estimation. Biometrika 96 149–162.
• Serban, N. and Wasserman, L. (2005). CATS: Clustering after transformation and smoothing. J. Amer. Statist. Assoc. 100 990–999.
• Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statist. Sinica 4 639–650.
• Smith, M. and Kohn, R. (1996). Nonparametric regression using Bayesian variable selection. J. Econometrics 75 317–343.
• Sudderth, E. B. and Jordan, M. I. (2009). Shared segmentation of natural scenes using dependent Pitman–Yor processes. In Advances in Neural Information Processing Systems 21 (D. Koller, D. Schuurmans, Y. Bengio and L. Bottou, eds.).
• Tarpey, T. and Kinateder, K. K. J. (2003). Clustering functional data. J. Classification 20 93–114.
• Wakefield, J. C., Zhou, C. and Self, S. G. (2003). Modelling gene expression data over time: Curve clustering with informative prior distributions. In Bayesian Statistics, 7 (Tenerife, 2002) (J. Bernardo, M. Bayarri, J. Berger, A. Dawid, D. Heckerman, A. Smith and M. West, eds.) 721–732. Oxford Univ. Press, New York.
• Wilcox, A. J., Weinberg, C. R., O’Connor, J. F., Baid, D. D., Schlatterer, J. P., Canfield, R. E., Armstrong, E. G. and Nisula, B. C. (1998). Incidence of early loss of pregnancy. N. Engl. J. Med. 319 189–194.
• Yeung, K. Y. and Ruzzo, W. L. (2001). Principal component analysis for clustering gene expression data. Bioinformatics 17 763–774.
• Yeung, K. Y., Fraley, C., Muruan, A., Raftery, A. E. and Ruzzo, W. L. (2001). Model-based clustering and data transformation for gene expression data. Bioinformatics 17 977–987.

#### Supplemental materials

• Supplementary material: Supplement to “Functional clustering in nested designs: Modeling variability in reproductive epidemiology studies”. The supplementary materials contain the details of the Markov chain Monte Carlo algorithm used to fit the models introduced in the paper.