Statistics Surveys

Semi-parametric estimation for conditional independence multivariate finite mixture models

Didier Chauveau, David R. Hunter, and Michael Levine

Full-text: Open access

Abstract

The conditional independence assumption for nonparametric multivariate finite mixture models, a weaker form of the well-known conditional independence assumption for random effects models for longitudinal data, is the subject of an increasing number of theoretical and algorithmic developments in the statistical literature. After presenting a survey of this literature, including an in-depth discussion of the all-important identifiability results, this article describes and extends an algorithm for estimation of the parameters in these models. The algorithm works for any number of components in three or more dimensions. It possesses a descent property and can be easily adapted to situations where the data are grouped in blocks of conditionally independent variables. We discuss how to adapt this algorithm to various location-scale models that link component densities, and we even adapt it to a particular class of univariate mixture problems in which the components are assumed symmetric. We give a bandwidth selection procedure for our algorithm. Finally, we demonstrate the effectiveness of our algorithm using a simulation study and two psychometric datasets.

Article information

Source
Statist. Surv. Volume 9 (2015), 1-31.

Dates
First available in Project Euclid: 6 February 2015

Permanent link to this document
https://projecteuclid.org/euclid.ssu/1423229941

Digital Object Identifier
doi:10.1214/15-SS108

Mathematical Reviews number (MathSciNet)
MR3310969

Zentralblatt MATH identifier
1307.62090

Subjects
Primary: 62G05: Estimation
Secondary: 62G07: Density estimation

Keywords
Kernel density estimation MM algorithms

Citation

Chauveau, Didier; Hunter, David R.; Levine, Michael. Semi-parametric estimation for conditional independence multivariate finite mixture models. Statist. Surv. 9 (2015), 1--31. doi:10.1214/15-SS108. https://projecteuclid.org/euclid.ssu/1423229941.


Export citation

References

  • [1] Allman, E. S., Matias, C., and Rhodes, J. A. (2009). Identifiability of parameters in latent structure models with many observed variables. Annals of Statistics, 37(6A):3099–3132.
  • [2] Anderson, J. (1979). Multivariate logistic compounds. Biometrika, 66(1):17–26.
  • [3] Bache, K. and Lichman, M. (2013). University of California, Irvine machine learning repository. http://archive.ics.uci.edu/ml.
  • [4] Benaglia, T., Chauveau, D., and Hunter, D. R. (2009a). An EM-like algorithm for semi-and non-parametric estimation in multivariate mixtures. Journal of Computational and Graphical Statistics, 18(2):505–526.
  • [5] Benaglia, T., Chauveau, D., and Hunter, D. R. (2010). Bandwidth selection in an EM-like algorithm for nonparametric multivariate mixtures. In Nonparametric Statistics and Mixture Models: A Festschrift in Honor of Thomas P. Hettmansperger.
  • [6] Benaglia, T., Chauveau, D., Hunter, D. R., and Young, D. (2009b). mixtools: An R package for analyzing finite mixture models. Journal of Statistical Software, 32(6):1–29.
  • [7] Bordes, L., Chauveau, D., and Vandekerkhove, P. (2007). A stochastic EM algorithm for a semiparametric mixture model. Computational Statistics and Data Analysis, 51(11):5429–5443.
  • [8] Bordes, L., Mottelet, S., and Vandekerkhove, P. (2006). Semiparametric estimation of a two-component mixture model. Annals of Statistics, 34(3):1204–1232.
  • [9] Bordes, L. and Vandekerkhove, P. (2010). Semiparametric two-component mixture model with a known component: An asymptotically normal estimator. Mathematical Methods of Statistics, 19(1):22–41.
  • [10] Carreira-Perpiñán, M. Á. and Renals, S. (2000). Practical identifiability of finite mixtures of multivariate Bernoulli distributions. Neural Computation, 12(1):141–152.
  • [11] Chauveau, D., Saby, N. P. A., Orton, T. G., Lemercier, B., Walter, C., and Arrouays, D. (2014). Large-scale simultaneous hypothesis testing in monitoring carbon content from French soil database: A semi-parametric mixture approach. Geoderma, 219:117–124.
  • [12] Cruz-Medina, I. R. and Hettmansperger, T. P. (2004). Nonparametric estimation in semi-parametric univariate mixture models. Journal of Statistical Computation and Simulation, 74(7):513–524.
  • [13] Eggermont, P. P. B. (1999). Nonlinear smoothing and the EM algorithm for positive integral equations of the first kind. Applied Mathematics and Optimization, 39(1):75–91.
  • [14] Eggermont, P. P. B. and LaRiccia, V. N. (2001). Maximum Penalized Likelihood Estimation. Springer, New York.
  • [15] Elmore, R. and Wang, S. (2003). Identifiability and estimation in finite mixture models with multinomial components. Technical report, Department of Statistics, Pennsylvania State University.
  • [16] Elmore, R. T., Hall, P., and Neeman, A. (2005). An application of classical invariant theory to identifiability in nonparametric mixtures. Annales de l’institut Fourier, 55(1):1–28.
  • [17] Elmore, R. T., Hettmansperger, T. P., and Thomas, H. (2004). Estimating component cumulative distribution functions in finite mixture models. Communications in Statistics. Theory and Methods, 33(9):2075–2086.
  • [18] Glick, N. (1973). Sample-based multinomial classification. Biometrics, 29(2):241–256.
  • [19] Gyllenberg, M., Koski, T., Reilink, E., and Verlaan, M. (1994). Non-uniqueness in probabilistic numerical identification of bacteria. Journal of Applied Probability, 31:542–548.
  • [20] Hall, P., Neeman, A., Pakyari, R., and Elmore, R. T. (2005). Nonparametric inference in multivariate mixtures. Biometrika, 92(3):667–678.
  • [21] Hall, P. and Zhou, X. H. (2003). Nonparametric estimation of component distributions in a multivariate mixture. Annals of Statistics, 31:201–224.
  • [22] Hettmansperger, T. P. and Thomas, H. (2000). Almost nonparametric inference for repeated measures in mixture models. Journal of the Royal Statistical Society, Series B, 62(4):811–825.
  • [23] Hohmann, D. (2010). Identification and Estimation in Semiparametric Two-Component Mixtures. PhD thesis, Philipps Universität Marburg.
  • [24] Hunter, D. R. and Lange, K. (2004). A tutorial on MM algorithms. The American Statistician, 58:30–37.
  • [25] Hunter, D. R., Wang, S., and Hettmansperger, T. P. (2007). Inference for mixtures of symmetric distributions. Ann. Statist., 35(1):224–251.
  • [26] Kasahara, H. and Shimotsu, K. (2009). Nonparametric identification of finite mixture models of dynamic discrete choices. Econometrica, 77(1):135–175.
  • [27] Kruskal, J. B. (1976). More factors than subjects, tests and treatments: An indeterminacy theorem for canonical decomposition and individual differences scaling. Psychometrika, 41(3):281–293.
  • [28] Kruskal, J. B. (1977). Three-way arrays: Rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear Algebra and Its Applications, 18(2):95–138.
  • [29] Laird, N. M. and Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics, 38(4):963–974.
  • [30] Leung, D. and Qin, J. (2006). Semi-parametric inference in a bivariate (multivariate) mixture model. Statistica Sinica, 16(1):153.
  • [31] Levine, M., Hunter, D. R., and Chauveau, D. (2011). Maximum smoothed likelihood for multivariate mixtures. Biometrika, 98:403–416.
  • [32] Lindsay, B. G. (1995). Mixture models: Theory, geometry and applications. In NSF-CBMS Regional Conference Series in Probability and Statistics, pages i–163. JSTOR.
  • [33] Meng, X. L. and Rubin, D. B. (1993). Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika, 80(2):267.
  • [34] Miller, C. A., Kail, R., and Leonard, L. B. (2001). Speed of processing in children with specific language impairment. Journal of Speech, Language, and Hearing Research, 44:416–433.
  • [35] Nash, W. J., Sellers, T. L., Talbot, S. R., Cawthorn, A. J., and Ford, W. B. (1994). The population biology of abalone (Haliotis species) in Tasmania. I. blacklip abalone (H. rubra) from the north coast and islands of Bass Strait. Technical report, Tasmania Sea Fisheries Division. Technical Report No. 48 (ISSN 1034-3288).
  • [36] Nigam, K., McCallum, A., Thrun, S., and Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2):103–134.
  • [37] R Core Team (2013). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
  • [38] Robin, S., Bar-Hen, A., Daudin, J.-J., and Pierre, L. (2007). A semi-parametric approach for mixture models: Application to local false discovery rate estimation. Computational Statistics & Data Analysis, 51(12):5483–5493.
  • [39] Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Monographs on Statistics and Applied Probability. Chapman & Hall, London.
  • [40] Thomas, H., Lohaus, A., and Brainerd, C. J. (1993). Modeling growth and individual differences in spatial tasks. Monographs of the Society for Research in Child Development, 58(9).
  • [41] Young, D. S., Benaglia, T., Chauveau, D., Elmore, R. T., Hettmansperger, T. P., Hunter, D. R., Thomas, H., and Xuan, F. (2009). mixtools: Tools for mixture models. R package version 0.3.3.