## Annals of Statistics

### Identifiability of nonparametric mixture models and Bayes optimal clustering

#### Abstract

Motivated by problems in data clustering, we establish general conditions under which families of nonparametric mixture models are identifiable by introducing a novel framework involving clustering overfitted parametric (i.e., misspecified) mixture models. These identifiability conditions generalize existing conditions in the literature and are flexible enough to include, for example, mixtures of infinite Gaussian mixtures. In contrast to the recent literature, we allow for general nonparametric mixture components and instead impose regularity assumptions on the underlying mixing measure. As our primary application we apply these results to partition-based clustering, generalizing the notion of a Bayes optimal partition from classical parametric model-based clustering to nonparametric settings. Furthermore, this framework is constructive, so that it yields a practical algorithm for learning identified mixtures, which is illustrated through several examples on real data. The key conceptual device in the analysis is the convex, metric geometry of probability measures on metric spaces and its connection to the Wasserstein convergence of mixing measures. The result is a flexible framework for nonparametric clustering with formal consistency guarantees.

#### Article information

Source
Ann. Statist., Volume 48, Number 4 (2020), 2277-2302.

Dates
Revised: July 2019
First available in Project Euclid: 14 August 2020

https://projecteuclid.org/euclid.aos/1597370673

Digital Object Identifier
doi:10.1214/19-AOS1887

Mathematical Reviews number (MathSciNet)
MR4134795

#### Citation

Aragam, Bryon; Dan, Chen; Xing, Eric P.; Ravikumar, Pradeep. Identifiability of nonparametric mixture models and Bayes optimal clustering. Ann. Statist. 48 (2020), no. 4, 2277--2302. doi:10.1214/19-AOS1887. https://projecteuclid.org/euclid.aos/1597370673

#### References

• [1] Achlioptas, D. and McSherry, F. (2005). On spectral learning of mixtures of distributions. In International Conference on Computational Learning Theory 458–469. Springer, Berlin.
• [2] Ahmad, K. E. and Al-Hussaini, E. K. (1982). Remarks on the nonidentifiability of mixtures of distributions. Ann. Inst. Statist. Math. 34 543–544.
• [3] Airoldi, E. M., Costa, T. B. and Chan, S. H. (2013). Stochastic blockmodel approximation of a graphon: Theory and consistent estimation. In Advances in Neural Information Processing Systems 692–700.
• [4] Allman, E. S., Matias, C. and Rhodes, J. A. (2009). Identifiability of parameters in latent structure models with many observed variables. Ann. Statist. 37 3099–3132.
• [5] Anandkumar, A., Hsu, D., Janzamin, M. and Kakade, S. (2015). When are overcomplete topic models identifiable? Uniqueness of tensor Tucker decompositions with structured sparsity. J. Mach. Learn. Res. 16 2643–2694.
• [6] Anandkumar, A., Hsu, D., Javanmard, A. and Kakade, S. (2013). Learning linear Bayesian networks with latent variables. In Proceedings of the 30th International Conference on Machine Learning 249–257.
• [7] Aragam, B., Dan, C., Xing, E. P and Ravikumar, P. (2020). Supplement to “Identifiability of nonparametric mixture models and Bayes optimal clustering.” https://doi.org/10.1214/19-AOS1887SUPP.
• [8] Arora, S. and Kannan, R. (2005). Learning mixtures of separated nonspherical Gaussians. Ann. Appl. Probab. 15 69–92.
• [9] Barndorff-Nielsen, O. (1965). Identifiability of mixtures of exponential families. J. Math. Anal. Appl. 12 115–121.
• [10] Basu, A., Shioya, H. and Park, C. (2011). Statistical Inference: The Minimum Distance Approach. Monographs on Statistics and Applied Probability 120. CRC Press, Boca Raton, FL.
• [11] Beran, R. (1977). Minimum Hellinger distance estimates for parametric models. Ann. Statist. 5 445–463.
• [12] Bock, H. H. (1996). Probabilistic models in cluster analysis. Comput. Statist. Data Anal. 23 5–28.
• [13] Bonhomme, S., Jochmans, K. and Robin, J.-M. (2016). Estimating multivariate latent-structure models. Ann. Statist. 44 540–563.
• [14] Bonhomme, S., Jochmans, K. and Robin, J.-M. (2016). Non-parametric estimation of finite mixtures from repeated measurements. J. R. Stat. Soc. Ser. B. Stat. Methodol. 78 211–229.
• [15] Bordes, L., Kojadinovic, I. and Vandekerkhove, P. (2013). Semiparametric estimation of a two-component mixture of linear regressions in which one component is known. Electron. J. Stat. 7 2603–2644.
• [16] Bordes, L., Mottelet, S. and Vandekerkhove, P. (2006). Semiparametric estimation of a two-component mixture model. Ann. Statist. 34 1204–1232.
• [17] Bruni, C. and Koch, G. (1985). Identifiability of continuous mixtures of unknown Gaussian distributions. Ann. Probab. 13 1341–1357.
• [18] Chacón, J. E. (2015). A population background for nonparametric density-based clustering. Statist. Sci. 30 518–532.
• [19] Chaudhuri, K. and Dasgupta, S. (2010). Rates of convergence for the cluster tree. In Advances in Neural Information Processing Systems 343–351.
• [20] Chen, J. H. (1995). Optimal rate of convergence for finite mixture models. Ann. Statist. 23 221–233.
• [21] Chen, Y.-C., Genovese, C. R. and Wasserman, L. (2016). A comprehensive approach to mode clustering. Electron. J. Stat. 10 210–241.
• [22] Csiszár, I. (1967). Information-type measures of difference of probability distributions and indirect observations. Studia Sci. Math. Hungar. 2 299–318.
• [23] Cuevas, A. and González Manteiga, W. (1991). Data-driven smoothing based on convexity properties. In Nonparametric Functional Estimation and Related Topics (Spetses, 1990). NATO Adv. Sci. Inst. Ser. C Math. Phys. Sci. 335 225–240. Kluwer Academic, Dordrecht.
• [24] D’Haultfœuille, X. and Février, P. (2015). Identification of mixture models using support variations. J. Econometrics 189 70–82.
• [25] Dasgupta, S. (1999). Learning mixtures of Gaussians. In 40th Annual Symposium on Foundations of Computer Science (New York, 1999) 634–644. IEEE Computer Soc., Los Alamitos, CA.
• [26] Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Applications of Mathematics (New York) 31. Springer, New York.
• [27] Eldridge, J., Belkin, M. and Wang, Y. (2016). Graphons, mergeons, and so on! In Advances in Neural Information Processing Systems 2307–2315.
• [28] Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. J. Amer. Statist. Assoc. 97 611–631.
• [29] Gassiat, E. and Rousseau, J. (2013). Non parametric finite translation mixtures with dependent regime. Preprint. Available at arXiv:1302.2345.
• [30] Genovese, C. R. and Wasserman, L. (2000). Rates of convergence for the Gaussian mixture sieve. Ann. Statist. 28 1105–1127.
• [31] Ghosal, S. and van der Vaart, A. W. (2001). Entropies and rates of convergence for maximum likelihood and Bayes estimation for mixtures of normal densities. Ann. Statist. 29 1233–1263.
• [32] Hall, P., Neeman, A., Pakyari, R. and Elmore, R. (2005). Nonparametric inference in multivariate mixtures. Biometrika 92 667–678.
• [33] Hall, P. and Zhou, X.-H. (2003). Nonparametric estimation of component distributions in a multivariate mixture. Ann. Statist. 31 201–224.
• [34] Hartigan, J. A. (1975). Clustering Algorithms. Wiley, New York.
• [35] Hartigan, J. A. (1981). Consistency of single linkage for high-density clusters. J. Amer. Statist. Assoc. 76 388–394.
• [36] Heinrich, P. and Kahn, J. (2018). Strong identifiability and optimal minimax rates for finite mixture estimation. Ann. Statist. 46 2844–2870.
• [37] Hettmansperger, T. P. and Thomas, H. (2000). Almost nonparametric inference for repeated measures in mixture models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 62 811–825.
• [38] Ho, N. and Nguyen, X. (2016). On strong identifiability and convergence rates of parameter estimation in finite mixtures. Electron. J. Stat. 10 271–307.
• [39] Ho, N. and Nguyen, X. (2019). Singularity structures and impacts on parameter estimation in finite mixtures of distributions. SIAM J. Math. Data Sci. 1 730–758.
• [40] Holland, P. W., Laskey, K. B. and Leinhardt, S. (1983). Stochastic blockmodels: First steps. Soc. Netw. 5 109–137.
• [41] Holzmann, H., Munk, A. and Gneiting, T. (2006). Identifiability of finite mixtures of elliptical distributions. Scand. J. Stat. 33 753–763.
• [42] Hunter, D. R., Wang, S. and Hettmansperger, T. P. (2007). Inference for mixtures of symmetric distributions. Ann. Statist. 35 224–251.
• [43] Hunter, D. R. and Young, D. S. (2012). Semiparametric mixtures of regressions. J. Nonparametr. Stat. 24 19–38.
• [44] Jacobs, R. A., Jordan, M. I., Nowlan, S. J. and Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Comput. 3 79–87.
• [45] Jochmans, K., Henry, M. and Salanié, B. (2017). Inference on two-component mixtures under tail restrictions. Econometric Theory 33 610–635.
• [46] Kannan, R., Salmasian, H. and Vempala, S. (2008). The spectral method for general mixture models. SIAM J. Comput. 38 1141–1156.
• [47] Levine, M., Hunter, D. R. and Chauveau, D. (2011). Maximum smoothed likelihood for multivariate mixtures. Biometrika 98 403–416.
• [48] Li, J. and Schmidt, L. (2015). A nearly optimal and agnostic algorithm for properly learning a mixture of $k$ Gaussians, for any constant $k$. Preprint. Available at arXiv:1506.01367.
• [49] Lindsay, B. G. (1995). Mixture models: Theory, geometry and applications. In NSF-CBMS Regional Conference Series in Probability and Statistics. JSTOR.
• [50] Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Trans. Inform. Theory 28 129–137.
• [51] MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proc. Fifth Berkeley Sympos. Math. Statist. and Probability (Berkeley, Calif., 1965/66) 281–297. Univ. California Press, Berkeley, CA.
• [52] Mixon, D. G., Villar, S. and Ward, R. (2017). Clustering subgaussian mixtures by semidefinite programming. Inf. Inference 6 389–415.
• [53] Ng, A. Y., Jordan, M. I. and Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. In NIPS 14 849–856.
• [54] Nguyen, X. (2013). Convergence of latent mixing measures in finite and infinite mixture models. Ann. Statist. 41 370–400.
• [55] Parthasarathy, K. R. (1967). Probability Measures on Metric Spaces. Probability and Mathematical Statistics, No. 3. Academic Press, New York.
• [56] Priebe, C. E. (1994). Adaptive mixtures. J. Amer. Statist. Assoc. 89 796–806.
• [57] Rinaldo, A. and Wasserman, L. (2010). Generalized density clustering. Ann. Statist. 38 2678–2722.
• [58] Ritter, G. (2015). Robust Cluster Analysis and Variable Selection. Monographs on Statistics and Applied Probability 137. CRC Press, Boca Raton, FL.
• [59] Rohe, K., Chatterjee, S. and Yu, B. (2011). Spectral clustering and the high-dimensional stochastic blockmodel. Ann. Statist. 39 1878–1915.
• [60] Schiebinger, G., Wainwright, M. J. and Yu, B. (2015). The geometry of kernelized spectral clustering. Ann. Statist. 43 819–846.
• [61] Shi, T., Belkin, M. and Yu, B. (2009). Data spectroscopy: Eigenspaces of convolution operators and clustering. Ann. Statist. 37 3960–3984.
• [62] Sriperumbudur, B. and Steinwart, I. (2012). Consistency and rates for clustering with dbscan. In Artificial Intelligence and Statistics 1090–1098.
• [63] Steinhaus, H. (1956). Sur la division des corps matériels en parties. Bull. Acad. Polon. Sci. Cl. III. 4 801–804.
• [64] Steinwart, I. (2011). Adaptive density level set clustering. In International Conference on Learning Theory 703–738.
• [65] Steinwart, I. (2015). Fully adaptive density-based clustering. Ann. Statist. 43 2132–2167.
• [66] Sweeting, T. J. (1986). On a converse to Scheffé’s theorem. Ann. Statist. 14 1252–1256.
• [67] Tang, J., Meng, Z., Nguyen, X., Mei, Q. and Zhang, M. (2014). Understanding the limiting factors of topic modeling via posterior contraction analysis. In International Conference on Machine Learning 190–198.
• [68] Teicher, H. (1960). On the mixture of distributions. Ann. Math. Stat. 31 55–73.
• [69] Teicher, H. (1961). Identifiability of mixtures. Ann. Math. Stat. 32 244–248.
• [70] Teicher, H. (1963). Identifiability of finite mixtures. Ann. Math. Stat. 34 1265–1269.
• [71] Teicher, H. (1967). Identifiability of mixtures of product measures. Ann. Math. Stat. 38 1300–1302.
• [72] Thomann, P., Steinwart, I. and Schmid, N. (2015). Towards an axiomatic approach to hierarchical clustering of measures. J. Mach. Learn. Res. 16 1949–2002.
• [73] Titterington, D. M., Smith, A. F. M. and Makov, U. E. (1985). Statistical Analysis of Finite Mixture Distributions. Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics. Wiley, Chichester.
• [74] Ultsch, A. (2005). Clustering with SOM: U*C. In Proc. Workshop on Self-Organizing Maps 75–82, Paris, France.
• [75] Wolfe, J. H. (1970). Pattern clustering by multivariate mixture analysis. Multivar. Behav. Res. 5 329–350.
• [76] Yakowitz, S. J. and Spragins, J. D. (1968). On the identifiability of finite mixtures. Ann. Math. Stat. 39 209–214.
• [77] Yan, D., Huang, L. and Jordan, M. I. (2009). Fast approximate spectral clustering. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 907–916. ACM, New York.

#### Supplemental materials

• Supplement to “Identifiability of nonparametric mixture models and Bayes optimal clustering”. This supplement contains proofs of all the main results along with various technical results and experiment details.