Electronic Journal of Statistics

Model-based clustering with envelopes

Wenjing Wang, Xin Zhang, and Qing Mai

Full-text: Open access

Abstract

Clustering analysis is an important unsupervised learning technique in multivariate statistics and machine learning. In this paper, we propose a set of new mixture models called CLEMM (in short for Clustering with Envelope Mixture Models) that is based on the widely used Gaussian mixture model assumptions and the nascent research area of envelope methodology. Formulated mostly for regression models, envelope methodology aims for simultaneous dimension reduction and efficient parameter estimation, and includes a very recent formulation of envelope discriminant subspace for classification and discriminant analysis. Motivated by the envelope discriminant subspace pursuit in classification, we consider parsimonious probabilistic mixture models where the cluster analysis can be improved by projecting the data onto a latent lower-dimensional subspace. The proposed CLEMM framework and the associated envelope-EM algorithms thus provide foundations for envelope methods in unsupervised and semi-supervised learning problems. Numerical studies on simulated data and two benchmark data sets show significant improvement of our propose methods over the classical methods such as Gaussian mixture models, K-means and hierarchical clustering algorithms. An R package is available at https://github.com/kusakehan/CLEMM.

Article information

Source
Electron. J. Statist., Volume 14, Number 1 (2020), 82-109.

Dates
Received: December 2018
First available in Project Euclid: 3 January 2020

Permanent link to this document
https://projecteuclid.org/euclid.ejs/1578042014

Digital Object Identifier
doi:10.1214/19-EJS1652

Keywords
Clustering computational statistics dimension reduction envelope methods Gaussian mixture models

Rights
Creative Commons Attribution 4.0 International License.

Citation

Wang, Wenjing; Zhang, Xin; Mai, Qing. Model-based clustering with envelopes. Electron. J. Statist. 14 (2020), no. 1, 82--109. doi:10.1214/19-EJS1652. https://projecteuclid.org/euclid.ejs/1578042014


Export citation

References

  • [1] Akaike, H. (1998). Information theory and an extension of the maximum likelihood principle. In, Selected papers of Hirotugu Akaike, pages 199–213. Springer.
  • [2] Baek, J., McLachlan, G. J., and Flack, L. K. (2010). Mixtures of factor analyzers with common factor loadings: Applications to the clustering and visualization of high-dimensional data., IEEE transactions on pattern analysis and machine intelligence, 32(7):1298–1309.
  • [3] Banfield, J. D. and Raftery, A. E. (1993). Model-based gaussian and non-gaussian clustering., Biometrics, pages 803–821.
  • [4] Boyles, R. A. (1983). On the convergence of the em algorithm., Journal of the Royal Statistical Society. Series B (Methodological), pages 47–50.
  • [5] Cai, T. T., Ma, J., and Zhang, L. (2019). Chime: Clustering of high-dimensional gaussian mixtures with em algorithm and its optimality., The Annals of Statistics, 47(3):1234–1267.
  • [6] Carreira-Perpinan, M. A. (2000). Mode-finding for mixtures of gaussian distributions., IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(11):1318–1323.
  • [7] Chi, E. C. and Lange, K. (2015). Splitting methods for convex clustering., Journal of Computational and Graphical Statistics, 24(4):994–1013.
  • [8] Cook, R., Helland, I., and Su, Z. (2013). Envelopes and partial least squares regression., Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(5):851–877.
  • [9] Cook, R. D. (2018a)., An introduction to envelopes: dimension reduction for efficient estimation in multivariate statistics. John Wiley & Sons.
  • [10] Cook, R. D. (2018b). Principal components, sufficient dimension reduction, and envelopes., Annual Review of Statistics and Its Application, 5:533–559.
  • [11] Cook, R. D., Li, B., and Chiaromonte, F. (2010). Envelope models for parsimonious and efficient multivariate linear regression., Statistica Sinica, pages 927–960.
  • [12] Cook, R. D. and Yin, X. (2001). Dimension reduction and visualization in discriminant analysis (with discussion)., Australian & New Zealand Journal of Statistics, 43(2):147–199.
  • [13] Cook, R. D. and Zhang, X. (2015). Foundations for envelope models and methods., Journal of the American Statistical Association, 110(510):599–611.
  • [14] Cook, R. D. and Zhang, X. (2016). Algorithms for envelope estimation., Journal of Computational and Graphical Statistics, 25(1):284–300.
  • [15] Cook, R. D. and Zhang, X. (2018). Fast envelope algorithms., Statistica Sinica, 28(3):1179–1197.
  • [16] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm., Journal of the royal statistical society. Series B, pages 1–38.
  • [17] Ding, C. and He, X. (2004). K-means clustering via principal component analysis. In, Proceedings of the twenty-first international conference on Machine learning, page 29. ACM.
  • [18] Eck, D. J., Geyer, C. J., and Cook, R. D. (2020). Combining envelope methodology and aster models for variance reduction in life history analyses., Journal of Statistical Planning and Inference, 205:283–292.
  • [19] Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation., Journal of the American Statistical Association, 97(458):611–631.
  • [20] Friedman, J. H. (1989). Regularized discriminant analysis., Journal of the American Statistical Association, 84(405):165–175.
  • [21] Hartigan, J. (1975)., Clustering algorithms. John Wiley & Sons Inc., New York.
  • [22] Hastie, T. and Tibshirani, R. (1996). Discriminant analysis by gaussian mixtures., Journal of the Royal Statistical Society. Series B (Methodological), pages 155–176.
  • [23] Huang, M., Li, R., and Wang, S. (2013). Nonparametric mixture of regression models., Journal of the American Statistical Association, 108(503):929–941.
  • [24] Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). Data clustering: a review., ACM Computing Surveys (CSUR), 31(3):264–323.
  • [25] Johnson, S. C. (1967). Hierarchical clustering schemes., Psychometrika, 32(3):241–254.
  • [26] Karlis, D. and Xekalaki, E. (2003). Choosing initial values for the em algorithm for finite mixtures., Computational Statistics & Data Analysis, 41(3-4):577–590.
  • [27] Khare, K., Pal, S., and Su, Z. (2017). A bayesian approach for envelope models., The Annals of Statistics, 45(1):196–222.
  • [28] Li, B. (2018)., Sufficient dimension reduction: methods and applications with R. CRC Press.
  • [29] Lindsay, B. G. (1995). Mixture models: theory, geometry and applications. Institute of Mathematical, Statistics.
  • [30] MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In, Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA.
  • [31] McLachlan, G. and Peel, D. (2004)., Finite mixture models. John Wiley & Sons.
  • [32] Rubin, D. B. and Thayer, D. T. (1982). Em algorithms for ml factor analysis., Psychometrika, 47(1):69–76.
  • [33] Schwarz, G. (1978). Estimating the dimension of a model., The Annals of Statistics, 6(2):461–464.
  • [34] Shin, S. J., Wu, Y., Zhang, H. H., and Liu, Y. (2014). Probability-enhanced sufficient dimension reduction for binary classification., Biometrics, 70(3):546–555.
  • [35] Steinbach, M., Ertöz, L., and Kumar, V. (2004). The challenges of clustering high dimensional data. In, New directions in statistical physics, pages 273–309. Springer.
  • [36] Wang, J. and Wang, L. (2010). Sparse supervised dimension reduction in high dimensional classification., Electronic Journal of Statistics, 4:914–931.
  • [37] Wen, Z. and Yin, W. (2013). A feasible method for optimization with orthogonality constraints., Mathematical Programming, 142(1-2):397–434.
  • [38] Witten, D. M. and Tibshirani, R. (2010). A framework for feature selection in clustering., Journal of the American Statistical Association, 105(490):713–726.
  • [39] Wu, C. J. (1983). On the convergence properties of the em algorithm., The Annals of Statistics, pages 95–103.
  • [40] Yao, W. and Lindsay, B. G. (2009). Bayesian mixture labeling by highest posterior density., Journal of the American Statistical Association, 104(486):758–767.
  • [41] Zhang, X. and Li, L. (2017). Tensor envelope partial least-squares regression., Technometrics, 59(4):426–436.
  • [42] Zhang, X. and Mai, Q. (2018). Model-free envelope dimension selection., Electronic Journal of Statistics, 12(2):2193–2216.
  • [43] Zhang, X. and Mai, Q. (2019). Efficient integration of sufficient dimension reduction and prediction in discriminant analysis., Technometrics, 61:259–272.
  • [44] Zhuang, X., Huang, Y., Palaniappan, K., and Zhao, Y. (1996). Gaussian mixture density modeling, decomposition, and applications., IEEE Transactions on Image Processing, 5(9):1293–1302.