The Annals of Applied Statistics

On Bayesian “central clustering”: Application to landscape classification of Western Ghats

Sabyasachi Mukhopadhyay, Sourabh Bhattacharya, and Kajal Dihidar

Full-text: Open access

Abstract

Landscape classification of the well-known biodiversity hotspot, Western Ghats (mountains), on the west coast of India, is an important part of a world-wide program of monitoring biodiversity. To this end, a massive vegetation data set, consisting of 51,834 4-variate observations has been clustered into different landscapes by Nagendra and Gadgil [Current Sci. 75 (1998) 264–271]. But a study of such importance may be affected by nonuniqueness of cluster analysis and the lack of methods for quantifying uncertainty of the clusterings obtained.

Motivated by this applied problem of much scientific importance, we propose a new methodology for obtaining the global, as well as the local modes of the posterior distribution of clustering, along with the desired credible and “highest posterior density” regions in a nonparametric Bayesian framework. To meet the need of an appropriate metric for computing the distance between any two clusterings, we adopt and provide a much simpler, but accurate modification of the metric proposed in [In Felicitation Volume in Honour of Prof. B. K. Kale (2009) MacMillan]. A very fast and efficient Bayesian methodology, based on [Sankhyā Ser. B 70 (2008) 133–155], has been utilized to solve the computational problems associated with the massive data and to obtain samples from the posterior distribution of clustering on which our proposed methods of summarization are illustrated.

Clustering of the Western Ghats data using our methods yielded landscape types different from those obtained previously, and provided interesting insights concerning the differences between the results obtained by Nagendra and Gadgil [Current Sci. 75 (1998) 264–271] and us. Statistical implications of the differences are also discussed in detail, providing interesting insights into methodological concerns of the traditional clustering methods.

Article information

Source
Ann. Appl. Stat., Volume 5, Number 3 (2011), 1948-1977.

Dates
First available in Project Euclid: 13 October 2011

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1318514291

Digital Object Identifier
doi:10.1214/11-AOAS454

Mathematical Reviews number (MathSciNet)
MR2884928

Zentralblatt MATH identifier
1366.62122

Keywords
Bayesian analysis cluster analysis Dirichlet process Gibbs sampling massive data mixture analysis

Citation

Mukhopadhyay, Sabyasachi; Bhattacharya, Sourabh; Dihidar, Kajal. On Bayesian “central clustering”: Application to landscape classification of Western Ghats. Ann. Appl. Stat. 5 (2011), no. 3, 1948--1977. doi:10.1214/11-AOAS454. https://projecteuclid.org/euclid.aoas/1318514291


Export citation

References

  • Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to nonparametric problems. Ann. Statist. 2 1152–1174.
  • Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis, 2nd ed. Springer, New York.
  • Bhattacharya, S. (2008). Gibbs sampling based Bayesian analysis of mixtures with unknown number of components. Sankhyā Ser. B 70 133–155.
  • Dahl, D. B. (2009). Modal clustering in a class of product partition models. Bayesian Anal. 4 243–264.
  • Escobar, M. D. and West, M. (1995). Bayesian density estimation and inference using mixtures. J. Amer. Statist. Assoc. 90 577–588.
  • Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1 209–230.
  • Fraley, C. and Raftery, A. E. (1999). MCLUST: Software for model-based cluster analysis. J. Classification 16 297–306.
  • Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. J. Amer. Statist. Assoc. 97 611–631.
  • Ghosh, J. K., Dihidar, K. and Samanta, T. (2009). On different clusterings of the same data set. In Felicitation Volume in Honour of Prof. B. K. Kale (B. Arnold, U. Gather and S. M. Bendre, eds.). MacMillan, New Delhi.
  • Hartigan, J. A. (1975). Clustering Algorithms. Wiley, New York.
  • MacEachern, S. N. (1994). Estimating normal means with a conjugate-style Dirichlet process prior. Comm. Statist. Simulation Comput. 23 727–741.
  • Mukhopadhyay, S., Bhattacharya, S. and Dihidar, K. (2011). Supplement to “On Bayesian central clustering: Application to landscape classification of Western Ghats”. DOI:10.1214/11-AOAS454SUPP.
  • Nagendra, H. and Gadgil, M. (1998). Linking regional and landscape scales for assessing biodiversity: A case study from Western Ghats. Current Sci. 75 264–271.
  • Nagendra, H. and Gadgil, M. (1999). Biodiversity assessment at multiple scales: Linking remotely sensed data with field information. Proc. Natl. Acad. Sci. USA 96 9154–9158.
  • Richardson, S. and Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components (with discussion). J. Roy. Statist. Soc. Ser. B 59 731–792.

Supplemental materials

  • Supplementary material: Supplement to “On Bayesian “central clustering”: Application to landscape classification of Western Ghats. Sections S-1 and S-2 contain, respectively, the full conditional distributions of the random variables with respect to the nonmarginalized and marginalized version of SB’s model. That the K-means clustering algorithm is a special case of the clustering method based on SB’s model is shown in Section S-3. Properties of the approximate distance measure d̂ are explored in Section S-4. Section S-5 contains reports of our investigation on whether or not the spatial structure of the superpixels should be incorporated in our model. Detailed analysis of sensitivity of the results with respect to changes in the values of the hyperparameters of our model is provided in Section S-6. Thorough explanation of the computational superiority of SB’s model over that associated with efficient implementation of EW’s model is presented in Section S-7. Finally, a new method for MCMC convergence diagnostics in clustering models is proposed in Section S-8, which we apply in our situation for studying convergence of our Markov chain.