Open Access
December 2013 An algorithm for deciding the number of clusters and validation using simulated data with application to exploring crop population structure
Mark A. Newell, Dianne Cook, Heike Hofmann, Jean-Luc Jannink
Ann. Appl. Stat. 7(4): 1898-1916 (December 2013). DOI: 10.1214/13-AOAS671

Abstract

A first step in exploring population structure in crop plants and other organisms is to define the number of subpopulations that exist for a given data set. The genetic marker data sets being generated have become increasingly large over time and commonly are of the high-dimension, low sample size (HDLSS) situation. An algorithm for deciding the number of clusters is proposed, and is validated on simulated data sets varying in both the level of structure and the number of clusters covering the range of variation observed empirically. The algorithm was then tested on six empirical data sets across three small grain species. The algorithm uses bootstrapping, three methods of clustering, and defines the optimum number of clusters based on a common criterion, the Hubert’s gamma statistic. Validation on simulated sets coupled with testing on empirical sets suggests that the algorithm can be used for a wide variety of genetic data sets.

Citation

Download Citation

Mark A. Newell. Dianne Cook. Heike Hofmann. Jean-Luc Jannink. "An algorithm for deciding the number of clusters and validation using simulated data with application to exploring crop population structure." Ann. Appl. Stat. 7 (4) 1898 - 1916, December 2013. https://doi.org/10.1214/13-AOAS671

Information

Published: December 2013
First available in Project Euclid: 23 December 2013

zbMATH: 1283.62231
MathSciNet: MR3161706
Digital Object Identifier: 10.1214/13-AOAS671

Keywords: bootstrap , cluster analysis , Dimension reduction , genetic marker data , high dimensional , low sample size , simulation , visualization

Rights: Copyright © 2013 Institute of Mathematical Statistics

Vol.7 • No. 4 • December 2013
Back to Top