Electronic Journal of Statistics

Large-scale mode identification and data-driven sciences

Subhadeep Mukhopadhyay

Full-text: Open access

Abstract

Bump-hunting or mode identification is a fundamental problem that arises in almost every scientific field of data-driven discovery. Surprisingly, very few data modeling tools are available for automatic (not requiring manual case-by-case investigation), objective (not subjective), and nonparametric (not based on restrictive parametric model assumptions) mode discovery, which can scale to large data sets. This article introduces LPMode–an algorithm based on a new theory for detecting multimodality of a probability density. We apply LPMode to answer important research questions arising in various fields from environmental science, ecology, econometrics, analytical chemistry to astronomy and cancer genomics.

Article information

Source
Electron. J. Statist., Volume 11, Number 1 (2017), 215-240.

Dates
Received: August 2016
First available in Project Euclid: 3 February 2017

Permanent link to this document
https://projecteuclid.org/euclid.ejs/1486090845

Digital Object Identifier
doi:10.1214/17-EJS1229

Mathematical Reviews number (MathSciNet)
MR3605036

Zentralblatt MATH identifier
1356.62052

Subjects
Primary: 62G07: Density estimation 62G30: Order statistics; empirical distribution functions 62G86: Nonparametric inference and fuzziness

Keywords
Skew-G modeling connector density large-scale mode exploration bump(s) above background orthogonal rank polynomials nonparametric exploratory modeling multidisciplinary sciences

Rights
Creative Commons Attribution 4.0 International License.

Citation

Mukhopadhyay, Subhadeep. Large-scale mode identification and data-driven sciences. Electron. J. Statist. 11 (2017), no. 1, 215--240. doi:10.1214/17-EJS1229. https://projecteuclid.org/euclid.ejs/1486090845


Export citation

References

  • Allen, C. R. (2006). Discontinuities in ecological data., Proceedings of the National Academy of Sciences, 103 6083–6084.
  • Baldry, I. K., Glazebrook, K., Brinkmann, J., Ivezić, Ž., Lupton, R. H., Nichol, R. C. and Szalay, A. S. (2004). Quantifying the bimodal color-magnitude distribution of galaxies., The Astrophysical Journal, 600 681.
  • Balogh, M. L., Baldry, I. K., Nichol, R., Miller, C., Bower, R. and Glazebrook, K. (2004). The bimodal galaxy color distribution: dependence on luminosity and environment., The Astrophysical Journal Letters, 615 L101.
  • Bechtel, Y. C., Bonaiti-Pellie, C., Poisson, N., Magnette, J. and Bechtel, P. R. (1993). A population and family study n-acetyltransferase using caffeine urinary metabolites., Clinical Pharmacology & Therapeutics, 54 134–141.
  • Breiman, L. (2001). Random forests., Machine Learning, 42 5–32.
  • Chacón, J. E., Duong, T. et al. (2013). Data-driven density derivative estimation, with applications to nonparametric clustering and bump hunting., Electronic Journal of Statistics, 7 499–532.
  • Chacón, J. E. et al. (2015). A population background for nonparametric density-based clustering., Statistical Science, 30 518–532.
  • Chen, Y.-C., Genovese, C. R., Wasserman, L. et al. (2016). A comprehensive approach to mode clustering., Electronic Journal of Statistics, 10 210–241.
  • Cox, D. and Barndorff-Nielsen, O. (1994)., Inference and asymptotics, vol. 52. CRC Press.
  • Day, N. E. (1969). Estimating the components of a mixture of normal distributions., Biometrika, 56 463–474.
  • Desmedt, C., Piette, F., Loi, S., Wang, Y., Lallemand, F., Haibe-Kains, B., Viale, G., Delorenzi, M., Zhang, Y., d’Assignies, M. S. et al. (2007). Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the transbig multicenter independent validation series., Clinical cancer research, 13 3207–3214.
  • Efron, B. (2010)., Large-scale inference: empirical Bayes methods for estimation, testing, and prediction, vol. 1. Cambridge; New York: Cambridge University Press.
  • Efron, B. and Tibshirani, R. J. (1994)., An introduction to the bootstrap. CRC press.
  • Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation., Journal of the American statistical Association, 97 611–631.
  • Fukugita, M., Ichikawa, T., Gunn, J., Doi, M., Shimasaku, K. and Schneider, D. (1996). The sloan digital sky survey photometric system., The Astronomical Journal, 111 1748.
  • Hart, J. D. (1985). On the choice of a truncation point in fourier series density estimation., Journal of Statistical Computation and Simulation, 21 95–116.
  • Henderson, D. J., Parmeter, C. F. and Russell, R. R. (2008). Modes, weighted modes, and calibrated modes: evidence of clustering using modality tests., Journal of Applied Econometrics, 23 607–638.
  • Holling, C. S. (1992). Cross-scale morphology, geometry, and dynamics of ecosystems., Ecological monographs, 62 447–502.
  • Izenman, A. J. and Sommer, C. J. (1988). Philatelic mixtures and multimodal densities., Journal of the American Statistical association, 83 941–953.
  • Kallenberg, W. C. (2000). The penalty in data driven neyman’s tests., Math. Methods Statist, 11 323–340.
  • Ledwina, T. (1994). Data driven version of neyman smooth test of fit., Journal of the American Statistical Association, 89 1000–1005.
  • Lowthian, P. J. and Thompson, M. (2002). Bump-hunting for the proficiency tester—searching for multimodality., Analyst, 127 1359–1364.
  • Marchis, F., Enriquez, J., Emery, J., Mueller, M., Baek, M., Pollock, J., Assafin, M., Martins, R. V., Berthier, J., Vachier, F. et al. (2012). Multiple asteroid systems: Dimensions and thermal properties from spitzer space telescope and ground-based observations., Icarus, 221 1130–1161.
  • Marchis, F., Hestroffer, D., Descamps, P., Berthier, J., Bouchez, A. H., Campbell, R. D., Chin, J. C., Van Dam, M. A., Hartman, S. K., Johansson, E. M. et al. (2006). A low density of 0.8 g cm-3 for the trojan binary asteroid 617 patroclus., Nature, 439 565–567.
  • Mukhopadhyay, S. (2016). Large scale signal detection: A unifying view., Biometrics, 72 325–334.
  • Mukhopadhyay, S. (2017). Supplementary appendix to “Large-scale mode identification and data-driven sciences”., Electronic Journal of Statistics. DOI: 10.1214/17-EJS1229SUPP.
  • Mukhopadhyay, S. and Parzen, E. (2014). LP approach to statistical modeling., Preprint arXiv:1405.2601.
  • Novikov, D., Colombi, S. and Doré, O. (2006). Skeleton as a probe of the cosmic web: the two-dimensional case., Monthly Notices of the Royal Astronomical Society, 366 1201–1216.
  • Parzen, E. (1962). On estimation of a probability density function and mode., Annals of mathematical statistics, 33 1065–1076.
  • Pittau, M. G., Zelli, R. and Johnson, P. A. (2010). Mixture models, convergence clubs, and polarization., Review of Income and Wealth, 56 102–122.
  • Quah, D. (1993). Empirical cross-section dynamics in economic growth., European Economic Review, 37 426–434.
  • Richardson, S. and Green, P. J. (1997). On bayesian analysis of mixtures with an unknown number of components (with discussion)., Journal of the Royal Statistical Society: series B (statistical methodology), 59 731–792.
  • Sarmanho, G., Borges, P., Fraga, I. and Leal, L. (2015). Treatment of bimodality in proficiency test of ph in bioethanol matrix., Accreditation and Quality Assurance (in press).
  • Sheather, S. J. and Jones, M. C. (1991). A reliable data-based bandwidth selection method for kernel density estimation., Journal of the Royal Statistical Society. Series B (Methodological) 683–690.
  • Shorack, G. R. and Wellner, J. A. (2009)., Empirical processes with applications to statistics, vol. 59. Siam.
  • Silverman, B. and Young, G. (1987). The bootstrap: To smooth or not to smooth?, Biometrika, 74 469–479.
  • Silverman, B. W. (1981). Using kernel density estimates to investigate multimodality., Journal of the Royal Statistical Society. Series B (Methodological) 97–99.
  • Silverman, B. W. (1986)., Density estimation for statistics and data analysis, vol. 26. CRC press.
  • Tsybakov, A. B. (2009)., Introduction to nonparametric estimation. Revised and extended from the 2004 French original. Translated by Vladimir Zaiats. Springer Series in Statistics. Springer, New York.
  • Van De Vijver, M. J., He, Y. D., van’t Veer, L. J., Dai, H., Hart, A. A., Voskuil, D. W., Schreiber, G. J., Peterse, J. L., Roberts, C., Marton, M. J. et al. (2002). A gene-expression signature as a predictor of survival in breast cancer., New England Journal of Medicine, 347 1999–2009.
  • van’t Veer, L. J., Dai, H., Van De Vijver, M. J., He, Y. D., Hart, A. A., Mao, M., Peterse, H. L., van der Kooy, K., Marton, M. J., Witteveen, A. T. et al. (2002). Gene expression profiling predicts clinical outcome of breast cancer., Nature, 415 530–536.
  • Wasserman, L. (2006)., All of Nonparametric Statistics. Springer Texts in Statistics.
  • Wilson, I. (1983). Add a new dimension to your philately., The American Philatelist, 97 342–349.
  • York, D. G., Adelman, J., Anderson Jr, J. E., Anderson, S. F., Annis, J., Bahcall, N. A., Bakken, J., Barkhouser, R., Bastian, S., Berman, E. et al. (2000). The sloan digital sky survey: Technical summary., The Astronomical Journal, 120 1579.

Supplemental materials