The Annals of Applied Statistics

Efficient, adaptive cross-validation for tuning and comparing models, with application to drug discovery

Hui Shen, William J. Welch, and Jacqueline M. Hughes-Oliver

Full-text: Open access

Abstract

Cross-validation (CV) is widely used for tuning a model with respect to user-selected parameters and for selecting a “best” model. For example, the method of k-nearest neighbors requires the user to choose k, the number of neighbors, and a neural network has several tuning parameters controlling the network complexity. Once such parameters are optimized for a particular data set, the next step is often to compare the various optimized models and choose the method with the best predictive performance. Both tuning and model selection boil down to comparing models, either across different values of the tuning parameters or across different classes of statistical models and/or sets of explanatory variables. For multiple large sets of data, like the PubChem drug discovery cheminformatics data which motivated this work, reliable CV comparisons are computationally demanding, or even infeasible. In this paper we develop an efficient sequential methodology for model comparison based on CV. It also takes into account the randomness in CV. The number of models is reduced via an adaptive, multiplicity-adjusted sequential algorithm, where poor performers are quickly eliminated. By exploiting matching of individual observations, it is sometimes even possible to establish the statistically significant inferiority of some models with just one execution of CV.

Article information

Source
Ann. Appl. Stat., Volume 5, Number 4 (2011), 2668-2687.

Dates
First available in Project Euclid: 20 December 2011

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1324399611

Digital Object Identifier
doi:10.1214/11-AOAS491

Mathematical Reviews number (MathSciNet)
MR2907131

Zentralblatt MATH identifier
1234.62115

Keywords
Assay data cheminformatics drug discovery k-nearest neighbors multiplicity adjustment neural network PubChem randomized-block design sequential analysis

Citation

Shen, Hui; Welch, William J.; Hughes-Oliver, Jacqueline M. Efficient, adaptive cross-validation for tuning and comparing models, with application to drug discovery. Ann. Appl. Stat. 5 (2011), no. 4, 2668--2687. doi:10.1214/11-AOAS491. https://projecteuclid.org/euclid.aoas/1324399611


Export citation

References

  • Burden, F. R. (1989). Molecular identification number for substructure searches. J. Chem. Inf. Comput. Sci. 29 225–227.
  • Burman, P. (1989). A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods. Biometrika 76 503–514.
  • Dean, A. and Voss, D. (1999). Design and Analysis of Experiments. Springer, New York.
  • Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10 1895–1923.
  • Dudoit, S. and van der Laan, M. J. (2005). Asymptotics of cross-validated risk estimation in estimator selection and performance assessment. Stat. Methodol. 2 131–154.
  • Hawkins, D. M., Basak, S. C. and Mills, D. (2003). Assessing model fit by cross-validation. J. Chem. Inf. Comput. Sci. 43 579–586.
  • Hughes-Oliver, J. M., Brooks, A. D., Welch, W. J., Khaledi, M. G., Hawkins, D., Young, S. S., Patil, K., Howell, G. W., Ng, R. T. and Chu, M. T. (2011). ChemModLab: A web-based cheminformatics modeling laboratory. Cheminformatics. To appear.
  • Kempthorne, O. (1952). The Design and Analysis of Experiments. Wiley, New York.
  • Kempthorne, O. (1955). The randomization theory of experimental inference. J. Amer. Statist. Assoc. 50 946–967.
  • Li, K.-C. (1987). Asymptotic optimality for Cp, CL, cross-validation and generalized cross-validation: Discrete index set. Ann. Statist. 15 958–975.
  • Maron, O. and Moore, A. W. (1997). The racing algorithm: Model selection for lazy learners. Artificial Intelligence Review 11 193–225.
  • Montgomery, D. C. (1997). Design and Analysis of Experiments, 4th ed. Wiley, New York.
  • Ripley, B. D. (1996). Pattern Recognition and Neural Networks. Cambridge Univ. Press, Cambridge.
  • Shao, J. (1993). Linear model selection by cross-validation. J. Amer. Statist. Assoc. 88 486–494.
  • Sinisi, S. E. and van der Laan, M. J. (2004). Deletion/substitution/addition algorithm in learning with applications in genomics. Stat. Appl. Genet. Mol. Biol. 3 Art. 18, 40 pp. (electronic).
  • Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. J. Roy. Statist. Soc. Ser. B 36 111–147.
  • Stone, M. (1977). Asymptotics for and against cross-validation. Biometrika 64 29–35.
  • Wang, Y. (2005). Statistical methods for high throughput screening drug discovery data. Ph.D. thesis, Dept. Statistics and Actuarial Science, Univ. Waterloo, Ontario, Canada.
  • Yang, Y. (2006). Comparing learning methods for classification. Statist. Sinica 16 635–657.
  • Zhang, P. (1993). Model selection via multifold cross validation. Ann. Statist. 21 299–313.