Statistics Surveys

A survey of cross-validation procedures for model selection

Sylvain Arlot and Alain Celisse

Full-text: Open access

Abstract

Used to estimate the risk of an estimator or to perform model selection, cross-validation is a widespread strategy because of its simplicity and its (apparent) universality. Many results exist on model selection performances of cross-validation procedures. This survey intends to relate these results to the most recent advances of model selection theory, with a particular emphasis on distinguishing empirical statements from rigorous theoretical results. As a conclusion, guidelines are provided for choosing the best cross-validation procedure according to the particular features of the problem in hand.

Article information

Source
Statist. Surv. Volume 4 (2010), 40-79.

Dates
First available in Project Euclid: 9 March 2010

Permanent link to this document
http://projecteuclid.org/euclid.ssu/1268143839

Digital Object Identifier
doi:10.1214/09-SS054

Mathematical Reviews number (MathSciNet)
MR2602303

Zentralblatt MATH identifier
1190.62080

Subjects
Primary: 62G08: Nonparametric regression
Secondary: 62G05: Estimation 62G09: Resampling methods

Keywords
Model selection cross-validation leave-one-out

Citation

Arlot, Sylvain; Celisse, Alain. A survey of cross-validation procedures for model selection. Statist. Surv. 4 (2010), 40--79. doi:10.1214/09-SS054. http://projecteuclid.org/euclid.ssu/1268143839.


Export citation

References

  • Akaike, H. (1970). Statistical predictor identification., Ann. Inst. Statist. Math., 22:203–217.
  • Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In, Second International Symposium on Information Theory (Tsahkadsor, 1971), pages 267–281. Akadémiai Kiadó, Budapest.
  • Allen, D. M. (1974). The relationship between variable selection and data augmentation and a method for prediction., Technometrics, 16:125–127.
  • Alpaydin, E. (1999). Combined 5 x 2 cv F test for comparing supervised classification learning algorithms., Neur. Comp., 11(8):1885–1892.
  • Anderson, R. L., Allen, D. M., and Cady, F. B. (1972). Selection of predictor variables in linear multiple regression. In bancroft, T. A., editor, In Statistical papers in Honor of George W. Snedecor. Iowa: iowa State University Press.
  • Arlot, S. (2007)., Resampling and Model Selection. PhD thesis, University Paris-Sud 11. http://tel.archives-ouvertes.fr/tel-00198803/en/.
  • Arlot, S. (2008a). Suboptimality of penalties proportional to the dimension for model selection in heteroscedastic regression., arXiv:0812.3141.
  • Arlot, S. (2008b)., V-fold cross-validation improved: V-fold penalization. arXiv:0802.0566v2.
  • Arlot, S. (2009). Model selection by resampling penalization., Electron. J. Stat., 3:557–624 (electronic).
  • Arlot, S. and Celisse, A. (2009). Segmentation in the mean of heteroscedastic data via cross-validation., arXiv:0902.3977v2.
  • Baraud, Y. (2002). Model selection for regression on a random design., ESAIM Probab. Statist., 6:127–146 (electronic).
  • Barron, A., Birgé, L., and Massart, P. (1999). Risk bounds for model selection via penalization., Probab. Theory Related Fields, 113(3):301–413.
  • Bartlett, P. L., Boucheron, S., and Lugosi, G. (2002). Model selection and error estimation., Machine Learning, 48:85–113.
  • Bellman, R. E. and Dreyfus, S. E. (1962)., Applied Dynamic Programming. Princeton.
  • Bengio, Y. and Grandvalet, Y. (2004). No unbiased estimator of the variance of, K-fold cross-validation. J. Mach. Learn. Res., 5:1089–1105 (electronic).
  • Bhansali, R. J. and Downham, D. Y. (1977). Some properties of the order of an autoregressive model selected by a generalization of Akaike’s FPE criterion., Biometrika, 64(3):547–551.
  • Birgé, L. and Massart, P. (2001). Gaussian model selection., J. Eur. Math. Soc. (JEMS), 3(3):203–268.
  • Birgé, L. and Massart, P. (2007). Minimal penalties for Gaussian model selection., Probab. Theory Related Fields, 138(1-2):33–73.
  • Blanchard, G. and Massart, P. (2006). Discussion: “Local Rademacher complexities and oracle inequalities in risk minimization” [Ann. Statist., 34 (2006), no. 6, 2593–2656] by V. Koltchinskii. Ann. Statist., 34(6):2664–2671.
  • Boucheron, S., Bousquet, O., and Lugosi, G. (2005). Theory of classification: a survey of some recent advances., ESAIM Probab. Stat., 9:323–375 (electronic).
  • Bousquet, O. and Elisseff, A. (2002). Stability and Generalization., J. Machine Learning Research, 2:499–526.
  • Bowman, A. W. (1984). An alternative method of cross-validation for the smoothing of density estimates., Biometrika, 71(2):353–360.
  • Breiman, L. (1996). Heuristics of instability and stabilization in model selection., Ann. Statist., 24(6):2350–2383.
  • Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984)., Classification and regression trees. Wadsworth Statistics/Probability Series. Wadsworth Advanced Books and Software, Belmont, CA.
  • Breiman, L. and Spector, P. (1992). Submodel selection and evaluation in regression. the x-random case., International Statistical Review, 60(3):291–319.
  • Burman, P. (1989). A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods. Biometrika, 76(3):503–514.
  • Burman, P. (1990). Estimation of optimal transformations using, v-fold cross validation and repeated learning-testing methods. Sankhyā Ser. A, 52(3):314–345.
  • Burman, P., Chow, E., and Nolan, D. (1994). A cross-validatory method for dependent data., Biometrika, 81(2):351–358.
  • Burman, P. and Nolan, D. (1992). Data-dependent estimation of prediction functions., J. Time Ser. Anal., 13(3):189–207.
  • Burnham, K. P. and Anderson, D. R. (2002)., Model selection and multimodel inference. Springer-Verlag, New York, second edition. A practical information-theoretic approach.
  • Cao, Y. and Golubev, Y. (2006). On oracle inequalities related to smoothing splines., Math. Methods Statist., 15(4):398–414.
  • Celisse, A. (2008a). Model selection in density estimation via cross-validation. Technical report, arXiv:0811.0802.
  • Celisse, A. (2008b)., Model Selection Via Cross-Validation in Density Estimation, Regression and Change-Points Detection. PhD thesis, University Paris-Sud 11, http://tel.archives-ouvertes.fr/tel-00346320/en/.
  • Celisse, A. and Robin, S. (2008). Nonparametric density estimation by exact leave-p-out cross-validation., Computational Statistics and Data Analysis, 52(5):2350–2368.
  • Chow, Y. S., Geman, S., and Wu, L. D. (1987). Consistent cross-validated density estimation., Ann. Statist., 11:25–38.
  • Chu, C.-K. and Marron, J. S. (1991). Comparison of two bandwidth selectors with dependent errors., Ann. Statist., 19(4):1906–1918.
  • Craven, P. and Wahba, G. (1979). Smoothing noisy data with spline functions. Estimating the correct degree of smoothing by the method of generalized cross-validation., Numer. Math., 31(4):377–403.
  • Dalelane, C. (2005). Exact oracle inequality for sharp adaptive kernel density estimator. Technical report, arXiv.
  • Daudin, J.-J. and Mary-Huard, T. (2008). Estimation of the conditional risk in classification: The swapping method., Comput. Stat. Data Anal., 52(6):3220–3232.
  • Davies, S. L., Neath, A. A., and Cavanaugh, J. E. (2005). Cross validation model selection criteria for linear regression based on the Kullback-Leibler discrepancy., Stat. Methodol., 2(4):249–266.
  • Davison, A. C. and Hall, P. (1992). On the bias and variability of bootstrap and cross-validation estimates of error rate in discrimination problems., Biometrika, 79(2):279–284.
  • Devroye, L., Györfi, L., and Lugosi, G. (1996)., A probabilistic theory of pattern recognition, volume 31 of Applications of Mathematics (New York). Springer-Verlag, New York.
  • Devroye, L. and Wagner, T. J. (1979). Distribution-Free performance Bounds for Potential Function Rules., IEEE Transaction in Information Theory, 25(5):601–604.
  • Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms., Neur. Comp., 10(7):1895–1924.
  • Efron, B. (1983). Estimating the error rate of a prediction rule: improvement on cross-validation., J. Amer. Statist. Assoc., 78(382):316–331.
  • Efron, B. (1986). How biased is the apparent error rate of a prediction rule?, J. Amer. Statist. Assoc., 81(394):461–470.
  • Efron, B. (2004). The estimation of prediction error: covariance penalties and cross-validation., J. Amer. Statist. Assoc., 99(467):619–642. With comments and a rejoinder by the author.
  • Efron, B. and Morris, C. (1973). Combining possibly related estimation problems (with discussion)., J. R. Statist. Soc. B, 35:379.
  • Efron, B. and Tibshirani, R. (1997). Improvements on cross-validation: the.632+ bootstrap method., J. Amer. Statist. Assoc., 92(438):548–560.
  • Fromont, M. (2007). Model selection by bootstrap penalization for classification., Mach. Learn., 66(2–3):165–207.
  • Geisser, S. (1974). A predictive approach to the random effect model., Biometrika, 61(1):101–107.
  • Geisser, S. (1975). The predictive sample reuse method with applications., J. Amer. Statist. Assoc., 70:320–328.
  • Girard, D. A. (1998). Asymptotic comparison of (partial) cross-validation, GCV and randomized GCV in nonparametric regression., Ann. Statist., 26(1):315–334.
  • Grünwald, P. D. (2007)., The Minimum Description Length Principle. MIT Press, Cambridge, MA, USA.
  • Györfi, L., Kohler, M., Krzyżak, A., and Walk, H. (2002)., A distribution-free theory of nonparametric regression. Springer Series in Statistics. Springer-Verlag, New York.
  • Hall, P. (1983). Large sample optimality of least squares cross-validation in density estimation., Ann. Statist., 11(4):1156–1174.
  • Hall, P. (1987). On Kullback-Leibler loss and density estimation., The Annals of Statistics, 15(4):1491–1519.
  • Hall, P., Lahiri, S. N., and Polzehl, J. (1995). On bandwidth choice in nonparametric regression with both short- and long-range dependent errors., Ann. Statist., 23(6):1921–1936.
  • Hall, P., Marron, J. S., and Park, B. U. (1992). Smoothed cross-validation., Probab. Theory Related Fields, 92(1):1–20.
  • Hall, P. and Schucany, W. R. (1989). A local cross-validation algorithm., Statist. Probab. Lett., 8(2):109–117.
  • Härdle, W. (1984). How to determine the bandwidth of some nonlinear smoothers in practice. In, Robust and nonlinear time series analysis (Heidelberg, 1983), volume 26 of Lecture Notes in Statist., pages 163–184. Springer, New York.
  • Härdle, W., Hall, P., and Marron, J. S. (1988). How far are automatically chosen regression smoothing parameters from their optimum?, J. Amer. Statist. Assoc., 83(401):86–101. With comments by David W. Scott and Iain Johnstone and a reply by the authors.
  • Hart, J. D. (1994). Automated kernel smoothing of dependent data by using time series cross-validation., J. Roy. Statist. Soc. Ser. B, 56(3):529–542.
  • Hart, J. D. and Vieu, P. (1990). Data-driven bandwidth choice for density estimation based on dependent data., Ann. Statist., 18(2):873–890.
  • Hart, J. D. and Wehrly, T. E. (1986). Kernel regression estimation using repeated measurements data., J. Amer. Statist. Assoc., 81(396):1080–1088.
  • Hastie, T., Tibshirani, R., and Friedman, J. (2009)., The elements of statistical learning. Springer Series in Statistics. Springer-Verlag, New York. Data mining, inference, and prediction. 2nd edition.
  • Herzberg, A. M. and Tsukanov, A. V. (1986). A note on modifications of jackknife criterion for model selection., Utilitas Math., 29:209–216.
  • Herzberg, P. A. (1969). The parameters of cross-validation., Psychometrika, 34:Monograph Supplement.
  • Hesterberg, T. C., Choi, N. H., Meier, L., and Fraley, C. (2008). Least angle and l1 penalized regression: A review., Statistics Surveys, 2:61–93 (electronic).
  • Hills, M. (1966). Allocation Rules and their Error Rates., J. Royal Statist. Soc. Series B, 28(1):1–31.
  • Hoeting, J. A., Madigan, D., Raftery, A. E., and Volinsky, C. T. (1999). Bayesian Model Averaging: A tutorial., Statistical Science, 14(4):382–417.
  • Huber, P. (1964). Robust estimation of a local parameter., Ann. Math. Statist., 35:73–101.
  • John, P. W. M. (1971)., Statistical design and analysis of experiments. The Macmillan Co., New York.
  • Jonathan, P., Krzanowki, W. J., and McCarthy, W. V. (2000). On the use of cross-validation to assess performance in multivariate prediction., Stat. and Comput., 10:209–229.
  • Kearns, M., Mansour, Y., Ng, A. Y., and Ron, D. (1997). An Experimental and Theoretical Comparison of Model Selection Methods., Machine Learning, 27:7–50.
  • Kearns, M. and Ron, D. (1999). Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation., Neural Computation, 11:1427–1453.
  • Koltchinskii, V. (2001). Rademacher penalties and structural risk minimization., IEEE Trans. Inform. Theory, 47(5):1902–1914.
  • Lachenbruch, P. A. and Mickey, M. R. (1968). Estimation of Error Rates in Discriminant Analysis., Technometrics, 10(1):1–11.
  • Larson, S. C. (1931). The shrinkage of the coefficient of multiple correlation., J. Edic. Psychol., 22:45–55.
  • Lecué, G. (2006). Optimal oracle inequality for aggregation of classifiers under low noise condition. In Gabor Lugosi, H. U. S., editor, 19th Annual Conference On Learning Theory, COLT06., pages 364–378. Springer.
  • Lecué, G. (2007). Suboptimality of penalized empirical risk minimization in classification. In, COLT 2007, volume 4539 of Lecture Notes in Artificial Intelligence. Springer, Berlin.
  • Leung, D., Marriott, F., and Wu, E. (1993). Bandwidth selection in robust smoothing., J. Nonparametr. Statist., 2:333–339.
  • Leung, D. H.-Y. (2005). Cross-validation in nonparametric regression with outliers., Ann. Statist., 33(5):2291–2310.
  • Li, K.-C. (1985). From Stein’s unbiased risk estimates to the method of generalized cross validation., Ann. Statist., 13(4):1352–1377.
  • Li, K.-C. (1987). Asymptotic optimality for, Cp, CL, cross-validation and generalized cross-validation: discrete index set. Ann. Statist., 15(3):958–975.
  • Mallows, C. L. (1973). Some comments on, Cp. Technometrics, 15:661–675.
  • Markatou, M., Tian, H., Biswas, S., and Hripcsak, G. (2005). Analysis of variance of cross-validation estimators of the generalization error., J. Mach. Learn. Res., 6:1127–1168 (electronic).
  • Massart, P. (2007)., Concentration inequalities and model selection, volume 1896 of Lecture Notes in Mathematics. Springer, Berlin. Lectures from the 33rd Summer School on Probability Theory held in Saint-Flour, July 6–23, 2003, With a foreword by Jean Picard.
  • Molinaro, A. M., Simon, R., and Pfeiffer, R. M. (2005). Prediction error estimation: a comparison of resampling methods., Bioinformatics, 21(15):3301–3307.
  • Mosteller, F. and Tukey, J. W. (1968). Data analysis, including statistics. In Lindzey, G. and Aronson, E., editors, Handbook of Social Psychology, Vol. 2. Addison-Wesley.
  • Nadeau, C. and Bengio, Y. (2003). Inference for the generalization error., Machine Learning, 52:239–281.
  • Nemirovski, A. (2000). Topics in Non-Parametric Statistics. In Bernard, P., editor, Lecture Notes in Mathematics, Lectures on Probability Theory and Statistics, Ecole d’ete de Probabilities de Saint-Flour XXVIII - 1998. M. Emery, A. Nemirovski, D. Voiculescu.
  • Nishii, R. (1984). Asymptotic properties of criteria for selection of variables in multiple regression., Ann. Statist., 12(2):758–765.
  • Opsomer, J., Wang, Y., and Yang, Y. (2001). Nonparametric regression with correlated errors., Statist. Sci., 16(2):134–153.
  • Picard, R. R. and Cook, R. D. (1984). Cross-validation of regression models., J. Amer. Statist. Assoc., 79(387):575–583.
  • Politis, D. N., Romano, J. P., and Wolf, M. (1999)., Subsampling. Springer Series in Statistics. Springer-Verlag, New York.
  • Quenouille, M. H. (1949). Approximate tests of correlation in time-series., J. Roy. Statist. Soc. Ser. B., 11:68–84.
  • Raftery, A. E. (1995). Bayesian Model Selection in Social Research., Siociological Methodology, 25:111–163.
  • Ripley, B. D. (1996)., Pattern Recognition and Neural Networks. Cambridge Univ. Press.
  • Rissanen, J. (1983). Universal Prior for Integers and Estimation by Minimum Description Length., The Annals of Statistics, 11(2):416–431.
  • Ronchetti, E., Field, C., and Blanchard, W. (1997). Robust linear model selection by cross-validation., J. Amer. Statist. Assoc., 92:1017–1023.
  • Rudemo, M. (1982). Empirical Choice of Histograms and Kernel Density Estimators., Scandinavian Journal of Statistics, 9:65–78.
  • Sauvé, M. (2009). Histogram selection in non gaussian regression., ESAIM: Probability and Statistics, 13:70–86.
  • Schuster, E. F. and Gregory, G. G. (1981). On the consistency of maximum likelihood nonparametric density estimators. In Eddy, W. F., editor, Computer Science and Statistics: Proceedings of the 13th Symposium on the Interface, pages 295–298. Springer-Verlag, New York.
  • Schwarz, G. (1978). Estimating the dimension of a model., Ann. Statist., 6(2):461–464.
  • Shao, J. (1993). Linear model selection by cross-validation., J. Amer. Statist. Assoc., 88(422):486–494.
  • Shao, J. (1996). Bootstrap model selection., J. Amer. Statist. Assoc., 91(434):655–665.
  • Shao, J. (1997). An asymptotic theory for linear model selection., Statist. Sinica, 7(2):221–264. With comments and a rejoinder by the author.
  • Shibata, R. (1984). Approximate efficiency of a selection procedure for the number of regression variables., Biometrika, 71(1):43–49.
  • Stone, C. (1984). An asymptotically optimal window selection rule for kernel density estimates., The Annals of Statistics, 12(4):1285–1297.
  • Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions., J. Roy. Statist. Soc. Ser. B, 36:111–147. With discussion and a reply by the authors.
  • Stone, M. (1977). Asymptotics for and against cross-validation., Biometrika, 64(1):29–35.
  • Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso., J. Royal Statist. Soc. Series B, 58(1):267–288.
  • van der Laan, M. J. and Dudoit, S. (2003). Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples. Working Paper Series Working Paper 130, U.C. Berkeley Division of Biostatistics. available at, http://www.bepress.com/ucbbiostat/paper130.
  • van der Laan, M. J., Dudoit, S., and Keles, S. (2004). Asymptotic optimality of likelihood-based cross-validation., Stat. Appl. Genet. Mol. Biol., 3:Art. 4, 27 pp. (electronic).
  • van der Laan, M. J., Dudoit, S., and van der Vaart, A. W. (2006). The cross-validated adaptive epsilon-net estimator., Statist. Decisions, 24(3):373–395.
  • van der Vaart, A. W., Dudoit, S., and van der Laan, M. J. (2006). Oracle inequalities for multi-fold cross validation., Statist. Decisions, 24(3):351–371.
  • van Erven, T., Grünwald, P. D., and de Rooij, S. (2008). Catching up faster by switching sooner: A prequential solution to the aic-bic dilemma., arXiv:0807.1005.
  • Vapnik, V. (1982)., Estimation of dependences based on empirical data. Springer Series in Statistics. Springer-Verlag, New York. Translated from the Russian by Samuel Kotz.
  • Vapnik, V. N. (1998)., Statistical learning theory. Adaptive and Learning Systems for Signal Processing, Communications, and Control. John Wiley & Sons Inc., New York. A Wiley-Interscience Publication.
  • Vapnik, V. N. and Chervonenkis, A. Y. (1974)., Teoriya raspoznavaniya obrazov. Statisticheskie problemy obucheniya. Izdat. “Nauka”, Moscow. Theory of Pattern Recognition (In Russian).
  • Wahba, G. (1975). Periodic splines for spectral density estimation: The use of cross validation for determining the degree of smoothing., Communications in Statistics, 4:125–142.
  • Wahba, G. (1977). Practical Approximate Solutions to Linear Operator Equations When the Data are Noisy., SIAM Journal on Numerical Analysis, 14(4):651–667.
  • Wegkamp, M. (2003). Model selection in nonparametric regression., The Annals of Statistics, 31(1):252–273.
  • Yang, Y. (2001). Adaptive Regression by Mixing., J. Amer. Statist. Assoc., 96(454):574–588.
  • Yang, Y. (2005). Can the strengths of AIC and BIC be shared? A conflict between model indentification and regression estimation., Biometrika, 92(4):937–950.
  • Yang, Y. (2006). Comparing learning methods for classification., Statist. Sinica, 16(2):635–657.
  • Yang, Y. (2007). Consistency of cross validation for comparing regression procedures., Ann. Statist., 35(6):2450–2473.
  • Zhang, P. (1993). Model selection via multifold cross validation., Ann. Statist., 21(1):299–313.