The Annals of Statistics

Smoothing spline ANOVA models for large data sets with Bernoulli observations and the randomized GACV

Fangyu Gao, Ronald Klein, Barbara Klein, Xiwu Lin, Grace Wahba, and Dong Xiang

Full-text: Open access


We propose the randomized Generalized Approximate Cross Validation (ranGACV) method for choosing multiple smoothing parameters in penalized likelihood estimates for Bernoulli data. The method is intended for application with penalized likelihood smoothing spline ANOVA models. In addition we propose a class of approximate numerical methods for solving the penalized likelihood variational problem which, in conjunction with the ranGACV method allows the application of smoothing spline ANOVA models with Bernoulli data to much larger data sets than previously possible. These methods are based on choosing an approximating subset of the natural (representer) basis functions for the variational problem. Simulation studies with synthetic data, including synthetic data mimicking demographic risk factor data sets is used to examine the properties of the method and to compare the approach with the GRKPACK code of Wang (1997c). Bayesian “confidence intervals” are obtained for the fits and are shown in the simulation studies to have the “across the function” property usually claimed for these confidence intervals. Finally the method is applied to an observational data set from the Beaver Dam Eye study, with scientifically interesting results.

Article information

Ann. Statist., Volume 28, Number 6 (2000), 1570-1600.

First available in Project Euclid: 12 March 2002

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62G07: Density estimation 92C60: Medical epidemiology 68T05: Learning and adaptive systems [See also 68Q32, 91E40] 65D07: Splines 65D10: Smoothing, curve fitting 62A99: None of the above, but in this section 62J07: Ridge regression; shrinkage estimators
Secondary: 41A63: Multidimensional problems (should also be assigned at least one other classification number in this section) 41A15: Spline approximation 62G07: Density estimation 62M30: Spatial processes 65D15: Algorithms for functional approximation 92H25 49M15: Newton-type methods

Smoothing spline ANOVA nonparametric regression exponential families risk factor estimation degrees of freedom representers reproducing kernel Hilbert spaces penalized likelihood


Lin, Xiwu; Wahba, Grace; Xiang, Dong; Gao, Fangyu; Klein, Ronald; Klein, Barbara. Smoothing spline ANOVA models for large data sets with Bernoulli observations and the randomized GACV. Ann. Statist. 28 (2000), no. 6, 1570--1600. doi:10.1214/aos/1015957471.

Export citation


  • Bowman, K., Sacks, J. and Chang, Y. (1993). Design and analysis of numerical experiments. J. Atmos. Sci. 50 1267-1278.
  • Brumback, B. and Rice, J. (1998). Smoothing spline models for the analysis of nested and crossed samples of curves. J. Amer. Statist. Assoc. 93 961-991.
  • Craven, P. and Wahba, G. (1979). Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation. Numer. Math. 31 377-403.
  • Efron, B. (1986). How biased is the apparent error rate of a prediction rule? J. Amer. Statist. Assoc 81 461-470.
  • Friedman, J. (1991). Multivariate adaptive regression splines. Ann. Statist 19 1-141.
  • Gao, F. (1999). Penalized multivariate logistic regression with a large data set. Ph. D. dissertation, Dept. Statistics, Univ. Wisconsin-Madison.
  • Girard, D. (1998). Asymptotic comparison of (partial) cross-validation, GCV and randomized GCV in nonparametric regression. Ann. Statist. 26 315-334.
  • Gong, J., Wahba, G., Johnson, D. and Tribbia, J. (1998). Adaptive tuning of numerical weather prediction models: simultaneous estimation of weighting, smoothing and physical parameters. Monthly Weather Rev. 125 210-231.
  • Gu, C. (1990). Adaptive spline smoothing in non-Gaussian regression models. J. Amer. Statist. Assoc. 85 801-807.
  • Gu, C. (1992). Penalized likelihood regression: a Bayesian analysis. Statist. Sinica 2 255-264.
  • Gu, C. (1998). Structural multivariate function estimation: Some automatic density and hazard estimates. Statist. Sinica 8 317-336.
  • Gu, C. and Wahba, G. (1993). Semiparametric analysis of variance with tensor product thin plate splines. J. Roy. Statist. Soc. Ser. B 55 353-368.
  • Hastie, T. and Tibshirani, R. (1990). Generalized Additive Models. Chapman and Hall, New York.
  • Hutchinson, M. (1984). A summary of some surface fitting and contouring programs for noisy data, Technical Report ACT 84/6, CSIRO Division of Mathematics and Statistics, Canberra, Australia.
  • Kimeldorf, G. and Wahba, G. (1971). Some results on Tchebycheffian spline functions. J. Math. Anal. Appl. 33 82-95.
  • Klein, B. E. K., Klein, R., Jensen, S. and Ritter, L. (1994). Are sex hormones associated with age-related maculopathy in women? The Beaver Dam Eye Study. Trans. Amer. Ophth. Soc. 92 289-297.
  • Klein, R., Klein, B. E. K. and Linton, K. (1992). Prevalence of age-related maculopathy: The Beaver Dam Eye Study. Ophthalmalogy 99 933-942.
  • Klein, R., Klein, B. E. K., Jensen, S. and Meuer, S. (1997). The five-year incidence and progression of age-related maculopathy: The Beaver Dam eye study. Ophthalmology 104 7-21.
  • Klein, R., Klein, B. E. K., Linton, K. and DeMets, D. (1991). The Beaver Dam eye study: Visual acuity. Ophthalmology 98 1310-1315.
  • Klein, R., Klein, B. E. K., Moss, S. E., Davis, M. D. and DeMets, D. L. (1984). The Wisconsin Epidemiologic Study of Diabetic Retinopathy. II. Prevalence and risk of diabetic retinopathy when age at diagnosis is less than 30 years. Arch. Ophthalmology 102 520-526. Lin, X. (1998a). Smoothing spline analysis of variance for polychotomous response data. Ph.D. dissertation, Dept. Statistics, Univ. Wisconsin-Madison. Lin, Y. (1998b). Tensor product space ANOVA models. Ann. Statist. 28 734-755. Lin, Y. (1998c). Tensor product space ANOVA models in multivariate function estimation. Ph.D. dissertation, Univ. Pennsylvania, Philadelphia PA.
  • Luo, Z. (1998). Backfitting in smoothing spline ANOVA. Ann. Statist. 26 1733-1759.
  • Luo, Z. and Wahba, G. (1997). Hybrid adaptive splines. J. Amer. Statist. Assoc. 92 107-114.
  • Mallows, C. (1973). Some comments on Cp. Technometrics 15 661-675.
  • SAS Institute, Inc. (1989). SAS/STAT User's Guide, Version 6, 4th ed. SAS Institute, Inc. Cary, NC.
  • Nychka, D. (1988). Bayesian confidence intervals for smoothing splines. J. Amer. Statist. Assoc. 83 1134-1143.
  • Press, W., Teukolsky, S., Vetterling, W. and Flannery, B. (1992). Numerical Recipes in Fortran 77: The Art of Scientific Computing. Cambridge Univ. Press.
  • Silverman, B. (1985). Some aspects of the spline smoothing approach to non-parametric regression curve fitting. J. Roy. Statist. Soc. Ser. B 47 1-52.
  • Verbyla, A., Cullis, B., Kenward, M. and Welham, S. (1997). The analysis of designed experiments and longitudinal data using smoothing splines. Technical Report 97/4, Dept. Statistics, Univ. Adelaide.
  • Wahba, G. (1983). Bayesian "confidence intervals" for the cross-validated smoothing spline. J. Roy. Statist. Soc. Ser. B 45 133-150.
  • Wahba, G. (1990). Spline Models for Observational Data. SIAM, Philadelphia.
  • Wahba, G., Gu, C., Wang, Y. and Chappell, R. (1995). Soft classification, a. k. a. risk estimation, via penalized log likelihood and smoothing spline analysis of variance. In The Mathematics of Generalization (D. Wolpert, ed.) 329-360. Addison-Wesley, Reading, MA.
  • Wahba, G., Wang, Y., Gu, C., Klein, R. and Klein, B. (1995). Smoothing spline ANOVA for exponential families, with application to the Wisconsin Epidemiological Study of Diabetic Retinopathy. Ann. Statist. 23 1865-1895.
  • Wang, Y. (1995). GRKPACK: Fitting smoothing spline analysis of variance models to data from exponential families. Technical Report 942, Dept. Statistics, Univ. Wisconsin-Madison.
  • Wang, Y. (1997). GRKPACK: Fitting smoothing spline analysis of variance models to data from exponential families. Comm. Statist. Simul. Comput. 26 765-782.
  • Wang, Y. (1998). Smoothing spline models with correlated random errors. J. Amer. Statist. Assoc. 93 341-348.
  • Wang, Y. and Wahba, G. (1995). Bootstrap confidence intervals for smoothing splines and their comparison to Bayesian "confidence intervals." J. Statist. Comput. Simul. 51 263-279.
  • Wang, Y., Wahba, G., Gu, C., Klein, R. and Klein, B. (1997). Using smoothing spline ANOVA to examine the relation of risk factors to the incidence and progression of diabetic retinopathy. Statistics in Medicine 16 1357-1376.
  • Wong, W. (1992). Estimation of the loss of an estimate. Technical Report 356, Dept. Statistics, Univ. Chicago.
  • Wood, S. and Kohn, R. (1998). A Bayesian aproach to robust binary nonparametric regression. J. Amer. Statist. Assoc. 93 203-213. Xiang, D. (1996), Model fitting and testing for non-Gaussian data with a large data set. Ph.D. dissertation, Univ. Wisconsin-Madison.
  • Xiang, D. and Wahba, G. (1995). Testing the generalized linear model null hypothesis versus "smooth" alternatives. Technical Report 953, Dept. Statistics, Univ. Wisconsin- Madison.
  • Xiang, D. and Wahba, G. (1996). A generalized approximate cross validation for smoothing splines with non-Gaussian data. Statist. Sinica 6 675-692.
  • Xiang, D. and Wahba, G. (1997). Approximate smoothing spline methods for large data sets in the binary case. In Proceedings of the 1997 ASA Joint Statistical Meetings, Biometrics Section 94-98. Amer. Statist. Assoc., Alexandria, VA.
  • Ye, J. (1998). On measuring and correcting the effects of data mining and model selection. J. Amer. Statist. Assoc. 93 120-131. Ye, J. and Wong, W. (1997a). Evaluation of highly complex modeling procedures with Binomial and Poisson data. Unpublished manuscript. Ye, J. and Wong, W. (1997b) Model uncertainty and correcting for selection bias. Unpublished manuscript.