## The Annals of Statistics

### Smoothing spline ANOVA for exponential families, with application to the Wisconsin Epidemiological Study of Diabetic Retinopathy : the 1994 Neyman Memorial Lecture

#### Abstract

Let $y_i, i = 1, \dots, n$, be independent observations with the density of $y_i$ of the form $h(y_i, f_i) = \exp{y_i f_i - b(f_i) + c(y_i)]$, where b and c are given functions and b is twice continuously differentiable and bounded away from 0. Let $f_i = f(t(i))$, where $t = (t_1, \dots, t_d) \epsilon \mathsf{T}^{(1)} \otimes \dots \otimes \mathsf{T}^{(d)} = \mathsf{T}$, the $\mathsf{T}^{(\alpha)}$ are measurable spaces of rather general form and f is an unknown function on $\mathsf{T}$ with some assumed "smoothness" properties. Given ${y_i, t(i), i = 1, \dots, n}$, it is desired to estimate $f(t)$ for t in some region of interest contained in $\mathsf{T}$. We develop the fitting of smoothing spline ANOVA models to this data of the form $f(t) = C + \sum_{\alpha} f_{\alpha}(t_{\alpha}) + \sum_{\alpha < \beta} f_{\alpha \beta} (t_{\alpha}, t_{\beta}) + \dots$. The components of the decomposition satisfy side conditions which generalize the usual side conditions for parametric ANOVA. The estimate of f is obtained as the minimizer, in an appropriate function space, of $\mathsf{L}(y, f) + \sum_{\alpha} \lambda_{\alpha} J_{\alpha}(f_{\alpha}) + \sum_{\alpha <\beta} \lambda_{\alpha \beta} J_{\alpha \beta}(f_{\alpha \beta}) + \dots$, where $\mathsf{L}(y, f)$ is the negative log likelihood of $y = (y_1, \dots, y_n)'$ given f, the $J_{\alpha}, J_{\alpha \beta}, \dots$ are quadratic penalty functionals and the ANOVA decomposition is terminated in some manner. There are five major parts required to turn this program into a practical data analysis tool: (1) methods for deciding which terms in the ANOVA decomposition to include (model selection), (2) methods for choosing good values of the smoothing parameters $\lambda_{\alpha}, \lambda_{\alpha \beta}, \dots$, (3) methods for making confidence statements concerning the estimate, (4) numerical algorithms for the calculations and, finally, (5) public software. In this paper we carry out this program, relying on earlier work and filling in important gaps. The overall scheme is applied to Bernoulli data from the Wisconsin Epidemiologic Study of Diabetic Retinopathy to model the risk of progression of diabetic retinopathy as a function of glycosylated hemoglobin, duration of diabetes and body mass index. It is believed that the results have wide practical application to the analysis of data from large epidemiological studies.

#### Article information

Source
Ann. Statist., Volume 23, Number 6 (1995), 1865-1895.

Dates
First available in Project Euclid: 15 October 2002

https://projecteuclid.org/euclid.aos/1034713638

Digital Object Identifier
doi:10.1214/aos/1034713638

Mathematical Reviews number (MathSciNet)
MR1389856

Zentralblatt MATH identifier
0854.62042

#### Citation

Wahba, Grace; Wang, Yuedong; Gu, Chong; Klein, Ronald; Klein, Barbara. Smoothing spline ANOVA for exponential families, with application to the Wisconsin Epidemiological Study of Diabetic Retinopathy : the 1994 Neyman Memorial Lecture. Ann. Statist. 23 (1995), no. 6, 1865--1895. doi:10.1214/aos/1034713638. https://projecteuclid.org/euclid.aos/1034713638

#### References

• BATES, D. M., LINDSTROM, M. J., WAHBA, G. and YANDELL, B. S. 1987. GCVPACK: routines for generalized cross validation. Comm. Statist. Simulation Comput. 16 263 297. Z. Z
• BREIMAN, L. 1991. The method for estimating multivariate functions from noisy data with. discussion. Technometrics 33 125 160. Z.
• BREIMAN, L., FRIEDMAN, J., OLSHEN, R. and STONE, C. 1984. Classification and Regression Trees. Wadsworth, Belmont, CA. Z.
• CHAMBERS, J. and HASTIE, T. 1992. Statistical Models in S. Wadsworth and Brooks Cole, Belmont, CA. Z.
• CHEN, Z. 1991. Interaction spline models and their convergence rates. Ann. Statist. 19 1855 1868.
• CHEN, Z. 1993. Fitting multivariate regression functions by interaction spline models. J. Roy. Statist. Soc. Ser. B 55 473 491. Z.
• CHEN, Z., GU, C. and WAHBA, G. 1989. Comment on Linear smoothers and additive models'' by A. Buja, T. Hastie and R. Tibshirani. Ann. Statist. 17 515 521. Z.
• CHENG, B. and TITTERINGTON, D. 1994. Neural networks: a review from a statistical perspective Z. with discussion. Statist. Sci. 9 2 54. Z.
• COX, D. and CHANG, Y. 1990. Iterated state space algorithms and cross validation for generalized smoothing splines. Technical Report 49, Dept. Statistics, Univ. Illinois, Champaign. Z. Z.
• COX, D., KOH, E., WAHBA, G. and YANDELL, B. 1988. Testing the parametric null model Z. hy pothesis in semiparametric partial and generalized spline models. Ann. Statist. 16 113 119. Z.
• CRAVEN, P. and WAHBA, G. 1979. Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation. Numer. Math. 31 377 403. Z.
• EFRON, B. and STEIN, C. 1981. The jackknife estimate of variance. Ann. Statist. 9 586 596. Z.
• EFRON, B. and TIBSHIRANI, R. 1993. An Introduction to the Bootstrap. Chapman and Hall, London. Z.
• EUBANK, R. 1988. Spline Smoothing and Nonparametric Regression. Dekker, New York. Z.
• FRIEDMAN, J. 1991. Multivariate adaptive regression splines. Ann. Statist. 19 1 141. Z.
• FRIEDMAN, J. H. and STUETZLE, W. 1981. Projection pursuit regression. J. Amer. Statist. Assoc. 76 817 823. Z.
• GEMAN, S., BIENENSTOCK, E. and DOURSAT, R. 1992. Neural networks and the bias variance dilemma. Neural Computation 4 1 58. Z.
• GIRARD, D. 1987. A fast Monte Carlo cross-validation'' procedure for large least squares problems with noisy data. Technical Report RR 687-M, IMAG, Grenoble, France. Z.
• GIRARD, D. 1989. A fast Monte-Carlo cross-validation'' procedure for large least squares problems with noisy data. Numer. Math. 56 1 23. Z.
• GIRARD, D. 1991. Asy mptotic optimality of the fast randomized versions of GCV and C in ridge L regression and regularization. Ann. Statist. 19 1950 1963. Z.
• GOLUB, G. and VON MATT, U. 1995. Generalized cross-validation in large scale problems. Technical Report, Scientific Computing Computational Mathematics Program, Stanford Univ. To appear. Z.
• GREEN, P. and SILVERMAN, B. 1994. Nonparametric Regression and Generalized Linear Models. Chapman and Hall, London. Z.
• GREEN, P. and YANDELL, B. 1985. Semi-Parametric Generalized Linear Models. Lecture Notes in Statist. 32 44 55. Springer, Berlin. Z.
• GU, C. 1989. RKPACK and its applications: fitting smoothing spline models. In Proceedings of Z the Statistical Computing Section 42 51. Amer. Statist. Assoc., Alexandria, VA. Code. available through netlib. Z.
• GU, C. 1990. Adaptive spline smoothing in non-Gaussian regression models. J. Amer. Statist. Assoc. 85 801 807. Z.
• GU, C. 1992a. Cross-validating non-Gaussian data. Journal of Computational and Graphical Statistics 1 169 179. Z.
• GU, C. 1992b. Diagnostics for nonparametric regression models with additive terms. J. Amer. Statist. Assoc. 87 1051 1057. Z.
• GU, C. 1992c. Penalized likelihood regression: a Bayesian analysis. Statist. Sinica 2 255 264. Z.
• GU, C., BATES, D., CHEN, Z. and WAHBA, G. 1989. The computation of GCV functions through Householder tridiagonalization with application to the fitting of interaction spline models. SIAM J. Matrix Anal. Appl. 10 457 480. Z.
• GU, C. and QIU, C. 1994. Penalized likelihood regression: a simple asy mptotic analysis. Statist. Sinica 4 297 304. Z.
• GU, C. and WAHBA, G. 1991a. Comments on Multivariate adaptive regression splines'' by J. Friedman. Ann. Statist. 19 115 123. Z.
• GU, C. and WAHBA, G. 1991b. Minimizing GCV GML scores with multiple smoothing parameters via the Newton method. SIAM J. Sci. Statist. Comput. 12 383 398.
• GU, C. and WAHBA, G. 1993a. Semiparametric analysis of variance with tensor product thin plate splines. J. Roy. Statist. Soc. Ser. B 55 353 368. Z.
• GU, C. and WAHBA, G. 1993b. Smoothing spline ANOVA with component-wise Bayesian confidence intervals.'' Journal of Computational and Graphical Statistics 2 97 117. Z.
• HASTIE, T. and TIBSHIRANI, R. 1990. Generalized Additive Models. Chapman and Hall, London. Z.
• HASTIE, T. and TIBSHIRANI, R. 1993. Varying-coefficient models. J. Roy. Statist. Soc. Ser. B 55 757 796. Z.
• HUDSON, M. 1978. A natural identity for exponential families with applications in multiparameter estimation. Ann. Statist. 6 473 484. Z.
• HUTCHINSON, M. 1984. A summary of some surface fitting and contouring programs for noisy data. Technical Report ACT 84 6, CSIRO Division of Mathematics and Statistics, Canberra. Z.
• HUTCHINSON, M. 1989. A stochastic estimator for the trace of the influence matrix for Laplacian smoothing splines. Comm. Statist. Simulation Comput. 18 1059 1076. Z.
• HUTCHINSON, M. and GESSLER, P. 1994. Splines more than just a smooth interpolator. Geoderma 62 45 67.
• KLEIN, B. E. K., DAVIS, M. D., SEGAL, P., LONG, J. A., HARRIS, W. A., HAUG, G. A., MAGLI, Y. and Z.
• Sy RJALA, S. 1984. Diabetic retinopathy: assessment of severity and progression. Ophthalmology 91 10 17. Z.
• KLEIN, R., KLEIN, B. E. K., MOSS, S. E. and CRUICKSHANKS, K. J. 1994a. The relationship of hy pergly cemia to long-term incidence and progression of diabetic retinopathy. Archives of Internal Medicine 154 2169 2178. Z.
• KLEIN, R., KLEIN, B. E. K., MOSS, S. E. and CRUICKSHANKS, K. J. 1994b. The Wisconsin Epidemiologic Study of Diabetic Retinopathy. XIV. Ten year incidence and progression of diabetic retinopathy. Archives of Ophthalmology 112 1217 1228. Z.
• KLEIN, R., KLEIN, B. E. K., MOSS, S. E., DAVIS, M. D. and DEMETS, D. L. 1984a. The Wisconsin Epidemiologic Study of Diabetic Retinopathy. II. Prevalence and risk of diabetic retinopathy when age at diagnosis is less than 30 years. Archives of Ophthalmology 102 520 526. Z.
• KLEIN, R., KLEIN, B. E. K., MOSS, S. E., DAVIS, M. D. and DEMETS, D. L. 1984b. The Wisconsin Epidemiologic Study of Diabetic Retinopathy. III. Prevalence and risk of diabetic retinopathy when age at diagnosis is 30 or more years. Archives of Ophthalmology 102 527 532. Z.
• KLEIN, R., KLEIN, B. E. K., MOSS, S. E., DAVIS, M. D. and DEMETS, D. L. 1988. Gly cosy lated hemoglobin predicts the incidence and progression of diabetic retinopathy. Journal of the American Medical Association 260 2864 2871. Z.
• KLEIN, R., KLEIN, B. E. K., MOSS, S. E., DAVIS, M. D. and DEMETS, D. L. 1989a. Is blood pressure a predictor of the incidence or progression of diabetic retinopathy? Archives of Internal Medicine 149 2427 2432. Z.
• KLEIN, R., KLEIN, B. E. K., MOSS, S. E., DAVIS, M. D. and DEMETS, D. L. 1989b. The Wisconsin Epidemiologic Study of Diabetic Retinopathy. IX. Four year incidence and progression of diabetic retinopathy when age at diagnosis is less than 30 years. Archives of Ophthalmology 107 237 243. Z.
• KLEIN, R., KLEIN, B. E. K., MOSS, S. E., DAVIS, M. D. and DEMETS, D. L. 1989c. The Wisconsin Epidemiologic Study of Diabetic Retinopathy. X. Four year incidence and progression of diabetic retinopathy when age at dianosis is 30 or more years. Archives of Ophthalmology 107 244 249. Z.
• KLEIN, R., KLEIN, B. E. K., MOSS, S. E., DEMETS, D. L., KAUFFMAN, I. and VOSS, P. S. 1984. Prevalence of diabetes mellitus in southern Wisconsin. American Journal of Epidemiology 119 54 61. Z.
• LI, K. C. 1985. From Stein's unbiased risk estimates to the method of generalized cross-validation. Ann. Statist. 13 1352 1377. Z.
• LI, K. C. 1986. Asy mptotic optimality of C and generalized cross validation in ridge regression L with application to spline smoothing. Ann. Statist. 14 1101 1112. Z.
• LIU, Y. 1993. Unbiased estimate of generalization error and model selection in neural network. Unpublished manuscript, Institute of Brain and Neural Sy stems, Dept. physics, Brown Univ.
• LUO, Z. and WAHBA G. 1995. Hy brid adaptive splines. Technical Report 947, Dept. Statistics, Univ. Wisconsin, Madison. Z.
• MALLOWS, C. 1973. Some comments on C. Technometrics 15 661 675. p Z.
• MCCULLAGH, P. and NELDER, J. 1989. Generalized Linear Models, 2nd ed. Chapman and Hall, London. Z.
• MOODY, J. 1991. The effective number of parameters: an analysis of generalization and regularization in nonlinear learning sy stems. In Advances in Neural Information Z. Processing Sy stems 4 J. Moody, S. Hanson and R. Lippman, eds. 847 854. Kaufmann, San Mateo, CA.Z.
• NELDER, J. and WEDDERBURN, R. 1972. Generalized linear models. J. Roy. Statist. Soc. Ser. A 35 370 384. Z.
• Ny CHKA, D. 1988. Bayesian confidence intervals for smoothing splines. J. Amer. Statist. Assoc. 83 1134 1143. Z.
• Ny CHKA, D. 1990. The average posterior variance of a smoothing spline and a consistent estimate of the average squared error. Ann. Statist. 18 415 428. Z.
• Ny CHKA, D., WAHBA, G., GOLDFARB, S. and PUGH, T. 1984. Cross-validated spline methods for the estimation of three dimensional tumor size distributions from observations on two dimensional cross sections. J. Amer. Statist. Assoc. 79 832 846. Z. O'SULLIVAN, F. 1983. The analysis of some penalized likelihood estimation schemes. Ph.D. dissertation, Technical Report 726, Dept. Statistics, Univ. Wisconsin Madison. Z. O'SULLIVAN, F. 1990. An iterative approach to two-dimensional Laplacian smoothing with application to image restoration. J. Amer. Statist. Assoc. 85 213 219. Z. O'SULLIVAN, F., YANDELL, B. and RAy NOR, W. 1986. Automatic smoothing of regression functions in generalized linear models. J. Amer. Statist. Assoc. 81 96 103. Z.
• RAGHAVAN, N. 1993. Bayesian inference in nonparametric logistic regression. Ph.D. dissertation Univ. Illinois, Urbana Champaign. Z.
• RIPLEY, B. 1994. Neural networks and related methods for classification. J. Roy. Statist. Soc. Ser. B 56 409 456. Z.
• ROOSEN, C. and HASTIE, T. 1994. Automatic smoothing spline projection pursuit. Journal of Computational and Graphical Statistics 3 235 248. Z. SAS INSTITUTE 1989. SAS STAT User's Guide, Version 6, 4th ed. SAS Institute, Inc., Cary, North Carolina. Z.
• SHIAU, J. J., WAHBA, G. and JOHNSON, D. 1986. Partial spline models for the inclusion of tropopause and frontal boundary information. Journal of Atmospheric and Oceanic Technology 3 714 725. Z.
• STONE, C. 1994. The use of poly nomial splines and their tensor products in multivariate Z. function estimation with discussion. Ann. Statist. 22 118 184. Z.
• WAHBA, G. 1978. Improper priors, spline smoothing and the problem of guarding against model errors in regression. J. Roy. Statist. Soc. Ser. B 40 364 372. Z.
• WAHBA, G. 1980. Spline bases, regularization, and generalized cross validation for solving approximation problems with large quantities of noisy data. In Approximation Theory Z. III W. Cheney, ed. 905 912. Academic Press, New York. Z.
• WAHBA, G. 1981. Spline interpolation and smoothing on the sphere. SIAM J. Sci. Statist. Comput. 2 5 16. Z.
• WAHBA, G. 1982. Erratum: spline interpolation and smoothing on the sphere. SIAM J. Sci. Statist. Comput. 3 385 386. Z.
• WAHBA, G. 1983. Bayesian confidence intervals'' for the cross-validated smoothing spline. J. Roy. Statist. Soc. Ser. B 45 133 150. Z.
• WAHBA, G. 1990. Spline Models for Observational Data. SIAM, Philadelphia. Z.
• WAHBA, G. 1992. Multivariate function and operator estimation, based on smoothing splines and reproducing kernels. In Nonlinear Modeling and Forecasting. Santa Fe Institute Z. Studies in the Sciences of Complexity, Proceedings M. Casdagli and S. Eubank, eds. 12 95 112. Addison-Wesley, Reading, MA. Z.
• WAHBA, G. 1995. Generalization and regularization in nonlinear learning sy stems. In HandZ. book of Brain Theory and Neural Networks M. Arbib, ed. 426 430. MIT Press.
• WAHBA, G., GU, C., WANG, Y. and CHAPPELL, R. 1995. Soft classification, a.k.a. risk estimation, via penalized log likelihood and smoothing spline analysis of variance. In The Mathematics of Generalization. Santa Fe Institute Studies in the Sciences of Complexity, Z. Proceedings D. Wolpert, ed. 20 329 360. Addison-Wesley, Reading, MA. Z.
• WAHBA, G., JOHNSON, D., GAO, F. and GONG, J. 1994. Adaptive tuning of numerical weather prediction models: randomized GCV in three and four dimensional data assimilation. Monthly Weather Review 123 3358 3369. Z.
• WAHBA, G. and WENDELBERGER, J. 1980. Some new mathematical methods for variational objective analysis using splines and cross-validation. Monthly Weather Review 108 1122 1145. Z.
• WANG, Y. 1994. Smoothing spline analysis of variance of data from exponential families. Ph.D. dissertation, Technical Report 928, Univ. Wisconsin Madison. Z.
• WANG, Y. 1995. GRKPACK: fitting smoothing spline analysis of variance models to data from exponential families. Technical Report 942, Dept. Statistics, Univ. Wisconsin Madison. Z.
• WANG, Y. and WAHBA, G. 1995. Bootstrap confidence intervals for smoothing splines and their comparison to Bayesian confidence intervals.'' J. Statist. Comput. Simulation. 51 263 280. Z.
• WANG, Y., WAHBA, G., CHAPPELL, R. and GU, C. 1995. Simulation studies of smoothing parameter estimates and Bayesian confidence intervals in Bernoulli SS-ANOVA models. Comm. Statist. Simulation Comput. To appear. Z.
• WEBER, R. and TALKNER, P. 1993. Some remarks on spatial correlation function models. Monthly Weather Review 121 2611 2617. Z.
• WONG, W. 1992. Estimation of the loss of an estimate. Technical Report 356, Dept. Statistics, Univ. Chicago. Z.
• XIANG, D. and WAHBA, G. 1995. Testing the generalized linear model Null Hy pothesis versus smooth'' alternatives. Technical Report 953, Dept. Statistics Univ. Wisconsin Madison. Z.
• YANDELL, B. 1986. Algorithms for nonlinear generalized cross-validation. In Computer Science Z. and Statistics: 18th Sy mposium on the Interface T. Boardman, ed.. Amer. Statist. Assoc., Washington, DC.
• MADISON, WISCONSIN 53706 1420 WASHINGTON HEIGHTS
• ANN ARBOR, MICHIGAN 48109
• CHONG GU RONALD KLEIN, MD
• DEPARTMENT OF STATISTICS BARBARA KLEIN, MD PURDUE UNIVERSITY DEPARTMENT OF OPHTHALMOLOGY MATH SCIENCES BUILDING UNIVERSITY OF WISCONSIN
• WEST LAFAy ETTE, INDIANA 47907 610 NORTH WALNUT STREET