Statistical Science

ROS Regression: Integrating Regularization with Optimal Scaling Regression

Jacqueline J. Meulman, Anita J. van der Kooij, and Kevin L. W. Duisters

Full-text: Open access

Abstract

We present a methodology for multiple regression analysis that deals with categorical variables (possibly mixed with continuous ones), in combination with regularization, variable selection and high-dimensional data ($P\gg N$). Regularization and optimal scaling (OS) are two important extensions of ordinary least squares regression (OLS) that will be combined in this paper. There are two data analytic situations for which optimal scaling was developed. One is the analysis of categorical data, and the other the need for transformations because of nonlinear relationships between predictors and outcome. Optimal scaling of categorical data finds quantifications for the categories, both for the predictors and for the outcome variables, that are optimal for the regression model in the sense that they maximize the multiple correlation. When nonlinear relationships exist, nonlinear transformation of predictors and outcome maximize the multiple correlation in the same way. We will consider a variety of transformation types; typically we use step functions for categorical variables, and smooth (spline) functions for continuous variables. Both types of functions can be restricted to be monotonic, preserving the ordinal information in the data. In combination with optimal scaling, three popular regularization methods will be considered: Ridge regression, the Lasso and the Elastic Net. The resulting method will be called ROS Regression (Regularized Optimal Scaling Regression). The OS algorithm provides straightforward and efficient estimation of the regularized regression coefficients, automatically gives the Group Lasso and Blockwise Sparse Regression, and extends them by the possibility to maintain ordinal properties in the data. Extended examples are provided.

Article information

Source
Statist. Sci., Volume 34, Number 3 (2019), 361-390.

Dates
First available in Project Euclid: 11 October 2019

Permanent link to this document
https://projecteuclid.org/euclid.ss/1570780975

Digital Object Identifier
doi:10.1214/19-STS697

Keywords
Lasso and Elastic Net regularization for nominal and ordinal data monotonic group Lasso regularization for categorical high-dimensional data optimal scaling linearization of nonlinear relationships monotonic step functions and splines

Citation

Meulman, Jacqueline J.; van der Kooij, Anita J.; Duisters, Kevin L. W. ROS Regression: Integrating Regularization with Optimal Scaling Regression. Statist. Sci. 34 (2019), no. 3, 361--390. doi:10.1214/19-STS697. https://projecteuclid.org/euclid.ss/1570780975


Export citation

References

  • Angoff, C. and Mencken, H. L. (1931). The worst American state. American Mercury 24 1–16. 175–188, 355–31.
  • Barlow, R. E., Bartholomew, D. J., Bremner, J. M. and Brunk, H. D. (1972). Statistical Inference Under Order Restrictions: The Theory and Application of Isotonic Regression. Wiley, New York.
  • Bock, R. D. (1960). Methods and Applications of Optimal Scaling. Report 25. L. L. Thurstone Lab, Univ. North Carolina, Chapel Hill.
  • Breiman, L. (1995). Better subset regression using the nonnegative garrote. Technometrics 37 373–384.
  • Breiman, L. and Friedman, J. H. (1985). Estimating optimal transformations for multiple regression and correlation (with discussion). J. Amer. Statist. Assoc. 80 580–619.
  • Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984). Classification and Regression Trees. Wadsworth Statistics/Probability Series. Wadsworth, Belmont, CA.
  • Buja, A. (1990). Remarks on functional canonical variates, alternating least squares methods and ACE. Ann. Statist. 18 1032–1069.
  • Buja, A., Hastie, T. and Tibshirani, R. (1989). Linear smoothers and additive models. Ann. Statist. 17 453–555.
  • Cai, T. T. and Zhang, L. (2018). High-dimensional Gaussian copula regression: Adaptive estimation and statistical inference. Statist. Sinica 28 963–993.
  • Céa, J. and Glowinski, R. (1973). Sur des méthodes d’optimisation par relaxation. Rev. Française Automat. Informat. Recherche Opérationnelle Sér. Rouge 7 5–31.
  • Chen, S. S., Donoho, D. L. and Saunders, M. A. (1998). Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20 33–61.
  • Chouldechova, A. and Hastie, T. J. (2015). Generalized Additive Model Selection. Available at arXiv:1506.03850 [stat.ML].
  • Daubechies, I., Defrise, M. and De Mol, C. (2004). An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Comm. Pure Appl. Math. 57 1413–1457.
  • De Leeuw, J., Young, F. W. and Takane, Y. (1976). Additive structure in qualitative data. Psychometrika 41 471–503.
  • Dette, H., Van Hecke, R. and Volgushev, S. (2014). Some comments on copula-based regression. J. Amer. Statist. Assoc. 109 1319–1324.
  • Dhillon, I. S. (2008). The log-determinant divergence and its applications. Paper presented at the Householder Symposium XVII, Zeuthen, Germany.
  • Efron, B. (1983). Estimating the error rate of a prediction rule: Improvement on cross-validation. J. Amer. Statist. Assoc. 78 316–331.
  • Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 32 407–499.
  • Frank, I. E. and Friedman, J. H. (1993). A statistical view of some chemometrics regression tools. Technometrics 35 109–148.
  • Friedman, J. H. (1991). Multivariate adaptive regression splines (with discussion). Ann. Statist. 19 1–141.
  • Friedman, J. Hastie, T. J. and Tibshirani, R. J. (2010). Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 22 1548–7660.
  • Friedman, J., Hastie, T. and Tibshirani, R. (2012). glmnet: Lasso and elastic-net regularized generalized linear models. Available at http://CRAN.R-project.org/package=glmnet. R package version 1.9-5.
  • Friedman, J. H. and Meulman, J. J. (2003). Prediction with multiple additive regression trees with application in epidemiology. Stat. Med. 22 1365–1381.
  • Friedman, J. H. and Popescu, B. E. (2004). Gradient directed regularization for linear regression and classification Technical report, Dept. Statistics, Stanford Univ., Stanford, CA.
  • Friedman, J. H. and Stuetzle, W. (1981). Projection pursuit regression. J. Amer. Statist. Assoc. 76 817–823.
  • Friedman, J., Hastie, T., Höfling, H. and Tibshirani, R. (2007). Pathwise coordinate optimization. Ann. Appl. Stat. 1 302–332.
  • Fu, W. J. (1998). Penalized regressions: The Bridge versus the Lasso. J. Comput. Graph. Statist. 7 397–416.
  • Genest, C. and Nešlehová, J. (2007). A primer on copulas for count data. Astin Bull. 37 475–515.
  • Gifi, A. (1981). Nonlinear multivariate analysis. Unpublished Manuscript. Department of Data Theory, Leiden University, Leiden.
  • Gifi, A. (1990). Nonlinear Multivariate Analysis. Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics. Wiley, Chichester.
  • Golub, G. H., Heath, M. and Wahba, G. (1979). Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics 21 215–223.
  • Groenen, P. J. F., van Os, B.-J. and Meulman, J. J. (2000). Optimal scaling by alternating length-constrained nonnegative least squares, with application to distance-based analysis. Psychometrika 65 511–524.
  • Gurin, L. G., Poljak, B. T. and Raĭk, È. V. (1967). Projection methods for finding a common point of convex sets. Zh. Vychisl. Mat. Mat. Fiz. 7 1211–1228.
  • Hastie, T. J. and Tibshirani, R. J. (1990). Generalized Additive Models. Monographs on Statistics and Applied Probability 43. CRC Press, London.
  • Hastie, T., Tibshirani, R. and Buja, A. (1994). Flexible discriminant analysis by optimal scoring. J. Amer. Statist. Assoc. 89 1255–1270.
  • Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer Series in Statistics. Springer, New York.
  • Hayashi, C. (1952). On the prediction of phenomena from qualitive data and the quantification of qualitative data from the mathematico-statistical point of view. Ann. Inst. Statist. Math. 1952 93–96.
  • Hoerl, A. E. and Kennard, R. (1970a). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12 55–67.
  • Hoerl, A. E. and Kennard, R. (1970b). Ridge regression: Applications to nonorthogonal problems. Technometrics 12 69–82.
  • Hoff, P. D. (2007). Extending the rank likelihood for semiparametric copula estimation. Ann. Appl. Stat. 1 265–283.
  • IBM Corp. (2010). IBM SPSS Statistics 19.0 Algorithms. IBM Corp., Amonk, NY.
  • James, W. and Stein, C. (1961). Estimation with quadratic loss. In Proc. 4th Berkeley Sympos. Math. Statist. and Prob., Vol. I 361–379. Univ. California Press, Berkeley, CA.
  • Kim, Y., Kim, J. and Kim, Y. (2006). Blockwise sparse regression. Statist. Sinica 16 375–390.
  • Kolev, N. and Paiva, D. (2009). Copula-based regression models: A survey. J. Statist. Plann. Inference 139 3847–3856.
  • Kruskal, J. B. (1964). Nonmetric multidimensional scaling: A numerical method. Psychometrika 29 115–129.
  • Kruskal, J. B. (1965). Analysis of factorial experiments by estimating monotone transformations of the data. J. Roy. Statist. Soc. Ser. B 27 251–263.
  • Masarotto, G. and Varin, C. (2012). Gaussian copula marginal regression. Electron. J. Stat. 6 1517–1549.
  • Mazumder, R., Friedman, J. H. and Hastie, T. (2011). SparseNet: Coordinate descent with nonconvex penalties. J. Amer. Statist. Assoc. 106 1125–1138.
  • Meulman, J. J. (1986). A Distance Approach to Nonlinear Multivariate Analysis. DSWO Press, Leiden.
  • Meulman, J. J., Heiser, W. J. and SPSS (1998). SPSS Categories 8.0, Chicago, IL SPSS Inc.
  • Meulman, J. J., Heiser, W. J. and SPSS Inc. (2010). IBM SPSS Categories 19. IBM Corp., Amonk, NY.
  • Meulman, J. J., Zeppa, P., Boon, M. E. and Rietveld, W. J. (1992). Prediction of various grades of cervical neoplasia on plastic-embedded cytobrush samples. Discriminant analysis with qualitative and quantitative predictors. Anal. Quant. Cytol. Histol. 14 60–72.
  • Nelder, J. A. and Wedderburn, R. W. M. (1972). Generalized linear models. J. Roy. Statist. Soc. Ser. A 135 370–384.
  • Nishisato, S. (1980). Analysis of Categorical Data: Dual Scaling and Its Applications. Mathematical Expositions 24. Univ. Toronto Press, Toronto.
  • Nishisato, S. (1994). Elements of Dual Scaling: An Introduction to Practical Data Analysis. Lawrence Erlbaum, Hillsdale, NJ.
  • Noh, H., El Ghouch, A. and Bouezmarni, T. (2013). Copula-based regression estimation and inference. J. Amer. Statist. Assoc. 108 676–688.
  • Oberhofer, W. and Kmenta, J. (1974). A general procedure for obtaining maximum likelihood estimates in generalized regression models. Econometrica 42 579–590.
  • Osborne, M. R., Presnell, B. and Turlach, B. A. (2000a). A new approach to variable selection in least squares problems. IMA J. Numer. Anal. 20 389–403.
  • Osborne, M. R., Presnell, B. and Turlach, B. A. (2000b). On the LASSO and its dual. J. Comput. Graph. Statist. 9 319–337.
  • Parsa, R. A. and Klugman, S. A. (2011). Copula regression. Proc. Casualty Actuar. Soc. 5 45–54.
  • Perkins, S., Lacker, K. and Theiler, J. (2003). Grafting: Fast, incremental feature selection by gradient descent in function space. J. Mach. Learn. Res. 3 1333–1356.
  • Pitt, M., Chan, D. and Kohn, R. (2006). Efficient Bayesian inference for Gaussian copula regression models. Biometrika 93 537–554.
  • R Core Team (2017). R Foundation for Statistical Computing, Vienna, Austria. Available at http://www.R-project.org/.
  • Ramsay, J. O. (1988). Monotone regression splines in action (with discussion). Statist. Sci. 4 425–441.
  • SAS/STAT (1990). User’s Guide, Version 6 2. SAS Institute Inc., Cary NC.
  • Sklar, M. (1959). Fonctions de répartition à $n$ dimensions et leurs marges. Publ. Inst. Stat. Univ. Paris 8 229–231.
  • Song, P. X.-K. (2000). Multivariate dispersion models generated from Gaussian copula. Scand. J. Stat. 27 305–320.
  • Tibshirani, R. (1988). Estimating transformations for regression via additivity and variance stabilization. J. Amer. Statist. Assoc. 83 394–405.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • Tibshirani, R. (2011). Regression shrinkage and selection via the lasso: A retrospective. J. R. Stat. Soc. Ser. B. Stat. Methodol. 73 273–282.
  • Tikhonov, A. N. (1943). On the stability of inverse problems. C. R. (Dokl.) Acad. Sci. URSS 39 176–179.
  • Trivedi, P. K. and Zimmer, D. M. (2005). Copula modeling: An introduction for practitioners. Found. Trends Econom. 1 1–111.
  • Tseng, P. (1988). Coordinate Ascent for Maximizing Nondifferentiable Concave Functions. Technical Report LIDS-P, 1840, Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA.
  • Tseng, P. (2001). Convergence of a block method for nondifferentiable minimization. J. Optim. Theory Appl. 109 475–494.
  • van de Geer, S., Bühlmann, P., Ritov, Y. and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist. 42 1166–1202.
  • Van der Kooij, A. J. (2007). Prediction accuracy and stability of regression with optimal scaling transformations. Thesis, Leiden Univ. Available at https://openaccess.leidenuniv.nl/handle/1887/12096.
  • Wainer, H. and Thissen, D. (1981). Graphical data analysis. Annu. Rev. Psychol. 32 191–241.
  • Walberg, H. J. and Rasher, S. P. (1977). The ways schooling makes a difference. Phi Delta Kappan 58 703–707.
  • Winsberg, S. and Ramsay, J. O. (1980). Monotonic transformations to additivity using splines. Biometrika 67 669–674.
  • Wu, T. T. and Lange, K. (2008). Coordinate descent algorithms for lasso penalized regression. Ann. Appl. Stat. 2 224–244.
  • Young, F. W. (1981). Quantitative analysis of qualitative data. Psychometrika 46 357–388.
  • Young, F. W., De Leeuw, J. and Takane, Y. (1976). Regression with qualitative and quantitative variables: An alternating least squares method with Optimal Scaling features. Psychometrika 41 505–529.
  • Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B. Stat. Methodol. 68 49–67.
  • Zangwill, W. I. (1969/70). Convergence conditions for nonlinear programming algorithms. Manage. Sci. 16 1–13.
  • Zhao, P. and Yu, B. (2007). Stagewise lasso. J. Mach. Learn. Res. 8 2701–2726.
  • Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 301–320.