Open Access
February 2020 Envelope-based sparse partial least squares
Guangyu Zhu, Zhihua Su
Ann. Statist. 48(1): 161-182 (February 2020). DOI: 10.1214/18-AOS1796
Abstract

Sparse partial least squares (SPLS) is widely used in applied sciences as a method that performs dimension reduction and variable selection simultaneously in linear regression. Several implementations of SPLS have been derived, among which the SPLS proposed in Chun and Keleş (J. R. Stat. Soc. Ser. B. Stat. Methodol. 72 (2010) 3–25) is very popular and highly cited. However, for all of these implementations, the theoretical properties of SPLS are largely unknown. In this paper, we propose a new version of SPLS, called the envelope-based SPLS, using a connection between envelope models and partial least squares (PLS). We establish the consistency, oracle property and asymptotic normality of the envelope-based SPLS estimator. The large-sample scenario and high-dimensional scenario are both considered. We also develop the envelope-based SPLS estimators under the context of generalized linear models, and discuss its theoretical properties including consistency, oracle property and asymptotic distribution. Numerical experiments and examples show that the envelope-based SPLS estimator has better variable selection and prediction performance over the SPLS estimator (J. R. Stat. Soc. Ser. B. Stat. Methodol. 72 (2010) 3–25).

References

1.

Agresti, A. (2013). Categorical Data Analysis, 3rd ed. Wiley Series in Probability and Statistics. Wiley Interscience, Hoboken, NJ. 1281.62022Agresti, A. (2013). Categorical Data Analysis, 3rd ed. Wiley Series in Probability and Statistics. Wiley Interscience, Hoboken, NJ. 1281.62022

2.

Chen, L. and Huang, J. Z. (2012). Sparse reduced-rank regression for simultaneous dimension reduction and variable selection. J. Amer. Statist. Assoc. 107 1533–1545. 1258.62075 10.1080/01621459.2012.734178Chen, L. and Huang, J. Z. (2012). Sparse reduced-rank regression for simultaneous dimension reduction and variable selection. J. Amer. Statist. Assoc. 107 1533–1545. 1258.62075 10.1080/01621459.2012.734178

3.

Chen, X., Zou, C. and Cook, R. D. (2010). Coordinate-independent sparse sufficient dimension reduction and variable selection. Ann. Statist. 38 3696–3723. 1204.62107 10.1214/10-AOS826 euclid.aos/1291126970Chen, X., Zou, C. and Cook, R. D. (2010). Coordinate-independent sparse sufficient dimension reduction and variable selection. Ann. Statist. 38 3696–3723. 1204.62107 10.1214/10-AOS826 euclid.aos/1291126970

4.

Chun, H. and Keleş, S. (2010). Sparse partial least squares regression for simultaneous dimension reduction and variable selection. J. R. Stat. Soc. Ser. B. Stat. Methodol. 72 3–25. 1411.62184 10.1111/j.1467-9868.2009.00723.xChun, H. and Keleş, S. (2010). Sparse partial least squares regression for simultaneous dimension reduction and variable selection. J. R. Stat. Soc. Ser. B. Stat. Methodol. 72 3–25. 1411.62184 10.1111/j.1467-9868.2009.00723.x

5.

Chung, D. and Keleş, S. (2010). Sparse partial least squares classification for high dimensional data. Stat. Appl. Genet. Mol. Biol. 9 Art. 17, 32. 1304.92041 10.2202/1544-6115.1492Chung, D. and Keleş, S. (2010). Sparse partial least squares classification for high dimensional data. Stat. Appl. Genet. Mol. Biol. 9 Art. 17, 32. 1304.92041 10.2202/1544-6115.1492

6.

Cook, R. D., Forzani, L. and Su, Z. (2016). A note on fast envelope estimation. J. Multivariate Anal. 150 42–54. 1345.62082 10.1016/j.jmva.2016.05.006Cook, R. D., Forzani, L. and Su, Z. (2016). A note on fast envelope estimation. J. Multivariate Anal. 150 42–54. 1345.62082 10.1016/j.jmva.2016.05.006

7.

Cook, R. D., Helland, I. S. and Su, Z. (2013). Envelopes and partial least squares regression. J. R. Stat. Soc. Ser. B. Stat. Methodol. 75 851–877. 1411.62137 10.1111/rssb.12018Cook, R. D., Helland, I. S. and Su, Z. (2013). Envelopes and partial least squares regression. J. R. Stat. Soc. Ser. B. Stat. Methodol. 75 851–877. 1411.62137 10.1111/rssb.12018

8.

Cook, R. D., Li, B. and Chiaromonte, F. (2010). Envelope models for parsimonious and efficient multivariate linear regression. Statist. Sinica 20 927–1010. 1259.62059Cook, R. D., Li, B. and Chiaromonte, F. (2010). Envelope models for parsimonious and efficient multivariate linear regression. Statist. Sinica 20 927–1010. 1259.62059

9.

Cook, R. D. and Zhang, X. (2015). Foundations for envelope models and methods. J. Amer. Statist. Assoc. 110 599–611. 1390.62131 10.1080/01621459.2014.983235Cook, R. D. and Zhang, X. (2015). Foundations for envelope models and methods. J. Amer. Statist. Assoc. 110 599–611. 1390.62131 10.1080/01621459.2014.983235

10.

Cook, R. D. and Zhang, X. (2016). Algorithms for envelope estimation. J. Comput. Graph. Statist. 25 284–300.Cook, R. D. and Zhang, X. (2016). Algorithms for envelope estimation. J. Comput. Graph. Statist. 25 284–300.

11.

De Jong, S. (1993). SIMPLS: An alternative approach to partial least squares regression. Chemom. Intell. Lab. Syst. 18 251–263.De Jong, S. (1993). SIMPLS: An alternative approach to partial least squares regression. Chemom. Intell. Lab. Syst. 18 251–263.

12.

Diaconis, P. and Freedman, D. (1984). Asymptotics of graphical projection pursuit. Ann. Statist. 12 793–815. 0559.62002 10.1214/aos/1176346703 euclid.aos/1176346703Diaconis, P. and Freedman, D. (1984). Asymptotics of graphical projection pursuit. Ann. Statist. 12 793–815. 0559.62002 10.1214/aos/1176346703 euclid.aos/1176346703

13.

Ding, B. and Gentleman, R. (2005). Classification using generalized partial least squares. J. Comput. Graph. Statist. 14 280–298.Ding, B. and Gentleman, R. (2005). Classification using generalized partial least squares. J. Comput. Graph. Statist. 14 280–298.

14.

Huang, X., Pan, W., Park, S., Han, X., Miller, L. W. and Hall, J. (2004). Modeling the relationship between LVAD support time and gene expression changes in the human heart by penalized partial least squares. Bioinformatics 20 888–894.Huang, X., Pan, W., Park, S., Han, X., Miller, L. W. and Hall, J. (2004). Modeling the relationship between LVAD support time and gene expression changes in the human heart by penalized partial least squares. Bioinformatics 20 888–894.

15.

Karabulut, E. M. and Ibrikci, T. (2014). Effective automated prediction of vertebral column pathologies based on logistic model tree with SMOTE preprocessing. J. Med. Syst. 38 1–9.Karabulut, E. M. and Ibrikci, T. (2014). Effective automated prediction of vertebral column pathologies based on logistic model tree with SMOTE preprocessing. J. Med. Syst. 38 1–9.

16.

Khare, K., Oh, S.-Y. and Rajaratnam, B. (2015). A convex pseudolikelihood framework for high dimensional partial correlation estimation with convergence guarantees. J. R. Stat. Soc. Ser. B. Stat. Methodol. 77 803–825. 1414.62183 10.1111/rssb.12088Khare, K., Oh, S.-Y. and Rajaratnam, B. (2015). A convex pseudolikelihood framework for high dimensional partial correlation estimation with convergence guarantees. J. R. Stat. Soc. Ser. B. Stat. Methodol. 77 803–825. 1414.62183 10.1111/rssb.12088

17.

Khare, K., Pal, S. and Su, Z. (2017). A Bayesian approach for envelope models. Ann. Statist. 45 196–222. 1367.62174 10.1214/16-AOS1449 euclid.aos/1487667621Khare, K., Pal, S. and Su, Z. (2017). A Bayesian approach for envelope models. Ann. Statist. 45 196–222. 1367.62174 10.1214/16-AOS1449 euclid.aos/1487667621

18.

Lê Cao, K.-A., Rossouw, D., Robert-Granié, C. and Besse, P. (2008). A sparse PLS for variable selection when integrating omics data. Stat. Appl. Genet. Mol. Biol. 7 Art. 35, 31. 1276.62061Lê Cao, K.-A., Rossouw, D., Robert-Granié, C. and Besse, P. (2008). A sparse PLS for variable selection when integrating omics data. Stat. Appl. Genet. Mol. Biol. 7 Art. 35, 31. 1276.62061

19.

Lee, D., Lee, W., Lee, Y. and Pawitan, Y. (2011). Sparse partial least-squares regression and its applications to high-throughput data analysis. Chemom. Intell. Lab. Syst. 109 1–8.Lee, D., Lee, W., Lee, Y. and Pawitan, Y. (2011). Sparse partial least-squares regression and its applications to high-throughput data analysis. Chemom. Intell. Lab. Syst. 109 1–8.

20.

Lichman, M. (2013). UCI machine learning repository.Lichman, M. (2013). UCI machine learning repository.

21.

Ma, Y. and Zhu, L. (2013). Efficiency loss and the linearity condition in dimension reduction. Biometrika 100 371–383. 1284.62262 10.1093/biomet/ass075Ma, Y. and Zhu, L. (2013). Efficiency loss and the linearity condition in dimension reduction. Biometrika 100 371–383. 1284.62262 10.1093/biomet/ass075

22.

Marx, B. D. (1996). Iteratively reweighted partial least squares estimation for generalized linear regression. Technometrics 38 374–381. 0902.62081 10.1080/00401706.1996.10484549Marx, B. D. (1996). Iteratively reweighted partial least squares estimation for generalized linear regression. Technometrics 38 374–381. 0902.62081 10.1080/00401706.1996.10484549

23.

McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models. Monographs on Statistics and Applied Probability. CRC Press, London. 0744.62098McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models. Monographs on Statistics and Applied Probability. CRC Press, London. 0744.62098

24.

Park, P. J., Tian, L. and Kohane, I. S. (2002). Linking gene expression data with patient survival times using partial least squares. Bioinformatics 18 S120–S127.Park, P. J., Tian, L. and Kohane, I. S. (2002). Linking gene expression data with patient survival times using partial least squares. Bioinformatics 18 S120–S127.

25.

Peng, J., Wang, P., Zhou, N. and Zhu, J. (2009). Partial correlation estimation by joint sparse regression models. J. Amer. Statist. Assoc. 104 735–746. 1388.62046 10.1198/jasa.2009.0126Peng, J., Wang, P., Zhou, N. and Zhu, J. (2009). Partial correlation estimation by joint sparse regression models. J. Amer. Statist. Assoc. 104 735–746. 1388.62046 10.1198/jasa.2009.0126

26.

Ramsey, F. and Schafer, D. (2012). The Statistical Sleuth: A Course in Methods of Data Analysis. Cengage Learning, Boston.Ramsey, F. and Schafer, D. (2012). The Statistical Sleuth: A Course in Methods of Data Analysis. Cengage Learning, Boston.

27.

Rothman, A. J., Bickel, P. J., Levina, E. and Zhu, J. (2008). Sparse permutation invariant covariance estimation. Electron. J. Stat. 2 494–515. 1320.62135 10.1214/08-EJS176Rothman, A. J., Bickel, P. J., Levina, E. and Zhu, J. (2008). Sparse permutation invariant covariance estimation. Electron. J. Stat. 2 494–515. 1320.62135 10.1214/08-EJS176

28.

Su, Z. and Cook, R. D. (2011). Partial envelopes for efficient estimation in multivariate linear regression. Biometrika 98 133–146. 1214.62062 10.1093/biomet/asq063Su, Z. and Cook, R. D. (2011). Partial envelopes for efficient estimation in multivariate linear regression. Biometrika 98 133–146. 1214.62062 10.1093/biomet/asq063

29.

Su, Z. and Cook, D. (2012). Inner envelopes: Efficient estimation in multivariate linear regression. Biometrika 99 687–702. 06085163 10.1093/biomet/ass024Su, Z. and Cook, D. (2012). Inner envelopes: Efficient estimation in multivariate linear regression. Biometrika 99 687–702. 06085163 10.1093/biomet/ass024

30.

Su, Z., Zhu, G., Chen, X. and Yang, Y. (2016). Sparse envelope model: Efficient estimation and response variable selection in multivariate linear regression. Biometrika 103 579–593. 07072139 10.1093/biomet/asw036Su, Z., Zhu, G., Chen, X. and Yang, Y. (2016). Sparse envelope model: Efficient estimation and response variable selection in multivariate linear regression. Biometrika 103 579–593. 07072139 10.1093/biomet/asw036

31.

Wang, L., Chen, G. and Li, H. (2007). Group SCAD regression analysis for microarray time course gene expression data. Bioinformatics 23 1486–1494.Wang, L., Chen, G. and Li, H. (2007). Group SCAD regression analysis for microarray time course gene expression data. Bioinformatics 23 1486–1494.

32.

Wold, H. (1966). Estimation of principal components and related models by iterative least squares. In Multivariate Analysis (Proc. Internat. Sympos., Dayton, Ohio, 1965) 391–420. Academic Press, New York.Wold, H. (1966). Estimation of principal components and related models by iterative least squares. In Multivariate Analysis (Proc. Internat. Sympos., Dayton, Ohio, 1965) 391–420. Academic Press, New York.

33.

Wold, H. (1975). Path Models with Latent Variables: The NIPALS Approach. Academic Press, Cambridge. 0331.62058Wold, H. (1975). Path Models with Latent Variables: The NIPALS Approach. Academic Press, Cambridge. 0331.62058

34.

Zhang, T. and Zou, H. (2014). Sparse precision matrix estimation via lasso penalized D-trace loss. Biometrika 101 103–120. 1285.62063 10.1093/biomet/ast059Zhang, T. and Zou, H. (2014). Sparse precision matrix estimation via lasso penalized D-trace loss. Biometrika 101 103–120. 1285.62063 10.1093/biomet/ast059

35.

Zhu, G. and Su, Z. (2019). Supplement to “Envelope-based sparse partial least squares.”  https://doi.org/10.1214/18-AOS1796SUPP.Zhu, G. and Su, Z. (2019). Supplement to “Envelope-based sparse partial least squares.”  https://doi.org/10.1214/18-AOS1796SUPP.

36.

Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418–1429. 1171.62326 10.1198/016214506000000735Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418–1429. 1171.62326 10.1198/016214506000000735
Copyright © 2020 Institute of Mathematical Statistics
Guangyu Zhu and Zhihua Su "Envelope-based sparse partial least squares," The Annals of Statistics 48(1), 161-182, (February 2020). https://doi.org/10.1214/18-AOS1796
Received: 1 April 2017; Published: February 2020
Vol.48 • No. 1 • February 2020
Back to Top