Statistics Surveys

PLS for Big Data: A unified parallel algorithm for regularised group PLS

Pierre Lafaye de Micheaux, Benoît Liquet, and Matthew Sutton

Full-text: Open access

Abstract

Partial Least Squares (PLS) methods have been heavily exploited to analyse the association between two blocks of data. These powerful approaches can be applied to data sets where the number of variables is greater than the number of observations and in the presence of high collinearity between variables. Different sparse versions of PLS have been developed to integrate multiple data sets while simultaneously selecting the contributing variables. Sparse modeling is a key factor in obtaining better estimators and identifying associations between multiple data sets. The cornerstone of the sparse PLS methods is the link between the singular value decomposition (SVD) of a matrix (constructed from deflated versions of the original data) and least squares minimization in linear regression. We review four popular PLS methods for two blocks of data. A unified algorithm is proposed to perform all four types of PLS including their regularised versions. We present various approaches to decrease the computation time and show how the whole procedure can be scalable to big data sets. The bigsgPLS R package implements our unified algorithm and is available at https://github.com/matt-sutton/bigsgPLS.

Article information

Source
Statist. Surv., Volume 13 (2019), 119-149.

Dates
Received: August 2018
First available in Project Euclid: 2 September 2019

Permanent link to this document
https://projecteuclid.org/euclid.ssu/1567411220

Digital Object Identifier
doi:10.1214/19-SS125

Mathematical Reviews number (MathSciNet)
MR3998928

Zentralblatt MATH identifier
07104724

Subjects
Primary: 6202 62J99: None of the above, but in this section

Keywords
High dimensional data Lasso penalties Partial Least Squares Singular Value Decomposition

Rights
Creative Commons Attribution 4.0 International License.

Citation

Lafaye de Micheaux, Pierre; Liquet, Benoît; Sutton, Matthew. PLS for Big Data: A unified parallel algorithm for regularised group PLS. Statist. Surv. 13 (2019), 119--149. doi:10.1214/19-SS125. https://projecteuclid.org/euclid.ssu/1567411220


Export citation

References

  • Abdi, H. & Williams, L. J. (2013), ‘Partial least squares methods: partial least squares correlation and partial least square regression’, Methods Mol. Biol. 930, 549–579.
  • Alin, A. (2009), ‘Comparison of pls algorithms when number of objects is much larger than number of variables’, Statistical Papers 50, 711–720.
  • Allen, G. I., Grosenick, L. & Taylor, J. (2014), ‘A generalized least-square matrix decomposition’, Journal of the American Statistical Association 109(505), 145–159.
  • Allen, G. I., Peterson, C., Vannucci, M. & Maletic-Savatic, M. (2013), ‘Regularized Partial Least Squares with an Application to NMR Spectroscopy’, Statistical Analysis and Data Mining 6(4), 302–314.
  • Allen, G. I. & Tibshirani, R. (2010), ‘Transposable regularized covariance models with an application to missing data imputation’, Ann Appl Stat 4(2), 764–790.
  • Baglama, J. & Reichel, L. (2015), irlba: Fast Truncated SVD, PCA and Symmetric Eigendecomposition for Large Dense and Sparse Matrices. R package version 2.0.0. http://CRAN.R-project.org/package=irlba
  • Barker, M. & Rayens, W. (2003), ‘Partial least squares for discrimination’, Journal of Chemometrics 17(3), 166–173.
  • Boulesteix, A.-L. & Strimmer, K. (2007), ‘Partial least squares: a versatile tool for the analysis of high-dimensional genomic data’, Briefings in Bioinformatics 8(1), 32–44.
  • Brown, P. J. & Zidek, J. V. (1980), ‘Adaptive multivariate ridge regression’, Ann. Statist. 8(1), 64–74. https://doi.org/10.1214/aos/1176344891
  • Cak, A. D., Moran, E. F., de O. Figueiredo, R., Lu, D., Li, G. & Hetrick, S. (2016), ‘Urbanization and small household agricultural land use choices in the Brazilian amazon and the role for the water chemistry of small streams’, Journal of Land Use Science 11(2), 203–221.
  • Cardot, H. & Degras, D. (2017), ‘Online principal component analysis in high dimension: Which algorithm to choose?’, International Statistical Review. http://dx.doi.org/10.1111/insr.12220
  • Chen, X. & Liu, H. (2012), ‘An efficient optimization algorithm for structured sparse cca, with applications to eqtl mapping’, Statistics in Biosciences 4(1), 3–26.
  • Chun, H. & Keleş, S. (2010), ‘Sparse partial least squares regression for simultaneous dimension reduction and variable selection’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72(1), 3–25.
  • Chung, D. & Keleş, S. (2010), ‘Sparse Partial Least Squares Classification for High Dimensional Data’, Statistical Applications in Genetics and Molecular Biology 9(1), 17.
  • Cohen, G., Afshar, S., Tapson, J. & van Schaik, A. (2017), ‘EMNIST: an extension of MNIST to handwritten letters’, CoRR abs/1702.05373. http://arxiv.org/abs/1702.05373
  • de Jong, S. (1993), ‘Simpls: an alternative approach to partial least squares regression’, Chemometrics and Intelligent Laboratory Systems 18, 251–263.
  • Dhanjal, C., Gunn, S. R. & Shawe-Taylor, J. (2009), ‘Efficient sparse kernel feature extraction based on partial least squares’, IEEE Transactions on Pattern Analysis and Machine Intelligence 31(8), 1347–1361.
  • Friedman, J., Hastie, T. & Tibshirani, R. (2010), ‘Regularization paths for generalized linear models via coordinate descent’, Journal of Statistical Software 33(1), 1–22. http://www.jstatsoft.org/v33/i01/
  • Friedman, J., Hastie, T., Tibshirani, R., Simon, N., Narasimhan, B. & Qian, J. (2018), glmnet: Lasso and Elastic-Net Regularized Generalized Linear Models. R package version 2.0-16. https://CRAN.R-project.org/package=glmnet
  • Geladi, P. & Kowalski, B. R. (1986), ‘Partial least-squares regression: a tutorial’, Analytica Chimica Acta 185, 1–17.
  • Guo, G. & Mu, G. (2013), Joint estimation of age, gender and ethnicity: Cca vs. pls, in ‘10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG)’, pp. 1–6.
  • Hardoon, D. R., Szedmak, S. & Shawe-Taylor, J. (2004), ‘Canonical correlation analysis: an overview with application to learning methods’, Neural Computation 16(12), 2639–2664.
  • Hastie, T., Tibshirani, R. & Friedman, J. H. (2009), The elements of statistical learning: data mining, inference, and prediction, 2nd Edition, Springer series in statistics, Springer. http://www.worldcat.org/oclc/300478243
  • Höskuldsson, A. (1988), ‘Pls regression methods’, Journal of Chemometrics 2, 211–228.
  • Hotelling, H. (1936), ‘Relations between two sets of variates’, Biometrika28(3-4), 321.
  • Ji, G., Yang, Z. & You, W. (2011), ‘Pls-based gene selection and identification of tumor-specific genes’, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 41(6), 830–841.
  • Kraemer, N. & Sugiyama, M. (2011), ‘The degrees of freedom of partial least squares regression’, Journal of the American Statistical Association 106(494).
  • Krishnan, A., Williams, L. J., McIntosh, A. R. & Abdi, H. (2011), ‘Partial least squares (pls) methods for neuroimaging: A tutorial and review’, NeuroImage 56(2), 455 – 475.
  • Lafaye de Micheaux, P., Liquet, B. & Sutton, M. (2017), ‘A Unified Parallel Algorithm for Regularized Group PLS Scalable to Big Data’, ArXiv e-prints.
  • Lê Cao, K.-A., Rossouw, D., Robert-Granié, C. & Besse, P. (2008), ‘Sparse PLS: Variable Selection when Integrating Omics data’, Statistical Application and Molecular Biology 7((1):37).
  • LeCun, Y. & Cortes, C. (2010), ‘MNIST handwritten digit database’. http://yann.lecun.com/exdb/mnist/
  • Liang, F., Shi, R. & Mo, Q. (2016), ‘A split-and-merge approach for singular value decomposition of large-scale matrices’, Statistics And Its Interface 9(4), 453–459.
  • Lin, D., Cao, H., Calhoun, V. D. & Wang, Y.-P. (2014), ‘Sparse models for correlative and integrative analysis of imaging and genetic data’, Journal of Neuroscience Methods 237, 69 – 78.
  • Lindgren, F. & Rännar, S. (1998), ‘Alternative partial least squares (pls) algorithms’, Perspectives Drug Discovery and Design pp. 105–113.
  • Liquet, B., Lafaye de Micheaux, P., Hejblum, B. & Thiébaut, R. (2016), ‘Group and sparse group partial least square approaches applied in genomics context’, Bioinformatics 32, 35–42.
  • Liu, J. & Calhoun, V. D. (2014), ‘A review of multivariate analyses in imaging genetics’, Frontiers in Neuroinformatics 8(29).
  • Lockhart, R., Taylor, J., Tibshirani, R. J. & Tibshirani, R. (2014), ‘A significance test for the lasso’, Ann Stat 42(2), 413–468.
  • Lorenzi, M., Gutman, B., Hibar, D. P., Altmann, A., Jahanshad, N., Thompson, P. M. & Ourselin, S. (2016), Partial least squares modelling for imaging-genetics in Alzheimer’s disease: Plausibility and generalization, in ‘2016 IEEE 13th International Symposium on Biomedical Imaging (ISBI)’, pp. 838–841.
  • Lütkepohl, H. (2005), New introduction to multiple time series analysis, Springer-Verlag, Berlin.
  • Mackey, L. W. (2009), Deflation methods for sparse pca, in D. Koller, D. Schuurmans, Y. Bengio & L. Bottou, eds, ‘Advances in Neural Information Processing Systems 21’, Curran Associates, Inc., pp. 1017–1024.
  • Mardia, K. V., Kent, J. T. & Bibby, J. M. (1979), Multivariate analysis / K.V. Mardia, J.T. Kent, J.M. Bibby, Academic Press London; New York.
  • McIntosh, A. R., Bookstein, F. L., Haxby, J. V. & Grady, C. L. (1996), ‘Spatial pattern analysis of functional brain images using partial least squares’, NeuroImage 3(3), 143–157.
  • Meyer, C. D. (2000), Matrix Analysis and Applied Linear Algebra, SIAM.
  • Netrapalli, P., Jain, P. & Sanghavi, S. (2015), ‘Phase retrieval using alternating minimization’, IEEE Transactions on Signal Processing 63(18), 4814–4826.
  • Nguyen, D. & Rocke, D. (2002), ‘Tumor classification by partial least squares using microarray gene expression data’, Bioinformatics 18(1), 39–50.
  • Nicole Kraemer, M. L. B. (2018), plsdof: Degrees of Freedom and Statistical Inference for Partial Least Squares Regression. R package version 0.2-8. https://CRAN.R-project.org/package=plsdof
  • Nielsen, F. A. (2002), Neuroinformatics in Functional Neuroimaging, PhD thesis, Technical University of Denmark, Lyngby.
  • Palermo, R. E., Patterson, L. J., Aicher, L. D., Korth, M. J., Robert-Guroff, M. & Katze, M. G. (2011), ‘Genomic analysis reveals pre- and postchallenge differences in a rhesus macaque aids vaccine trial: Insights into mechanisms of vaccine efficacy’, Journal of Virology 85(2), 1099–1116.
  • Phatak, A. & de Jong, S. (1997), ‘The geometry of partial least squares’, Journal of Chemometrics 11(4), 311–338.
  • R Core Team (2017), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
  • Rohlf, F. J. & Corti, M. (2000), ‘Use of two-block partial least-squares to study covariation in shape’, Systematic Biology 49(4), 740–753.
  • Roon, P. V., Zakizadeh, J. & Chartier, S. (2014), ‘Partial least squares tutorial for analyzing neuroimaging data’, The Quantitative Methods for Psychology 10(2), 200–215.
  • Rosipal, R. & Krämer, N. (2006), Overview and recent advances in partial least squares, in ‘Subspace, Latent Structure and Feature Selection: Statistical and Optimization Perspectives Workshop’, pp. 34–51.
  • S. E. Leurgans, R. A. Moyeed, B. W. S. (1993), ‘Canonical correlation analysis when the data are curves’, Journal of the Royal Statistical Society. Series B (Methodological) 55(3), 725–740.
  • Shen, H. & Huang, J. Z. (2008), ‘Sparse principal component analysis via regularized low rank matrix approximation’, Journal of Multivariate Analysis 99(6), 1015 – 1034.
  • Simon, N., Friedman, J., Hastie, T. & Tibshirani, R. (2013), ‘A sparse-group lasso’, Journal of Computational and Graphical Statistics 22(2), 231–245.
  • Sutton, M., Thiebaut, T. & Liquet, B. (2018), ‘Sparse partial least squares with group and subgroup structure’, Statistics in Medicine 37(23), 3338–33356.
  • Tenenhaus, M. (1998), La régression PLS: Théorie et Pratique, Paris: Technip.
  • ter Braak, C. J. F. & de Jong, S. (1998), ‘The objective function of partial least squares regression’, Journal of Chemometrics 12(1), 41–54.
  • Tibshirani, R. (1994), ‘Regression shrinkage and selection via the lasso’, Journal of the Royal Statistical Society, Series B 58, 267–288.
  • Tibshirani, R. J. & Taylor, J. (2011), ‘The solution path of the generalized lasso’, Annals of Statistics 39(3), 1335–1371.
  • Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. & Knight, K. (2005), ‘Sparsity and smoothness via the fused lasso’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67(1), 91–108.
  • Tibshirani, R., Tibshirani, R., Taylor, J., Loftus, J. & Reid, S. (2017), selectiveInference: Tools for Post-Selection Inference. R package version 1.2.4. https://CRAN.R-project.org/package=selectiveInference
  • Tseng, P. (1988), Coordinate ascent for maximising nondifferentiable concave functions, Technical report, Massachusetts Institute of Technology. Laboratory for Information and Decision Systems.Cambridge MA.
  • Vinod, H. (1976), ‘Canonical ridge and econometrics of joint production’, Journal of Econometrics 4(2), 147 – 166.
  • Vinzi, V., Trinchera, L. & Amato, S. (2010), ‘Pls path modeling: from foundations to recent developments and open issues for model assessment and improvement’, Handbook of Partial Least Squares pp. 47–82.
  • Wegelin, J. A. (2000), A survey of partial least squares (pls) methods, with emphasis on the two-block case, Technical report, University of Washington.
  • Witten, D. M., Tibshirani, R. & Hastie, T. (2009), ‘A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis’, Biostatistics 10(3), 515–534.
  • Wold, H. (1966), Estimation of principal components and related models by iterative least squares, in ‘Multivariate Analysis’, Academic Press, New York, Wiley, Dayton, Ohio, pp. 391–420.
  • Wold, S., Ruhe, A., Wold, H. & Dunn, W. J. (1984), ‘The collinearity problem in linear regression. the partial least squares (pls) approach to generalized inverses’, SIAM Journal on Scientificic and Statistical Computing 5(3), 735–743.
  • Wold, S., Sjöström, M. & Eriksson, L. (2001), ‘Pls-regression: a basic tool of chemometrics’, Chemometrics and Intelligent Laboratory Systems 58(2), 109 – 130.
  • Yee, T. W. (2018), VGAM: Vector Generalized Linear and Additive Models. R package version 1.0-6. https://CRAN.R-project.org/package=VGAM
  • Yee, T. W. & Wild, C. J. (1996), ‘Vector generalized additive models’, Journal of the Royal Statistical Society. Series B (Methodological) 58(3), 481–493. http://www.jstor.org/stable/2345888
  • Yeniay, O. & Goktas, A. (2002), ‘A comparison of partial least squares regression with other prediction methods’, Hacettepe Journal of Mathematics and Statistics 31(99), 99–101.
  • Yuan, M. & Lin, Y. (2006), ‘Model selection and estimation in regression with grouped variables’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68(1), 49–67.
  • Zeng, Y. & Breheny, P. (2017a), ‘The biglasso package: A memory- and computation-efficient solver for lasso model fitting with big data in r’, ArXiv e-prints. https://arxiv.org/abs/1701.05936
  • Zeng, Y. & Breheny, P. (2017b), The biglasso Package: A Memory- and Computation-Efficient Solver for Lasso Model Fitting with Big Data in R. R package version 1.3. https://CRAN.R-project.org/package=biglasso