Electronic Journal of Statistics

On principal components regression, random projections, and column subsampling

Martin Slawski

Full-text: Open access

Abstract

Principal Components Regression (PCR) is a traditional tool for dimension reduction in linear regression that has been both criticized and defended. One concern about PCR is that obtaining the leading principal components tends to be computationally demanding for large data sets. While random projections do not possess the optimality properties of the leading principal subspace, they are computationally appealing and hence have become increasingly popular in recent years. In this paper, we present an analysis showing that for random projections satisfying a Johnson-Lindenstrauss embedding property, the prediction error in subsequent regression is close to that of PCR, at the expense of requiring a slightly large number of random projections than principal components. Column sub-sampling constitutes an even cheaper way of randomized dimension reduction outside the class of Johnson-Lindenstrauss transforms. We provide numerical results based on synthetic and real data as well as basic theory revealing differences and commonalities in terms of statistical performance.

Article information

Source
Electron. J. Statist., Volume 12, Number 2 (2018), 3673-3712.

Dates
Received: October 2017
First available in Project Euclid: 31 October 2018

Permanent link to this document
https://projecteuclid.org/euclid.ejs/1540951345

Digital Object Identifier
doi:10.1214/18-EJS1486

Rights
Creative Commons Attribution 4.0 International License.

Citation

Slawski, Martin. On principal components regression, random projections, and column subsampling. Electron. J. Statist. 12 (2018), no. 2, 3673--3712. doi:10.1214/18-EJS1486. https://projecteuclid.org/euclid.ejs/1540951345


Export citation

References

  • [1] D. Achlioptas, Database-friendly random projections: Johnson-Lindenstrauss with binary coins, Journal of Computer and System Sciences, 66 (2003), pp. 671–687.
  • [2] D. Ahfock, W. Astle, and S. Richardson, Statistical Properties of Sketching Algorithms. arXiv:1706.03665, 2017.
  • [3] N. Ailon and B. Chazelle, Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform, in Proceedings of the Symposium on Theory of Computing (STOC), 2006, pp. 557–563.
  • [4] N. Ailon and E. Liberty, Almost optimal unrestricted fast Johnson-Lindenstrauss transform, in Symposium on Discrete Algorithms (SODA), 2011.
  • [5] Z. Allen-Zhu and Y. Li, Faster Principal Component Regression and Stable Matrix Chebyshev Approximation, in International Conference on Machine Learning, 2017.
  • [6] A. Artemiou and B. Li, On principal components and regression: a statistical explanation of a natural phenomenon, Statistica Sinica, 19 (2009), pp. 1557–1565.
  • [7] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin, A Simple Proof of the Restricted Isometry Property for Random Matrices, Constructive Approximation, 28 (2006), pp. 253–263.
  • [8] L. Bottou, F. Curtis, and J. Nocedal, Optimization Methods for Large-Scale Machine Learning. arXiv:1606.04838, 2018.
  • [9] L. Breiman, Bagging predictors, Machine Learning, 24 (1996), pp. 123–140.
  • [10] P. Bühlmann and S. van de Geer, Statistics for High-Dimensional Data, Springer, 2011.
  • [11] F. Bunea, Y. She, and M. Wegkmap, Optimal selection of reduced rank estimators of high-dimensional matrices, The Annals of Statistics, 39 (2011), pp. 1282–1309.
  • [12] K. Buza, Data Analysis, Machine Learning and Knowledge Discovery, Springer, 2014, ch. Feedback Prediction for Blogs, pp. 145–152.
  • [13] E. Candes and M. Wakin, An introduction to compressive sampling., IEEE Signal Processing Magazine, 25 (2008), pp. 21–30.
  • [14] T. Cannings and R. Samworth, Random-projection ensemble classification, Journal of the Royal Statistical Society Series B, 79 (2017), pp. 959–1035.
  • [15] P. Dhillon, D. Foster, S. Kakade, and L. Ungar, A risk comparison of ordinary least squares vs. ridge regression, Journal of Machine Learning Research, 14 (2013), pp. 1505–1511.
  • [16] L. Dicker, D. Foster, and D. Hsu, Kernel ridge vs. principal component regression: Minimax bounds and the qualification of regularization operators, Electronic Journal of Statistics, 11 (2017), pp. 1022–1047.
  • [17] B. F, S. Pereverzev, and L. Rosasco, On regularization algorithms in learning theory, Journal of Complexity, 23, pp. 52–72.
  • [18] R. Frostig, C. Musco, C. Musco, and A. Sidford, Principal component projection without principal component analysis, in International Conference on Machine Learning (ICML), 2016.
  • [19] G. Golub and C. V. Loan, Matrix Computations, Johns Hopkins University Press, 3rd ed., 1996.
  • [20] A. Gupta and D. Nagar, Matrix Variate Distributions, CRC Press, 1999.
  • [21] N. Halko, P. Martinsson, and J. Tropp, Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions, SIAM Review, (2011), pp. 217–288.
  • [22] T. Hastie, R. Tisbhirani, and M. Wainwright, Statistical Learning with Sparsity, CRC Press, 2015.
  • [23] D. Homrighausen and D. McDonald, Compressed and Penalized Linear Regression. arXiv:1705.08036, 2017.
  • [24] H. Hotelling, Analysis of a complex of statistical variables into principal components, Journal of Educational Psychology, 24 (1933), pp. 417– 441.
  • [25] D. Hsu, S. Kakade, and T. Zhang, A tail inequality for quadratic forms of sub-Gaussian random vectors, Electronic Communications in Probability, 52 (2012), pp. 1–6.
  • [26] P. Indyk and A. Naor, Nearest-neighbor-preserving embeddings, ACM Transactions on Algorithms, 3 (2007), p. 31.
  • [27] A. James, Normal Multivariate Analysis and the Orthogonal Group, The Annals of Mathematical Statistics, 25 (1954), pp. 40–75.
  • [28] W. Johnson and J. Lindenstrauss, Extensions of Lipschitz mappings into a Hilbert space, Contemporary Mathematics, (1984), pp. 189–206.
  • [29] I. Joliffe, A note on the use of principal components in regression, Journal of the Royal Statistical Society Series C, 31 (1982), pp. 300–303.
  • [30] A. Kaban, New Bounds on Compressive Linear Least Squares Regression, in Artificial Intelligence and Statistics (AISTATS), 2014, pp. 448–456.
  • [31] S. Kasiviswanathan and M. Rudelson, Compressed sparse linear regression. arXiv:1707.08902.
  • [32] M. Kendall, A course in Multivariate Analysis, Griffith, London, 1957.
  • [33] F. Krahmer and R. Ward, New and improved Johnson-Lindenstrauss embeddings via the Restricted Isometry Property, SIAM Journal on Mathematical Analysis, 43 (2011), pp. 1269–1281.
  • [34] B. Laurent and P. Massart, Adaptive estimation of a quadratic functional by model selection, The Annals of Statistics, 28 (2000), pp. 1302–1338.
  • [35] Y. Lu and D. Foster, Fast Ridge Regression with Randomized Principal Component Analysis and Gradient Descent, in Uncertainty in Artificial Intelligence (UAI), 2014.
  • [36] O. Maillard and R. Munos, Compressed least-squares regression, in Advances in Neural Information Processing Systems (NIPS), 2009, pp. 1213–1221.
  • [37] T. Marzetta, G. Tucci, and S. Simon, A Random Matrix-Theoretic Approach to Handling Singular Covariance Estimates, IEEE Transactions on Information Theory, 57 (2011), pp. 6256–6271.
  • [38] J. Matousek, On variants of the Johnson-Lindenstrauss lemma, Random Structures and Algorithms, 33 (2008), pp. 142–156.
  • [39] M. Pilanci and M. Wainwright, Randomized Sketches of Convex Programs With Sharp Guarantees, IEEE Transactions on Information Theory, 61 (2015), pp. 5096–5115.
  • [40] G. Raskutti and M. Mahoney, A Statistical Perspective on Randomized Sketching for Ordinary Least-Squares, Journal of Machine Learning Research, 17 (2016), pp. 1–31.
  • [41] T. Sarlos, Improved approximation algorithms for large matrices via random projections, in Foundations of Computer Science (FOCS), 2006, pp. 143–152.
  • [42] R. Shah and N. Meinshausen, On $b$-bit min-wise hashing for large-scale regression and classification with sparse data, Journal of Machine Learning Research, 18 (2018), pp. 1–42.
  • [43] M. Slawski, Compressed least squares regression revisited, in Artificial Intelligence and Statistics (AISTATS), 2017, pp. 1207–1215.
  • [44] G.-A. Thanei, C. Heinze, and N. Meinshausen, Big and Complex Data Analysis, Springer, 2017, ch. Random Projections for Large-Scale Regression, pp. 51–68.
  • [45] J. Tropp, Improved analysis of the subsampled randomized Hadamard transform, Adaptive Data Analysis, (2011), pp. 115–126.
  • [46] S. Vempala, The Random Projection Method, American Mathematical Society, 2005.
  • [47] S. Wang, A. Gittens, and M. Mahoney, Sketched Ridge Regression: Optimization Perspective, Statistical Perspective, and Model Averaging, Journal of Machine Learning Research, 18 (2017), pp. 1–50.
  • [48] M. Woodrufe, Sketching as a Tool for Numerical Linear Algebra, Foundations and Trends in Theoretical Computer Science, 10 (2014), pp. 1–157.
  • [49] S. Zhou, J. Lafferty, and L. Wasserman, Compressed and privacy-sensitive regression, IEEE Transactions on Information Theory, 55 (2009), pp. 846–866.