Electronic Journal of Statistics

On the predictive potential of kernel principal components

Ben Jones, Andreas Artemiou, and Bing Li

Full-text: Open access


We give a probabilistic analysis of a phenomenon in statistics which, until recently, has not received a convincing explanation. This phenomenon is that the leading principal components tend to possess more predictive power for a response variable than lower-ranking ones despite the procedure being unsupervised. Our result, in its most general form, shows that the phenomenon goes far beyond the context of linear regression and classical principal components — if an arbitrary distribution for the predictor $X$ and an arbitrary conditional distribution for $Y\vert X$ are chosen then any measureable function $g(Y)$, subject to a mild condition, tends to be more correlated with the higher-ranking kernel principal components than with the lower-ranking ones. The “arbitrariness” is formulated in terms of unitary invariance then the tendency is explicitly quantified by exploring how unitary invariance relates to the Cauchy distribution. The most general results, for technical reasons, are shown for the case where the kernel space is finite dimensional. The occurency of this tendency in real world databases is also investigated to show that our results are consistent with observation.

Article information

Electron. J. Statist., Volume 14, Number 1 (2020), 1-23.

Received: August 2019
First available in Project Euclid: 3 January 2020

Permanent link to this document

Digital Object Identifier

Primary: 60K35: Interacting random processes; statistical mechanics type models; percolation theory [See also 82B43, 82C43] 60K35: Interacting random processes; statistical mechanics type models; percolation theory [See also 82B43, 82C43]
Secondary: 60K35: Interacting random processes; statistical mechanics type models; percolation theory [See also 82B43, 82C43]

Cauchy distribution dimension reduction nonparametric regression kernel principal components unitary invariance

Creative Commons Attribution 4.0 International License.


Jones, Ben; Artemiou, Andreas; Li, Bing. On the predictive potential of kernel principal components. Electron. J. Statist. 14 (2020), no. 1, 1--23. doi:10.1214/19-EJS1655. https://projecteuclid.org/euclid.ejs/1578020612

Export citation


  • [1] Alter, O., Brown, P. and Botstein, D. (2000). Singular value decomposition for gene-wide expression data processing and modelling., Proceedings of the National Academy of Science, 97, 10101–10106.
  • [2] Arnold, B. C. and Brockett, P. L. (1992). On distributions whose component ratios are Cauchy., American Statistician, 46, 25–26.
  • [3] Aronszajn, N. (1950). Theory of reproducing kernels., Transactions of the American Mathematical Society, 68, 337–404.
  • [4] Artemiou, A. and Dong, Y. (2016). Sufficient dimension reduction via principal Lq support vector machine., Electronic Journal of Statistics, 10, 783–805
  • [5] Artemiou, A. and Li, B. (2009). On principal components and regression: a statistical explanation of a natural phenomenon., Statistica Sinica, 19, 1557–1566.
  • [6] Artemiou A, and Li, B. (2013). Predictive power of principal components for single-index model and sufficient dimension reduction., Journal of Multivariate Analysis, 119, 176–184
  • [7] Bura, E. and Pfeiffer, R. M. (2003). Graphical methods for class prediction using dimension reduction techniques on DNA microarray data., Bioinformatics, 19, 1252–1258.
  • [8] Chiaromonte, F. and Martinelli, J. (2002). Dimension reduction strategies for analyzing global gene expression data with a response., Math. Biosci., 176, 123–144.
  • [9] Cook, R. D. (2007). Fisher lecture: dimension reduction in regression., Statistical Science, 22, 1–40.
  • [10] Cox, D. R. (1968). Notes on some aspects of regression analysis., Journal of the Royal Statistical Society, Ser. A, 131, 265–279.
  • [11] Dicker, L., Foster, D. and Hsu, D. (2017). Kernel ridge vs principal component regression: Minimax bounds and the qualification of regularization operators., Electronic Journal of Statistics, 11, 1022–1047
  • [12] Fang, K., Kotz, S., and Ng, K. (1990)., Symmetric Multivariate and Related Distributions. Chapman and Hall. Ltd., London.
  • [13] Fukumizu, K., Bach, F. R. and Jordan, M. I. (2004). Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces., The Journal of Machine Learning Research, 5, 73–99.
  • [14] Fukumizu, K., Bach, F. R. and Jordan, M. I. (2009). Kernel dimension reduction in regression., Annals of Statistics, 4, 1871–1905.
  • [15] Hall, P. and Yang, Y.-J. (2010). Ordering and selecting components in multivariate or functional data linear prediction., Journal of the Royal Statistical Society, Series B, 72, 93–110.
  • [16] Hastie, T. and Stuetzle, W. (1989). Principal curves., Journal of the American Statistical Association, 84, 502–516.
  • [17] Hawkins, D. M. and Fatti, L. P. (1984). Exploring multivariate data using the minor principal components., The Statistician, 33, 325–338.
  • [18] Hotelling, H. (1957). The relationship of the newer multivariate statistical methods to factor analysis., British Journal of Statistical Psychology, 10, 69–79.
  • [19] Hocking, R. R. (1976). The analysis and selection of variables in linear regression., Biometrics, 32, 1–49.
  • [20] Hsing, T. and Eubank, R. (2015)., Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators. John Wiley and Sons, UK.
  • [21] Hsing, T. and Ren, H. (2009). An RKHS formulation of the inverse regression dimension-reduction problem., Annals of Statistics, 37, 726–755.
  • [22] Johnson, R. A. & Wichern, D. W. (2007)., Applied Multivariate Statistical Analysis. Pearson Education, Inc.
  • [23] Jolliffe, I. T. (1982). A note on the use of principal components in regression., Applied Statistics, 31, 300–303.
  • [24] Jolliffe, I. T. (2002)., Principal Component Analysis. Springer.
  • [25] Jones, B. and Artemiou, A. (2019+). On principal components regression with hilbertian predictors. To appear in the, Annals of the Institute of Statistical Mathematics.
  • [26] Karatzoglou, A., Smola, A., Hornik, K. and Zeileis, A, (2004) Kernlab – an S4 package for kernel methods in R., Journal of Statistical Software, 11, 9.
  • [27] Kendall, M. G. (1957)., A Course in Multivariate Analysis. Griffin, London.
  • [28] Kingman, J. F. C. (1967). Completely random measures., Pacific Journal of Mathematics, 21, 59–78.
  • [29] Lee, K.-Y., Li, B. and Chiaromonte, F. (2013). A general theory for nonlinear sufficient dimension reduction: Formulation and estimation., The Annals of Statistics, 41, 221—249.
  • [30] Li, B. (2007). Comment: Fisher lecture: dimension reduction in regression., Statistical Science, 22, 32–35.
  • [31] Li, B., Artemiou, A. and Li, L. (2011). Principal support vector machine for linear and nonlinear sufficient dimension reduction., The Annals of Statistics, 39, 3182–3210
  • [32] Li, B. and Song, J. (2017). Nonlinear sufficient dimension reduction for functional data., The Annals of Statistics, 45, 1059–1095
  • [33] Li, L. and Li, H. (2004). Dimension reduction methods for microarrays with application to censored survival data., Bioinformatics, 20, 3406–3412.
  • [34] Mosteller, F. and Tukey, J. W. (1977)., Data Analysis and Regression. Addison-Wesley, Reading, Massachusetts.
  • [35] Muirhead, R.J. (1982)., Aspects of Multivariate Statistical Theory. John Wiley and Sons, New York.
  • [36] Ni, L. (2011). Principal component regression revisited., 21, 741–747.
  • [37] Rice, J. A. and Silverman, B. W. (1991). Estimating the mean and covariance structure nonparametrically when the data are curves., Journal of Royal Statistical Society, Series B, 53, 233–243.
  • [38] Schölkopf, B., Smola, A., Müller, K.-R. (1997). Kernel principal component analysis., Artificial Neural Networks, 583–588.
  • [39] Schölkopf, B., Smola, A., Müller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem., Neural Computation, 10, 1299–1319.
  • [40] Scott, D. (1992)., Multivariate Density Estimation. Wiley, New York.
  • [41] Shi, T., Belkin, M., and Yu, B. (2009). Data spectroscopy: eigenspaces of convolution operators and clustering., Annals of Statistics, 37, 3960–3984.
  • [42] Silverman, B. W. (1996). Smoothed functional principal components analysis by choice of norm., The Annals of Statistics, 24, 1–24.
  • [43] Skorohod, A. V. (1976). Random operators in a Hilbert space., Lecture Notes in Mathematics, 550, 567–591.
  • [44] Skorohod, A. V. (1984)., Random Linear Operators. D. Reidel Publishing Company, Dordrecht, Holland.
  • [45] Vapnik, V. (1998)., Statistical Learning Theory. Wiley Intersience.
  • [46] Yeh, Y. R., Huang S. Y., and Lee Y. J. (2009). Nonlinear dimension reduction with kernel sliced inverse regression., IEEE Transactions on Knowledge and Data Engineering, 21, 1590–1603.