The Annals of Applied Statistics

Bi-cross-validation of the SVD and the nonnegative matrix factorization

Art B. Owen and Patrick O. Perry

Full-text: Open access

Abstract

This article presents a form of bi-cross-validation (BCV) for choosing the rank in outer product models, especially the singular value decomposition (SVD) and the nonnegative matrix factorization (NMF). Instead of leaving out a set of rows of the data matrix, we leave out a set of rows and a set of columns, and then predict the left out entries by low rank operations on the retained data. We prove a self-consistency result expressing the prediction error as a residual from a low rank approximation. Random matrix theory and some empirical results suggest that smaller hold-out sets lead to more over-fitting, while larger ones are more prone to under-fitting. In simulated examples we find that a method leaving out half the rows and half the columns performs well.

Article information

Source
Ann. Appl. Stat. Volume 3, Number 2 (2009), 564-594.

Dates
First available in Project Euclid: 22 June 2009

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1245676186

Digital Object Identifier
doi:10.1214/08-AOAS227

Mathematical Reviews number (MathSciNet)
MR2578836

Zentralblatt MATH identifier
1166.62047

Keywords
Cross-validation principal components random matrix theory sample reuse weak factor model

Citation

Owen, Art B.; Perry, Patrick O. Bi-cross-validation of the SVD and the nonnegative matrix factorization. Ann. Appl. Stat. 3 (2009), no. 2, 564--594. doi:10.1214/08-AOAS227. https://projecteuclid.org/euclid.aoas/1245676186


Export citation

References

  • Akaike, H. (1974). A new look at the statistical model identification. IEEE Trans. Automat. Control. 19 716–723.
  • Alter, O., Brown, P. O. and Botstein, D. (2000). Singular value decomposition for genome-wide expression data processing and modeling. Proc. Nat. Acad. Sci. U.S.A. 97 10101–10106.
  • Bai, J. (2003). Inferential theory for factor models of large dimensions. Econometrica 71 135–171.
  • Bai, J. and Ng, S. (2002). Determining the number of factors in approximate factor models. Econometrica 70 191–221.
  • Baik, J. and Silverstein, J. W. (2004). Eigenvalues of large sample covariance matrices of spiked population models. J. Multivariate Anal. 97 1382–1408.
  • Ben-Israel, A. and Greville, T. N. E. (2003). Generalized Inverses: Theory and Applications, 2nd ed. Springer, New York.
  • Besse, P. and Ferré, L. (1993). Sur l’usage de la validation croisée en analyse en composantes principales. Revue de statistique appliquée 41 71–76.
  • dos S. Dias, C. T. and Krzanowski, W. J. (2003). Model selection and cross validation in additive main effect and multiplicative interaction models. Crop Science 43 865–873.
  • Eastment, H. T. and Krzanowski, W. J. (1982). Cross-validatory choice of the number of components from a principal component analysis. Technometrics 24 73–77.
  • Eckart, C. and Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika 1 211–218.
  • Gabriel, K. (2002). Le biplot–outil d’exploration de données multidimensionelles. Journal de la Societe Francaise de Statistique 143 5–55.
  • Golub, G. H. and Van Loan, C. F. (1996). Matrix Computations, 3rd ed. Johns Hopkins Univ. Press, Baltimore, MD.
  • Hansen, P. C. (1987). The truncated SVD as a method for regularization. BIT 27 534–553.
  • Hartigan, J. (1975). Clustering Algorithms. Wiley, New York.
  • Hoff, P. D. (2007). Model averaging and dimension selection for the singular value decomposition. J. Amer. Statist. Assoc. 102 674–685.
  • Holmes-Junca, S. (1985). Outils informatiques pour l’évaluation de la pertinence d’un résultat en analyse des données. Ph.D. thesis, Université Montpelier 2.
  • Jackson, D. A. (1993). Stopping rules in principal components analysis: A comparison of heuristical and statistical approaches. Ecology (Durham) 74 2204–2214.
  • Johnstone, I. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. 29 295–327.
  • Jolliffe, I. T. (2002). Principal Component Analysis, 2nd ed. Springer, New York.
  • Juvela, M., Lehtinen, K. and Paatero, P. (1994). The use of positive matrix factorization in the analysis of molecular line spectra from the thumbprint nebula. Clouds, Cores, and Low Mass Stars 65 176–180.
  • Kolda, T. G. and O’Leary, D. P. (1998). A semidiscrete matrix decomposition for latent semantic indexing in information retrieval. ACM Transactions on Information Systems 16 322–346.
  • Laurberg, H., Christensen, M. G., Plumbley, M. D., Hansen, L. K. and Jensen, S. H. (2008). Theorems on positive data: On the uniqueness of NMF. Computational Intelligence and Neuroscience 2008.
  • Lazzeroni, L. and Owen, A. (2002). Plaid models for gene expression data. Statist. Sinica 24 61–86.
  • Lee, D. D. and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature 401 788–791.
  • Louwerse, D. J., Smilde, A. K. and Kiers, H. A. L. (1999). Cross-validation of multiway component models. Journal of Chemometrics 13 491–510.
  • Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979). Multivariate Analysis. Academic Press, London.
  • McCullagh, P. (2000). Resampling and exchangeable arrays. Bernoulli 6 285–301.
  • Minka, T. P. (2000). Automatic choice of dimensionality for PCA. In NIPS 2000 598–604.
  • Muirhead, R. (1982). Aspects of Multivariate Statistical Theory. Wiley, New York.
  • Oba, S., Sato, M., Takemasa, I., Monden, M., Matsubara, K. and Ishii, S. (2003). A Bayesian missing value estimation method for gene expression profile analysis. Bioinformatics 19 2088–2096.
  • Onatski, A. (2007). Asymptotics of the principal components estimator of large factor models with weak factors and i.i.d. Gaussian noise. Technical report, Columbia Univ.
  • Owen, A. B. (2007). The pigeonhole bootstrap. Ann. Appl. Statist. 1 386–411.
  • Porter, M. (1980). An algorithm for suffix stripping. Program 14 130–137.
  • Rodwell, G., Sonu, R., Zahn, J. M., Lund, J., Wilhelmy, J., Wang, L., Xiao, W., Mindrinos, M., Crane, E., Segal, E., Myers, B., Davis, R., Higgins, J., Owen, A. B. and Kim, S. K. (2004). A transcriptional profile of aging in the human kidney. PLOS Biology 2 2191–2201.
  • Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461–464.
  • Shao, J. (1997). An asymptotic theory for linear model selection. Statist. Sinica 7 221–264.
  • Soshnikov, A. (2001). A note on universality of the distribution of the largest eigenvalues in certain sampling covariances. J. Statist. Phys. 108 5–6.
  • Tian, Y. (2004). On mixed-type reverse-order laws for the Moore–Penrose inverse of a matrix product. Int. J. Math. Math. Sci. 2004 3103–3116.
  • Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D. and Altman, R. B. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics 17 520–525.
  • Wold, H. (1966). Nonlinear estimation by iterative least squares procedures. In Research Papers in Statistics: Festschrift for J. Neyman (F. N. David, ed.) 411–444. Wiley, New York.
  • Wold, S. (1978). Cross-validatory estimation of the number of components in factor and principal components models. Technometrics 20 397–405.