Electronic Journal of Statistics

Efficient moment calculations for variance components in large unbalanced crossed random effects models

Katelyn Gao and Art Owen

Full-text: Open access


Large crossed data sets, often modeled by generalized linear mixed models, have become increasingly common and provide challenges for statistical analysis. At very large sizes it becomes desirable to have the computational costs of estimation, inference and prediction (both space and time) grow at most linearly with sample size.

Both traditional maximum likelihood estimation and numerous Markov chain Monte Carlo Bayesian algorithms take superlinear time in order to obtain good parameter estimates in the simple two-factor crossed random effects model. We propose moment based algorithms that, with at most linear cost, estimate variance components, measure the uncertainties of those estimates, and generate shrinkage based predictions for missing observations. When run on simulated normally distributed data, our algorithm performs competitively with maximum likelihood methods.

Article information

Electron. J. Statist., Volume 11, Number 1 (2017), 1235-1296.

Received: January 2016
First available in Project Euclid: 14 April 2017

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62F10: Point estimation
Secondary: 62J10: Analysis of variance and covariance

Crossed random effects variance components big data

Creative Commons Attribution 4.0 International License.


Gao, Katelyn; Owen, Art. Efficient moment calculations for variance components in large unbalanced crossed random effects models. Electron. J. Statist. 11 (2017), no. 1, 1235--1296. doi:10.1214/17-EJS1236. https://projecteuclid.org/euclid.ejs/1492135234

Export citation


  • [1] Bates, D. (2014). Computational methods for mixed models. Technical report, Department of Statistics, University of Wisconsin–Madison., https://cran.r-project.org/web/packages/lme4/vignettes/Theory.pdf.
  • [2] Bennett, J. and Lanning, S. (2007). The Netflix prize. In, Proceedings of KDD Cup and Workshop 2007.
  • [3] Chan, T. F., Golub, G. H., and LeVeque, R. J. (1983). Algorithms for computing the sample variance: Analysis and recommendations., The American Statistician, 37(3):242–247.
  • [4] Clayton, D. and Rasbash, J. (1999). Estimation in large cross random-effect models by data augmentation., Journal of the Royal Statistical Society: Series A (Statistics in Society), 162(3):425–436.
  • [5] Gelman, A., Van Dyk, D. A., Huang, Z., and Boscardin, J. W. (2012). Using redundant parameterizations to fit hierarchical models., Journal of Computational and Graphical Statistics, 17(1).
  • [6] Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images., IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721–741.
  • [7] Hairer, M., Stuart, A. M., and Vollmer, S. J. (2014). Spectral gaps for a Metropolis Hastings algorithm in infinite dimensions., The Annals of Applied Probability, 24(6):2455–2490.
  • [8] Henderson, C. R. (1953). Estimation of variance and covariance components., Biometrics, 9(2):226–252.
  • [9] Johansson, F. (2015)., mpmath: a Python library for arbitrary-precision floating-point arithmetic (version 0.14). http://mpmath.org.
  • [10] Last.fm (2010). Last.fm dataset – 360k users., http://ocelma.net/MusicRecommendationDataset/lastfm-360K.html. http://www.last.fm/.
  • [11] Lavrakas, P. (2008)., Encyclopedia of Survey Research Methods: A-M., volume 1. Sage.
  • [12] Liu, J. S. (2004)., Monte Carlo Strategies in Scientific Computing. Springer New York.
  • [13] Owen, A. B. (2007). The pigeonhole bootstrap., The Annals of Applied Statistics, 1(2):386–411.
  • [14] Owen, A. B. and Eckles, D. (2012). Bootstrapping data arrays of arbitrary order., The Annals of Applied Statistics, 6(3):895–927.
  • [15] Pébay, P. (2008). Formulas for robust, one-pass parallel computation of covariances and arbitrary-order statistical moments. Technical Report SAND2008-6212, Sandia National, Laboratories.
  • [16] Raudenbush, S. W. (1993). A crossed random effects model for unbalanced data with applications in cross-sectional and longitudinal research., Journal of Educational and Behavioral Statistics, 18(4):321–349.
  • [17] Roberts, G. O. and Rosenthal, J. S. (2001). Optimal scaling for various Metropolis Hastings algorithms., Statistical Science, 16(4):351–367.
  • [18] Roberts, G. O. and Sahu, S. K. (1997). Updating schemes, correlation structure, blocking and parameterization for the Gibbs sampler., Journal of the Royal Statistical Society. Series B (Methodological), 59(2):291–317.
  • [19] Searle, S. R., Casella, G., and McCulloch, C. E. (2006)., Variance components. John Wiley & Sons.
  • [20] Snijders, T. A. (2011). Multilevel analysis. In Lovric, M., editor, International Encyclopedia of Statistical Science, pages 879–882. Springer Berlin Heidelberg.
  • [21] Van Dyk, D. A. and Meng, X.-L. (2001). The art of data augmentation., Journal of Computational and Graphical Statistics, 10(1):1–50.
  • [22] Yahoo!-Webscope (2015a). Dataset ydata-ymovies-user-movie-ratings-train-v1_0., http://research.yahoo.com/Academic_Relations.
  • [23] Yahoo!-Webscope (2015b). Dataset ydata-ymusic-rating-study-v1_0-train., http://research.yahoo.com/Academic_Relations.
  • [24] Yu, Y. and Meng, X.-L. (2011). To center or not to center: That is not the question – an ancillarity–sufficiency interweaving strategy (ASIS) for boosting MCMC efficiency., Journal of Computational and Graphical Statistics, 20(3):531–570.