## Statistical Science

### Statistical Significance of the Netflix Challenge

#### Abstract

Inspired by the legacy of the Netflix contest, we provide an overview of what has been learned—from our own efforts, and those of others—concerning the problems of collaborative filtering and recommender systems. The data set consists of about 100 million movie ratings (from 1 to 5 stars) involving some 480 thousand users and some 18 thousand movies; the associated ratings matrix is about 99% sparse. The goal is to predict ratings that users will give to movies; systems which can do this accurately have significant commercial applications, particularly on the world wide web. We discuss, in some detail, approaches to “baseline” modeling, singular value decomposition (SVD), as well as kNN (nearest neighbor) and neural network models; temporal effects, cross-validation issues, ensemble methods and other considerations are discussed as well. We compare existing models in a search for new models, and also discuss the mission-critical issues of penalization and parameter shrinkage which arise when the dimensions of a parameter space reaches into the millions. Although much work on such problems has been carried out by the computer science and machine learning communities, our goal here is to address a statistical audience, and to provide a primarily statistical treatment of the lessons that have been learned from this remarkable set of data.

#### Article information

Source
Statist. Sci. Volume 27, Number 2 (2012), 202-231.

Dates
First available in Project Euclid: 19 June 2012

http://projecteuclid.org/euclid.ss/1340110870

Digital Object Identifier
doi:10.1214/11-STS368

Mathematical Reviews number (MathSciNet)
MR2963993

#### Citation

Feuerverger, Andrey; He, Yu; Khatri, Shashi. Statistical Significance of the Netflix Challenge. Statist. Sci. 27 (2012), no. 2, 202--231. doi:10.1214/11-STS368. http://projecteuclid.org/euclid.ss/1340110870.

#### References

• ACM SIGKDD (2007). KDD Cup and Workshop 2007. Available at www.cs.uic.edu/~liub/Netflix-KDD-Cup-2007.html.
• Adomavicius, G. and Tuzhilin, A. (2005). Towards the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering 17 634–749.
• Barbieri, M. M. and Berger, J. O. (2004). Optimal predictive model selection. Ann. Statist. 32 870–897.
• Baron, A. (1984). Predicted squared error: A criterion for automatic model selection. In Self-Organizing Methods in Modeling (S. Farrow, ed.). Marcel Dekker, New York.
• Bell, R. and Koren, Y. (2007a). Lessons from the Netflix Prize challenge. ACM SIGKDD Explorations Newsletter 9 75–79.
• Bell, R. and Koren, Y. (2007b). Improved neighborhood-based collaborative filtering. In Proc. KDD Cup and Workshop 2007 7–14. ACM, New York.
• Bell, R. and Koren, Y. (2007c). Scalable collaborative filtering with jointly derived neighborhood interpolation weights. In Proc. Seventh IEEE Int. Conf. on Data Mining 43–52. IEEE Computer Society, Los Alamitos, CA.
• Bell, R., Koren, Y. and Volinsky, C. (2007a). Modeling relationships at multiple scales to improve accuracy of large recommender systems. In Proc. 13th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining 95–104. ACM, New York.
• Bell, R., Koren, Y. and Volinsky, C. (2007b). The BellKor solution to the Netflix Prize. Available at http://www.research.att.com/~volinsky/netflix/ProgressPrizes2007BellKorSolution.pdf.
• Bell, R., Koren, Y. and Volinsky, C. (2007c). Chasing $1,000,000: How we won the Netflix Progress Prize. ASA Statistical and Computing Graphics Newsletter 18 4–12. • Bell, R., Koren, Y. and Volinsky, C. (2008). The BellKor 2008 solution to the Netflix Prize. Available at http://www.netflixprize.com/assets/ProgressPrize2008_BellKor.pdf. • Bell, R. M., Bennett, J., Koren, Y. and Volinsky, C. (2009). The million dollar programming prize. IEEE Spectrum 46 28–33. • Bennett, J. and Lanning, S. (2007). The Netflix Prize. In Proc. KDD Cup and Workshop 2007 3–6. ACM, New York. • Berger, J. (1982). Bayesian robustness and the Stein effect. J. Amer. Statist. Assoc. 77 358–368. • Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Clarendon Press, New York. • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer, New York. • Breiman, L. (1996). Bagging predictors. Machine Learning 26 123–140. • Breiman, L. and Friedman, J. H. (1997). Predicting multivariate responses in multiple linear regression (with discussion). J. Roy. Statist. Soc. Ser. B 59 3–54. • Burges, C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2 121–167. • Candes, E. and Plan, Y. (2009). Matrix completion with noise. Technical report, Caltech. • Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when$p$is much larger than$n$. Ann. Statist. 35 2313–2351. • Canny, J. F. (2002). Collaborative filtering with privacy via factor analysis. In Proc. 25th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval 238–245. ACM, New York. • Carlin, B. P. and Louis, T. A. (1996). Bayes and Empirical Bayes Methods for Data Analysis. Monogr. Statist. Appl. Probab. 69. Chapman & Hall, London. • Casella, G. (1985). An introduction to empirical Bayes data analysis. Amer. Statist. 39 83–87. • Chien, Y. H. and George, E. (1999). A Bayesian model for collaborative filtering. In Online Proc. 7th Int. Workshop on Artificial Intelligence and Statistics. Fort Lauderdale, FL. • Christianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge Univ. Press, Cambridge. • Cohen, W. W., Schapire, R. E. and Singer, Y. (1999). Learning to order things. J. Artificial Intelligence Res. 10 243–270 (electronic). • Copas, J. B. (1983). Regression, prediction and shrinkage. J. Roy. Statist. Soc. Ser. B 45 311–354. • DeCoste, D. (2006). Collaborative prediction using ensembles of maximum margin matrix factorizations. In Proc. 23rd Int. Conf. on Machine Learning 249–256. ACM, New York. • Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W. and Harshman, R. A. (1990). Indexing by latent semantic analysis. Journal of the Amercan Society of Information Science 41 391–407. • Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. Ser. B 39 1–38. • Efron, B. (1975). Biased versus unbiased estimation. Advances in Math. 16 259–277. • Efron, B. (1983). Estimating the error rate of a prediction rule: Improvement on cross-validation. J. Amer. Statist. Assoc. 78 316–331. • Efron, B. (1986). How biased is the apparent error rate of a prediction rule? J. Amer. Statist. Assoc. 81 461–470. • Efron, B. (1996). Empirical Bayes methods for combining likelihoods (with discussion). J. Amer. Statist. Assoc. 91 538–565. • Efron, B. (2004). The estimation of prediction error: Covariance penalties and cross-validation (with discussion). J. Amer. Statist. Assoc. 99 619–642. • Efron, B. and Morris, C. (1971). Limiting the risk of Bayes and empirical Bayes estimators. I. The Bayes case. J. Amer. Statist. Assoc. 66 807–815. • Efron, B. and Morris, C. (1972a). Limiting the risk of Bayes and empirical Bayes estimators. II. The empirical Bayes case. J. Amer. Statist. Assoc. 67 130–139. • Efron, B. and Morris, C. (1972b). Empirical Bayes on vector observations: An extension of Stein’s method. Biometrika 59 335–347. • Efron, B. and Morris, C. (1973a). Stein’s estimation rule and its competitors—an empirical Bayes approach. J. Amer. Statist. Assoc. 68 117–130. • Efron, B. and Morris, C. (1973b). Combining possibly related estimation problems (with discussion). J. Roy. Statist. Soc. Ser. B 35 379–421. • Efron, B. and Morris, C. (1975). Data analysis using Stein’s estimator and its generalization. J. Amer. Statist. Assoc. 70 311–319. • Efron, B. and Morris, C. (1977). Stein’s paradox in statistics. Scientific American 236 119–127. • Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression (with discussion). Ann. Statist. 32 407–499. • Fan, J. and Li, R. (2006). Statistical challenges with high dimensionality: Feature selection in knowledge discovery. In International Congress of Mathematicians III 595–622. Eur. Math. Soc., Zürich. • Friedman, J. (1994). An overview of predictive learning and function approximation. In From Statistics to Neural Networks (V. Cherkassky, J. Friedman and H. Wechsler, eds.). NATO ISI Series F 136. Springer, New York. • Funk, S. (2006/2007). See Webb, B. (2006/2007). • Gorrell, G. and Webb, B. (2006). Generalized Hebbian algorithm for incremental latent semantic analysis. Technical report, Linköping Univ., Sweden. • Greenshtein, E. and Ritov, Y. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 10 971–988. • Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer, New York. • Herlocker, J. L., Konstan, J. A., Borchers, A. and Riedl, J. (1999). An algorithmic framework for performing collaborative filtering. In Proc. 22nd ACM SIGIR Conf. on Information Retrieval 230–237. • Herlocker, J. L., Konstan, J. A. and Riedl, J. T. (2000). Explaining collaborative filtering recommendations. In Proc. 2000 ACM Conf. on Computer Supported Cooperative Work 241–250. ACM, New York. • Herlocker, J. L., Konstan, J. A., Terveen, L. G. and Riedl, J. T. (2004). Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems 22 5–53. • Hertz, J., Krogh, A. and Palmer, R. G. (1991). Introduction to the Theory of Neural Computation. Addison-Wesley, Redwood City, CA. • Hill, W., Stead, L., Rosenstein, M. and Furnas, G. (1995). Recommending and evaluating choices in a virtual community of use. In Proc. SIGCHI Conf. on Human Factors in Computing Systems 194–201. ACM, New York. • Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Comput. 14 1771–1800. • Hofmann, T. (2001a). Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. J 42 177–196. • Hofmann, T. (2001b). Learning what people (don’t) want. In Proc. European Conf. on Machine Learning. Lect. Notes Comput. Sci. Eng. 2167 214–225. Springer, Berlin. • Hofmann, T. (2004). Latent semantic models for collaborative filtering. ACM Transactions on Information Systems 22 89–115. • Hofmann, T. and Puzicha, J. (1999). Latent class models for collaborative filtering. In Proc. Int. Joint Conf. on Artificial Intelligence 2 688–693. Morgan Kaufmann, San Francisco, CA. • Hu, Y., Koren, Y. and Volinsky, C. (2008). Collaborative filtering for implicit feedback datasets. Technical report, AT&T Labs—Research, Florham Park, NJ. • Izenman, A. J. (2008). Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning. Springer, New York. • James, W. and Stein, C. (1961). Estimation with quadratic loss. In Proc. 4th Berkeley Sympos. Math. Statist. Probab. I 361–379. Univ. California Press, Berkeley, CA. • Kim, D. and Yum, B. (2005). Collaborative filtering based on iterative principal component analysis. Expert Systems with Applications 28 823–830. • Koren, Y. (2008). Factorization meets the neighborhood: A multifaceted collaborative filtering model. In Proc. 14th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining 426–434. ACM, New York. • Koren, Y. (2009). Collaborative filtering with temporal dynamics. In Proc. 15th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining 447–456. ACM, New York. • Koren, Y. (2010). Factor in the neighbors: Scalable and accurate collaborative filtering. ACM Transactions on Knowledge Discovery from Data 4 Article 1. • Koren, Y., Bell, R. and Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer 42 (8) 30–37. • Li, K.-C. (1985). From Stein’s unbiased risk estimates to the method of generalized cross validation. Ann. Statist. 13 1352–1377. • Lim, Y. J. and Teh, Y. W. (2007). Variational Bayesian approach to movie rating predictions. In Proc. KDD Cup and Workshop 2007 15–21. ACM, New York. • Little, R. J. A. and Rubin, D. B. (1987). Statistical Analysis with Missing Data. Wiley, New York. • Mallows, C. (1973). Some comments on$\mathrm{C}_{p}\$. Technometrics 15 661–675.
• Maritz, J. S. and Lwin, T. (1989). Empirical Bayes Methods, 2nd ed. Monogr. Statist. Appl. Probab. 35. Chapman & Hall, London.
• Marlin, B. (2004). Collaborative filtering: A machine learning perspective. M.Sc. thesis, Computer Science Dept., Univ. Toronto.
• Marlin, B. and Zemel, R. S. (2004). The multiple multiplicative factor model for collaborative filtering. In Proc. 21st Int. Conf. on Machine Learning. ACM, New York.
• Marlin, B., Zemel, R. S., Roweis, S. and Slaney, M. (2007). Collaborative filtering and the missing at random assumption. In Proc. 23rd Conf. on Uncertainty in Artificial Intelligence. AMC, New York.
• Moguerza, J. M. and Muñoz, A. (2006). Support vector machines with applications. Statist. Sci. 21 322–336.
• Moody, J. E. (1992). The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. In Advances in Neural Information Processing Systems 4. Morgan Kaufmann, San Francisco, CA.
• Morris, C. N. (1983). Parametric empirical Bayes inference: Theory and applications (with discussion). J. Amer. Statist. Assoc. 78 47–65.
• Narayanan, A. and Shmatikov, V. (2008). Robust de-anonymization of large datasets (How to break anonymity of the Netflix Prize dataset). Preprint.
• Neal, R. M. and Hinton, G. E. (1998). A view of the EM algorithm that justifies incremental, sparse and other variants. In Learning in Graphical Models (M. I. Jordan, ed.) 355–368. Kluwer.
• Netflix Inc. (2006/2010). Netflix Prize webpage: http://www.netflixprize.com/. Netflix Prize Leaderboard: http://www.netflixprize.com/leaderboard/. Netflix Prize Forum: www.netflixprize.com/community/.
• Oard, D. and Kim, J. (1998). Implicit feedback for recommender systems. In Proc. AAAI Workshop on Recommender Systems 31–36. AAAI, Menlo Park, CA.
• Park, S. T. and Pennock, D. M. (2007). Applying collaborative filtering techniques to movie search for better ranking and browsing. In Proc. 13th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining 550–559. ACM, New York.
• Paterek, A. (2007). Improving regularized singular value decomposition for collaborative filtering. In Proc. KDD Cup and Workshop 2007 39–42. ACM, New York.
• Piatetsky, G. (2007). Interview with Simon Funk. SIGKDD Explorations Newsletter 9 38–40.
• Popescul, A., Ungar, L., Pennock, D. and Lawrence, S. (2001). Probabilistic models for unified collaborative and content-based recommendation in sparse-data environments. In Proc. 17th Conf. on Uncertainty Artificial Intelligence. Morgan Kaufmann, San Francisco, CA. 437–444.
• Pu, P., Bridge, D. G., Mobasher, B. and Ricci, F. (2008). In Proc. ACM Conf. on Recommender Systems 2008.
• Raiko, T., Ilin, A. and Karhunen, J. (2007). Principal component analysis for large scale problems with lots of missing values. In ECML 2007. Lecture Notes in Artificiant Intelligence 4701 (J. N. Kok et al. eds.) 691–698. Springer, Berlin.
• Rennie, J. D. M. and Srebro, N. (2005). Fast maximum margin matrix factorization for collaborative prediction. In Proc. 22nd Int. Conf. on Machine Learning 713–719. ACM, New York.
• Resnick, P. and Varian, H. R. (1997). Recommender systems. Communications of the ACM 40 56–58.
• Resnick, P., Iacocou, N., Suchak, M., Berstrom, P. and Riedl, J. (1994). Grouplens: An open architecture for collaborative filtering of netnews. In Proc. ACM Conf. on Computer Support Cooperative Work 175–186.
• Ripley, B. D. (1996). Pattern Recognition and Neural Networks. Cambridge Univ. Press, Cambridge.
• Robbins, H. (1956). An empirical Bayes approach to statistics. In Proc. 3rd Berkeley Sympos. Math. Statist. Probab. I 157–163. Univ. California Press, Berkeley.
• Robbins, H. (1964). The empirical Bayes approach to statistical decision problems. Ann. Math. Statist. 35 1–20.
• Robbins, H. (1983). Some thoughts on empirical Bayes estimation. Ann. Statist. 11 713–723.
• Roweis, S. (1997). EM algorithms for PCA and SPCA. In Advances in Neural Information Processing Systems 10 626–632. MIT Press, Cambridge, MA.
• Salakhutdinov, R. and Mnih, A. (2008a). Probabilistic matrix factorization. In Advances in Neural Information Processing Systems 20 1257–1264. MIT Press, Cambridge, MA.
• Salakhutdinov, R. and Mnih, A. (2008b). Bayesian probabilistic matrix factorization using MCMC. In Proc. 25th Int. Conf. on Machine Learning.
• Salakhutdinov, R., Mnih, A. and Hinton, G. (2007). Restricted Boltzmann machines for collaborative filtering. In Proc. 24th Int. Conf. on Machine Learning. ACM Inetrnational Conference Proceeding Series 227 791–798. ACM, New York.
• Sali, S. (2008). Movie rating prediction using singular value decomposition. Technical report, Univ. California, Santa Cruz.
• Sarwar, B., Karypis, G., Konstan, J. and Riedl, J. T. (2000). Application of dimensionality reduction in recommender system—a case study. In Proc. ACM WebKDD Workshop. ACM, New York.
• Sarwar, B., Karypis, G., Konstan, J. and Riedl, J. T. (2001). Item-based collaborative filtering recommendation algorithms. In Proc. 10th Int. Conf. on the World Wide Web 285–295. ACM, New York.
• Srebro, N. and Jaakkola, T. (2003). Weighted low-rank approximations. In Proc. Twentieth Int. Conf. on Machine Learning (T. Fawcett and N. Mishra, eds.) 720–727. ACM, New York.
• Srebro, N., Rennie, J. D. M. and Jaakkola, T. S. (2005). Maximum-margin matrix factorization. In Advances in Neural Information Processing Systems 17 1329–1336.
• Stein, C. (1974). Estimation of the mean of a multivariate normal distribution. In Proceedings of the Prague Symposium on Asymptotic Statistics (Charles Univ., Prague, 1973) II 345–381. Charles Univ., Prague.
• Stein, C. M. (1981). Estimation of the mean of a multivariate normal distribution. Ann. Statist. 9 1135–1151.
• Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions (with discussion). J. Roy. Statist. Soc. Ser. B 36 111–147.
• Takacs, G., Pilaszy, I., Nemeth, B. and Tikk, D. (2007). On the Gravity recommendation system. In Proc. KDD Cup and Workshop 2007 22–30. ACM, New York.
• Takacs, G., Pilaszy, I., Nemeth, B. and Tikk, D. (2008a). Major components of the Gravity recommendation system. SIGKDD Explorations 9 80–83.
• Takacs, G., Pilaszy, I., Nemeth, B. and Tikk, D. (2008b). Investigation of various matrix factorization methods for large recommender systems. In Proc. 2nd Netflix-KDD Workshop. ACM, New York.
• Takacs, G., Pilaszy, I., Nemeth, B. and Tikk, D. (2008c). Matrix factorization and neighbor based algorithms for the Netflix Prize problem. In Proc. ACM Conf. on Recommender Systems 267–274. ACM, New York.
• Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
• Tintarev, N. and Masthoff, J. (2007). A survey of explanations in recommender systems. In Proc. 23rd Int. Conf. on Data Engineering Workshops 801–810. IEEE, New York.
• Toscher, A. and Jahrer, M. (2008). The BigChaos solution to the Netflix Prize 2008. Technical report, commendo research and consulting, Köflach, Austria.
• Toscher, A., Jahrer, M. and Bell, R. M. (2009). The BigChaos solution to the Netflix Grand Prize. Technical report, commendo research and consulting, Koflach, Austria.
• Toscher, A., Jahrer, M. and Legenstein, R. (2008). Improved neighbourhood-based algorithms for large-scale recommender systems. In Proc. 2nd Netflix-KDD Workshop 2008. ACM, New York.
• Toscher, A., Jahrer, M. and Legenstein, R. (2010). Combining predictions for accurate recommender systems. In Proc. 16th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining 693–701. ACM, Washington, DC.
• Tuzhilin, A., Koren, Y., Bennett, C., Elkan, C. and Lemire, D. (2008). Proc. 2nd KDD Workshop on Large Scale Recommender Systems and the Netflix Prize Competition. ACM, New York.
• Ungar, L. and Foster, D. (1998). Clustering methods for collaborative filtering. In Proc. Workshop on Recommendation Systems. AAAI Press, Menlo Park.
• van Houwelingen, J. C. (2001). Shrinkage and penalized likelihood as methods to improve predictive accuracy. Statist. Neerlandica 55 17–34.
• Vapnik, V. N. (2000). The Nature of Statistical Learning Theory, 2nd ed. Springer, New York.
• Wang, J., de Vries, A. P. and Reinders, M. J. T. (2006). Unifying user-based and item-based collaborative filtering approaches by similarity fusion. In Proc. 29th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval 501–508. ACM, New York.
• Webb, B. (aka Funk, S.) (2006/2007). ‘Blog’ entries, 27 October 2006, 2 November 2006, 11 December 2007 and 17 August 2007. Available at http://sifter.org/~simon/journal/.
• Wu, M. (2007). Collaborative filtering via ensembles of matrix factorizations. In Proc. KDD Cup and Workshop 2007 43–47. ACM, New York.
• Ye, J. (1998). On measuring and correcting the effects of data mining and model selection. J. Amer. Statist. Assoc. 93 120–131.
• Yuan, M. and Lin, Y. (2005). Efficient empirical Bayes variable selection and estimation in linear models. J. Amer. Statist. Assoc. 100 1215–1225.
• Zhang, Y. and Koren, J. (2007). Efficient Bayesian hierarchical user modeling for recommendation systems. In Proc. 30th Int. ACM SIGIR Conf. on Research and Developments in Information Retrieval. ACM, New York.
• Zhou, Y., Wilkinson, D., Schreiber, R. and Pan, R. (2008). Large scale parallel collaborative filtering for the Netlix Prize. In Proc. 4th Int. Conf. Algorithmic Aspects in Information and Management. Lecture Notes in Comput. Sci. 5031 337–348. Springer, Berlin.
• Zou, H., Hastie, T. and Tibshirani, R. (2006). Sparse principal component analysis. J. Comput. Graph. Statist. 15 265–286.
• Zou, H., Hastie, T. and Tibshirani, R. (2007). On the “degrees of freedom” of the lasso. Ann. Statist. 35 2173–2192.