Statistical Science

Statistical Theory Powering Data Science

Junhui Cai, Avishai Mandelbaum, Chaitra H. Nagaraja, Haipeng Shen, and Linda Zhao

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


Statisticians are finding their place in the emerging field of data science. However, many issues considered “new” in data science have long histories in statistics. Examples of using statistical thinking are illustrated, which range from exploratory data analysis to measuring uncertainty to accommodating nonrandom samples. These examples are then applied to service networks, baseball predictions and official statistics.

Article information

Statist. Sci., Volume 34, Number 4 (2019), 669-691.

First available in Project Euclid: 8 January 2020

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Service networks queueing theory empirical Bayes nonparametric estimation sports statistics decennial census house price index


Cai, Junhui; Mandelbaum, Avishai; Nagaraja, Chaitra H.; Shen, Haipeng; Zhao, Linda. Statistical Theory Powering Data Science. Statist. Sci. 34 (2019), no. 4, 669--691. doi:10.1214/19-STS754.

Export citation


  • Adler, P. S., Mandelbaum, A., Nguyen, V. and Schwerer, E. (1995). From project to process management: An empirically-based framework for analyzing product development time. Manage. Sci. 41 458–484.
  • Aldor-Noiman, S., Feigin, P. D. and Mandelbaum, A. (2009). Workload forecasting for a call center: Methodology and a case study. Ann. Appl. Stat. 3 1403–1447.
  • Anderson, M. (2015). The American Census: A Social History, 2nd ed. Yale University Press, New Haven.
  • Armony, M., Israelit, S., Mandelbaum, A., Marmor, Y. N., Tseytlin, Y. and Yom-Tov, G. B. (2015). On patient flow in hospitals: A data-based queueing-science perspective. Stoch. Syst. 5 146–194.
  • Azriel, D., Feigin, P. and Mandelbaum, A. (2014). Erlang-S: A data-based model of servers in queueing networks. Manage. Sci. 65 4607–4635.
  • Baccelli, F., Kauffmann, B. and Veitch, D. (2009). Inverse problems in queueing theory and Internet probing. Queueing Syst. 63 59–107.
  • Bailey, M. J., Muth, R. F. and Nourse, H. O. (1963). A regression method for real estate price index construction. J. Amer. Statist. Assoc. 58 933–942.
  • Bender-deMoll, S. and McFarland, D. A. (2006). The art and science of dynamic network visualization. J. Soc. Struct. 7 1–38.
  • Berk, R., Brown, L. D., Buja, A., Zhang, K. and Zhao, L. (2013). Valid post-selection inference. Ann. Statist. 41 802–837.
  • Borst, S., Mandelbaum, A. and Reiman, M. I. (2004). Dimensioning large call centers. Oper. Res. 52 17–34.
  • Bramson, M. (1998). State space collapse with application to heavy traffic limits for multiclass queueing networks. Queueing Syst. 30 89–148.
  • Brown, L. D. (1971). Admissible estimators, recurrent diffusions, and insoluble boundary value problems. Ann. Math. Stat. 42 855–903.
  • Brown, L. D. (2008). In-season prediction of batting averages: A field test of empirical Bayes and Bayes methodologies. Ann. Appl. Stat. 2 113–152.
  • Brown, L. D. (2015). Comments on “Methodological issues and challenges in the production of official statistics.” J. Surv. Statist. Methodol. 3 478–481.
  • Brown, L. D. and Greenshtein, E. (2009). Nonparametric empirical Bayes and compound decision approaches to estimation of a high-dimensional vector of normal means. Ann. Statist. 37 1685–1704.
  • Brown, L., Gans, N., Mandelbaum, A., Sakov, A., Shen, H., Zeltyn, S. and Zhao, L. (2005). Statistical analysis of a telephone call center: A queueing-science perspective. J. Amer. Statist. Assoc. 100 36–50.
  • Cai, J. and Zhao, L. (2019). Nonparametric empirical Bayes method for sparse noisy signals. Preprint.
  • Calhoun, C. (1996). OFHEO House Price Indices: HPI Technical Description. Available at
  • Case, K. E. and Shiller, R. J. (1987). Prices of single-family homes since 1970: New indexes for four cities. N. Engl. Econ. Rev. Sept/Oct 45–56.
  • Case, K. E. and Shiller, R. J. (1989). The efficiency of the market for single family homes. Am. Econ. Rev. 79 125–137.
  • Chan, W. and L’Ecuyer, P. CCOptim: Call Center Optimization Java Library. Available at
  • Chen, N., Lee, D. and Shen, H. (2018). Can Customer Arrival Rates Be Modelled by Sine Waves? Submitted.
  • Chen, H. and Yao, D. D. (2001). Fundamentals of Queueing Networks: Performance, Asymptotics, and Optimization, Stochastic Modelling and Applied Probability. Applications of Mathematics (New York) 46. Springer, New York.
  • Chen, H., Harrison, J. M., Mandelbaum, A., Van Ackere, A. and Wein, L. (1988). Empirical evaluation of a queueing network model for semiconductor wafer fabrication. Oper. Res. 36 202–215.
  • Citro, C. F. (2016). The US federal statistical system’s past, present, and future. Annu. Rev. Stat. Appl. 3 347–373.
  • Cowling, A., Hall, P. and Phillips, M. J. (1996). Bootstrap confidence regions for the intensity of a Poisson point process. J. Amer. Statist. Assoc. 91 1516–1524.
  • Dai, J. G. and He, S. (2010). Customer abandonment in many-server queues. Math. Oper. Res. 35 347–362.
  • Dai, J. G., Yeh, D. H. and Zhou, C. (1997). The QNet method for re-entrant queueing networks with priority disciplines. Oper. Res. 45 610–623.
  • Davenport, T. H. and Patil, D. J. (2012). Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review.
  • Deo, S. and Lin, W. (2013). The impact of size and occupancy of hospital on the extent of ambulance diversion: Theory and evidence. Oper. Res. 61 544–562.
  • Dicker, L. H. and Zhao, S.D. (2016). High-dimensional classification via nonparametric empirical Bayes and maximum likelihood inference. Biometrika 103 21–34.
  • Dong, J., Yom-Tov, E. and Yom-Tov, G. B. (2018). The impact of delay announcements on hospital network coordination and waiting times. Manage. Sci. 65 1969–1994.
  • Eberstadt, N., Nunn, R., Schanzenback, D. W. and Strain, M. R. (2017). “In order that they might rest their arguments on facts”: The vital role of government-collected data. The Hamilton Project at Brookings and the American Enterprise Institute.
  • Efron, B. and Morris, C. (1975). The efficiency of logistic regression compared to normal discriminant analysis. J. Amer. Statist. Assoc. 70 311–319.
  • Efron, B. and Morris, C. (1973). Stein’s estimation rule and its competitors—an empirical Bayes approach. J. Amer. Statist. Assoc. 68 117–130.
  • Efron, B. and Morris, C. (1977). Stein’s paradox in statistics. Sci. Am. 236 119–127.
  • Erlang, A. K. (1948). On the rational determination of the number of circuits. In The Life and Works of A. K. Erlang (E. Brockmeyer, H. L. Halstrom and A. Jensen, eds.) 216–221. The Copenhagen Telephone Company, Copenhagen.
  • Federal Housing and Finance Agency. House Price Index, Quarterly Purchase-Only Indexes (Estimated Using Sales Price Data), 100 Largest Metropolitan Statistical Areas (Seasonally Adjusted and Unadjusted). Available at; accessed 30 January 2019.
  • Feldman, Z. and Mandelbaum, A. (2010). Using simulation-based stochastic approximation to optimize staffing of systems with skills-based-routing. In Proceedings—Winter Simulation Conference. 3307–3317.
  • Feldman, Z., Mandelbaum, A., Massey, W. A. and Whitt, W. (2008). Staffing of time-varying queues to achieve time-stable performance. Manage. Sci. 54 324–338.
  • Gans, N., Koole, G. and Mandelbaum, A. (2003). Telephone call centers: Tutorial, review, and research prospects. Manuf. Serv. Oper. Manag. 5 79–141.
  • Gans, N., Liu, N., Mandelbaum, A., Shen, H. and Ye, H. (2010). Service times in call centers: Agent heterogeneity and learning with some operational consequences. In Borrowing Strength: Theory Powering Applications—a Festschrift for Lawrence D. Brown. Inst. Math. Stat. (IMS) Collect. 6 99–123. IMS, Beachwood, OH.
  • Gans, N., Shen, H., Zhou, Y. P., Korolev, N., McCord, A. and Ristock, H. (2015). Parametric forecasting and stochastic programming models for call-center workforce scheduling. Manuf. Serv. Oper. Manag. 17 571–588.
  • Garnett, O., Mandelbaum, A. and Reiman, M. (2002). Designing a call center with impatient customers. Manuf. Serv. Oper. Manag. 4 208–227.
  • Gershwin, G. and Gershwin, I. (1937). Let’s Call the Whole Thing Off. Shall We Dance?
  • Glasserman, P. (2004). Monte Carlo Methods in Financial Engineering: Stochastic Modelling and Applied Probability. Applications of Mathematics (New York) 53. Springer, New York.
  • Glynn, P. W. and Iglehart, D. L. (1989). Importance sampling for stochastic simulations. Manage. Sci. 35 1367–1392.
  • Groves, R. M. (2011). Three eras of survey reseach. Public Opin. Q. 75 861–871.
  • Gu, J. and Koenker, R. (2017). Empirical Bayesball remixed: Empirical Bayes methods for longitudinal data. J. Appl. Econometrics 32 575–599.
  • Gurvich, I. and Whitt, W. (2009). Queue-and-idleness-ratio controls in many-server service systems. Math. Oper. Res. 34 363–396.
  • Ibrahim, R. (2018). Sharing delay information in service systems: A literature survey. Queueing Syst. 89 49–79.
  • Ibrahim, R. and L’Ecuyer, P. (2013). Forecasting call center arrivals: Fixed-effects, mixed-effects, and bivariate models. Manuf. Serv. Oper. Manag. 15 72–85.
  • Ibrahim, R. and Whitt, W. (2011). Wait-time predictors for customer service systems with time-varying demand and capacity. Oper. Res. 59 1106–1118.
  • Ibrahim, R., Ye, H., L’Ecuyer, P. and Shen, H. (2016a). Modeling and forecasting call center arrivals: A literature survey and a case study. Int. J. Forecast. 32 865–874.
  • Ibrahim, R., L’Ecuyer, P., Shen, H. and Thiongane, M. (2016b). Inter-dependent, heterogeneous, and time-varying service-time distributions in call centers. European J. Oper. Res. 250 480–492.
  • James, W. and Stein, C. (1961). Estimation with quadratic loss. In Proc. 4th Berkeley Sympos. Math. Statist. and Prob., Vol. I 361–379. Univ. California Press, Berkeley, CA.
  • Jiang, W. and Zhang, C.-H. (2009). General maximum likelihood empirical Bayes estimation of normal means. Ann. Statist. 37 1647–1684.
  • Jiang, W. and Zhang, C.-H. (2010). Empirical Bayes in-season prediction of baseball batting averages. In Borrowing Strength: Theory Powering Applications—a Festschrift for Lawrence D. Brown. Inst. Math. Stat. (IMS) Collect. 6 263–273. IMS, Beachwood, OH.
  • Kang, W., Pang, G. (2013). Fluid limit of a many-server queueing network with abandonment. Preprint.
  • Kaspi, H. and Ramanan, K. (2011). Law of large numbers limits for many-server queues. Ann. Appl. Probab. 21 33–114.
  • Kim, S. H. and Whitt, W. (2014). Are call center and hospital arrivals well modeled by nonhomogeneous Poisson processes? Manuf. Serv. Oper. Manag. 16 464–480.
  • Koenker, R. and Mizera, I. (2014). Convex optimization, shape constraints, compound decisions, and empirical Bayes rules. J. Amer. Statist. Assoc. 109 674–685.
  • Kolaczyk, E. D. (2009). Statistical Analysis of Network Data: Methods and Models. Springer Series in Statistics. Springer, New York.
  • Lauger, A., Wisniewski, B. and McKenna, L. (2014). Disclosure Avoidance Techniques at the U.S. Census Bureau: Current Practices and Research. Research Report Series (Disclosure Avoidance #2014-02).
  • Li, G., Huang, J. Z. and Shen, H. (2018). To wait or not to wait: Two-way functional hazards model for understanding waiting in call centers. J. Amer. Statist. Assoc. 113 1503–1514.
  • Lindley, D. V. (1962). Discussion on professor Stein’s paper. J. R. Stat. Soc. 24 285–287.
  • Madison, J. (1790). Census of the Union. In Annals of Congress, House of Representatives, 1st Congress, 2nd Session.
  • Maman, S. (2009). Uncertainty in the demand for service: The case of call centers and emergency departments Ph.D. thesis Technion-Israel Institute of Technology, Faculty of Industrial.
  • Mandelbaum, A. and Momčilović, P. (2012). Queues with many servers and impatient customers. Math. Oper. Res. 37 41–65.
  • Mandelbaum, A. and Zeltyn, S. (2009). Staffing many-server queues with impatient customers: Constraint satisfaction in call centers. Oper. Res. 57 1189–1205.
  • Mandelbaum, A. and Zeltyn, S. (2013). Data-stories about (im)patient customers in tele-queues. Queueing Syst. 75 115–146.
  • Mandelbaum, A., Momčilović, P., Trichakis, N., Kadish, S., Leib, R. and Bunnell, C. (2017). Data-driven appointment-scheduling under uncertainty: The case of an infusion unit in a cancer center. Under Revision to Management Science.
  • Matteson, D. S., McLean, M. W., Woodard, D. B. and Henderson, S. G. (2011). Forecasting emergency medical service call arrival rates. Ann. Appl. Stat. 5 1379–1406.
  • Meng, X.-L. (2018). Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election. Ann. Appl. Stat. 12 685–726.
  • Muralidharan, O. (2010). An empirical Bayes mixture method for effect size and false discovery rate estimation. Ann. Appl. Stat. 4 422–438.
  • Muthuraman, K. and Zha, H. (2008). Simulation-based portfolio optimization for large portfolios with transaction costs. Math. Finance 18 115–134.
  • Nagaraja, C. H. (2019). Measuring Society. CRC Press, Boca Raton, FL.
  • Nagaraja, C. H. and Brown, L. D. (2013). Constructing and evaluating an autoregressive house price index. In Topics in Applied Statistics (M. Hu, Y. Liu and J. Lin, eds.). Springer Proceedings in Mathematics & Statistics 55 3–12. Springer, Berlin.
  • Nagaraja, C. H., Brown, L. D. and Wachter, S. (2014). Repeat sales house price index methodology. J. Real Estate Lit. 22 23–46.
  • Nagaraja, C. H., Brown, L. D. and Zhao, L. H. (2011). An autoregressive approach to house price modeling. Ann. Appl. Stat. 5 124–149.
  • Newman, M. E. J. (2008). The Mathematics of Networks. The New Palgrave Encyclopedia of Economics. Palgrave Macmillan, Basingstoke, UK.
  • Newman, M. (2018). Networks. Oxford Univ. Press, Oxford.
  • Pfeffermann, D. (2015). Methdological issues and challenges in the production of official statistics: 24th Annual Morris Hansen Lecture. J. Surv. Statist. Methodol. 3 425–477.
  • Pidgin, C. F. (1888). Practical Statistics: A Handbook for the Use of the Statistician at Work, Students in Colleges and Academies, Agents, Census Enumerators, Etc. The W.E. Smythe Company.
  • Puhalskii, A. A. and Reiman, M. I. (2000). The multiclass $GI/PH/N$ queue in the Halfin–Whitt regime. Adv. in Appl. Probab. 32 564–595.
  • Raykar, V. and Zhao, L. (2010). Nonparametric prior for adaptive sparsity. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics 629–636.
  • Reed, J. and Tezcan, T. (2012). Hazard rate scaling of the abandonment distribution for the $GI/M/n+GI$ queue in heavy traffic. Oper. Res. 60 981–995.
  • Reich, M. (2011). The workload process: Modelling, inference and applications. M. Sc. research proposal.
  • Robert, P. (2003). Stochastic Networks and Queues: Stochastic Modelling and Applied Probability, French ed. Applications of Mathematics (New York) 52. Springer, Berlin.
  • Sangalli, L. M. (2018). The role of statistics in the era of big data. Statist. Probab. Lett. 136 1–3.
  • SEELab (Service Enterprise Engineering Laboratory). Available at
  • Senderovich, A. (2016). Queue Mining: Service Perspectives in Process Mining Ph.D. thesis Technion-Israel Institute of Technology, Faculty of Industrial.
  • Senderovich, A., Weidlich, M., Gal, A. and Mandelbaum, A. (2015). Queue mining for delay prediction in multi-class service processes. Inf. Syst. 53 278–295.
  • Shen, H. and Brown, L. D. (2006). Non-parametric modelling for time-varying customer service time at a bank call centre. Appl. Stoch. Models Bus. Ind. 22 297–311.
  • Shen, H. and Huang, J. Z. (2008a). Interday forecasting and intraday updating of call center arrivals. Manuf. Serv. Oper. Manag. 10 391–410.
  • Shen, H. and Huang, J. Z. (2008b). Forecasting time series of inhomogeneous Poisson processes with application to call center workforce management. Ann. Appl. Stat. 2 601–623.
  • Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, 19541955, Vol. I 197–206. Univ. California Press, Berkeley and Los Angeles.
  • Stein, C. M. (1962). Confidence sets for the mean of a multivariate normal distribution. J. Roy. Statist. Soc. Ser. B 24 265–296.
  • Strawderman, W. E. (1971). Proper Bayes minimax estimators of the multivariate normal mean. Ann. Math. Stat. 42 385–388.
  • Strawderman, W. E. (1973). Proper Bayes minimax estimators of the multivariate normal mean vector for the case of common unknown variances. Ann. Statist. 1 1189–1194.
  • Taylor, J. W. (2012). Density forecasting of intraday call center arrivals using models based on exponential smoothing. Manage. Sci. 58 534–549.
  • Torrieri, N., ACSO, DSSD and SEHSD Program Staff (2014). American Community Survey Design and Methodology. U.S. Census Bureau.
  • Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley Series in Behavioral Science. Addison-Wesley Pub. Co., Reading, MA.
  • van Dyk, D., Fuentes, M., Jordan, M. I., Newton, M., Ray, B. R., Temple Lang, D. and Wickham, H. (2015). ASA Statement on the Role of Statistics in Data Science. Amstat news.
  • Vardi, Y. (1996). Network tomography: Estimating source-destination traffic intensities from link data. J. Amer. Statist. Assoc. 91 365–377.
  • Varian, H. (2009). Hal Varian on how the Web challenges managers. McKinsey & Company.
  • Weinberg, J., Brown, L. D. and Stroud, J. R. (2007). Bayesian forecasting of an inhomogeneous Poisson process with applications to call center data. J. Amer. Statist. Assoc. 102 1185–1198.
  • Weinstein, A., Ma, Z., Brown, L. D. and Zhang, C.-H. (2018). Group-linear empirical Bayes estimates for a heteroscedastic normal mean. J. Amer. Statist. Assoc. 113 698–710.
  • Whitt, W. (1983). The queueing network analyzer. Bell Syst. Tech. J. 62 2779–2815.
  • Whitt, W. (1992). Understanding the efficiency of multi-server service systems. Manage. Sci. 38 708–723.
  • Whitt, W. (2002a). Stochastic-Process Limits: An Introduction to Stochastic-Process Limits and Their Application to Queues. Springer Series in Operations Research. Springer, New York.
  • Whitt, W. (2002b). Stochastic models for the design and management of customer contact centers: Some research directions. Department of Industrial Engineering and Operations Research, Columbia Univ., New York.
  • Whitt, W. (2012). Fitting birth-and-death queueing models to data. Statist. Probab. Lett. 82 998–1004.
  • Wright, C. D. and Hunt, W. O. (1900). The history and growth of the United States census: Prepared for the Senate Committee on the Census. In 56th Congreess, 1st Session; Document No. 194.
  • Xie, X., Kou, S. C. and Brown, L. D. (2012). SURE estimates for a heteroscedastic hierarchical model. J. Amer. Statist. Assoc. 107 1465–1479.
  • Ye, H., Luedtke, J. and Shen, H. (2019). Call center arrivals: When to jointly forecast multiple streams? Prod. Oper. Manag. 28 27–42.
  • Yom-Tov, G. and Mandelbaum, A. (2014). Erlang-R: A time-varying queue with reentrant customers, in support of healthcare staffing. Manuf. Serv. Oper. Manag. 16 283–299.
  • Zeltyn, S. and Mandelbaum, A. (2005). Call centers with impatient customers: Many-server asymptotics of the $M/M/n+G$ queue. Queueing Syst. 51 361–402.
  • Zeltyn, S., Marmor, Y. N., Mandelbaum, A., Carmeli, B., Greenshpan, O., Mesika, Y., Wasserkrug, S., Vortman, P., Schwartz, D. et al. (2011). Simulation-based models of emergency departments: Real-time control, operations planning and scenario analysis. ACM Trans. Model. Comput. Simul. 21 3.
  • Zhang, P. and Serban, N. (2007). Discovery, visualization and performance analysis of enterprise workflow. Comput. Statist. Data Anal. 51 2670–2687.
  • U.S. Census Bureau (1907). Heads of Families at the First Census of the United States Taken in the Year 1790. Government Printing Office, Washington, DC.
  • U.S. Census Bureau (2009). TIGER/Line Shapefiles. Available at
  • U.S. Census Bureau. Decennial census of population and housing. Available at
  • U.S. Census Bureau (2010). Census Bureau Launches 2010 Census Advertising Campaign: Communication Effort Seeks to Boost Nation’s Mail-Back Participation Rates. Available at, January 2010.
  • U.S. Census Bureau (2017a). “Annual Estimates of the Resident Population: April 1, 2010 to July 1, 2017—Table PEPANNRES.” Population Estimates Program.
  • U.S. Census Bureau (2017b). “Geographic Mobility by Selected Characteristics in the United States—Table S0701.” American Community Survey 1-Year Estimates. Available at
  • U.S. Census Bureau (2017c). “Citizen, Voting-age Population by Age—Table B29001.” American Community Survey 1-Year Estimates. Available at
  • U.S. Census Bureau (2017d). “Field of Bachelor’s Degree for First Major—Table S1502.” American Community Survey 1-Year Estimates. Available at
  • U.S. Census Bureau (2017e). “Commuting Characteristics by Sex—Table S0801.” American Community Survey 1-Year Estimates. Available at
  • U.S. Census Bureau (2017f). “Veteran Status—Table S2101.” American Community Survey 1-Year Estimates. Available at
  • U.S. Census Bureau—Census History Staff (2017g). Title 13, U.S. Code. Available at Last revised: July 18, 2017.
  • U.S. Census Bureau—Census History Staff (2017h). Title 26, U.S. Code. Available at Last revised July 18, 2017.
  • U.S. Office of the Secretary of State (1793). Return of the Whole Number of Persons Within the Several Districts of the United States According to, “An Act Providing for the Enumeration of the Inhabitants of the United States”.