Statistical Science

Principles of Experimental Design for Big Data Analysis

Christopher C. Drovandi, Christopher C. Holmes, James M. McGree, Kerrie Mengersen, Sylvia Richardson, and Elizabeth G. Ryan

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text

Abstract

Big Datasets are endemic, but are often notoriously difficult to analyse because of their size, heterogeneity and quality. The purpose of this paper is to open a discourse on the potential for modern decision theoretic optimal experimental design methods, which by their very nature have traditionally been applied prospectively, to improve the analysis of Big Data through retrospective designed sampling in order to answer particular questions of interest. By appealing to a range of examples, it is suggested that this perspective on Big Data modelling and analysis has the potential for wide generality and advantageous inferential and computational properties. We highlight current hurdles and open research questions surrounding efficient computational optimisation in using retrospective designs, and in part this paper is a call to the optimisation and experimental design communities to work together in the field of Big Data analysis.

Article information

Source
Statist. Sci. Volume 32, Number 3 (2017), 385-404.

Dates
First available in Project Euclid: 1 September 2017

Permanent link to this document
https://projecteuclid.org/euclid.ss/1504253123

Digital Object Identifier
doi:10.1214/16-STS604

Keywords
Active learning Big Data dimension reduction experimental design sub-sampling

Citation

Drovandi, Christopher C.; Holmes, Christopher C.; McGree, James M.; Mengersen, Kerrie; Richardson, Sylvia; Ryan, Elizabeth G. Principles of Experimental Design for Big Data Analysis. Statist. Sci. 32 (2017), no. 3, 385--404. doi:10.1214/16-STS604. https://projecteuclid.org/euclid.ss/1504253123


Export citation

References

  • Amzal, B., Bois, F. Y., Parent, E. and Robert, C. P. (2006). Bayesian-optimal design via interacting particle systems. J. Amer. Statist. Assoc. 101 773–785.
  • Austin, P. C. (2011). An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivar. Behav. Res. 46 399–424.
  • Bardenet, R., Doucet, A. and Holmes, C. (2014). Towards scaling up Markov chain Monte Carlo: An adaptive subsampling approach. In Proceedings of the 31st International Conference on Machine Learning (ICML-14) 405–413.
  • Bardenet, R., Doucet, A. and Holmes, C. (2015). On Markov chain Monte Carlo methods for tall data. Preprint. Available at arXiv:1505.02827 [stat.ME].
  • Bouveyron, C. and Brunet-Saumard, C. (2014). Model-based clustering of high-dimensional data: A review. Comput. Statist. Data Anal. 71 52–78.
  • Box, G. E. P. (1980). Sampling and Bayes’ inference in scientific modelling and robustness. J. R. Stat. Soc., A 143 383–430.
  • Brick, J. M. and Montaquila, J. M. (2009). Nonresponse and weighting. In Sample Surveys: Design, Methods and Applications. Handbook of Statist. 29 163–185. Elsevier, Amsterdam.
  • Chambers, R. (1988). Design-adjusted regression with selectivity bias. Appl. Stat. 37 323–334.
  • Chen, C., Grennan, K., Badner, J., Zhang, D., Jin, E. G. L. and Li, C. (2011). Removing batch effects in analysis of expression microarray data: An evaluation of six batch adjustment methods. PLoS ONE 6 e17238.
  • Cichosz, P. (2015). Data Mining Algorithms: Explained Using R. Wiley, United Kingdom.
  • Dagostino, R. B. (1998). Tutorial in biostatistics: Propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control group. Stat. Med. 17 2265–2281.
  • Drovandi, C. C., McGree, J. M. and Pettitt, A. N. (2013). Sequential Monte Carlo for Bayesian sequentially designed experiments for discrete data. Comput. Statist. Data Anal. 57 320–335.
  • Drovandi, C. C. and Tran, M.-N. (2016). Improving the efficiency of fully Bayesian optimal design of experiments using randomised quasi-Monte Carlo. Available at http://eprints.qut.edu.au/97889.
  • Duffull, S. B., Graham, G., Mengersen, K. and Eccleston, J. (2012). Evaluation of the pre-posterior distribution of optimized sampling times for the design of pharmacokinetic studies. J. Biopharm. Statist. 22 16–29.
  • Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 32 407–499.
  • Elgamal, T. and Hefeeda, M. (2015). Analysis of PCA algorithms in distributed environments. Preprint. Available at arXiv:1503.05214v2 [cs.DC].
  • Espiro-Hernandez, G., Gustafson, P. and Burstyn, I. (2011). Bayesian adjustment for measurement error in continuous exposures in an individually matched case-control study. BMC Med. Res. Methodol. 11 67–77.
  • Fan, J., Feng, Y. and Rui Song, R. (2011). Nonparametric independence screening in sparse ultra-high dimensional additive models. J. Amer. Statist. Assoc. 106 544–557.
  • Fan, J., Han, F. and Liu, H. (2014). Challenges of big data analysis. Int. Reg. Sci. Rev. 1 293–314.
  • Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B. Stat. Methodol. 70 849–911.
  • Fedorov, V. V. (1972). Theory of Optimal Experiments. Academic Press, New York.
  • Fouskakis, D., Ntzoufras, I. and Draper, D. (2009). Bayesian variable selection using cost-adjusted BIC, with application to cost-effective measurement of quality of health care. Ann. Appl. Stat. 3 663–690.
  • Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M. and Bouchachia, A. (2014). A survey on concept drift adaptation. ACM Computing Surveys (CSUR) 46 Article Number 44.
  • Gandomi, A. and Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. Internat. J. Inform. Management Sci. 35 137–144.
  • Gelman, A. (2007). Struggles with survey weighting and regression modeling (with discussion). Statist. Sci. 22 153–164.
  • Guhaa, S., Hafen, R., Rounds, J., Xia, J., Li, J., Xi, B. and Cleveland, W. S. (2012). Large complex data: Divide and recombine (D&R) with RHIPE. Stat. 1 53–67.
  • Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, New York.
  • Karvanen, J., Kulathinal, S. and Gasbarra, D. (2009). Optimal designs to select individuals for genotyping conditional on observed binary or survival outcomes and non-genetic covariates. Comput. Statist. Data Anal. 53 1782–1793.
  • Kettaneha, N., Berglund, A. and Wold, S. (2005). PCA and PLS with very large data sets. Comput. Statist. Data Anal. 48 68–85.
  • Kish, L. and Hess, I. (1950). On noncoverage of sample dwellings. J. Amer. Statist. Assoc. 53 509–524.
  • Kleiner, A., Talwalkar, A., Sarkar, P. and Jordan, M. I. (2014). A scalable bootstrap for massive data. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 795–816.
  • Kück, H., de Freitas, N. and Doucet, A. (2006). SMC samplers for Bayesian optimal nonlinear design. Technical Report, Univ. British Columbia, Vancouver, BC.
  • Lehmann, H. P. and Goodman, S. N. (2000). Bayesian communication: A clinically significiant paradigm for electronic communication. J. Am. Med. Inform. Assoc. 7 254–266.
  • Leskovec, J., Rajaraman, A. and Ullman, J. D. (2014). Mining of Massive Datasets. Cambridge Univ. Press, Cambridge.
  • Lessler, J. T. and Kalsbeek, W. D. (1992). Nonsampling Error in Surveys. Wiley, New York.
  • Levy, P. S. and Lemeshow, S. (1999). Sampling of Populations: Methods and Applications, 3rd ed. Wiley, New York.
  • Liang, F., Cheng, Y., Song, Q., Park, J. and Yang, P. (2013). A resampling-based stochastic approximation method for analysis of large geostatistical data. J. Amer. Statist. Assoc. 108 325–339.
  • Liberty, E. (2013). Simple and deterministic matrix sketching. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 581–588. ACM, New York.
  • Long, Q., Scavino, M., Tempone, R. and Wang, S. (2013). Fast estimation of expected information gains for Bayesian experimental designs based on Laplace approximations. Comput. Methods Appl. Mech. Engrg. 259 24–39.
  • Mason, A., Best, N., Plewis, I. and Richardson, S. (2012). Strategy for modelling nonrandom missing data mechanisms in observational studies using Bayesian methods. J. Off. Stat. 28 279–302.
  • McCarron, C. E., Pullenayegum, E. M., Thabane, L., Goeree, R. and Tarride, J.-E. (2011). Bayesian hierarchical models combining different study types and adjusting for covariate imbalances: A simulation study to assess model performance. PLoS ONE 6 e25635.
  • Mentré, F., Mallet, A. and Baccar, D. (1997). Optimal design in random-effects regression models. Biometrika 84 429–442.
  • Muff, S., Riebler, A., Held, L., Rue, H. and Saner, P. (2015). Bayesian analysis of measurement error models using integrated nested Laplace approximations. J. R. Stat. Soc. Ser. C. Appl. Stat. 64 231–252.
  • Müller, P. (1999). Simulation-based optimal design. In Bayesian Statistics, 6 (Alcoceber, 1998) 459–474. Oxford Univ. Press, New York.
  • Myers, R. H., Montgomery, D. C. and Anderson-Cook, C. M. (2009). Response Surface Methodology: Process and Product Optimization Using Designed Experiments, 3rd ed. Wiley, Hoboken, NJ.
  • Nawarathna, L. S. and Choudhary, P. K. (2015). A heteroscedastic measurement error model for method comparison data with replicate measurements. Stat. Med. 34 1242–1258.
  • Ogungbenro, K. and Aarons, L. (2007). Design of population pharmacokinetic experiments using prior information. Xenobiotica 37 1311–1330.
  • Oleson, J. J., He, C., Sun, D. and Sheriff, S. (2007). Bayesian estimation in small areas when the sampling design strata differ from the study domains. Surv. Methodol. 33 173–185.
  • Oswald, F. L. and Putka, D. J. (2015). Statistical methods for big data. In Big Data at Work: The Data Science Revolution and Organisational Psychology. Routledge, New York.
  • Pitchforth, J. and Mengersen, K. (2012). Bayesian meta-analysis. In Case Studies in Bayesian Statistics 121–144. Wiley, New York.
  • Pukelsheim, F. (1993). Optimal Design of Experiments. Wiley, New York.
  • Reinikainen, J., Karvanen, J. and Tolonen, H. (2016). Optimal selection of individuals for repeated covariate measurements in follow-up studies. Stat. Methods Med. Res. 25 2420–2433.
  • Richardson, S. and Gilks, S. (1993). A Bayesian approach to measurement error problems in epidemiology using conditional independence models. Am. J. Epidemiol. 138 430–442.
  • Rue, H., Martino, S. and Chopin, N. (2009). Approximate Bayesian inference for latent Gaussian models using integrated nested Laplace approximations (with discussion). J. R. Stat. Soc. Ser. B. Stat. Methodol. 71 319–392.
  • Ryan, E. G., Drovandi, C. C. and Pettitt, A. N. (2015). Simulation-based fully Bayesian experimental design for mixed effects models. Comput. Statist. Data Anal. 92 26–39.
  • Savage, L. J. (1972). The Foundations of Statistics, revised ed. Dover Publications, New York.
  • Schifano, E. D., Wu, J., Wang, C., Yan, J. and Chen, M.-H. (2016). Online updating of statistical inference in the big data setting. Technometrics 58 393–403.
  • Schmid, C. H. and Mengersen, K. (2013). Handbook of Meta-analysis in Ecology and Evolution. 145–173 Bayesian meta-analysis. Princeton Univ. Press, Princeton.
  • Scott, S. L., Blocker, A. W. and Bonassi, F. V. (2013). Bayes and big data: The consensus Monte Carlo algorithm. In Bayes 250.
  • Si, Y., Pillai, N. and Gelman, A. (2015). Bayesian nonparametric weighted sampling inference. Bayesian Anal. 10 605–625.
  • Suykens, J. A. K., Signoretto, M. and Argyriou, A. (2015). Regularization, Optimization, Kernels, and Support Vector Machines. Chapman and Hall/CRC, Boca Raton, FL.
  • Tan, F. E. S. and Berger, M. P. F. (1999). Optimal allocation of time points for the random effects models. Comm. Statist. 28 517–540.
  • Toulis, P., Airoldi, E. and Renni, J. (2014). Statistical analysis of stochastic gradient methods for generalized linear models. In Proceedings of the 31st International Conference on Machine Learning 667–675.
  • Trost, S. G., Loprinzi, P. D., Moore, R. and Pfeiffer, K. A. (2011). Comparison of accelerometer cut-points for predicting activity intensity in Youth. Med. Sci. Sports Exerc. 43 1360–1368.
  • Wang, C., Chen, M. H., Schifano, E., Wu, J. and Yan, J. (2015). A survey of statistical methods and computing for big data. Preprint. Available at arXiv:1502.07989v1 [stat.CO].
  • Wolpert, R. L. and Mengersen, K. L. (2004a). Adjusted likelihoods for synthesizing empirical evidence from studies that differ in quality and design: Effects of environmental tobacco smoke. Statist. Sci. 3 450–471.
  • Wolpert, R. L. and Mengersen, K. L. (2004b). Adjusted likelihoods for synthesizing empirical evidence from studies that differ in quality and design: Effects of environmental tobacco smoke. Statist. Sci. 19 450–471.
  • Woods, D. C., Lewis, S. M., Eccleston, J. A. and Russell, K. G. (2006). Designs for generalized linear models with several variables and model uncertainty. Technometrics 48 284–292.
  • Xi, B., Chen, H., Cleveland, W. S. and Telkamp, T. (2010). Statistical analysis and modelling of Internet VoIP traffic for network engineering. Electron. J. Stat. 4 58–116.
  • Yoo, C., Ramirez, L. and Juan Liuzzi, J. (2014). Big data analysis using modern statistical and machine learning methods in medicine. International Neurology Journal 18 50–57.