## The Annals of Applied Statistics

### Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election

Xiao-Li Meng

#### Abstract

Statisticians are increasingly posed with thought-provoking and even paradoxical questions, challenging our qualifications for entering the statistical paradises created by Big Data. By developing measures for data quality, this article suggests a framework to address such a question: “Which one should I trust more: a 1% survey with 60% response rate or a self-reported administrative dataset covering 80% of the population?” A 5-element Euler-formula-like identity shows that for any dataset of size $n$, probabilistic or not, the difference between the sample average $\overline{X}_{n}$ and the population average $\overline{X}_{N}$ is the product of three terms: (1) a data quality measure, $\rho_{{R,X}}$, the correlation between $X_{j}$ and the response/recording indicator $R_{j}$; (2) a data quantity measure, $\sqrt{(N-n)/n}$, where $N$ is the population size; and (3) a problem difficulty measure, $\sigma_{X}$, the standard deviation of $X$. This decomposition provides multiple insights: (I) Probabilistic sampling ensures high data quality by controlling $\rho_{{R,X}}$ at the level of $N^{-1/2}$; (II) When we lose this control, the impact of $N$ is no longer canceled by $\rho_{{R,X}}$, leading to a Law of Large Populations (LLP), that is, our estimation error, relative to the benchmarking rate $1/\sqrt{n}$, increases with $\sqrt{N}$; and (III) the “bigness” of such Big Data (for population inferences) should be measured by the relative size $f=n/N$, not the absolute size $n$; (IV) When combining data sources for population inferences, those relatively tiny but higher quality ones should be given far more weights than suggested by their sizes.

Estimates obtained from the Cooperative Congressional Election Study (CCES) of the 2016 US presidential election suggest a $\rho_{{R,X}}\approx-0.005$ for self-reporting to vote for Donald Trump. Because of LLP, this seemingly minuscule data defect correlation implies that the simple sample proportion of the self-reported voting preference for Trump from $1\%$ of the US eligible voters, that is, $n\approx2\mbox{,}300\mbox{,}000$, has the same mean squared error as the corresponding sample proportion from a genuine simple random sample of size $n\approx400$, a $99.98\%$ reduction of sample size (and hence our confidence). The CCES data demonstrate LLP vividly: on average, the larger the state’s voter populations, the further away the actual Trump vote shares from the usual $95\%$ confidence intervals based on the sample proportions. This should remind us that, without taking data quality into account, population inferences with Big Data are subject to a Big Data Paradox: the more the data, the surer we fool ourselves.

#### Article information

Source
Ann. Appl. Stat., Volume 12, Number 2 (2018), 685-726.

Dates
Revised: April 2018
First available in Project Euclid: 28 July 2018

https://projecteuclid.org/euclid.aoas/1532743473

Digital Object Identifier
doi:10.1214/18-AOAS1161SF

Mathematical Reviews number (MathSciNet)
MR3834282

Zentralblatt MATH identifier
06980472

#### Citation

Meng, Xiao-Li. Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election. Ann. Appl. Stat. 12 (2018), no. 2, 685--726. doi:10.1214/18-AOAS1161SF. https://projecteuclid.org/euclid.aoas/1532743473

#### References

• Anderson, M. and Fienberg, S. E. (1999). Who Counts? The Politics of Census-Taking in Contemporary America. Russell Sage Foundation.
• Ansolabehere, S. and Hersh, E. (2012). Validation: What big data reveal about survey misreporting and the real electorate. Polit. Anal. 20 437–459.
• Ansolabehere, S., Schaffner, B. F. and Luks, S. (2017). Guide to the 2016 Cooperative Congressional Election Survey. Available at http://dx.doi.org/10.7910/DVN/GDF6Z0.
• Argentini, G. (2007). A matrix generalization of Euler identity $e^{ix}=\cos(x)+i\mathrm{sin} (x)$. Preprint. Available at arXiv:math/0703448.
• Bayarri, M. J., Benjamin, D. J., Berger, J. O. and Sellke, T. M. (2016). Rejection odds and rejection ratios: A proposal for statistical practice in testing hypotheses. J. Math. Psych. 72 90–103.
• Bethlehem, J. (2009). The rise of survey sampling. CBS Discussion Paper 9015.
• Burden, B. C. (2000). Voter turnout and the national election studies. Polit. Anal. 8 389–398.
• Chen, C., Duan, N., Meng, X.-L. and Alegria, M. (2006). Power-shrinkage and trimming: Two ways to mitigate excessive weights. In Proceedings of the Survey Research Methods Section of the American Statistical Association 2839–2846.
• Chen, Y., Meng, X.-L., Wang, X., van Dyk, D. A., Marshall, H. L. and Kashyap, V. L. (2018). Calibration concordance for astronomical instruments via multiplicative shrinkage. J. Amer. Statist. Assoc. To appear.
• Cohn, N. (2017). Election review: Why crucial state polls turned out to be wrong. The New York Times, June 1st.
• Donoho, D. (2017). 50 years of data science. J. Comput. Graph. Statist. 26 745–766.
• Duncan, G. T. and Fienberg, S. E. (1997). Obtaining information while preserving privacy: A Markov perturbation method for tabular data. In Joint Statistical Meetings 351–362.
• Efron, B. and Hinkley, D. V. (1978). Assessing the accuracy of the maximum likelihood estimator: Observed versus expected Fisher information. Biometrika 65 457–487.
• Fienberg, S. E. (1994). Conflicts between the needs for access to statistical information and demands for confidentiality. J. Off. Stat. 10 115–132.
• Fienberg, S. E. (1996). Applying statistical concepts and approaches in academic administration. In Education in a Research University 65–82. Stanford Univ. Press, Stanford.
• Fienberg, S. E. (2007). The Analysis of Cross-Classified Categorical Data, Springer Science & Business Media.
• Fienberg, S. E. (2010). The relevance or irrelevance of weights for confidentiality and statistical analyses. Journal of Privacy and Confidentiality 1 183–195.
• Fienberg, S. E., Petrović, S. and Rinaldo, A. (2011). Algebraic statistics for $p_{1}$ random graph models: Markov bases and their uses. In Looking Back. Lect. Notes Stat. Proc. 202 21–38. Springer, New York.
• Fienberg, S. E., Rinaldo, A. and Yang, X. (2010). Differential privacy and the risk-utility tradeoff for multi-dimensional contingency tables. In International Conference on Privacy in Statistical Databases 187–199. Springer, Berlin.
• Firth, D. and Bennett, K. E. (1998). Robust models in probability sampling (with discussions). J. R. Stat. Soc. Ser. B. Stat. Methodol. 60 3–21.
• Fréchet, M. (1951). Sur les tableaux de corrélation dont les marges sont données. Ann. Univ. Lyon. Sect. A. (3) 14 53–77.
• Fuller, W. A. (2011). Sampling Statistics Wiley, New York.
• Gelman, A. (2007). Struggles with survey weighting and regression modeling (with discussions). Statist. Sci. 22 153–188.
• Gelman, A. and Azari, J. (2017). 19 things we learned from the 2016 election (with discussions). Statistics and Public Policy 4 1–10.
• Hartley, H. O. and Ross, A. (1954). Unbiased ratio estimators. Nature 174 270–271.
• Heitjan, D. F. and Rubin, D. B. (1990). Inference from coarse data via multiple imputation with application to age heaping. J. Amer. Statist. Assoc. 85 304–314.
• Hickernell, F. J. (2006). Koksma–Hlawka Inequality. Wiley Online Library.
• Hickernell, F. J. (2018). The trio identity for Quasi-Monte Carlo error analysis. In Monte Carlo and Quasi Monte Carlo (P. Glynn and A. Owen, eds.) 13–37. Springer.
• Höffding, W. (1940). Masstabinvariante Korrelationstheorie. Schr. Math. Inst. u. Inst. Angew. Math. Univ. Berlin 5 181–233.
• Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. J. Amer. Statist. Assoc. 47 663–685.
• Keiding, N. and Louis, T. A. (2016). Perils and potentials of self-selected entry to epidemiological studies and surveys (with discussions). J. Roy. Statist. Soc. Ser. A 179 319–376.
• Kim, J. K. and Kim, J. J. (2007). Nonresponse weighting adjustment using estimated response probability. Canad. J. Statist. 35 501–514.
• Kim, J. K. and Riddles, M. K. (2012). Some theory for propensity-score-adjustment estimators in survey sampling. Surv. Methodol. 38 157.
• Kish, L. (1965). Survey Sampling. Wiley, New York.
• Kong, A., McCullagh, P., Meng, X.-L., Nicolae, D. and Tan, Z. (2003). A theory of statistical models for Monte Carlo integration (with discussions). J. R. Stat. Soc. Ser. B. Stat. Methodol. 65 585–618.
• Kong, A., McCullagh, P., Meng, X.-L. and Nicolae, D. L. (2007). Further explorations of likelihood theory for Monte Carlo integration. In Advances in Statistical Modeling and Inference. Ser. Biostat. 3 563–592. World Sci. Publ., Hackensack, NJ.
• Liu, J. S. (1996). Metropolized independent sampling with comparisons to rejection sampling and importance sampling. Stat. Comput. 6 113–119.
• Liu, K. and Meng, X.-L. (2016). There is individualized treatment. Why not individualized inference? The Annual Review of Statistics and Its Applications 3 79–111.
• Liu, J., Meng, X.-L., Chen, C. and Alegria, M. (2013). Statistics can lie but can also correct for lies: Reducing response bias in NLAAS via Bayesian imputation. Stat. Interface 6 387–398.
• Lohr, S. L. (2009). Sampling: Design and Analysis. Nelson Education.
• McDonald, M. P. (2017). 2016 November general election turnout rates. Available at http://www.electproject.org/2016g.
• Mehrhoff, J. (2016). Executive summary: Meng, X.-L. (2014), “A trio of inference problems that could win you a Nobel prize in statistics (if you help fund it)”. Conference handout.
• Meng, X.-L. (1993). On the absolute bias ratio of ratio estimators. Statist. Probab. Lett. 18 345–348.
• Meng, X.-L. (2005). Comment: Computation, survey and inference. Statist. Sci. 20 21–28.
• Meng, X.-L. (2014). A trio of inference problems that could win you a Nobel prize in statistics (if you help fund it). In Past, Present, and Future of Statistical Science (X. Lin et al., eds.) 537–562. CRC Press.
• Meng, X.-L. (2018). Statistical paradises and paradoxes in big data (II): Multi-resolution inference, Simpson’s paradox, and individualized treatments. Preprint.
• Owen, A. B. (2013). Monte Carlo Theory, Methods and Examples. Available at http://statweb.stanford.edu/~owen/mc/.
• Royall, R. (1968). An old approach to finite population sampling theory. J. Amer. Statist. Assoc. 63 1269–1279.
• Rubin, D. B. (1976). Inference and missing data. Biometrika 63 581–592.
• Senn, S. (2007). Trying to be precise about vagueness. Stat. Med. 26 1417–1430.
• Shirani-Mehr, H., Rothschild, D., Goel, S. and Gelman, A. (2018). Disentangling bias and variance in election polls. Unpublished manuscript. Available at http://www.stat.columbia.edu/~gelman/research/unpublished/polling-errors.pdf.
• Squire, P. (1988). Why the 1936 literary digest poll failed. Public Opin. Q. 52 125–133.
• Troxel, A. B., Ma, G. and Heitjan, D. F. (2004). An index of local sensitivity to nonignorability. Statist. Sinica 14 1221–1237.