## The Annals of Applied Statistics

### Hypothesis testing for high-dimensional multinomials: A selective review

#### Abstract

The statistical analysis of discrete data has been the subject of extensive statistical research dating back to the work of Pearson. In this survey we review some recently developed methods for testing hypotheses about high-dimensional multinomials. Traditional tests like the $\chi^{2}$-test and the likelihood ratio test can have poor power in the high-dimensional setting. Much of the research in this area has focused on finding tests with asymptotically normal limits and developing (stringent) conditions under which tests have normal limits. We argue that this perspective suffers from a significant deficiency: it can exclude many high-dimensional cases when—despite having non-normal null distributions—carefully designed tests can have high power. Finally, we illustrate that taking a minimax perspective and considering refinements of this perspective can lead naturally to powerful and practical tests.

#### Article information

Source
Ann. Appl. Stat., Volume 12, Number 2 (2018), 727-749.

Dates
Revised: February 2018
First available in Project Euclid: 28 July 2018

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1532743474

Digital Object Identifier
doi:10.1214/18-AOAS1155SF

Mathematical Reviews number (MathSciNet)
MR3834283

#### Citation

Balakrishnan, Sivaraman; Wasserman, Larry. Hypothesis testing for high-dimensional multinomials: A selective review. Ann. Appl. Stat. 12 (2018), no. 2, 727--749. doi:10.1214/18-AOAS1155SF. https://projecteuclid.org/euclid.aoas/1532743474

#### References

• Acharya, J., Daskalakis, C. and Kamath, G. (2015). Optimal testing for properties of distributions. In Advances in Neural Information Processing Systems 3591–3599.
• Acharya, J., Das, H., Jafarpour, A., Orlitsky, A., Pan, S. and Suresh, A. (2012). Competitive classification and closeness testing. In Proceedings of the 25th Annual Conference on Learning Theory 23 22.1–22.18.
• Addario-Berry, L., Broutin, N., Devroye, L. and Lugosi, G. (2010). On combinatorial testing problems. Ann. Statist. 38 3063–3092.
• Arias-Castro, E., Candès, E. J. and Durand, A. (2011). Detection of an anomalous cluster in a network. Ann. Statist. 39 278–304.
• Arias-Castro, E., Pelletier, B. and Saligrama, V. (2018). Remember the curse of dimensionality: The case of goodness-of-fit testing in arbitrary dimension. J. Nonparametr. Stat. 30 448–471.
• Balakrishnan, S. and Wasserman, L. (2017). Hypothesis testing for densities and high-dimensional multinomials: Sharp local minimax rates. Available at arXiv:1706.10003.
• Balakrishnan, S. and Wasserman, L. (2018). Hypothesis testing for densities and high-dimensional multinomials II: Sharp local minimax rates. Forthcoming.
• Barron, A. R. (1989). Uniformly powerful goodness of fit tests. Ann. Statist. 17 107–124.
• Batu, T., Kumar, R. and Rubinfeld, R. (2004). Sublinear algorithms for testing monotone and unimodal distributions. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing 381–390. ACM, New York.
• Batu, T., Fortnow, L., Rubinfeld, R., Smith, W. D. and White, P. (2000). Testing that distributions are close. In 41st Annual Symposium on Foundations of Computer Science (Redondo Beach, CA, 2000) 259–269. IEEE Comput. Soc., Los Alamitos, CA.
• Berger, R. L. and Boos, D. D. (1994). $P$ values maximized over a confidence set for the nuisance parameter. J. Amer. Statist. Assoc. 89 1012–1016.
• Berthet, Q. and Rigollet, P. (2013). Optimal detection of sparse principal components in high dimension. Ann. Statist. 41 1780–1815.
• Bhattacharya, B. and Valiant, G. (2015). Testing closeness with unequal sized samples. In Advances in Neural Information Processing Systems 2611–2619.
• Bickel, P. J., Ritov, Y. and Stoker, T. M. (2006). Tailor-made tests for goodness of fit to semiparametric hypotheses. Ann. Statist. 34 721–741.
• Bishop, Y. M. M., Fienberg, S. E. and Holland, P. W. (1977). Discrete Multivariate Analysis: Theory and Practice. MIT Press, Cambridge, MA. With the collaboration of Richard J. Light and Frederick Mosteller. Third printing.
• Cai, T. T. and Low, M. G. (2015). A framework for estimation of convex functions. Statist. Sinica 25 423–456.
• Canonne, C. L. (2018). A survey on distribution testing: Your data is big. But is it blue? Theory Comput. To appear.
• Canonne, C. L., Diakonikolas, I., Gouleakis, T. and Rubinfeld, R. (2016). Testing shape restrictions of discrete distributions. In 33rd Symposium on Theoretical Aspects of Computer Science. LIPIcs. Leibniz Int. Proc. Inform. 47 Art. No. 25. Schloss Dagstuhl. Leibniz-Zent. Inform., Wadern.
• Chan, S.-O., Diakonikolas, I., Valiant, G. and Valiant, P. (2014). Optimal algorithms for testing closeness of discrete distributions. In Proceedings of the Twenty-Fifth Annual ACM–SIAM Symposium on Discrete Algorithms 1193–1203. ACM, New York.
• Chatterjee, S., Guntuboyina, A. and Sen, B. (2015). On risk bounds in isotonic and other shape restricted regression problems. Ann. Statist. 43 1774–1800.
• Daskalakis, C., Kamath, G. and Wright, J. (2018). Which distribution distances are sublinearly testable? In Proceedings of the Twenty-Ninth Annual ACM–SIAM Symposium on Discrete Algorithms 2747–2764. SIAM, Philadelphia, PA.
• Devroye, L. and Györfi, L. (1985). Nonparametric Density Estimation: The $L_{1}$ View. Wiley, New York.
• Diakonikolas, I. and Kane, D. M. (2016). A new approach for testing properties of discrete distributions. In 57th Annual IEEE Symposium on Foundations of Computer Science—FOCS 2016 685–694. IEEE Comput. Soc., Los Alamitos, CA.
• Diakonikolas, I., Kane, D. M. and Nikishkin, V. (2015a). Optimal algorithms and lower bounds for testing closeness of structured distributions. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science—FOCS 2015 1183–1202. IEEE Comput. Soc., Los Alamitos, CA.
• Diakonikolas, I., Kane, D. M. and Nikishkin, V. (2015b). Testing identity of structured distributions. In Proceedings of the Twenty-Sixth Annual ACM–SIAM Symposium on Discrete Algorithms 1841–1854. SIAM, Philadelphia, PA.
• Diakonikolas, I., Kane, D. M. and Nikishkin, V. (2017). Near-optimal closeness testing of discrete histogram distributions. In 44th International Colloquium on Automata, Languages, and Programming. LIPIcs. Leibniz Int. Proc. Inform. 80 Art. No. 8. Schloss Dagstuhl. Leibniz-Zent. Inform., Wadern.
• Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. Ann. Statist. 32 962–994.
• Donoho, D. L. and Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika 81 425–455.
• Donoho, D. L., Johnstone, I. M., Kerkyacharian, G. and Picard, D. (1996). Density estimation by wavelet thresholding. Ann. Statist. 24 508–539.
• Ermakov, M. S. (1991). Minimax detection of a signal in Gaussian white noise. Theory Probab. Appl. 35 667–679.
• Fienberg, S. E. (1979). The use of chi-squared statistics for categorical data problems. J. Roy. Statist. Soc. Ser. B 41 54–64.
• Fienberg, S. E. (1980). The Analysis of Cross-Classified Categorical Data, 2nd ed. MIT Press, Cambridge, MA.
• Fienberg, S. E. and Holland, P. W. (1973). Simultaneous estimation of multinomial cell probabilities. J. Amer. Statist. Assoc. 68 683–691.
• Giné, E. and Nickl, R. (2016). Mathematical Foundations of Infinite-Dimensional Statistical Models. Cambridge Series in Statistical and Probabilistic Mathematics 40. Cambridge Univ. Press, New York.
• Goldenshluger, A. and Lepski, O. (2011). Bandwidth selection in kernel density estimation: Oracle inequalities and adaptive minimax optimality. Ann. Statist. 39 1608–1632.
• Goldreich, O. (2017). Introduction to Property Testing. Cambridge Univ. Press, Cambridge.
• Haberman, S. J. (1977). Log-linear models and frequency tables with small expected cell counts. Ann. Statist. 5 1148–1169.
• Hoeffding, W. (1965). Asymptotically optimal tests for multinomial distributions. Ann. Math. Stat. 36 369–408.
• Holst, L. (1972). Asymptotic normality and efficiency for certain goodness-of-fit tests. Biometrika 59 137–145.
• Indyk, P., Levi, R. and Rubinfeld, R. (2012). Approximating and testing k-histogram distributions in sub-linear time. In Proceedings of the 31st ACM SIGMOD–SIGACT–SIGART Symposium on Principles of Database Systems, PODS 2012 15–22.
• Ingster, Yu. I. (1997). Adaptive chi-square tests. Zap. Nauchn. Sem. S.-Peterburg. Otdel. Mat. Inst. Steklov. (POMI) 244 150–166, 333.
• Ingster, Yu. I. and Suslina, I. A. (2003). Nonparametric Goodness-of-Fit Testing Under Gaussian Models. Lecture Notes in Statistics 169. Springer, New York.
• Ingster, Y. I., Tsybakov, A. B. and Verzelen, N. (2010). Detection boundary in sparse regression. Electron. J. Stat. 4 1476–1526.
• Ivčenko, G. I. and Medvedev, Ju. I. (1980). Decomposable statistics and hypothesis testing. The case of small samples. Theory Probab. Appl. 23 540–551.
• Jiao, J., Han, Y. and Weissman, T. (2017). Minimax estimation of the $l_{1}$ distance. Available at arXiv:1705.00807.
• Koehler, K. J. (1986). Goodness-of-fit tests for ${\log}$-linear models in sparse contingency tables. J. Amer. Statist. Assoc. 81 483–493.
• Koehler, K. J. and Larntz, K. (1980). An empirical investigation of goodness-of-fit statistics for sparse multinomials. J. Amer. Statist. Assoc. 75 336–344.
• LeCam, L. (1973). Convergence of estimates under dimensionality restrictions. Ann. Statist. 1 38–53.
• Lehmann, E. L. and Romano, J. P. (2005). Testing Statistical Hypotheses, 3rd ed. Springer, New York.
• Lepski, O. V. and Spokoiny, V. G. (1999). Minimax nonparametric hypothesis testing: The case of an inhomogeneous alternative. Bernoulli 5 333–358.
• Morris, C. (1975). Central limit theorems for multinomial sums. Ann. Statist. 3 165–188.
• Paninski, L. (2008). A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Trans. Inform. Theory 54 4750–4755.
• Read, T. R. C. and Cressie, N. A. C. (1988). Goodness-of-Fit Statistics for Discrete Multivariate Data. Springer, New York.
• Rubinfeld, R. (2012). Taming big probability distributions. XRDS 19 24–28.
• Spokoiny, V. G. (1996). Adaptive hypothesis testing using wavelets. Ann. Statist. 24 2477–2498.
• Valiant, G. and Valiant, P. (2011). Estimating the unseen: An $n/\log(n)$-sample estimator for entropy and support size, shown optimal via new CLTs. In STOC’11—Proceedings of the 43rd ACM Symposium on Theory of Computing 685–694. ACM, New York.
• Valiant, G. and Valiant, P. (2017). An automatic inequality prover and instance optimal identity testing. SIAM J. Comput. 46 429–455.
• van de Geer, S. (2016). Estimation and Testing Under Sparsity. Lecture Notes in Math. 2159. Springer, Cham.
• Wei, Y. and Wainwright, M. J. (2017). The local geometry of testing in ellipses: Tight control via localized Kolomogorov widths. Available at arXiv:1712.00711.