## Statistical Science

### Equitability, Interval Estimation, and Statistical Power

#### Abstract

Emerging high-dimensional data sets often contain many nontrivial relationships, and, at modern sample sizes, screening these using an independence test can sometimes yield too many relationships to be a useful exploratory approach. We propose a framework to address this limitation centered around a property of measures of dependence called equitability. Given some measure of relationship strength, an equitable measure of dependence is one that assigns similar scores to equally strong relationships of different types. We formalize equitability within a semiparametric inferential framework in terms of interval estimates of relationship strength, and we then use the correspondence of these interval estimates to hypothesis tests to show that equitability is equivalent under moderate assumptions to requiring that a measure of dependence yield well-powered tests not only for distinguishing nontrivial relationships from trivial ones but also for distinguishing stronger relationships from weaker ones. We then show that equitability, to the extent it is achieved, implies that a statistic will be well powered to detect all relationships of a certain minimal strength, across different relationship types in a family. Thus, equitability is a strengthening of power against independence that enables exploration of data sets with a small number of strong, interesting relationships and a large number of weaker, less interesting ones.

#### Article information

Source
Statist. Sci., Volume 35, Number 2 (2020), 202-217.

Dates
First available in Project Euclid: 3 June 2020

https://projecteuclid.org/euclid.ss/1591171227

Digital Object Identifier
doi:10.1214/19-STS719

Mathematical Reviews number (MathSciNet)
MR4106601

#### Citation

Reshef, Yakir A.; Reshef, David N.; Sabeti, Pardis C.; Mitzenmacher, Michael. Equitability, Interval Estimation, and Statistical Power. Statist. Sci. 35 (2020), no. 2, 202--217. doi:10.1214/19-STS719. https://projecteuclid.org/euclid.ss/1591171227

#### References

• Acharya, J., Daskalakis, C. and Kamath, G. (2015). Optimal testing for properties of distributions. In Advances in Neural Information Processing Systems 3591–3599.
• Arias-Castro, E., Pelletier, B. and Saligrama, V. (2018). Remember the curse of dimensionality: The case of goodness-of-fit testing in arbitrary dimension. J. Nonparametr. Stat. 30 448–471.
• Baraud, Y. (2002). Non-asymptotic minimax rates of testing in signal detection. Bernoulli 8 577–606.
• Breiman, L. and Friedman, J. H. (1985). Estimating optimal transformations for multiple regression and correlation. J. Amer. Statist. Assoc. 80 580–619.
• Casella, G. and Berger, R. L. (2002). Statistical Inference, Vol. 2. The Wadsworth & Brooks/Cole Statistics/Probability Series. Wadsworth & Brooks/Cole Advanced Books & Software, Pacific Grove, CA.
• Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory, 2nd ed. Wiley Interscience, Hoboken, NJ.
• Csiszár, I. (2008). Axiomatic characterizations of information measures. Entropy 10 261–273.
• Ding, A. A., Dy, J. G., Li, Y. and Chang, Y. (2017). A robust-equitable measure for feature ranking and selection. J. Mach. Learn. Res. 18 Paper No. 71, 46.
• Emilsson, V., Thorleifsson, G., Zhang, B., Leonardson, A. S., Zink, F., Zhu, J., Carlson, S., Helgason, A., Walters, G. B. et al. (2008). Genetics of gene expression and its effect on disease. Nature 452 423–428.
• Faust, K. and Raes, J. (2012). Microbial interactions: From networks to models. Nat. Rev., Microbiol. 10 538–550.
• Fromont, M., Lerasle, M. and Reynaud-Bouret, P. (2016). Family-wise separation rates for multiple testing. Ann. Statist. 44 2533–2563.
• Gretton, A., Bousquet, O., Smola, A. and Schölkopf, B. (2005a). Measuring statistical dependence with Hilbert–Schmidt norms. In Algorithmic Learning Theory. Lecture Notes in Computer Science 3734 63–77. Springer, Berlin.
• Gretton, A., Herbrich, R., Smola, A., Bousquet, O. and Schölkopf, B. (2005b). Kernel methods for measuring independence. J. Mach. Learn. Res. 6 2075–2129.
• Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B. and Smola, A. (2012). A kernel two-sample test. J. Mach. Learn. Res. 13 723–773.
• Heller, R., Heller, Y. and Gorfine, M. (2013). A consistent multivariate test of association based on ranks of distances. Biometrika 100 503–510.
• Heller, R., Heller, Y., Kaufman, S., Brill, B. and Gorfine, M. (2016). Consistent distribution-free $K$-sample and independence tests for univariate random variables. J. Mach. Learn. Res. 17 Paper No. 29, 54.
• Hoeffding, W. (1948). A non-parametric test of independence. Ann. Math. Stat. 19 546–557.
• Huo, X. and Székely, G. J. (2016). Fast computing for distance covariance. Technometrics 58 435–447.
• Ingster, Y. I. (1989). Asymptotic minimax testing of independence hypothesis. J. Sov. Math. 44 466–476.
• Ingster, Y. I. (1987). Asymptotically minimax testing of nonparametric hypotheses. In Probability Theory and Mathematical Statistics, Vol. I (Vilnius, 1985) 553–574. VNU Sci. Press, Utrecht.
• Ingster, Y. I. and Suslina, I. A. (2003). Nonparametric Goodness-of-Fit Testing Under Gaussian Models. Lecture Notes in Statistics 169. Springer, New York.
• Jiang, B., Ye, C. and Liu, J. S. (2015). Nonparametric $K$-sample tests via dynamic slicing. J. Amer. Statist. Assoc. 110 642–653.
• Kinney, J. B. and Atwal, G. S. (2014). Equitability, mutual information, and the maximal information coefficient. Proc. Natl. Acad. Sci. USA 111 3354–3359.
• Kraskov, A., Stögbauer, H. and Grassberger, P. (2004). Estimating mutual information. Phys. Rev. E (3) 69 066138, 16.
• Lepski, O. V. and Spokoiny, V. G. (1999). Minimax nonparametric hypothesis testing: The case of an inhomogeneous alternative. Bernoulli 5 333–358.
• Linfoot, E. H. (1957). An informational measure of correlation. Inf. Control 1 85–89.
• Lopez-Paz, D., Hennig, P. and Schölkopf, B. (2013). The randomized dependence coefficient. In Advances in Neural Information Processing Systems 1–9.
• Lopez-Paz, D., Muandet, K., Schölkopf, B. and Tolstikhin, I. (2015). Towards a learning theory of causation. In International Conference on Machine Learning (ICML).
• Móri, T. F. and Székely, G. J. (2019). Four simple axioms of dependence measures. Metrika 82 1–16.
• Murrell, B., Murrell, D. and Murrell, H. (2014). R2-equitability is satisfiable. Proc. Natl. Acad. Sci. USA 111 E2160–E2160.
• Murrell, B., Murrell, D. and Murrell, H. (2016). Discovering general multidimensional associations. PLoS ONE 11 e0151551.
• Paninski, L. (2008). A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Trans. Inform. Theory 54 4750–4755.
• Wang, Y. X. R., Waterman, M. S. and Huang, H. (2014). Gene coexpression measures in large heterogeneous samples using count statistics. Proc. Natl. Acad. Sci. USA 111 16371–16376.
• Reimherr, M. and Nicolae, D. L. (2013). On quantifying dependence: A framework for developing interpretable measures. Statist. Sci. 28 116–130.
• Rényi, A. (1959). On measures of dependence. Acta Math. Acad. Sci. Hung. 10 441–451.
• Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh, P. J., Lander, E. S., Mitzenmacher, M. and Sabeti, P. C. (2011). Detecting novel associations in large data sets. Science 334 1518–1524.
• Reshef, D. N., Reshef, Y. A., Mitzenmacher, M. and Sabeti, P. C. (2014). Cleaning up the record on the maximal information coefficient and equitability. Proc. Natl. Acad. Sci. USA 111 E3362–E3363.
• Reshef, Y. A., Reshef, D. N., Finucane, H. K., Sabeti, P. C. and Mitzenmacher, M. (2016). Measuring dependence powerfully and equitably. J. Mach. Learn. Res. 17 Paper No. 212, 63.
• Reshef, D. N., Reshef, Y. A., Sabeti, P. C. and Mitzenmacher, M. (2018). An empirical study of the maximal and total information coefficients and leading measures of dependence. Ann. Appl. Stat. 12 123–155.
• Romano, S., Vinh, N. X., Verspoor, K. and Bailey, J. (2018). The randomized information coefficient: Assessing dependencies in noisy data. Mach. Learn. 107 509–549.
• Schweizer, B. and Wolff, E. F. (1981). On nonparametric measures of dependence for random variables. Ann. Statist. 9 879–885.
• Simon, N. and Tibshirani, R. (2012). Comment on “Detecting novel associations in large data sets.” Unpublished.
• Speed, T. (2011). A correlation for the 21st century. Science 334 1502–1503.
• Storey, J. D. and Tibshirani, R. (2003). Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. USA 100 9440–9445.
• Sugiyama, M. and Borgwardt, K. M. (2013). Measuring statistical dependence via the mutual information dimension. In The International Joint Conferences on Artificial Intelligence (IJCAI) 1692–1698. AAAI Press, Menlo Park, CA.
• Sun, N. and Zhao, H. (2014). Putting things in order. Proc. Natl. Acad. Sci. USA 111 16236–16237.
• Székely, G. J. and Rizzo, M. L. (2009). Brownian distance covariance. Ann. Appl. Stat. 3 1236–1265.
• Székely, G. J., Rizzo, M. L. and Bakirov, N. K. (2007). Measuring and testing dependence by correlation of distances. Ann. Statist. 35 2769–2794.
• Turk-Browne, N. B. (2013). Functional interactions as big data in the human brain. Science 342 580–584.
• Wang, X., Jiang, B. and Liu, J. S. (2017). Generalized R-squared for detecting dependence. Biometrika 104 129–139.
• Yatracos, Y. G. (1985). On the existence of uniformly consistent estimates. Proc. Amer. Math. Soc. 94 479–486.
• Yodé, A. F. (2011). Adaptive minimax test of independence. Math. Methods Statist. 20 246–268.
• Zhang, K. (2016). Bet on independence. Preprint. Available at arXiv:1610.05246.