## The Annals of Statistics

### Randomized incomplete $U$-statistics in high dimensions

#### Abstract

This paper studies inference for the mean vector of a high-dimensional $U$-statistic. In the era of big data, the dimension $d$ of the $U$-statistic and the sample size $n$ of the observations tend to be both large, and the computation of the $U$-statistic is prohibitively demanding. Data-dependent inferential procedures such as the empirical bootstrap for $U$-statistics is even more computationally expensive. To overcome such a computational bottleneck, incomplete $U$-statistics obtained by sampling fewer terms of the $U$-statistic are attractive alternatives. In this paper, we introduce randomized incomplete $U$-statistics with sparse weights whose computational cost can be made independent of the order of the $U$-statistic. We derive nonasymptotic Gaussian approximation error bounds for the randomized incomplete $U$-statistics in high dimensions, namely in cases where the dimension $d$ is possibly much larger than the sample size $n$, for both nondegenerate and degenerate kernels. In addition, we propose generic bootstrap methods for the incomplete $U$-statistics that are computationally much less demanding than existing bootstrap methods, and establish finite sample validity of the proposed bootstrap methods. Our methods are illustrated on the application to nonparametric testing for the pairwise independence of a high-dimensional random vector under weaker assumptions than those appearing in the literature.

#### Article information

Source
Ann. Statist., Volume 47, Number 6 (2019), 3127-3156.

Dates
Revised: October 2018
First available in Project Euclid: 31 October 2019

https://projecteuclid.org/euclid.aos/1572487388

Digital Object Identifier
doi:10.1214/18-AOS1773

Mathematical Reviews number (MathSciNet)
MR4025737

#### Citation

Chen, Xiaohui; Kato, Kengo. Randomized incomplete $U$-statistics in high dimensions. Ann. Statist. 47 (2019), no. 6, 3127--3156. doi:10.1214/18-AOS1773. https://projecteuclid.org/euclid.aos/1572487388

#### References

• [1] Arcones, M. A. and Giné, E. (1992). On the bootstrap of $U$ and $V$ statistics. Ann. Statist. 20 655–674.
• [2] Bergsma, W. and Dassios, A. (2014). A consistent test of independence based on a sign covariance related to Kendall’s tau. Bernoulli 20 1006–1028.
• [3] Bertail, P. and Tressou, J. (2006). Incomplete generalized $U$-statistics for food risk assessment. Biometrics 62 66–74, 315.
• [4] Bickel, P. J. and Freedman, D. A. (1981). Some asymptotic theory for the bootstrap. Ann. Statist. 9 1196–1217.
• [5] Blom, G. (1976). Some properties of incomplete $U$-statistics. Biometrika 63 573–580.
• [6] Bretagnolle, J. (1983). Lois limites du bootstrap de certaines fonctionnelles. Ann. Inst. H. Poincaré Sect. B (N.S.) 19 281–296.
• [7] Brown, B. M. and Kildea, D. G. (1978). Reduced $U$-statistics and the Hodges–Lehmann estimator. Ann. Statist. 6 828–835.
• [8] Chen, X. (2018). Gaussian and bootstrap approximations for high-dimensional U-statistics and their applications. Ann. Statist. 46 642–678.
• [9] Chen, X. and Kato, K. (2017). Jackknife multiplier bootstrap: Finite sample approximations to the $U$-process supremum with applications. Available at arXiv:1708.02705.
• [10] Chen, X. and Kato, K. (2019). Supplement to “Randomized incomplete $U$-statistics in high dimensions.” DOI:10.1214/18-AOS1773SUPP.
• [11] Chernozhukov, V., Chetverikov, D. and Kato, K. (2013). Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Ann. Statist. 41 2786–2819.
• [12] Chernozhukov, V., Chetverikov, D. and Kato, K. (2017). Central limit theorems and bootstrap in high dimensions. Ann. Probab. 45 2309–2352.
• [13] Clémençon, S., Colin, I. and Bellet, A. (2016). Scaling-up empirical risk minimization: Optimization of incomplete $\mathcal{U}$-statistics. J. Mach. Learn. Res. 17 Paper No. 76, 36.
• [14] Dehling, H. and Mikosch, T. (1994). Random quadratic forms and the bootstrap for $U$-statistics. J. Multivariate Anal. 51 392–413.
• [15] Embrechts, P., Lindskog, F. and Mcneil, A. (2003). Modelling dependence with copulas and applications to risk management. In Handbook of Heavy Tailed Distributions in Finance (S. T. Rachev, ed.) 8. North-Holland, Amsterdam.
• [16] Geisser, S. and Mantel, N. (1962). Pairwise independence of jointly dependent variables. Ann. Math. Statist. 33 290–291.
• [17] Gu, Q., Cao, Y., Ning, Y. and Liu, H. (2015). Local and global inference for high dimensional nonparanormal graphical models. Available at arXiv:1502.02347.
• [18] Han, F., Chen, S. and Liu, H. (2017). Distribution-free tests of independence in high dimensions. Biometrika 104 813–828.
• [19] Han, F. and Qian, T. (2016). Asymptotics for asymmetric weighted U-statistics: Central limit theorem and bootstrap under data heterogeneity. Preprint.
• [20] Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution. Ann. Math. Statistics 19 293–325.
• [21] Hoeffding, W. (1948). A non-parametric test of independence. Ann. Math. Statistics 19 546–557.
• [22] Hsing, T. and Wu, W. B. (2004). On weighted $U$-statistics for stationary processes. Ann. Probab. 32 1600–1631.
• [23] Hušková, M. and Janssen, P. (1993). Consistency of the generalized bootstrap for degenerate $U$-statistics. Ann. Statist. 21 1811–1823.
• [24] Hušková, M. and Janssen, P. (1993). Generalized bootstrap for studentized $U$-statistics: A rank statistic approach. Statist. Probab. Lett. 16 225–233.
• [25] Janson, S. (1984). The asymptotic distributions of incomplete $U$-statistics. Z. Wahrsch. Verw. Gebiete 66 495–505.
• [26] Kleiner, A., Talwalkar, A., Sarkar, P. and Jordan, M. I. (2014). A scalable bootstrap for massive data. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 795–816.
• [27] Lee, A. J. (1990). $U$-Statistics. Theory and Practice. Statistics: Textbooks and Monographs 110. Dekker, New York.
• [28] Leung, D. and Drton, M. (2018). Testing independence in high dimensions with sums of rank correlations. Ann. Statist. 46 280–307.
• [29] Major, P. (1994). Asymptotic distributions for weighted $U$-statistics. Ann. Probab. 22 1514–1535.
• [30] Mentch, L. and Hooker, G. (2016). Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. J. Mach. Learn. Res. 17 Paper No. 26, 41.
• [31] Nandy, P., Weihs, L. and Drton, M. (2016). Large-sample theory for the Bergsma–Dassios sign covariance. Electron. J. Stat. 10 2287–2311.
• [32] O’Neil, K. A. and Redner, R. A. (1993). Asymptotic distributions of weighted $U$-statistics of degree $2$. Ann. Probab. 21 1159–1169.
• [33] Rifi, M. and Utzet, F. (2000). On the asymptotic behavior of weighted $U$-statistics. J. Theoret. Probab. 13 141–167.
• [34] Rubin, H. and Vitale, R. A. (1980). Asymptotic distribution of symmetric statistics. Ann. Statist. 8 165–170.
• [35] Shapiro, C. P. and Hubert, L. (1979). Asymptotic normality of permutation statistics derived from weighted sums of bivariate functions. Ann. Statist. 7 788–794.
• [36] Székely, G. J., Rizzo, M. L. and Bakirov, N. K. (2007). Measuring and testing dependence by correlation of distances. Ann. Statist. 35 2769–2794.
• [37] van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics 3. Cambridge Univ. Press, Cambridge.
• [38] van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. With Applications to Statistics. Springer Series in Statistics. Springer, New York.
• [39] Wang, Q. and Jing, B.-Y. (2004). Weighted bootstrap for $U$-statistics. J. Multivariate Anal. 91 177–198.
• [40] Yao, S., Zhang, X. and Shao, X. (2018). Testing mutual independence in high dimension via distance covariance. J. R. Stat. Soc. Ser. B. Stat. Methodol. 80 455–480.
• [41] Zhang, Y., Duchi, J. and Wainwright, M. (2015). Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. J. Mach. Learn. Res. 16 3299–3340.

#### Supplemental materials

• Supplement to “Randomized incomplete $U$-statistics in high dimensions”. The Supplementary Material contains the proofs and additional discussions, simulation results and applications of the main paper.