A New Framework for Distance and Kernel-based Metrics in High Dimensions

: We present new metrics to quantify and test for (i) the equality of distributions and (ii) the independence between two high-dimensional random vectors. We show that the energy distance based on the usual Euclidean distance cannot completely characterize the homogeneity of two high-dimensional distributions in the sense that it only detects the equality of means and the traces of covariance matrices in the high-dimensional setup. We propose a new class of metrics which inherit the desirable properties of the energy distance/distance covariance in the low-dimensional setting and is capable of detecting the homogeneity of/ completely characterizing independence between the low-dimensional marginal distributions in the high dimensional setup. We further propose t-tests based on the new metrics to perform high-dimensional two-sample testing/ independence testing and study its asymptotic behavior under both high dimension low sample size (HDLSS) and high dimension medium sample size (HDMSS) setups. The computational complexity of the t-tests only grows linearly with the dimension and thus is scalable to very high dimensional data. We demonstrate the superior power behavior of the proposed tests for homogeneity of distributions and independence via both simulated and real datasets.


Introduction
Nonparametric two-sample testing of homogeneity of distributions has been a classical problem in statistics, finding a plethora of applications in goodness-of-fit testing, clustering, change-point detection and so on. Some of the most traditional tools in this domain are Kolmogorov-Smirnov test, and Wald-Wolfowitz runs test, whose multivariate and multidimensional extensions have been studied by Darling (1957), David (1958) and Bickel (1969) among others. Friedman and Rafsky (1979) proposed a distribution-free multivariate generalization of the Wald-Wolfowitz runs test applicable for arbitrary but fixed dimensions. Schilling (1986) proposed another distribution-free test for multivariate twosample problem based on k-nearest neighbor (k-NN) graphs. Maa et al. (1996) suggested a technique for reducing the dimensionality by examining the distribution of interpoint distances. In a recent novel work, Chen and Friedman (2017) proposed graph-based tests for moderate to high dimensional data and non-Euclidean data. The last two decades have seen an abundance of literature on distance and kernel-based tests for equality of distributions. Energy distance (first introduced by Székely (2002)) and maximum mean discrepancy or MMD (see Gretton et al. (2012)) have been widely studied in both the statistics and machine learning communities.  provided a unifying framework establishing the equivalence between the (generalized) energy distance and MMD. Although there have been some very recent works to gain insight on the decaying power of the distance and kernel-based tests for high dimensional inference (see for example Ramdas et al. (2015aRamdas et al. ( , 2015b, Kim et al. (2018) and Li (2018)), the behavior of these tests in the high dimensional setup is still a pretty unexplored area.
Measuring and testing for independence between two random vectors has been another fundamental problem in statistics, which has found applications in a wide variety of areas such as independent component analysis, feature selection, graphical modeling, causal inference, etc. There has been an enormous amount of literature on developing dependence metrics to quantify non-linear and non-monotone dependence in the low dimensional context. Gretton et al. (2005Gretton et al. ( , 2007 introduced a kernel-based independence measure, namely the Hilbert-Schmidt Independence Criterion (HSIC). Bergsma and Dassios (2014) proposed a consistent test of independence of two ordinal random variables based on an extension of Kendall's tau. Josse and Holmes (2014) suggested tests of independence based on the RV coefficient. Székely et al. (2007), in their seminal paper, introduced distance covariance (dCov) to characterize dependence between two random vectors of arbitrary dimensions.  extended the notion of distance covariance from Euclidean spaces to arbitrary metric spaces.  established the equivalence between HSIC and (generalized) distance covariance via the correspondence between positive definite kernels and semi-metrics of negative type. Over the last decade, the idea of distance covariance has been widely extended and analyzed in various ways; see for example Zhou (2012), , Wang et al. (2015), Shao and Zhang (2014), Huo andSzékely (2016), Zhang et al. (2018), Edelmann et al. (2018) among many others. There have been some very recent literature which aims at generalizing distance covariance to quantify the joint dependence among more than two random vectors; see for example Matteson and Tsay (2017), Jin and Matteson (2017), Chakraborty and Zhang (2018), Böttcher (2017), , etc. However, in the high dimensional setup, the literature is scarce, and the behavior of the widely used distance and kernel-based dependence metrics is not very well explored till date. Székely and Rizzo (2013) proposed a distance correlation based t-test to test for independence in high dimensions. In a very recent work, Zhu et al. (2018) showed that in the high dimension low sample size (HDLSS) setting, i.e., when the dimensions grow while the sample size is held fixed, the sample distance covariance can only measure the component-wise linear dependence between the two vectors. As a consequence, the distance correlation based t-test proposed by Székely et al. (2013) for independence between two high dimensional random vectors has trivial power when the two random vectors are nonlinearly dependent but component-wise uncorrelated. As a remedy, Zhu et al. (2018) proposed a test by aggregating the pairwise squared sample distance covariances and studied its asymptotic behavior under the HDLSS setup. This paper presents a new class of metrics to quantify the homogeneity of distributions and independence between two high-dimensional random vectors. The core of our methodology is a new way of defining the distance between sample points (interpoint distance) in the high-dimensional Euclidean spaces. In the first part of this work, we show that the energy distance based on the usual Euclidean distance cannot completely characterize the homogeneity of two high-dimensional distributions in the sense that it only detects the equality of means and the traces of covariance matrices in the highdimensional setup. To overcome such a limitation, we propose a new class of metrics based on the new distance which inherits the nice properties of energy distance and maximum mean discrepancy in the low-dimensional setting and is capable of detecting the pairwise homogeneity of the low-dimensional marginal distributions in the HDLSS setup. We construct a high-dimensional two sample t-test based on the U-statistic type estimator of the proposed metric, which can be viewed as a generalization of the classical two-sample t-test with equal variances. We show under the HDLSS setting that the new two sample t-test converges to a central t-distribution under the null and it has nontrivial power for a broader class of alternatives compared to the energy distance. We further show that the two sample t-test converges to a standard normal limit under the null when the dimension and sample size both grow to infinity with the dimension growing more rapidly. It is worth mentioning that we develop an approach to unify the analysis for the usual energy distance and the proposed metrics. Compared to existing works, we make the following contribution.
• We derive the asymptotic variance of the generalized energy distance under the HDLSS setting and propose a computationally efficient variance estimator (whose computational cost is linear in the dimension). Our analysis is based on a pivotal t-statistic which does not require permutation or resampling-based inference and allows an asymptotic exact power analysis.
In the second part, we propose a new framework to construct dependence metrics to quantify the dependence between two high-dimensional random vectors X and Y of possibly different dimensions.
The new metric, denoted by D 2 (X, Y ), generalizes both the distance covariance and HSIC. It completely characterizes independence between X and Y and inherits all other desirable properties of the distance covariance and HSIC for fixed dimensions. In the HDLSS setting, we show that the proposed population dependence metric behaves as an aggregation of group-wise (generalized) distance covariances. We construct an unbiased U-statistic type estimator of D 2 (X, Y ) and show that with growing dimensions, the unbiased estimator is asymptotically equivalent to the sum of group-wise squared sample (generalized) distance covariances. Thus it can quantify group-wise non-linear dependence between two high-dimensional random vectors, going beyond the scope of the distance covariance based on the usual Euclidean distance and HSIC which have been recently shown only to capture the componentwise linear dependence in high dimension, see Zhu et al. (2018). We further propose a t-test based on the new metrics to perform high-dimensional independence testing and study its asymptotic size and power behaviors under both the HDLSS and high dimension medium sample size (HDMSS) setups. In particular, under the HDLSS setting, we prove that the proposed t-test converges to a central t-distribution under the null and a noncentral t-distribution with a random noncentrality parameter under the alternative. Through extensive numerical studies, we demonstrate that the newly proposed t-test can capture group-wise nonlinear dependence which cannot be detected by the usual distance covariance and HSIC in the high dimensional regime. Compared to the marginal aggregation approach in Zhu et al. (2018), our new method enjoys two major advantages.
• Our approach provides a neater way of generalizing the notion of distance and kernel-based dependence metrics. The newly proposed metrics completely characterize dependence in the lowdimensional case and capture group-wise nonlinear dependence in the high-dimensional case. In this sense, our metric can detect a wider range of dependence compared to the marginal aggregation approach.
• The computational complexity of the t-tests only grows linearly with the dimension and thus is scalable to very high dimensional data.
Notation. Let X = (X 1 , . . . X p ) ∈ R p and Y = (Y 1 , . . . , Y q ) ∈ R q be two random vectors of dimensions p and q respectively. Denote by · p the Euclidean norm of R p (we shall use it interchangeably with · when there is no confusion). Let 0 p be the origin of R p . We use X ⊥ ⊥ Y to denote that X is independent of Y , and use "X d = Y " to indicate that X and Y are identically distributed. Let (X , Y ), (X , Y ) and (X , Y ) be independent copies of (X, Y ). We utilize the order in probability notations such as stochastic boundedness O p (big O in probability), convergence in probability o p (small o in probability) and equivalent order , which is defined as follows: for a sequence of random variables {Z n } ∞ n=1 and a sequence of real numbers {a n } ∞ n=1 , Z n p a n if and only if Z n /a n = O p (1) and a n /Z n = O p (1) as n → ∞. For a metric space (X , d X ), let M(X ) and M 1 (X ) denote the set of all finite signed Borel measures on X and all probability measures on X , respectively. Define in a similar way. For a matrix A = (a kl ) n k,l=1 ∈ R n×n , define its U-centered versionÃ = (ã kl ) ∈ R n×n as follows for k, l = 1, . . . , n. Define Denote by tr(A) the trace of a square matrix A. A ⊗ B denotes the kronecker product of two matrices A and B. Let Φ(·) be the cumulative distribution function of the standard normal distribution. Denote by t a,b the noncentral t-distribution with a degrees of freedom and noncentrality parameter b. Write t a = t a,0 . Denote by q α,a and Z α the upper α quantile of the distribution of t a and the standard normal distribution, respectively, for α ∈ (0, 1). Also denote by χ 2 a the chi-square distribution with a degrees of freedom. Denote U ∼ Rademacher (0.5) if P (U = 1) = P (U = −1) = 0.5.
Let 1 A denote the indicator function associated with a set A. Finally, denote by a the integer part of a ∈ R.
2 An overview: distance and kernel-based metrics 2.1 Energy distance and MMD Energy distance (see Székely et al. (2004Székely et al. ( , 2005, Baringhaus and Franz (2004)) or the Euclidean energy distance between two random vectors X, Y ∈ R p and X ⊥ ⊥ Y with E X p < ∞ and E Y p < ∞, is defined as where (X , Y ) is an independent copy of (X, Y ). Theorem 1 in Székely et al. (2005) shows that ED(X, Y ) ≥ 0 and the equality holds if and only if X d = Y . In general, for an arbitrary metric space The metric space (X , d) is said to be of strong negative type if the equality in (4) holds only when α i = 0 for all i ∈ {1, . . . , n}.
If (X , d) has strong negative type, then ED d (X, Y ) completely characterizes the homogeneity of the distributions of X and Y (see  and  for detailed discussions). This quantification of homogeneity of distributions lends itself for reasonable use in one-sample goodness-of-fit testing and two sample testing for equality of distributions.
On the machine learning side, Gretton et al. (2012) proposed a kernel-based metric, namely maximum mean discrepancy (MMD), to conduct two-sample testing for equality of distributions. We provide some background before introducing MMD.
Definition 2.2. (RKHS) Let H be a Hilbert space of real valued functions defined on some space X . A bivariate function K : X × X → R is called a reproducing kernel of H if : where ·, · H is the inner product associated with H. If H has a reproducing kernel, it is said to be a reproducing kernel Hilbert space (RKHS).
By Moore-Aronszajn theorem, for every positive definite function (also called a kernel) K : X × X → R, there is an associated RKHS H K with the reproducing kernel K. The map Π : M 1 (X ) → H K , defined as Π(P ) = X K(·, x) dP (x) for P ∈ M 1 (X ) is called the mean embedding function associated with K. A kernel K is said to be characteristic to M 1 (X ) if the map Π associated with K is injective.
Suppose K is a characteristic kernel on X . Then the MMD between X ∼ P X and Y ∼ P Y , where By virtue of K being a characteristic kernel, Gretton et al. (2012) shows that the squared MMD can be equivalently expressed as Theorem 22 in  establishes the equivalence between (generalized) energy distance and MMD. Following is the definition of a kernel induced by a distance metric (refer to Section 4.1 in  for more details). Definition 2.3. (Distance-induced kernel and kernel-induced distance) Let (X , d) be a metric space of negative type and x 0 ∈ X . Denote K : X × X → R as The kernel K is positive definite if and only if (X , d) has negative type, and thus K is a valid kernel on X whenever d is a metric of negative type. The kernel K defined in (7) is said to be the distance-induced kernel induced by d and centered at x 0 . One the other hand, the distance d can be generated by the Proposition 29 in  establishes that the distance-induced kernel K induced by d is characteristic to M 1 (X ) ∩ M 1 K (X ) if and only if (X , d) has strong negative type. Therefore, MMD can be viewed as a special case of the generalized energy distance in (3) with d being the metric induced by a characteristic kernel.
In Section 4, we shall propose a new class of metrics for quantifying the homogeneity of highdimensional distributions. This new class can be viewed as a particular case of the general measures in (3)

Distance covariance and HSIC
Distance covariance (dCov) was first introduced in the seminal paper by Székely et al. (2007) to quantify the dependence between two random vectors of arbitrary (fixed) dimensions. Consider two random vectors X ∈ R p and Y ∈ R q with E X p < ∞ and E Y q < ∞. The Euclidean dCov between X and Y is defined as the positive square root of where f X , f Y and f X,Y are the individual and joint characteristic functions of X and Y respectively, and, c p = π (1+p)/2 / Γ((1 + p)/2) is a constant with Γ(·) being the complete gamma function.
The key feature of dCov is that it completely characterizes independence between two random vectors of arbitrary dimensions, or in other words dCov(X, Y ) = 0 if and only if X ⊥ ⊥ Y . According to Remark 3 in Székely et al. (2007), dCov can be equivalently expressed as Lyons (2013) extends the notion of dCov from Euclidean spaces to general metric spaces. For arbitrary metric spaces (X , d X ) and (Y, d Y ), the generalized dCov between X ∼ P X ∈ M 1 (X ) ∩ M 1 d X (X ) and Theorem 3.11 in  shows that if (X , d X ) and (Y, d Y ) are both metric spaces of strong negative In other words, the complete characterization of independence by dCov holds true for any metric spaces of strong negative type. According to Theorem 3.16 in , every separable Hilbert space is of strong negative type. As Euclidean spaces are separable Hilbert spaces, the characterization of independence by dCov between two random vectors in (R p , · p ) and (R q , · q ) is just a special case.
Hilbert-Schmidt Independence Criterion (HSIC) was introduced as a kernel-based independence measure by Gretton et al. (2005Gretton et al. ( , 2007. Suppose X and Y are arbitrary topological spaces, K X and K Y are characteristic kernels on X and Y with the respective RKHSs H K X and H K Y . Let K = K X ⊗ K Y be the tensor product of the kernels K X and K Y , and, H K be the tensor product of the RKHSs H K X and where P XY denotes the joint probability distribution of X and Y . The HSIC between X and Y is essentially the MMD between the joint distribution P XY and the product of the marginals P X and P Y . Gretton et al. (2005) shows that the squared HSIC can be equivalently expressed as Theorem 24 in  establishes the equivalence between the generalized dCov and HSIC.
For an observed random sample (X i , Y i ) n i=1 from the joint distribution of X and Y , a U-statistic type estimator of the generalized dCov in (11) can be defined as whereÃ,B are the U-centered versions (see (1)

New distance for Euclidean space
We introduce a family of distances for Euclidean space, which shall play a central role in the subsequent developments. For x ∈ Rp, we partition x into p sub-vectors or groups, namely x = (x (1) , . . . , x (p) ), Let ρ i be a metric or semimetric (see for example Definition 1 in ) defined on R d i for 1 ≤ i ≤ p. We define a family of distances for Rp as where x, x ∈ Rp with x = (x (1) , . . . , x (p) ) and x = (x (1) , . . . , x (p) ), and d = ( Proposition 3.1. Suppose each ρ i is a metric of strong negative type on R d i . Then Rp, K d satisfies the following two properties: is a valid metric on Rp; 2. Rp, K d has strong negative type.
In a special case, suppose ρ i is the Euclidean distance on R d i . By Theorem 3.16 in , is a separable Hilbert space, and hence has strong negative type. Then the Euclidean space equipped with the metric is of strong negative type. Further, if all the components x (i) are unidimensional, i.e., d i = 1 for 1 ≤ i ≤ p, then the metric boils down to where x 1 = p j=1 |x j | is the l 1 or the absolute norm on R p . If then K d reduces to the usual Euclidean distance. We shall unify the analysis of our new metrics with the classical metrics by considering K d which is defined in (15) with S1 each ρ i being a metric of strong negative type on R d i ; S2 each ρ i being a semimetric defined in (18).
The first case corresponds to the newly proposed metrics while the second case leads to the classical metrics based on the usual Euclidean distance. Remarks 3.1 and 3.2 provide two different ways of generalizing the class in (15). To be focused, our analysis below shall only concern about the distances defined in (15). In the numerical studies in Section 6, we consider ρ i to be the Euclidean distance and the distances induced by the Laplace and Gaussian kernels (see Definition 2.3) which are of strong negative type on R d i for 1 ≤ i ≤ p.
Remark 3.1. A more general family of distances can be defined as According to Remark 3.19 of Lyons (2013), the space (Rp, K d,r ) is of strong negative type. The proposed distance is a special case with r = 1/2.
Based on the proposed distance, one can construct the generalized Gaussian and Laplacian kernels as If K d is translation invariant, then by Theorem 9 in Sriperumbudur et al. (2010) it can be verified that is a characteristic kernel on Rp. As a consequence, the Euclidean space equipped with the distance is of strong negative type. Remark 3.3. In Sections 4 and 5 we develop new classes of homogeneity and dependence metrics to quantify the pairwise homogeneity of distributions or the pairwise non-linear dependence of the lowdimensional groups. A natural question to arise in this regard is how to partition the random vectors optimally in practice. We present some real data examples in Section 6.3 of the main paper where all the group sizes have been considered to be one (as a special case of the general theory proposed in this paper), and an additional real data example in Section C of the supplement where the data admits some natural grouping. We believe this partitioning can be very much problem specific and may require subject knowledge. We leave it for future research to develop an algorithm to find the optimal groups using the data and perhaps some auxiliary information.

Homogeneity metrics
Consider X, Y ∈ Rp. Suppose X and Y can be partitioned into p sub-vectors or groups, viz. X = X (1) , X (2) , . . . , X (p) and Y = Y (1) , Y (2) , . . . , Y (p) , where the groups X (i) and Y (i) are d i dimensional, 1 ≤ i ≤ p, and p might be fixed or growing. We assume that X (i) and Y (i) 's are finite (low) dimensional vectors, i.e., . Denote the mean vectors and the covariance matrices of X and Y by µ X and µ Y , and, Σ X and Σ Y , respectively. We propose the following class of metrics E to quantify the homogeneity of the distributions of X and Y : with d = (d 1 , . . . , d p ). We shall drop the subscript d below for the ease of notation.
Assumption 4.1. Assume that sup 1≤i≤p Eρ Under Assumption 4.1, E is finite. In Section A.1 of the supplement we illustrate that in the lowdimensional setting, E(X, Y ) completely characterizes the homogeneity of the distributions of X and Y .
We propose an unbiased U-statistic type estimator E n,m (X, Y ) of E(X, Y ) as in equation (9) with d being the new metric K. We refer the reader to Section A.1 of the supplement, where we show that E n,m (X, Y ) essentially inherits all the nice properties of the U-statistic type estimator of generalized energy distance and MMD.
We define the following quantities which will play an important role in our subsequent analysis: In Case S2 (i.e., when K is the Euclidean distance), we have Under the null hypothesis H 0 : In the subsequent discussion we study the asymptotic behavior of E in the high-dimensional framework, i.e., when p grows to ∞ with fixed n and m (discussed in Subsection 4.1) and when n and m grow to ∞ as well (discussed in Subsection B.1 in the supplement). We point out some limitations of the test for homogeneity of distributions in the high-dimensional setup based on the usual Euclidean energy distance. Consequently we propose a test based on the proposed metric and justify its consistency for growing dimension.

High dimension low sample size (HDLSS)
In this subsection, we study the asymptotic behavior of the Euclidean energy distance and our proposed metric E when the dimension grows to infinity while the sample sizes n and m are held fixed.
We make the following moment assumption.
Assumption 4.2. There exist constants a, a , a , A, A , A such that uniformly over p, Under Assumption 4.2, it is not hard to see that τ X , τ Y , τ p 1/2 . The proposition below provides an expansion for K evaluated at random samples. and where Henceforth we will drop the subscripts X and Y from L X , L Y , R X and R Y for notational convenience.
Under certain strong mixing conditions or in general certain weak dependence assumptions, it is not hard to see that p i,j=1 cov (Z i , Z j ) = O(p) as p → ∞ (see for example Theorem 1.2 in Rio (1993) or Theorem 1 in ). Therefore we have var (L(X, X )) = O( 1 p ) and hence by Chebyshev's inequality, we have L(X, X ) = O p ( 1 √ p ). We refer the reader to Remark 2.1.1 in Zhu et al. (2019) for illustrations when each ρ i is the squared Euclidean distance.
Theorem 4.1. Suppose Assumptions 4.2 and 4.3 hold. Further assume that the following three sequences indexed by p are all uniformly integrable. Then we have Remark 4.2. Remark D.1 in the supplementary materials provides some illustrations on certain suffi- To illustrate that the leading term in equation (25) indeed gives a close approximation of the population E(X, Y ), we consider the special case when K is the Euclidean distance. Suppose We simulate large samples of sizes m = n = 5000 from the distributions of X and Y for p = 20, 40, 60, 80 and 100. The large sample sizes are to ensure that the U-statistic type estimator of E(X, Y ) gives a very close approximation of the population E(X, Y ). In Table 1 we list the ratio between E(X, Y ) and the leading term in (25) for the different values of p, which turn out to be very close to 1, demonstrating that the leading term in (25) indeed approximates E(X, Y ) reasonably well.
It is to be noted that assuming τ, τ X , τ Y < ∞ does not contradict with the growth rate τ, τ X , τ Y = O(p 1/2 ). Clearly under H 0 , 2τ − τ X − τ Y = 0 irrespective of the choice of K. In view of Lemma 4.1 and Theorem 4.1, in Case S2, the leading term of E(X, Y ) becomes zero if and only if µ X = µ Y and tr Σ X = tr Σ Y . In other words, when dimension grows high, the Euclidean energy distance can only capture the equality of the means and the first spectral means, whereas our proposed metric captures the pairwise homogeneity of the low dimensional marginal distributions of X (i) and Y (i) . Clearly for 1 ≤ i ≤ p implies µ X = µ Y and tr Σ X = tr Σ Y . Thus the proposed metric can capture a wider range of inhomogeneity of distributions than the Euclidean energy distance. Define in a similar way. We impose the following conditions to study the asymptotic behavior of the (unbiased) U-statistic type estimator of E(X, Y ) in the HDLSS setup.
Assumption 4.4. For fixed n and m, as p → ∞, where {a kl , b st , c uv } k,l, s<t, u<v are jointly Gaussian with zero mean. Further we assume that l, s<t, u<v are all independent with each other.
Due to the double-centering property and the independence between the two samples, it is straight- it is natural to expect that the limit {a kl , b st , c uv } k,l, s<t, u<v are all independent with each other.
Remark 4.4. The above multi-dimensional central limit theorem is classic and can be derived under suitable moment and weak dependence assumptions on the components of X and Y , such as mixing or near epoch dependent conditions. We refer the reader to Doukhan and Neumann (2008)  between X and Y , which plays an important role in the construction of the t-test statistic: Let v s := s(s − 3)/2 for s = m, n. We introduce the following quantities where σ 2 , σ 2 X and σ 2 Y are defined in Assumption 4.4. Under Assumption 4.5, further define We are now ready to introduce the two-sample t-test is the pool variance estimator with D 2 n (X, X) and D 2 m (Y, Y ) being the unbiased estimators of the (squared) distance variances defined in equation (14). It is interesting to note that the variability of the sample generalized energy distance depends on the distance variances as well as the cdCov. It is also worth mentioning that the computational complexity of the pool variance estimator and thus the t-statistic is linear in p.
To study the asymptotic behavior of the test, we consider the following class of distributions on (X, Y ): weak dependence assumptions on the components of X and Y . Then in Case S2 (i.e., K is the Euclidean distance), a set of sufficient conditions for (P X , P Y ) ∈ P is given by which suggests that the first two moments of P X and P Y are not too far away from each other. In this sense, P defines a class of local alternative distributions (with respect to the null H 0 : P X = P Y ). We now state the main result of this subsection.
Theorem 4.2. In both Cases S1 and S2, under Assumptions 4.2, 4.3 and 4.4 as p → ∞ with n and m remaining fixed, and further assuming that (P X , P Y ) ∈ P, we have vm are independent chi-squared random variables, and Z ∼ N (0, 1). In other words, where σ nm and a nm are defined in equation (26). In particular, under H 0 , we have Based on the asymptotic behavior of T n,m for growing dimensions, we propose a test for H 0 as follows: at level α ∈ (0, 1), reject H 0 if T n,m > q α,(n−1)(m−1)+vn+vm and fail to reject H 0 otherwise, where The asymptotic power curve for testing H 0 based on T n,m is given by 1 − φ m,n (t). The following proposition gives a large sample approximation of the power curve.
thereby justifying the consistency of the test.
Remark 4.5. We first derive the power function 1 − φ n,m (t) under the assumption that n and m are fixed. The main idea behind Proposition 4.2 where we let n, m → ∞ is to see whether we get a reasonably good approximation of power when n, m are large. In a sense we are doing sequential asymptotics, first letting p → ∞ and deriving the power function, and then deriving the leading term by letting n, m → ∞. This is a quite common practice in Econometrics (see for example Phillips and Moon (1999)). The aim is to derive a leading term for the power when n, m are fixed but large.
Consider ∆ = s/ √ nm (as in Proposition 4.2) and set σ 2 = σ 2 X = σ 2 Y = 1. In Figure 1 below, we plot the exact power (computed from (28) with 50, 000 Monte Carlo samples from the distribution of M ) with n = m = 5 and 10, t = q α,(n−1)(m−1)+vn+vm and α = 0.05, over different values of s. We overlay the large sample approximation of the power function (given in Proposition 4.2) and observe that the approximation works reasonably well even for small sample sizes. Clearly larger s results in better power and s = 0 corresponds to trivial power. We now discuss the power behavior of T n,m based on the Euclidean energy distance. In Case S2, it can be seen that where Σ 2 X (i, i ) is the covariance matrix between X (i) and X (i ) , and similar expressions for σ 2 Y . In case S2 (i.e., when K is the Euclidean distance), if we further assume µ X = µ Y , it can be verified that Hence in Case S2, under the assumptions that µ X = µ Y , tr Σ X = tr Σ Y and tr Σ 2 X = tr Σ 2 Y = tr Σ X Σ Y , it can be easily seen from equations (21), (29) and (30) that which implies that ∆ * 0 = 0 in Proposition 4.2. Consider the following class of alternative distributions According to Theorem 4.2, the t-test T n,m based on Euclidean energy distance has trivial power against H A . In contrast, the t-test based on the proposed metrics has non-trivial power against H A as long as To summarize our contributions : • We show that the Euclidean energy distance can only detect the equality of means and the traces of covariance matrices in the high-dimensional setup. To the best of our knowledge, such a limitation of the Euclidean energy distance has not been pointed out in the literature before.
• We propose a new class of homogeneity metrics which completely characterizes homogeneity of two distributions in the low-dimensional setup and has nontrivial power against a broader range of alternatives, or in other words, can detect a wider range of inhomogeneity of two distributions in the high-dimensional setup.
• Grouping allows us to detect homogeneity beyond univariate marginal distributions, as the difference between two univariate marginal distributions is automatically captured by the difference between the marginal distributions of the groups that contain these two univariate components.
• Consequently we construct a high-dimensional two-sample t-test whose computational cost is linear in p. Owing to the pivotal nature of the limiting distribution of the test statistic, no resampling-based inference is needed.
Remark 4.6. Although the test based on our proposed statistic is asymptotically powerful against the alternative H A unlike the Euclidean energy distance, it can be verified that it has trivial power against Thus although it can detect differences between two high-dimensional distributions beyond the first two moments (as a significant improvement to the Euclidean energy distance), it cannot capture differences beyond the equality of the low-dimensional marginal distributions. We conjecture that there might be some intrinsic difficulties for distance and kernel-based metrics to completely characterize the discrepancy between two high-dimensional distributions.

Dependence metrics
In this section, we focus on dependence testing of two random vectors X ∈ Rp and Y ∈ Rq.
Suppose X and Y can be partitioned into p and q groups, viz.
, where the components X (i) and Y (j) are d i and g j dimensional, respectively, for Here p, q might be fixed or growing. We assume that X (i) and Y (j) 's are finite We define a class of dependence metrics D between X and Y as the positive square root of where d = (d 1 , . . . , d p ) and g = (g 1 , . . . , g q ). We drop the subscripts d, g of K for notational convenience.
To ensure the existence of D, we make the following assumption.
In Section A.2 of the supplement we demonstrate that in the low-dimensional setting, D(X, Y ) completely characterizes independence between X and Y . For an observed random sample (X k , Y k ) n k=1 from the joint distribution of X and Y , define D X := (d X kl ) ∈ R n×n with d X kl := K(X k , X l ) and k, l ∈ {1, . . . , n}. Define d Y kl and D Y in a similar way. With some abuse of notation, we consider the U-statistic type estimator D 2 n (X, Y ) of D 2 as defined in (14) with d X and d Y being K d and K g respectively. In Section A.2 of the supplement, we illustrate that D 2 n (X, Y ) essentially inherits all the nice properties of the U-statistic type estimator of generalized dCov and HSIC.
In the subsequent discussion we study the asymptotic behavior of D in the high-dimensional framework, i.e., when p and q grow to ∞ with fixed n (discussed in Subsection 5.1) and when n grows to ∞ as well (discussed in Subsection B.2 in the supplement).

High dimension low sample size (HDLSS)
In this subsection, our goal is to explore the behavior of D 2 (X, Y ) and its unbiased U-statistic type estimator in the HDLSS setting where p and q grow to ∞ while the sample size n is held fixed. Denote . We impose the following conditions.
Remark 5.1. We refer the reader to Remark 4.1 in Section 4 for illustrations about some sufficient conditions under which we have var (L(X, , and similarly for R(Y, Y ).
where R is the remainder term such that Theorem 5.1 shows that when dimensions grow high, the population D 2 (X, Y ) behaves as an aggregation of group-wise generalized dCov and thus essentially captures group-wise non-linear dependencies between X and Y .
Remark 5.2. Consider a special case where d i = 1 and g j = 1, and ρ i and ρ j are Euclidean distances for all 1 ≤ i ≤ p and 1 ≤ j ≤ q. Then Theorem 5.1 essentially boils down to where R = o(1). This shows that in a special case (when we have unit group sizes), D 2 (X, Y ) essentially behaves as an aggregation of cross-component dCov between X and Y . If K d and K g are Euclidean distances, or in other words if each ρ i and ρ j are squared Euclidean distances, then using equation (10) it is straightforward to verify that D 2 where R 1 = o(1), which essentially presents a population version of Theorem 2.1.1 in Zhu et al. (2019) as a special case of Theorem 5.1.
Remark 5.3. To illustrate that the leading term in equation (33) indeed gives a close approximation of the population D 2 (X, Y ), we consider the special case when K d and K g are Euclidean distances and p = We simulate a large sample of size n = 5000 from the distribution of (X, Y ) for p = 20, 40, 60, 80 and 100. The large sample size is to ensure that the U-statistic type estimator of D 2 (X, Y ) (given in (14)) gives a very close approximation of the population D 2 (X, Y ). We list the ratio between D 2 (X, Y ) and the leading term in (33) for the different values of p, which turn out to be very close to 1, demonstrating that the leading term in (33) indeed approximates D 2 (X, Y ) reasonably well. The following theorem explores the behavior of the population D 2 (X, Y ) when p is fixed and q grows to infinity, while the sample size is held fixed. As far as we know, this asymptotic regime has not been previously considered in the literature. In this case, the Euclidean distance covariance behaves as an aggregation of martingale difference divergences proposed in Shao and Zhang (2014) which measures conditional mean dependence. Figure 2 below summarizes the curse of dimensionality for the Euclidean distance covariance under different asymptotic regimes.
Theorem 5.2. Under Assumption 4.2 and the assumption that as q → ∞ with p and n remaining fixed, we have Remark 5.4. In particular, when both K d and K g are Euclidean distances, we have ext we study the asymptotic behavior of the sample version D 2 n (X, Y ).
where X (i) , Y (j) are the i th and j th groups of X and Y , respectively, 1 ≤ i ≤ p, 1 ≤ j ≤ q , and R n is the remainder term. (1), i.e., R n is of smaller order compared to the leading term and hence is asymptotically negligible.
The above theorem generalizes Theorem 2.1.1 in Zhu et al. (2019) by showing that the leading term of D 2 n (X, Y ) is the sum of all the group-wise (unbiased) squared sample generalized dCov scaled by τ XY . In other words, in the HDLSS setting, D 2 n (X, Y ) is asymptotically equivalent to the aggregation of group-wise squared sample generalized dCov. Thus D 2 n (X, Y ) can quantify group-wise non-linear dependencies between X and Y , going beyond the scope of the usual Euclidean dCov.
Remark 5.6. Consider a special case where d i = 1 and g j = 1, and ρ i and ρ j are Euclidean distances for all 1 ≤ i ≤ p and 1 ≤ j ≤ q. Then Theorem 5.3 essentially states that where R n = o p (1). This demonstrates that in a special case (when we have unit group sizes), D 2 n (X, Y ) is asymptotically equivalent to the marginal aggregation of cross-component distance covariances proposed Remark 5.7. To illustrate the approximation of D 2 n (X, Y ) by the aggregation of group-wise squared sample generalized dCov given by Theorem 5.3, we simulated the datasets in Examples 6.4.1, 6.4.2, 6.5.1 and 6.5.2 100 times each with n = 50 and p = q = 50. For each of the datasets, the difference between D 2 n (X, Y ) and the leading term in the RHS of equation (36) is smaller than 0.01 100% of the times, which illustrates that the approximation works reasonably well.
The following theorem illustrates the asymptotic behavior of D 2 n (X, Y ) when p is fixed and q grows to infinity while the sample size is held fixed. Under this setup, if both K d and K g are Euclidean distances, the leading term of D 2 n (X, Y ) is the sum of the group-wise unbiased U-statistic type estimators of M DD 2 (Y j |X) for 1 ≤ j ≤ q, scaled by τ Y . In other words, the sample Euclidean distance covariance behaves as an aggregation of sample martingale difference divergences.  (1), as q → ∞ with p and n remaining fixed, we have where R n is the remainder term such that Remark 5.8. In particular, when both K d and K g are Euclidean distances, we have where M DD 2 n (Y j |X) is the unbiased U-statistic type estimator of M DD 2 (Y j |X) defined as in (14) with d X (x, x ) = x − x for x, x ∈ Rp and d Y (y, y ) = |y − y | 2 /2 for y, y ∈ R, respectively. Now denote X k = (X k(1) , . . . , X k(p) ) and Y k = (Y k(1) , . . . , Y k(q) ) for 1 ≤ k ≤ n. Define the leading term of D 2 n (X, Y ) in equation (36) as It can be verified that n k,l=1 , respectively. As an advantage of using the double-centered distances, we have for all 1 ≤ i, i ≤ p, Assumption 5.4. For fixed n, as p, q → ∞, where {d 1 kl , d 2 uv } k<l, u<v are jointly Gaussian. Further we assume that In view of (38), we have cov (d 1 kl , d 1 uv ) = cov (d 2 kl , d 2 uv ) = cov (d 1 kl , d 2 uv ) = 0 for {k, l} = {u, v}. Theorem 5.3 states that for growing p and q and fixed n, D 2 n (X, Y ) and L are asymptotically equivalent. By studying the leading term, we obtain the limiting distribution of D 2 n (X, Y ) as follows.
Theorem 5.5. Under Assumptions 4.2, 5.3 and 5.4, for fixed n and p, q → ∞, where M is a projection matrix of rank ν = n(n−3)

2
, and To perform independence testing, in the spirit of Székely and Rizzo (2014), we define the studentized test statistic where The following theorem states the asymptotic distributions of the test statistic T n under the null hypothesisH 0 : X ⊥ ⊥ Y and the alternative hypothesisH A : X ⊥ ⊥ Y .
Theorem 5.6. Under Assumptions 4.2, 5.3 and 5.4, for fixed n and p, q → ∞, where t is any fixed real number and W ∼ under the local alternative hypothesisH * A when n is allowed to grow.
Proposition 5.1. UnderH * A , as n → ∞ and t = O(1), The following summarizes our key findings in this section.
• Advantages of our proposed metrics over the Euclidean dCov and HSIC : i) Our proposed dependence metrics completely characterize independence between X and Y in the low-dimensional setup, and can detect group-wise non-linear dependencies between X ii) We also showed that with p remaining fixed and q growing high, the Euclidean dCov can only quantify conditional mean independence of the components of Y given X (which is weaker than independence). To the best of our knowledge, this has not been pointed out in the literature before.
• Advantages over the marginal aggregation approach by Zhu et al. (2019) : i) In the low-dimensional setup, our proposed dependence metrics can completely characterize independence between X and Y , whereas the metric proposed by Zhu et al. (2019) can only capture pairwise dependencies between the components of X and Y .
ii) We provide a neater way of generalizing dCov and HSIC between X and Y which is shown to be asymptotically equivalent to the marginal aggregation of cross-component distance covariances proposed by Zhu et al. (2019) as dimensions grow high. Also grouping or partitioning the two high-dimensional random vectors (which again may be problem specific) allows us to detect a wider range of alternatives compared to only detecting component-wise non-linear dependencies, as independence of two univariate marginals is implied from independence of two higher dimensional marginals containing the two univariate marginals.
iii) The computational complexity of the (unbiased) squared sample D(X, Y ) is O(n 2 (p + q)).
Thus the computational cost of our proposed two-sample t-test only grows linearly with the dimension and therefore is scalable to very high-dimensional data. Although a naive Behaves as a sum of groupwise MMD with the kernel k i Behaves as a sum of groupwise HSIC with the kernel k i 6 Numerical studies

Testing for homogeneity of distributions
We investigate the empirical size and power of the tests for homogeneity of two high dimensional distributions. For comparison, we consider the t-tests based on the following metrics: I. E with ρ i as the Euclidean distance for 1 ≤ i ≤ p; II. E with ρ i as the distance induced by the Laplace kernel for 1 ≤ i ≤ p; III. E with ρ i as the distance induced by the Gaussian kernel for 1 ≤ i ≤ p; IV. the usual Euclidean energy distance; V. MMD with the Laplace kernel; VI. MMD with the Gaussian kernel.
We set d i = 1 in Examples 6.1 and 6.2, and d i = 2 in Example 6.3 for 1 ≤ i ≤ p.
∼ Exponential(1) − 1. Example 6.3. Consider X k = (X k(1) , . . . , X k(p) ) and Y l = (Y l(1) , . . . , Y l(p) ) with k = 1, . . . , n and l = 1, . . . , m and d i = 2 for 1 ≤ i ≤ p. We generate i.i.d. samples from the following models: Note that for Examples 6.1 and 6.2, the metric defined in equation (15) essentially boils down to the special case in equation (17). We try small sample sizes n = m = 50, dimensions p = q = 50, 100 and 200, and β = 1/2. Table 4 reports the proportion of rejections out of 1000 simulation runs for the different tests. For the tests V and VI, we chose the bandwidth parameter heuristically as the median distance between the aggregated sample observations. For tests II and III, the bandwidth parameters are chosen using the median heuristic separately for each group.
In Example 6.1, the data generating scheme suggests that the variables X and Y are identically distributed. The results in Table 4 show that the tests based on both the proposed homogeneity metrics and the usual Euclidean energy distance and MMD perform more or less equally good, and the rejection probabilities are quite close to the 10% or 5% nominal level. In Example 6.2, clearly X and Y have different distributions but µ X = µ Y and Σ X = Σ Y . The results in Table 4 indicate that the tests based on the proposed homogeneity metrics are able to detect the differences between the two high-dimensional distributions beyond the first two moments unlike the tests based on the usual Euclidean energy distance and MMD, and thereby outperform the latter in terms of empirical power.
In Example 6.3, clearly µ X = µ Y and tr Σ X = tr Σ Y and the results show that the tests based on the proposed homogeneity metrics are able to detect the in-homogeneity of the low-dimensional marginal distributions unlike the tests based on the usual Euclidean energy distance and MMD.  Remark 6.1. In Example 6.3.1, marginally the p-many two-dimensional groups of X and Y are not identically distributed, but each of the 2p unidimensional components of X and Y have identical distributions. Consequently, choosing d i = 1 for 1 ≤ i ≤ p leads to trivial power of even our proposed tests, as is evident from Table 5 below. This demonstrates that grouping allows us to detect a wider range of alternatives.

Testing for independence
We study the empirical size and power of tests for independence between two high dimensional random vectors. We consider the t-tests based on the following metrics: I. D with d i = 1 and ρ i be the Euclidean distance for 1 ≤ i ≤ p; II. D with d i = 1 and ρ i be the distance induced by the Laplace kernel for 1 ≤ i ≤ p; III. D with d i = 1 and ρ i be the distance induced by the Gaussian kernel for 1 ≤ i ≤ p; IV. the usual Euclidean distance covariance; V. HSIC with the Laplace kernel; VI. HSIC with the Gaussian kernel.
The numerical examples we consider are motivated from Zhu et al. (2019).
For each example, we draw 1000 simulated datasets and perform tests for independence between the two variables based on the proposed dependence metrics, and the usual Euclidean dCov and HSIC. We try a small sample size n = 50 and dimensions p = 50, 100 and 200. For the tests II, III, V and VI, we chose the bandwidth parameter heuristically as the median distance between the sample observations. Table 6 reports the proportion of rejections out of the 1000 simulation runs for the different tests.
In Example 6.4, the data generating scheme suggests that the variables X and Y are independent.
The results in Table 6 show that the tests based on the proposed dependence metrics perform almost equally good as the other competitors, and the rejection probabilities are quite close to the 10% or 5% nominal level. In Examples 6.5 and 6.6, the variables are clearly (componentwise non-linearly) dependent by virtue of the data generating scheme. The results indicate that the tests based on the proposed dependence metrics are able to detect the componentwise non-linear dependence between the two high-dimensional random vectors unlike the tests based on the usual Euclidean dCov and HSIC, and thereby outperform the latter in terms of empirical power.

Real data analysis 6.3.1 Testing for homogeneity of distributions
We consider the two sample testing problem of homogeneity of two high-dimensional distributions on Earthquakes data. The dataset has been downloaded from UCR Time Series Classification Archive

Testing for independence
We consider the daily closed stock prices of p = 127 companies under the finance sector and q = 125 companies under the healthcare sector on the first dates of each month during the time period between January 1, 2017 and December 31, 2018. The data has been downloaded from Yahoo Finance via the R package 'quantmod'. At each time t, denote the closed stock prices of these companies from the two different sectors by X t = (X 1t , . . . , X pt ) and Y t = (Y 1t , . . . , Y qt ) for 1 ≤ t ≤ 24. We consider the stock returns S X t = (S X 1t , . . . , S X pt ) and Y jt for 1 ≤ i ≤ p and 1 ≤ j ≤ q. It seems intuitive that the stock returns for the companies under two different sectors are not totally independent, especially when a large number of companies are being considered. The tests based on the proposed dependence metrics deliver much smaller p-values compared to the tests based on traditional metrics. We note that the tests based on the usual dCov and HSIC with the Laplace kernel fail to reject the null at 5% level, thereby indicating cross-sector independence of stock return values. These results are consistent with the fact that the dependence among financial asset returns is usually nonlinear and thus cannot be fully characterized by traditional metrics in the high dimensional setup. We present an additional real data example on testing for independence in high dimensions in Section C of the supplement. There the data admits a natural grouping, and our results indicate that our proposed tests for independence exhibit better power when we consider the natural grouping than when we consider unit group sizes. It is to be noted that considering unit group sizes makes our proposed statistics essentially equivalent to the marginal aggregation approach proposed by Zhu et al. (2019). This indicates that grouping or clustering might improve the power of testing as they are capable of detecting a wider range of dependencies.

Discussions
In this paper, we introduce a family of distances for high dimensional Euclidean spaces. Built on the new distances, we propose a class of distance and kernel-based metrics for high-dimensional twosample and independence testing. The proposed metrics overcome certain limitations of the traditional metrics constructed based on the Euclidean distance. The new distance we introduce corresponds to a semi-norm given by where ρ i (x (i) ) = ρ i (x (i) , 0 d i ) and x = (x (1) , . . . , x (p) ) ∈ Rp with x (i) = (x i,1 , . . . , x i,d i ). Such a semi-norm has an interpretation based on a tree as illustrated by Figure 3. Tree structure provides useful information for doing grouping at different levels/depths. Theoretically, grouping allows us to detect a wider range of alternatives. For example, in two-sample testing, the difference between two one-dimensional marginals is always captured by the difference between two higher dimensional marginals that contain the two one-dimensional marginals. The same thing is true for dependence testing. Generally, one would like to find blocks which are nearly independent, but the variables inside a block have significant dependence among themselves. It is interesting to develop an algorithm for finding the optimal groups using the data and perhaps some auxiliary information.
Another interesting direction is to study the semi-norm and distance constructed based on a more sophisticated tree structure. For example, in microbiome-wide association studies, phylogenetic tree or evolutionary tree which is a branching diagram or "tree" showing the evolutionary relationships among various biological species. Distance and kernel-based metrics constructed based on the distance utilizing the phylogenetic tree information is expected to be more powerful in signal detection. We leave these topics for future investigation. The supplement is organized as follows. In Section A we explore our proposed homogeneity and dependence metrics in the low-dimensional setup. In Section B we study the asymptotic behavior of our proposed homogeneity and dependence metrics in the high dimension medium sample size (HDMSS) framework where both the dimension(s) and the sample size(s) grow. Section C illustrates an additional real data example for testing for independence in the high-dimensional framework. Finally, Section D contains additional proofs of the main results in the paper and Sections A and B in the supplement.

A Low-dimensional setup
In this section we illustrate that the new class of homogeneity metrics proposed in this paper inherits all the nice properties of generalized energy distance and MMD in the low-dimensional setting. Likewise, the proposed dependence metrics inherit all the desirable properties of generalized dCov and HSIC in the low-dimensional framework.

A.1 Homogeneity metrics
Note that in either Case S1 or S2, the Euclidean space equipped with distance K is of strong negative type. As a consequence, we have the following result. The following proposition shows that E n,m (X, Y ) is a two-sample U-statistic and an unbiased estimator of E(X, Y ).
Proposition A.1. The U-statistic type estimator enjoys the following properties: 1. E n,m is an unbiased estimator of the population E.
2. E n,m admits the following form : The following theorem shows the asymptotic behavior of the U-statistic type estimator of E for fixed p and growing n.
Theorem A.2. Under Assumption 4.5 and the assumption that sup 1≤i≤p Eρ i (X (i) , 0 d i ) < ∞ and sup 1≤i≤p Eρ i (Y (i) , ∞, as m, n → ∞ with p remaining fixed, we have the following: 1. E n,m (X, Y ) a.s.
2. When X d = Y , E n,m has degeneracy of order (1, 1), and where {Z k } is a sequence of independent N (0, 1) random variables and λ k 's depend on the distribution of (X, Y ).
Proposition A.1, Theorem A.1 and Theorem A.2 demonstrate that E inherits all the nice properties of generalized energy distance and MMD in the low-dimensional setting.

A.2 Dependence metrics
Note that Proposition 3.1 in Section 3 and Proposition 3.7 in  ensure that D(X, Y ) completely characterizes independence between X and Y , which leads to the following result. The following proposition shows that D 2 n (X, Y ) is an unbiased estimator of D 2 (X, Y ) and is a Ustatistic of order four. B High dimension medium sample size (HDMSS)

B.1 Homogeneity metrics
In this subsection, we consider the HDMSS setting where p → ∞ and n, m → ∞ at a slower rate than p. Under H 0 , we impose the following conditions to obtain the asymptotic null distribution of the statistic T n,m under the HDMSS setup.
Assumption B.1. As n, m and p → ∞, where α p is a positive real sequence such that τ X α 2 p = o(1) as p → ∞. Further assume that as n, p → ∞, We refer the reader to Remark 4.1 in the main paper which illustrates some sufficient conditions under which α p = O( 1 √ p ) and consequently τ X α 2 p = o(1) holds, as τ X p 1/2 . In similar lines of Remark D.1 in Section D of the supplementary material, it can be argued that E [R 4 (X, X )] = O 1 p 4 . If we further assume that Assumption 4.4 holds, then we have E [H 2 (X, X )] 1. Combining all the above, it is easy to verify that The following theorem illustrates the limiting null distribution of T n,m under the HDMSS setup. We refer the reader to Section D of the supplement for a detailed proof.

B.2 Dependence metrics
In this subsection, we consider the HDMSS setting where p, q → ∞ and n → ∞ at a slower rate than p, q. The following theorem shows that similar to the HDLSS setting, under the HDMSS setup, D 2 n is asymptotically equivalent to the aggregation of group-wise generalized dCov. In other words D 2 n (X, Y ) can quantify group-wise nonlinear dependence between X and Y in the HDMSS setup as well.
Remark B.3. Following Remark 4.1 in the main paper, we can write L(X, as p → ∞ (see for example Theorem 1 in ). Therefore we have where R n is the remainder term satisfying that R n = O p (τ XY (α p λ q + γ p β q + γ p λ q )) = o p (1), i.e., R n is of smaller order compared to the leading term and hence is asymptotically negligible.
The following theorem states the asymptotic null distribution of the studentized test statistic T n (given in equation (39) in the main paper) under the HDMSS setup. Define

C Additional real data example
We consider the monthly closed stock prices ofp = 33 companies under the oil and gas sector and Argentina, with d = (5, 1, 2, 5, 4, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1). And under the transport sector, we have q = 14 countries or groups, viz. USA, Brazil, Canada, Greece, China, Panama, Belgium, Bermuda, UK, Mexico, Chile, Monaco, Ireland and Hong Kong, with g = (5, 1, 2, 6, 4, 1, 1, 3, 1, 3, 1, 4, 1, 1). At each time t, denote the closed stock prices of these companies from the two different sectors by X t = (X 1t , . . . , X pt ) and Y t = (Y 1t , . . . , Y qt ) for 1 ≤ t ≤ 24. We consider the stock returns S X t = (S X 1t , . . . , S X pt ) and The intuitive idea is, stock returns of oil and gas companies should affect the stock returns of companies under the transport sector, and here both the random vectors admit a natural grouping based on the countries. Table 9 shows the p-values corresponding to the different tests for independence between {S X t } 23 t=1 and {S Y t } 23 t=1 . The tests based on the proposed dependence metrics considering the natural grouping deliver much smaller p-values compared to the tests based on the usual dCov and HSIC which fail to reject the null hypothesis of independence between {S X t } 23 t=1 and {S Y t } 23 t=1 . This makes intuitive sense as the dependence among financial asset returns is usually nonlinear in nature and thus cannot be fully characterized by the usual dCov and HSIC in the high dimensional setup.  Table 10 shows the p-values corresponding to the different tests for independence when we disregard the natural grouping and consider d i = 1 and g j = 1 for all 1 ≤ i ≤ p and 1 ≤ j ≤ q. Considering unit group sizes makes our proposed statistics essentially equivalent to the marginal aggregation approach proposed by Zhu et al. (2019). In this case the proposed tests have higher p-values than when we consider the natural grouping, indicating that grouping or clustering might improve the power of testing as they are capable of detecting a wider range of dependencies.

D Technical Appendix
Proof of Proposition 3.1. To prove (1), note that if d is a metric on a space X , then so is d 1/2 . It is easy to see that K 2 is a metric on Rp. To prove (2), note that (R d i , ρ i ) has strong negative type for 1 ≤ i ≤ p. The rest follows from Corollary 3.20 in . ♦ Proof of Proposition A.1. It is easy to verify that E n,m is an unbiased estimator of E and is a two-sample U-statistic with the kernel h. ♦ Proof of Theorem A.2. The first part of the proof follows from Theorem 1 in Sen (1977) and the observation that E |h| log + |h| ≤ E[h 2 ]. The power mean inequality says that for a i ∈ R, 1 ≤ i ≤ n, n ≥ 2 and r > 1, Using the power mean inequality, it is easy to see that the assumptions sup 1≤i≤p Eρ i (X are degenerate at 0 almost surely. Following Theorem 1.1 in Neuhaus (1977), we have Using R and L interchangeably with R(X, X ) and L(X, X ) respectively, we can write It is clear that R(X, X ) = O p (L 2 (X, X )) provided that L(X, X ) = o p (1). ♦ Proof of Theorem 4.1. Observe that E L(X, Y ) = E L(X, X ) = E L(Y, Y ) = 0. By Proposition 4.1, . By (42) and Assumption The other terms can be handled in a similar fashion. ♦ Denote L(X, Y ) by L and R(X, Y ) by R for notational simplicities. Further assume that E exp(tA p ) = O((1 − θ 1 t) −θ 2 p ) for θ 1 , θ 2 > 0 and θ 2 p > 4 uniformly over t < 0 (which is clearly satisfied when Z i 's are independent and E exp(tZ i ) ≤ a 1 (1 − a 2 t) −a 3 uniformly over t < 0 and 1 ≤ i ≤ p for some a 1 , a 2 , a 3 > 0 with a 3 p > 4). Under certain weak dependence assumptions, it can be shown that: Similar arguments hold for L(X, X ) and R(X, X ), and, L(Y, Y ) and R(Y, Y ) as well.
Proof of Remark D.1. To prove the first part, define L p := √ pL 2 /(1 + L). Following Chapter 6 of Resnick (1999), it suffices to show that sup p E L 2 p < ∞. Towards that end, using Hölder's inequality we observe With sup i EZ 8 i < ∞ and under certain weak dependence assumptions, it can be shown that E(A p − EA p ) 8 = O(p 4 ) (see for example Theorem 1 in Equation (3) in Cressie et al. (1981) states that for a non-negative random variable U with momentgenerating function M U (t) = E exp(tU ), one can write for any positive integer k, provided both the integrals exist. Using equation (45), the assumptions stated in Remark D.1 and basic properties of beta integrals, some straightforward calculations yield where C 1 , C 2 are positive constants, which clearly implies that E 1 Combining all the above, we get from (43) that E L 2 p = O( 1 p ) and therefore sup p E L 2 p < ∞, which completes the proof of the first part.
To prove the second part, note that following the proof of Proposition 4.1 and Hölder's inequality we can write Following the arguments as in the proof of the first part, clearly we have E L 8 = O( 1 p 4 ) and E 1 (1+L) 4 = O(1). From this and equation (47), it is straightforward to verify that E R 2 = O( 1 p 2 ), which completes the proof of the second part. ♦ Proof of Lemma 4.1. To see (2), first observe that the sufficient part is straightforward from equation (21) in the main paper. For the necessary part, denote a = tr Σ X , b = tr Σ Y and c = µ X − µ Y 2 . Then which implies the rest.
To see (1), again the sufficient part is straightforward from equation (20) in the paper and the form of K given in equation (15) in the paper. For the necessary part, first note that as (R d i , ρ i ) is a metric space of strong negative type for 1 ≤ i ≤ p, there exists a Hilbert space H i and an injective map where ·, · H i is the inner product defined on H i and · H i is the norm induced by the inner product (see Proposition 3 in Sejdinovic et al. (2013) for detailed discussions). Further, if k i is a distance-induced kernel induced by the metric ρ i , then by Proposition 14 in , H i is the RKHS with the reproducing kernel k i and φ i (z) is essentially the canonical feature map for H i , viz. φ i (z) : z → k i (·, z). It is easy to see that which implies that Therefore, 2τ − τ X − τ Y = 0 holds if and only if (1) ζ = 0, i.e., E φ i (X (i) ) = E φ i (Y (i) ) for all 1 ≤ i ≤ p, and, (2) τ X = τ Y , i.e., where Π i is the mean embedding function (associated with the distance induced kernel k i ) defined in Section 2.1, P i and Q i are the distributions of X (i) and Y (i) , respectively. As ρ i is a metric of strong negative type on R d i , the induced kernel k i is characteristic to M 1 (R d i ) and hence the mean embedding function Π i is injective. Therefore condition (1) above implies X (i) d = Y (i) . ♦ Now we introduce some notation before presenting the proof of Theorem 4.2. The key of our analysis is to study the variance of the leading term of E n,m (X, Y ) in the HDLSS setup, propose the variance estimator and study the asymptotic behavior of the variance estimator. It will be shown later (in the proof of Theorem 4.2) that the leading term in the Taylor's expansion of E n,m (X, can be written as L 1 + L 2 , where where L i 1 's are defined accordingly and By the double-centering properties, it is easy to see that L i 1 for 1 ≤ i ≤ 3 are uncorrelated. Define where V i 's are defined accordingly. Further let It can be verified that Thus we have We study the variances of L i 1 for 1 ≤ i ≤ 3 and propose some suitable estimators. The variance for L 2 1 is given by Clearly From Theorem 5.3 in Section 5.1, we know that for fixed n and growing p, D 2 n (X, X) is asymptotically equivalent to 1 . Therefore an estimator of V 2 is given by 4 D 2 n (X, X). Note that the computational cost of D 2 n (X, X) is linear in p while direct calculation of its leading term n ; ρ i ,ρ j (X (i) , X (i ) ) requires computation in the quadratic order of p. Similarly it can be shown that the variance of L 3 1 is V 3 and V 3 can be estimated by 4 D 2 m (Y, Y ). Likewise some easy calculations show that the variance of L 1 1 is V 1 . Definê andR It can be verified that Observe that LetÂ i = (ρ i (X k(i) , Y l(i) )) k,l , A i = (ρ i (X k(i) , Y l(i) )) k,l ∈ R n×m . Note that which suggests that is an unbiased estimator for V 1 . However, the computational cost forV 1 is linear in p 2 which is prohibitive for large p. We aim to find a joint metric whose computational cost is linear in p whose leading term is proportional toV 1 . It can be verified that cdCov 2 n,m (X, Y ) is asymptotically equivalent to This can be seen from the observation that Using the Hölder's inequality as well as the fact that τ 2R2 (X k , Y l ) is O p (τ 2 a 4 p ) = o p (1) under Assumption 4.3. Therefore, we can estimate V 1 by 4cdCov 2 n,m (X, Y ). Thus the variance of L 1 is V which can be estimated bŷ Proof of Theorem 4.2. Using Proposition 4.1, some algebraic calculations yield By Assumption 4.3, R n,m = O p (τ a 2 p + τ X b 2 p + τ Y c 2 p ) = o p (1) as p → ∞. Denote the leading term above by L. We can rewrite L as L 1 + L 2 , where L 1 and L 2 are defined in equations (48) and (49), respectively.
and F N,r := σ(X 1 , . . . , X r ). Then the leading term of E nm (X, Y ), viz., L 1 (see equation (48)) can be expressed as By Corollary 3.1 of Hall and Heyde (1980), it suffices to show the following : is a sequence of zero mean and square integrable martingales, To show (1), it is easy to see that S N r is square integrable, . If j ≤ r < q and i < j, then E(φ ij | F N,r ) = φ ij . If r < j ≤ q, then : if i ≤ r < j ≤ q, then E(φ ij | F N,r ) = 0 (due to U-centering).
Therefore E(S N q | F N,r ) = S N r for q > r. This completes the proof of (1).
To show (2), define L j (i, k) := E [φ ij φ kj | F N,j−1 ] for i, k < j ≤ N , and By virtue of Chebyshev's inequality, it will suffice to show var( η N V ) = o(1). Note that In view of equation (62), it can be verified that the above expression for E L j (i, k) L j (i , k ) holds true for j = j as well. Therefore Under Assumption 4.5 and H 0 , it can be verified that and Therefore under Assumption B.1 and H 0 , we have which completes the proof of (2). To show (3), note that it suffices to show Under Assumption 4.5, we have This along with the observation from equation (65) and Assumption B.1 complete the proof of (3).
Finally to see that Rn,m √ V = o p (1), note that from equation (59) we can derive using power mean inequality that E R 2 n,m ≤ C τ 2 E [R 2 (X, X )] for some positive constant C. Using this, equation (66), Chebyshev's inequality and Hölder's inequality, we have for any > 0 for some positive constant C . From this and Assumptions 4.5 and B.2, we get Rn,m √ V = o p (1), as N n. This completes the proof of the lemma. ♦ Lemma D.2. Under H 0 and Assumptions 4.5 and B.2, as n, m and p → ∞, we have where V i andV i , 1 ≤ i ≤ 3 are defined in equations (50) and (58), respectively in the supplementary material.
Using power mean inequality and Jensen's inequality, it is not hard to verify that E R 4 (X, X ) = O (E [R 4 (X, X )]). Using this and Hölder's inequality, we have τ 2 E R 2 (X, X ) E ρ i (X k(i) , Y l(i) )ρ i (X k(i ) , Y l(i ) ) where the expression forR(X k , Y l ) is given in equation (54). Following equation (56) we can write Therefore in view of equations (50), (55) and (58), using the power mean inequality we can write Proof of Lemma D.3. Again we deal withV 2 first. To simplify the notations, denote A ij = K(X i , X j ) and A ij = D X ij for 1 ≤ i = j ≤ n. Observe that var D 2 n (X, X) = var 1 n(n − 3) As in the proof of Lemma D.2, we can write where the four summands are uncorrelated with each other. Using the power mean inequality, it can be shown that is o(1) as n, p → ∞, provided 1 n 2 E K 4 (X, X ) where V 2 is defined in equations (51) and (52). Note that K(X, X ) = τ X 2L (X, X ) + τ XR (X, X ) .
Using the power mean inequality we can write With this and using Hölder's inequality, it can be verified that when {i, j} ∩ {i , j } = φ, the leading term of cov( A 2 ij , A 2 i j ) is O 1 n 2 E (Ā 4 ij ) . Therefore the third summand in equation (71) scaled by V 2   2 can be argued to be o(1) in similar lines of the argument for the first summand in equation (71).
For the second summand in equation (71), in the similar line we can argue that the leading term of Therefore the leading term of 1 For the second term above, using the power mean inequality we can write and Theorem 2.7 in , respectively and the parallel U-statistics theory (see for example To observe that the remainder term is negligible, note that under Assumption 5.2, Proof of Theorem 5.2. The proof is essentially similar to the proof of Theorem 5.1. Note that using Proposition 4.1, we can write Under the assumption that E [R 2 (Y, Y )] = O(b 4 q ), using Hölder's inequality it is easy to see that