Asymptotic Normality of Gini Correlation in High Dimension with Applications to the K-sample Problem

The categorical Gini correlation proposed by Dang et al. is a dependence measure to characterize independence between categorical and numerical variables. The asymptotic distributions of the sample correlation under dependence and independence have been established when the dimension of the numerical variable is fixed. However, its asymptotic behavior for high dimensional data has not been explored. In this paper, we develop the central limit theorem for the Gini correlation in the more realistic setting where the dimensionality of the numerical variable is diverging. We then construct a powerful and consistent test for the $K$-sample problem based on the asymptotic normality. The proposed test not only avoids computation burden but also gains power over the permutation procedure. Simulation studies and real data illustrations show that the proposed test is more competitive to existing methods across a broad range of realistic situations, especially in unbalanced cases.


Introduction
Recently, Dang et al. [7] proposed the categorical Gini covariance and correlation, gCov(X, Y ) and gCor(X, Y ), to measure dependence of a p-variate numerical variable X and a categorical variable Y .Suppose that the categorical variable Y takes values L 1 , ..., L K and its distribution P Y is P (Y = L k ) = p k > 0 for k = 1, 2, ..., K. X is from F and ψ denotes its characteristic function.Assume that the conditional distribution of X given Y = L k is F k with the corresponding characteristic function ψ k .The Gini covariance is defined as gCov(X, Y ) = c(p) where c(p) is a known constant.The Gini covariance measures dependence of X and Y by quantifying the difference between the conditional and the unconditional characteristic functions.The corresponding Gini correlation standardizes the Gini covariance to have a range in [0,1].Zero Gini covariance or correlation mutually implies independence.Another dependence measure that could characterize independence between two random variables is the popular distance correlation proposed by Szekely, Rizzo and Bakirov [27].It is flexible for X and Y in arbitrary dimensions and any types (numerical or categorical).It has attracted much attention since then, see e.g.[11,19,20,28,29,30,31,35,36] and references therein.In the case of p-variate X and categorical Y , the distance covariance becomes dCov(X, Y ) = c(p) Comparing (1.1) and (1.2), we see that the two covariances are closely related.
When the categorical variable Y takes two values (K = 2) or P Y is uniform, they are only different with a scaling factor [7].While for the general K ≥ 3 and unbalanced P Y , the Gini covariance is a better dependence measure than the distance covariance because the weight p k in (1.1) takes the nature of the categorical variable, while dCov(X, Y ), due to its squared weights, is dominated by the classes with large probabilities and the contribution from smaller classes is substantially reduced.
A fruitful research has been developed to study the asymptotic distributions of the sample distance statistics in different scenarios.Under independence of X ∈ R p and Y ∈ R q , the standard sample distance covariance or correlation converges in distribution to a mixture of chi-squared distribution in the classical setting where the sample size n → ∞ and p, q are fixed [27,15]; while in the high-dimension-low-sample size (HDLSS) setting when p, q → ∞ and n is fixed, Székely and Rizzo [30] derived a t-distribution limit of the unbiased sample covariance by assuming that the components of the high-dimensional vectors in X and in Y are exchangeable; Zhu et al. [35] relaxed that assumption and considered the high-dimensional medium-sample-size setting (HDMSS) where n, p, q → ∞ but p, q growing more rapidly than n; Gao et al. [11] have developed central limit theorems in a more realistic high dimensional high-sample-size setting (HDHSS) where n and p + q diverge in an arbitrary fashion.This result applies for the sample distance covariance of (1.2) in which q = 1 and n, p → ∞.However, there is no literature to study limiting distributions of sample Gini covariance and correlation in high dimension.
In the classical setting, Dang et al. [7] studied the asymptotic distributions of the V -statistic sample Gini covariance and correlation with the dimension of X fixed.They admit normal limits when X and Y are dependent and converge in distribution to a quadratic form of centered Gaussian random variables when X and Y are independent.In this paper, we will work with unbiased U -statistic covariance estimator and associated sample Gini correlation in high dimension.The first objective of this paper is to establish their asymptotic distributions for sample Gini covariance and correlation under independence of X and Y in the HDHSS setting.
The derived asymptotic distributions can be used for independence test.In other words, it is to test the equality of K conditional distributions, which is the classical K-sample problem encountered in almost every research field.Due to its fundamental importance and wide applications, research for this K-sample problem has been kept active since 1940's.For example, the widely used and well-studied tests such as Cramér-von Mises test [17,6], Anderson-Darling test [8,25] and their variations utilize different norms on the difference of empirical distribution functions, while some [1,21] are based on the comparison of density estimators if the underlying distributions are continuous.Other tests [26,10] are based on the difference between characteristic functions.Indeed, the test in [26] is equivalent to ours, but their test only considers the case of K = 2. Another equivalent test is the DISCO [23] whose test statistic is the ratio of the between Gini variation and the within Gini variation, while our Gini correlation is the ratio of the between Gini variation and the total Gini variation.Heller, Heller and Gorfine [13] and Heller et al. [14] proposed a dependence test based on rank distances.All those distance-based tests require a permutation procedure to determine the critical values.Sang, Dang and Zhao [24] developed a nonparametric test which applies the jackknife empirical likelihood and has a standard limiting chi-squared distribution.Other tests viewing the K-sample test as an independent test between a numerical and categorical variable can be found in [4,16,33].However, most the afore-mentioned work focuses on the fixed dimensional case and perform poorly or may even fail in high dimension.
Recently, several distance-based tests for two-sample problem have been proposed in high dimension, see [3,5,19,36].Li [19] constructed a test based on interpoint distances under HDLSS.Zhu and Shao [36] studied the two-sample problem using energy distance (ED) and maximum mean discrepancy with Gaussian and Laplacian kernels under HDLSS and HDMSS, in which they have shown that all these tests are inconsistent under some scenarios.The general K-sample testing in high dimension is more challenging and results in literature are very scarce.Mukhopadhyay and Wang [22] constructed a graph-based nonparametric approach under HDLSS.However, the power for the test is extremely low under some settings.Gao et al. [11] tested the K-sample problem in high dimension based the distance correlation.
Our second objective of this paper is to use the asymptotic result of the Gini covariance or correlation to construct a new consistent K-sample test in high dimension.The advantages of the new test include Throughout this paper, if not mentioned otherwise, the letter C, with or without a subscript, denotes a generic positive finite constant whose exact value is independent of sample sizes and may change from line to line.The remainder of the paper is organized as follows.In Section 2, we first briefly review the other representation of Gini correlation and review the existing statistical inference, then we present a U -estimator for the Gini correlation and the central limit theorem for the U -estimator when both the sample sizes and dimensionality are diverging.The K-sample test is proposed and its consistency is established.In Section 3, we conduct simulation studies to evaluate the performance of the proposed test.A real data analysis is illustrated in Section 4 to compare the proposed test with current existing approaches.We conclude and discuss future works in Section 5.All technical proofs are provided in Appendix.

Categorical Gini correlation
The Gini covariation between X and Y defined in (1.1) can be represented in the multivariate Gini mean differences (GMD).Let (X 1 , X 2 ) and (X 2 ) be independent pair variables independently from F and as the GMD of imsart-ejs ver.2020/08/06 file: High_dimensional_Gini_correlation_Final.tex date: April 19, 2023 Then the Gini correlation is This representation allows a nice interpretation.The Gini covariance is the between Gini variation and the Gini correlation is the ratio of the between and the total variation.Also from this representation, it is straightforward to obtain sample estimators.Dang et al. [7] used V-statistic estimators and derived limiting distributions of the estimators under the classical setting when the dimension of X is fixed.More specifically, suppose a sample 2), the estimator ρg (X, Y ) is obtained.Under the assumption of E X 2 < ∞ with p fixed and n → ∞, ρg (X, Y ) has the limiting distributions as below.
Under independence of X and Y , ρg (X, Y ) converges to a quadratic form of normal random variables.This result is difficult to be applied for the independence test, and hence one has to rely on a permutation procedure to determine a critical value of the test, which is computationally expansive.
This result is obtained under the classical setting.The inference for the Gini correlation in high dimension has not been explored and we will fill this gap by developing the asymptotic distributions when both the sample sizes and the dimensionality diverge to infinity.

U-estimators and projection representation
When the dimension p is large, the V-statistic Gini covariance and correlation estimators may have issues about bias.Therefore, we will estimate the GMDs by unbiased U -statistics.That is, Thus Gini covariance and correlation can be estimated by and respectively.Both of them are functions of U -statistics U n and U n k 's.We shall focus on the asymptotic distribution of gCov n (X, Y ).The application of Slutsky's theorem allows us to obtain the result on gCor n (X, Y ) immediately.Under independence of X and Y , the sample Gini covariance gCov n in (2.5) is a linear combination of U -statistics with first-order degeneracy.By classical theory about U statistics in the fixed dimensional asymptotic (fixed dimension with sample sizes diverge to infinity), a non-normal limiting distribution holds, a similar result as (2.3).However, as both the the dimension and the sample size go large, the degenerate U -statistic will admit a normal limit.To establish this result, we first take decompositions of U-statistics in (2.4) and rewrite (2.5).
By the Hoeffding decomposition, we have is called the double centered distance and it is actually the second order centered projection of the kernel function of U n .Analogously, imsart-ejs ver.2020/08/06 file: High_dimensional_Gini_correlation_Final.tex date: April 19, 2023 Under independence of X and Y , we have Then we can represent (2.5) as under the null that X and Y are independent.The representation of (2.8) has advantages over (2.5) due to appealing orthogonal properties of d(X 1 , X 2 ) as stated in Lemmas 6.1 and 6.2 in Appendix.Those properties largely simplify the calculation of specific moments involved.

Asymptotic normality
We study the asymptotic distributions of the U -estimators in this section.Let X 1 , X 2 , X 3 and X 4 be i.i.d copies of X.The following conditions will be needed to facilitate the proofs.
Remark 2.1.Our conditions C2 and C3 are corresponding to conditions ( 18) and ( 19) in [11] when τ = 1.In fact, the condition C2 can be weaken to be However, it is hard to check the condition when 0 < α < 1, so we take the stronger but simple condition.
Applying Martingale central limit theorem, we establish the limiting distribution of the sample Gini covariance in the following theorem.
Theorem 2.1.Under independence of X and Y , and conditions C1-C3, as , Theorem 2.1 reveals that a degenerate U -statistic admits a normal limit due to the high dimensionality.This is surprisingly inspiring to deal with problems which can be estimated by U -statistics in high dimension.
To make inference feasible, we need to estimate where V 2 n (X) is the bias-corrected estimator for the squared distance variance in [30].That is, Theorem 2.2.Under independence of X and Y , and conditions C1-C3, as The estimators in (2.4) are U -statistics and hence the ratio is consistent with ∆/∆ → 1 in probability.By applying Slutsky's theorem, we have the CLT for the Gini correlation.
Corollary 2.1.Under independence of X and Y , and conditions C1-C3, as From the result of (2.3) and Corollary 2.1, we see that when X and Y are independent, as the dimensionality of the numerical variable goes large and under some conditions on the fourth moment, the complicate quadratic form of normal distributions converges to a normal distribution.

High-dimensional K-sample test
These established CLTs can be applied to test the independence of X and Y .We will use the CLT for the Gini covariance to do the test.The one based on the Gini correlation is asymptotically equivalent.
The independence test is stated as (2.10) Note that the null hypothesis of the test in (2.10) is equivalent to the null of the K-sample test In the K sample test, we can view sample point For K = 2, the two sample problem, the proposed test is asymptotically equivalent to the test based on distance covariance because gCov(X, Y ) = dCov(X, Y )/ dCov(Y, Y ).This is the result of Remark 9 in [7].And hence two test statistics estimate a same population quantity.They are also asymptotically equivalent to Székely's energy test [26,2] that is based on energy statistic between F 1 and F 2 .
Theorem 2.1 allows us to avoid computation burden of the permutation tests.As demonstrated in the simulation, the test based on the limiting normality is more powerful than the permutation tests.The power function for the proposed test is The test consistency is established in the below theorem.
Theorem 2.3.For any alternative H 1 satisfying conditions C1 and C4, as min{n 1 , n 2 , ..., n k } → ∞, p k > 0 and p → ∞, we have Condition C1 is the usual assumption on the finite fourth moment.Condition C4, √ ngCov(X, Y ) → ∞, requires dependence of X and Y cannot be too weak.We might state a local alternative as The proposed test is able to detect the dependence under H 1 with power going to 1 as sample sizes increase.

Simulation study
In this section, we conduct three simulation studies to verify the theoretical properties of the standardized Gini covariance statistic and compare its performance in K-sample tests with others.

Limiting normality
We generate independent K samples from the same multivariate normal distributions and compute the standardized Gini covariance statistic.The procedure is repeated 5000 times.The setup parameters are listed below.
For each dimension p, the histogram of 5000 standardized Gini covariance statistics is plotted in Figure 1.Also the kernel density estimation (KDE) curve and the standard normal density curve are added to the histogram plot to visualize closeness between empirical density and asymptotical density functions.For imsart-ejs ver.2020/08/06 file: High_dimensional_Gini_correlation_Final.tex date: April 19, 2023 p = 5 in Figure 1(a), the histogram is slightly right-skewed and there is some discrepancy between KDE and the normal curve.But when dimension increases, the discrepancy becomes less and diminishes as shown in Figure 1(b)-(d).We also calculate the maximum point distance between KDE and Normal density function as a measure of discrepancy in Table 1.It is clear that the difference decreases with dimensionality.Gao et al. [11] developed the limiting normal distribution for distance correlation, so we also involve the maximum point distance between KDE for the distance measure.Comparing with the scaled distance covariance statistic, the Gini one has a better normal approximation in each dimension.

Size and power in K-sample tests
In this simulation, we compare five methods for K sample problem.Two of them are permutation tests.The one based on distance covariance in high dimension has been studied in [36] for K = 2.Here we examine both permutation tests for K sample problem in high dimension.Five methods are gCov: our proposed method using rescaled Gini covariance statistic and the normal percentile as the critical value.gCov-perm: permutation test using Gini covariance statistic.This test is asymptotically equivalent to the one-way DISCO method [23].dCov: the method using rescaled distance covariance statistic using the percentile of the standard normal as the critical value [11].dCov-perm: permutation test using distance covariance statistic.GLP: graphic LP polynomial basis function method proposed in [22].
Example 2. Generate samples of X (1)  We conduct 1000 simulations.The size and power of each test are computed and reported in Table 2.The column β = 0.0 corresponds to the size of tests.Several observations can be drawn.All tests maintain the nominal level 5% quite well.Permutation tests are slightly less powerful than their corresponding counterparts.GLP test is inferior to others in all cases.In the equal size case, Gini method gCov produces almost the same size and power as dCov, which is an expected result since the Gini covariance and distance covariance are asymptotically equivalent.While in the unbalanced cases, our Gini method gains 1% -6% power advantage over the distance one.An intuitive interpretation of the advantage is that gCov is a better measure than dCov in unbalanced distributions as stated in the Introduction section.
Example 3. Let Z k = (Z k1 , Z k2 , ...Z kp ) T − 1 p , where for k = 1, 2, 3 and j = 1, ..., p, Z kj 's are i.i.d.from Exp(1).Then generate Although the distributions are not elliptically symmetric, the patterns and observations from this simulation are very similar to those in Example 2 for all tests but GLP.We present the results in Figure 2. GLP seems sensitive to the asymmetry of distributions not only in terms of performance as well as in terms of computation.The GLP algorithm includes a middle step to perform K-mean clustering, and that step occasionally stops especially for unbalanced sample sizes.The GLP is slightly oversized and its power is extremely low.

Real data analysis
Two data sets from UCI machine learning repository [9] are studied for K sample tests.

LSVT voice rehabilitation data
The first data is LSVT Voice Rehabilitation dataset.After speech rehabilitation treatments in Parkinson's disease, 126 patients were evaluated based on 310 attributes.Refer to [32] for details of the data set and dysphonia measure attributes.Phonations of 42 patients were evaluated as 'acceptable', while 84 patients had 'unacceptable' phonations.This data set has the dimension larger than the sample size.Our goal is to test whether or not phonation features have a same distribution in the 'acceptable' group and the 'unacceptable' group, which is a K = 2 sample problem.Before we preform the test, we do some exploratory data analysis to visualize the data in the original high dimensional space and the data projected in low dimensional space.A heatmap on all 310 variables is plotted in Figure 3(a) in which the values are centered and scaled by each column variable.The top third rows are for the acceptable group, while the bottom two thirds for the unacceptable group.It is quite difficult to view differences between two groups.However, the difference shows in the heatmap on the selected 12 variables in Figure 3(b).The selected 12 variables are those with its categorical Gini correlation greater than 0.1.
We also conduct principal component analysis (PCA) on all variables.The proportions of variance of first two principal components (PC) are 32.29% and 19.87%, altogether accounting for 52.16% of the total variance.The data are projected on the plane of the first two PC's shown in he left panel of Figure 3(c) in which several patients with unacceptable evaluation are clearly outliers.We also plot data projection on the first two PC's when PCA is conducted on the selected 12 variables in Figure 3(d).From it, we can see that the unacceptable group tends to have larger values in the first PC.After a simple feature selection to reduce dimensionality, the separation of two groups is more evident.In the next, we perform formal tests on equality of distributions of two groups.The test of distributions on all 310 variables and the test of distributions on the 12 selected variables are conducted.
Besides the five methods considered in Section 3.2, five 2-sample test methods are added for comparison.Three methods are proposed in [19] and denoted as Li-loc, Li-scal and Li-both.Székely's energy test statistic in high dimension is also studied in [19].It is asymptotically normally distributed, equivalent to gCov and dCov, but its variance estimation is different and quite complicate in [19] and we include it for comparing its efficiency on variance estimation.The last considered method denoted as BG is proposed by Biswas and Ghosh in [3].The p-values of those ten methods are reported in Table 3 With the feature selection to reduce dimension, all methods except for Li-scal strongly reject the equality of two distributions.While for the high dimension data, three methods GLP, Li-scal and BG fail to conclude different distributions in two groups.The gCov and dCov methods provide the most significant evidence on the differences of two groups.

Arcene data
The second data set we apply to is Arcene mass-spectrometric data for 900 patients from cancer group and healthy group.The data set was merged from three resources on ovarian cancer data and prostate cancer data.The preprocessing steps of limiting the mass range, averaging the technical repeats, removing the baseline, smoothing, rescaling and aligning the spectra were prepared to reduce disparity between data sources.Arcene data have 10000 features including 7000 real features and 3000 random probes.The dimension is much higher than the sample size.The data was formatted for benchmarking variable selection algorithms for the two class classification problem in 2003 NIPS, the top conference on machine mining and computational neuroscience.The data were partitioned to training, validation, and test sets.For the training and validation sets, each has 44 cancer positives and 56 negatives, while the test set has 310 positives and 390 negatives.Refer to [12] for details about the data preparation and NIPS challenge results.Rather than conducting two sample test, we perform 3 sample testing on distributional equality of the training data, validation data and test data.That is the assumption and the logic behind the procedure of using training data to build model, using validation data to select model and using test data to assess model.P-values of five methods are reported in Table 4.Only GLP rejects the equality with p-value 0.0389, while the other four methods with large pvalues support the distribution equality assumption that makes the data mining challenge competition valid.

Conclusions and future work
The categorical Gini correlation is an alternative to the distance correlation to measure the correlation between a p-variate numeric variable X and a categorical variable Y .But the Gini one has more appealing properties such as nice presentation and better interpretation.When p is fixed, Dang et al. [7] showed that the sample Gini correlation converges in distribution to a quadratic form of normal distributions under independence of X and Y .In this paper, we have studied the inference of the categorical Gini correlation in a more realistic setting where both the sample size and the dimensionality are diverging in an arbitrary fashion.One of our main results, Theorem 2.1, reveals that those complicated quadratic forms of normal random variables admit a normal limit as the dimensionality p diverges to infinity, providing an intriguing example to understand the distinction between classical and high-dimensional theory.
Based on these asymptotic distributions, a new consistent K-sample test has been developed.Both simulation studies and real data illustrations have shown the proposed test performs uniformly better than the distance correlation based test for unbalanced cases.
The Gini covariance has been generalized to a reproducing kernel Hilbert space (RKHS) in [34] as follows.
gCov(X, Y ; 2 )}, (5.1)where ), the distance in the feature space induced by positive definite kernel κ.More specifically, a positive definite kernel, κ : R p × R p → R, implicitly defines an embedding map: via an inner product in the feature space F: Replacing the expectations in (5.1) by the corresponding U -statistics and replacing p k by pk , we obtain the sample kernelized Gini covariance gCov n (X, Y ; d κ ).With a choice of bounded kernel such as the popular radial basis function kernel (RBF), the moment condition C1 can be dropped.It will be interesting to derive similar results for kernel covariance and correlation.
As long as pairwise (dis)similarities are available, kernel Gini covariance can be used for complex data type.It is interesting to adopt kernel Gini covariance and correlation based on neural tangent kernel (NTK) in the study of deep artificial neural networks (ANN).Continuations of this work could take those directions as well as the following.
• The permutation test based on Gini covariances in high dimension has demonstrated its size and power empirically.A theoretical and rigorous treatment is needed.• When X and Y are dependent, the CLT holds for the sample Gini covariance gCov n .Under the null that X and Y are independent, gCov n is a U -statistic representation with first order degeneracy but admits a normal limit in the high dimension.Therefore, we would expect a non-null CLT for gCov n when p → ∞. • In this study, the number of levels of Y is fixed and finite.However, some applications like Poisson process have infinity levels.In some applications like discretization procedure, the number of levels might increase as sample size increases.It is interesting to study estimation of Gini correlation in those cases and explore its asymptotical distribution when n, p and K diverge.

Appendix
Let X, X 1 , X 2 , X 3 and X 4 be independent random variables from F .We will adopt the following notations through this section.

Lemmas
Before we prove the major result in Theorem 2.1, let us provide several necessary lemmas and their proofs.The remaining lemmas shall be given in the proof of Theorem 2.1.The double centered distance d(•, •) in (2.7) has appealing orthogonal properties in the following Lemmas 6.1 and 6.2.
Proof.It is straightforward to obtain that Ed(X 1 , X 2 ) = 0 and By the double expectation argument, we have The other properties can be proved similarly.
Using the double expectation argument and properties in Lemma 6.1 , we have This completes the proof of Lemma 6.2.
For a short presentation, we denote gCov n (X, Y ) as G n , which is In order to show the asymptotic normality of G n , we construct a martingale sequence as follows.Assume that X i 's have been sorted by Y i 's, that is, i , for i = 1, 2, ..., n 1 ; X n1+i = X i , for i = 1, ..., n 2 ; ...; {M n,l , 1 ≤ l ≤ n} is a martingale difference sequence with respect to the nested σ-fields {F l , 1 ≤ l ≤ n}.Also under the independence, We need to establish the asymptotic normality of n l=1 M n,l .Without loss of generality, we will prove the case for K = 3.
We first work out the representations of M n,l by using the properties in Lemmas 6.1 and 6.2.Depending on l, M n,l have three forms.
l }.We have i , X i , X Thus, 1 , ..., X (1)  n1 , X l−n1 }.We have i , X i , X i , X i , X i , X i , X i , X i , X i , X i , X i , X j ).
In order to apply martingale central limit theorem to the constructed martingale sequence, M n,l , l = 1, ..., n, we need the following Lemma 6.4 and Lemma 6.5.Lemma 6.4.Under conditions C1-C3 and independence of X and Y , as ). Proof.We first obtain three formulas of σ 2 n,l according to l. Case 1, for l ≤ n 1 , we have Case 2, for n 1 < l ≤ n 1 + n 2 , we have i , X Case 3, for n 1 + n 2 < l ≤ n, we have i ) i , X i , X i , X Therefore, under independence of X and Y , we have It is not difficult to show that To complete the proof of Lemma 6.4, it suffices to show that var( where i , X i , X i , X Under independence of X and Y and by the properties in Lemmas 6.1 and 6.2, R (1) and R (2) are orthogonal, that is,    The last equality is due to Condition C2 and Lemma 6.3.This completes the proof of this lemma.Lemma 6.5 implies that the Lindeberg's condition holds.Along with Lemma 6.4, an application of the martingale CLT completes the proof of Theorem 2.1.

Proof of Theorem 2.3
As σ0 in (2.9) is a ratio consistent estimator for σ 0 , it is sufficient to show that gCov n (X, Y ) The inequality (6.5) is obtained by applying the moment inequality of U -statistics from [18] (p.72) and conditional Jensen's inequality.Hence, With (6.6) and (6.7) together, we can conclude that gCov n (X, Y ) σ 0 → ∞ in probability.Therefore, P (gCov n (X, Y ) > Z α σ0 ) → 1.We have completed the proof.
and a n = O(b n ) means L ≤ a n /b n ≤ U for some finite constants L and U .For random variable sequences, similar notations o p (n) and O p (n) are used to stand for the relationships holding in probability.

Fig 1 .
Fig 1. Histograms of the standardized Gini covariance statistic in Example 1 with kernel density estimation curves in red and standard normal density curves in blue.

Fig 2 .
Fig 2. Size and power of tests in Example 3. Dashed horizontal line is the nominal level 0.05.
PCA on all 310 variables (d) PCA on the selected 12 variables

Fig 3 .
Fig 3. Heatmaps and 2 dimensional PCA projections of Voice rehabilitation data of all 310 variables and of the selected 12 variables.

Table 1
The maximum point distances between the kernel density estimation function and standard normal density function.KDEg is for rescaled gCovn and KDE d for dCovn.

Table 2
Size and Power of Tests for K = 3 samples in Example 2.

Table 3 p
. -values of various 2 sample tests for all features and for the select 12 features in LSVT Voice rehabilitation data.

Table 4 p
-values of testing whether training data, testing data and validation data in ARCENE have a same distribution.