Selective inference for clustering with unknown variance

,


Introduction
Data clustering is a popular method for summarizing trends in unlabeled data, and is a powerful tool for gaining understanding and interpretability in large datasets.However, it is well known that clustering can easily lead to false discoveries, in the sense that data from a single source (one true cluster) can be falsely separated into multiple clusters.A core challenge in addressing this issue is that of selective inference-the problem of performing inference on a hypothesis that was developed using the same dataset.In-depth motivations for selective inference can be found in Taylor and Tibshirani [2015] and Benjamini [2020], where the latter illustrates its importance from the perspective of replicability of experimental results.There has been extensive work on selective inference in supervised settings, such as the work by Fithian et al. [2014] and Loftus [2015] that provide selective inference frameworks for doing valid inference after model selection, but less is known about the unsupervised setting, which is the context for clustering.Some popular clustering methods include hierarchical clustering (Murtagh and Contreras [2012]), k-means clustering (Steinley [2006]), decision tree clustering (Liu et al. [2000]), and spectral clustering (Von Luxburg [2007]).
Concretely, for the clustering problem, if the n data points X 1 , . . ., X n ∈ R q are partitioned into clusters C 1 ∪ • • • ∪ C K = [n] := {1, . . ., n}, how can we determine whether the "discovered" clusters C k and C k are genuinely different using the observed data (the X i 's for i ∈ C k , and for i ∈ C k ), when these same data values were used to choose the clusters themselves?
To address this problem, Zhang et al. [2019] propose a method based on data splitting, where the hyperplane separating two clusters is estimated using a portion of the data, and the fitted hyperplane, instead of the clustering algorithm, is used on the rest of the data for generating cluster assignments.They then condition on the selection event-the event where the hyperplane is chosen-to account for the data dependency in the hypothesis test.Gao et al. [2022] provide an alternative solution to this problem, where they account for the clustering event by directly conditioning on it, specifically for the hierarchical clustering algorithm, and a recent work of Chen and Witten [2022] extends their work to the k-means clustering algorithm.Relatedly, Hivert et al. [2022] propose a set of valid inference procedures for three different null hypotheses testing whether two clusters estimated by a clustering algorithm are truly different, where one of the proposed procedures is an extension of the work of Gao et al. [2022] to testing whether a single feature plays a role in distinguishing the two clusters.There have also been relevant work on data with structural assumptions.For example, the work of Watanabe and Suzuki [2021] provides a method for doing inference on a data matrix represented by a latent block model after choosing the cluster membership of each entry of the data matrix using the same data matrix, and conditions on this selection event to do a valid inference.
The aforementioned work of Gao et al. [2022] offers an elegant solution to this problem, providing a framework for performing selective inference to test the null hypothesis that states for each pair k = k , where µ i = E [X i ] is the true mean of the i-th data point-in other words, is the mean of cluster C k equal to the mean of cluster C k ?Note that this hypothesis is indeed data-dependent, since the clusters C k and C k are chosen based on the observed data, and therefore testing this null must account for this data dependence.In Gao et al.
[2022]'s work, it is assumed that the data is distributed as where the means µ i ∈ R q are unknown while the (shared) variance σ 2 > 0 is known.In practice, however, σ would often need to be estimated from the data, and this poses a particular challenge in the setting of this clustering problem.Without knowing the true cluster structure of the data (since of course, this is exactly the question we are aiming to test), it is difficult to obtain a reliable estimate of σ-indeed, we will see shortly that many natural options lead to either substantial power loss or substantial loss of the Type I error control.This motivates the need for the more general model that avoids the need to estimate the true variance.In this work, we propose a method that avoids this obstacle, by allowing for an unknown variance σ 2 (or more generally, an unknown structured covariance matrix), while guaranteeing Type I error control and maintaining high power.
The remainder of this paper is organized as follows.In Section 2, we review the selective inference framework developed by Gao et al. [2022] for the setting where σ is known and discuss motivations for allowing σ to be unknown.In Section 3, we present our new method for performing inference on clustering in the setting of an unknown σ (with proofs deferred to the Appendix).Empirical results are presented in Section 4 to demonstrate the performance of the new method and compare against the existing framework.Finally, we conclude with a discussion and some open questions in Section 5.

Background: the known variance case
In this section, we will first give a brief overview of the selective inference method developed in Gao et al. [2022]'s work, and discuss some of the challenges posed by unknown variance σ 2 .

Gao et al. [2022]'s Method
Consider clusters C k , C k , which are two disjoint subsets of [n].If these clusters were chosen ahead of time-that is, independently of the data-then it would be simple to test the null hypothesis H 0 (C k , C k ) defined in (1)-specifically, we would naturally use the test statistic Here X ∈ R n×q is the matrix of observed data with i-th row X i ∈ R q , and where, for a subset C ⊆ [n], 1 C ∈ R n represents the vector with ith entry equal to 1 for each i ∈ C and 0 for i ∈ C.This test statistic follows a mean-zero normal distribution under the null hypothesis H 0 (C k , C k ), and so its norm follows a rescaled χ distributed under the null, However, since the clusters were chosen in a data-dependent way, this distribution is not the correct null distribution for X v 2 .To address this, we can rewrite X as which decomposes X into components lying in the span of v and its orthogonal complement, with denoting the projection matrix that projects to the span of v, and projecting to its orthogonal complement, and where • 2 denotes the Euclidean norm and • F the Frobenius norm.Gao et al. [2022]'s insight into handling the data-dependent cluster selection is to condition on the normalized matrix vv X vv X F and the orthogonal projection P ⊥ v X, so that only the test statistic X v 2 remains unknown, and moreover to condition on the range of values of X v 2 that agree with the clustering selection.Specifically, defining where Cluster(•) refers to the outcome of the clustering procedure.In other words, S contains all values of φ for which the same clustering outcome would have been obtained, if we plug in φ in place of the observed test statistic value X v 2 .Their main result establishes that, even given the data-dependent clustering procedure, the re-scaled χ distribution is the correct null distribution once truncated to this set S.
where σ is known, and let v be defined as above.Then, conditional on Cluster(X), vv X vv X F , and • χ q truncated to S. In particular, the p-value , S is uniformly distributed under H 0 (C k , C k ), where F χq (•; c, S) is the CDF of a c • χ q random variable truncated to the set S. Gao et al. [2022] provide an algorithm for exactly computing the set S for the hierarchical clustering algorithm with linkages for which the exact computation of this set is tractable, along with an implementation of an importance sampling algorithm for clustering algorithms where this set cannot be efficiently computed.
It can be seen in Theorem 1 that Gao et al. [2022] condition on Cluster(X), which is the entire outcome of the clustering algorithm and contains information about all K estimated clusters, as opposed to conditioning only on the event C k , C k ∈ Cluster(X), where C k and C k are the two clusters of interest.We might ask whether it is necessary to condition on this additional information, or whether this might lead to loss of power.Indeed, this type of approach is common in the selective inference literature-in order to be able to perform selective inference, we may need to condition on additional information for statistical reasons (i.e., to avoid nuisance parameters) and/or computational reasons; see, for instance, Fithian [2015, Section 3.2.4], Lee et al. [2016, Section 5.2] for more discussion.

Challenges in estimating σ
We next discuss motivations for allowing σ to be unknown.Continuing the discussion earlier on the difficulty of estimating σ from the data, we consider a simple scenario where we are aiming to determine whether data points X 1 , . . ., X n arise from a single cluster or from two clusters.To test this, we would choose a data-dependent clustering [n] = C 1 ∪ C 2 , and would now need to estimate σ in order to run Gao et al. [2022]'s test.
• Suppose we estimate the variance by using the within-cluster means, for instance, where XC k is the sample mean in cluster C k .With this choice, we might substantially underestimate the variance if the true data distribution only has a single cluster.The middle column of Figure 1 demonstrates this problem in practice-we can see that, when the null H 0 (C 1 , C 2 ) is true, the variance may sometimes be vastly underestimated and, as a result, the empirical distribution of the p-value is far from uniform, which would lead to false positives.
• Alternatively, we might take a more conservative estimate of variance by treating the data as a single cluster, e.g., Indeed, this is the estimator proposed in Gao et al. [2022, Section S3], and they prove theoretically that, as this is asymptotically an over-estimate of σ2 , Type I error control is guaranteed.However, this choice can lead to a substantially over-conservative test, as demonstrated in the right column of Figure 1-if the true data distribution arises from two clusters, this estimate can massively over-estimate σ 2 leading to a large loss of power.1 See Section 4 for details on these simulations.Thus, in Figure 1, we clearly see a tradeoff between Type I error control and power.When using the cluster-wise estimate σclustered , we see that power is high under the alternative, with the empirical power being as high as the case where the true σ is used, but Type I error control is lost under the null.On the other hand, when using the estimate σall that treats the entire dataset as a single cluster, we see that it controls Type I error under the null but incurs a loss in power under the alternative.
To avoid this tradeoff, in this work we propose a selective inference procedure for the clustering problem that can handle an unknown variance σ 2 > 0. This more general model resolves the issue-the p-value distribution is uniform when the data is generated from a single cluster, without sacrificing too much power in the scenario where the data is instead generated from distinct clusters.3 Proposed method: the unknown variance case We now introduce our proposed method for the setting where the variance is unknown.In this new setting, we assume that the data is distributed as X i ⊥ ⊥ ∼ N (µ i , σ 2 I q ), where the means µ i ∈ R q , as well as the (shared) variance σ 2 > 0, are unknown.
Recall the null hypothesis Gao et al. [2022], and the corresponding test statistic X v 2 .Unfortunately, in our new setting where σ is unknown, the distribution of this test statistic cannot be computed.With unknown σ, a typical approach would be to construct a data dependent estimator of σ to result in a test statistic of the form X v 2 2 /(estimate of σ 2 ), which would follow an F distribution (after rescaling the numerator appropriately).However, this is impossible to do in our setting under the broad null hypothesis H 0 .This is due to the fact that, to estimate σ, we would need to examine the within-cluster variability, i.e., quantities of the form i∈C k X i − XC k 2 2 for estimated clusters k = 1, . . ., K; however, these quantities follow a noncentral χ 2 distribution when each cluster 2 ).In other words, any nonzero differences in means within a cluster will lead to nuisance parameters, arising from within-cluster differences in means, that make it impossible to handle unknown variance.
To avoid these nuisance parameters, we restrict to a stronger null hypothesis, In other words, H 0 assumes that each data point in clusters C k and C k has the same mean, while H 0 makes the weaker assumption that the sample mean of data points in cluster C k and in cluster C k have the same mean.We can equivalently rewrite H 0 (C k , C k ) as e i e i   µ = 0 where w := (5)

Decomposition of X
To define our test statistic, we begin by taking a decomposition of the observed data X.This decomposition plays an analogous role to the decomposition (3) used by Gao et al. [2022], but is more complex to allow us to handle unknown variance.We begin by writing where is the rank-one projection matrix that captures the difference in cluster means for C k and C k , while where, for any subset C ⊆ [n], I C represents the diagonal matrix with entry (i, i) is the projection matrix to the orthogonal complement of P 0 and P 1 .We can see that P 0 , P 1 , and P 2 project to subspaces of dimension 1, m − 2, and n − m + 1, respectively, where is the number of data points in the two clusters.Intuitively, we can think of this decomposition of the data as follows: • P 0 X captures the difference in means between clusters C k and C k ; • P 1 X captures differences among points within C k , and among points within C k ; • P 2 X captures all other aspects of the data (i.e., the overall mean of the combined clusters C k ∪ C k , as well as information about data points not lying in Figure 2 illustrates the roles of these three terms in the decomposition of the data X.

The test statistic
In Gao et al. [2022]'s work, the test statistic they use is equivalent to P 0 X F , which under the null hypothesis H 0 , follows a χ q distribution (rescaled by σ), truncated to a region S that controls for the selection event.In our work, since σ is unknown, we will use an F distribution in place of the χ.The test statistic we propose is given by the ratio where the numerator is the same (up to squaring and rescaling) as the statistic used by Gao et al. [2022], while the denominator acts as an estimate of the noise variance σ 2 (up to rescaling).We can gain additional intuition for this test statistic by considering the case K = 2-if we define T two-sample as the test statistic for the standard two-sample t-test (i.e., testing for equality of means in clusters k, k , assuming equal variances), we can observe that R = T 2 two-sample in this case.We next need to define the truncation set.First, we rewrite our decomposition as Our test will condition on: • The total squared norm P 0 X 2 F + P 1 X 2 F for the first and second terms in the decomposition; • The normalized terms P 0 X P 0 X F and P 1 X P 1 X F for the first and second terms in the decomposition; • The third term P 2 X in the decomposition.
With these terms treated as known, the data X can then be fully determined by revealing the value R = (m − 2) of the test statistic.For any r > 0, define We can verify from the definition of R that X = x (R) holds by definition.Finally, define S = {r > 0 : Cluster(X) = Cluster(x (r))} ⊆ (0, ∞).

Main result
Our main result, presented next, establishes that we can compute the exact post-selection distribution of R, which thus allows us to perform valid selective inference in the unknownvariance setting.
∼ N (µ i , σ 2 I q ) where σ is unknown, and let P 0 , P 1 , and P 2 be defined as above.Then, conditional on Cluster(X), P 0 X 2 F + P 1 X 2 F , P 0 X P 0 X F , P 1 X P 1 X F , and P 2 X, under the null hypothesis H 0 (C k , C k ), the random variable R follows the F q,(m−2)q distribution truncated to the set S .In particular, the p-value , where F F q,(m−2)q (•; S ) is the CDF of a F q,(m−2)q random variable truncated to the set S .
The intuition is that, if P 0 and P 1 were fixed rather than data-dependent (i.e., if the clusters C k and C k were chosen before viewing the data), then we would have σ −2 P 0 X 2 F ∼ χ 2 q and, independently, would follow an F q,(m−2)q distribution.After accounting for the selection event, the null distribution is instead given by a truncated F distribution.
To implement the results of Theorem 2 in practice, we need to be able to compute this p-value.In other words, we need to either explicitly characterize the set S that is consistent with the selection event, or develop an empirical sampling strategy to estimate the p-value.We next consider this computational question.

Computing the p-value
To characterize the truncation set S , we will split into two cases.In the general setting, when the data is separated into an arbitrary number K ≥ 2 of clusters, we will handle the truncation event via numerical approximation.For the special case K = 2, however, we will show that S can potentially be computed explicitly, by relating the problem back to the work of Gao et al. [2022] for the known-variance case.

Special case: K = 2
Rewriting Gao et al. [2022]'s procedure in our notation, the modified data is defined as where v = is projection onto the span of v, and their selection set is given by S = {φ > 0 : Cluster(X) = Cluster(x(φ))}.In contrast, our method defines and S = {r > 0 : Cluster(X) = Cluster(x (r))}.
The next result shows that, for the case K = 2, these two definitions can be related in a simple way.
Proposition 1. Suppose that K = 2 (so that we have k = 1, k = 2).Assume that the clustering procedure is location-and scale-invariant-that is, for any x ∈ R n×q , for any a ∈ R and b ∈ R q .Then, under the notation and definitions above, The work of Gao et al. [2022] gives an explicit characterization of the set S for a family of agglomerative hierarchical clustering algorithms, while Chen and Witten [2022] does the same for k-means clustering.Moreover, both of these algorithms are location-and scaleinvariant.Consequently, for the case K = 2, we can explicitly characterize the truncation set S , and thus can compute the p-value P constructed in Theorem 2 by leveraging Gao et al. [2022]'s construction of the set S, for these two popular algorithms.Of course, a major limitation of this result is that we can only handle K = 2.However, in iterative procedures (e.g., hierarchical clustering), the test can be applied at the first step (i.e., the first split, from a single cluster to K = 2 clusters).This hypothesis test can then essentially be interpreted as testing the global null, i.e., whether the data should be split into clusters at all or simply treated as a single cluster.

General case: K ≥ 2
Next we consider the general case.If K > 2, then the result of Proposition 1 no longer applies-as we will see in the proof of this proposition, the case K = 2 leads to the result specifically because P 2 = 1n1 n n in that case, but this no longer holds for K > 2.Moreover, even if K = 2, it might be the case that we are using a clustering algorithm for which S does not have an explicit characterization and/or which is not location-and scale-invariant.In any of these settings, we do not have an explicit characterization of S , and thus the truncated CDF needed in Theorem 2 cannot be computed exactly.
Nonetheless, the inference procedure can still be run in the general case, by computing P approximately through importance sampling.Specifically, in order to (approximately) compute the p-value P , we need to be able to estimate , and then evaluate this probability at r = R. Since r = R may be extremely large, and/or the selection set S may be very small, the events R > r and/or R ∈ S may have extremely low probability under the distribution R ∼ F q,(m−2)q .We approximate this probability with importance sampling, using a truncated normal distribution as the proposal distribution.Further details are given in Appendix A.3.2.

Empirical results
We now provide empirical results for our proposed method.We present simulation results for Type I error control and empirical power for our proposed method, as well as results on a small real dataset (the penguin dataset provided by [Horst et al., 2020], which was also analyzed in Gao et al. [2022]).For all experiments, we use the hierarchical clustering algorithm with average linkage for clustering.2

Type I error control
Theorem 2 states that P follows the uniform distribution under the null, so it controls the Type 1 error rate of the hypothesis test.We illustrate this empirically by plotting empirical quantiles of a sample of P against the theoretical quantiles of the uniform distribution.
We compare to the p-value P computed by Gao et al. [2022]'s method, with either oracle knowledge of the true σ, or with plug-in estimates σall or σclustered .(The results for the latter two variants of Gao et al. [2022]'s method were also shown in the top row of Figure 1 in Section 2.2.)We perform 2000 independent trials, with datasets generated as , with µ i = 0 q for all i ∈ [n] so that the null hypothesis is true.We set σ = 1, n = 30, and q = 2. Figure 3 presents results for two different settings, K = 2 and K = 3.We use the exact computation method presented in Section 3.4.1 for K = 2, and the importance sampling method discussed in 3.4.2for K = 3 (with N = 8000 draws when we run importance sampling).For the K = 2 case, the p-values are generated for the comparison between clusters k = 1 and k = 2.For the K = 3 case, in each trial, the p-values are generated for the comparison between two randomly chosen clusters k = k ∈ {1, 2, 3}. Figure 3 illustrates that the proposed method, along with Gao et al. [2022]'s method with either the conservative estimate σall or with oracle knowledge of the true σ, all result in uniformly distributed p-values (and thus, control the Type I error rate) for both settings.On the other hand, Gao et al. [2022]'s method applied with σclustered fails to do so, with a substantially anticonservative distribution of the p-values.As expected, since σclustered is a more extreme underestimate for larger K, the nonuniformity of the p-values is more substantial when K = 3 than when K = 2.

Empirical power
It is inevitable that not knowing σ will cause some loss in power, as we are forced to condition on more components of X than we would otherwise do when computing the p-value.On the other hand, using σall in place of σ in P also causes a loss in power, due to σall being an overestimate of σ in settings where the alternative hypothesis is true.Here, we compute the empirical power of the same four methods as above.To be more precise, to calculate empirical power, for each trial, we again consider clusters k = 1 and k = 2 for the case K = 2, or a randomly chosen pair k = k ∈ {1, 2, 3} for the case K = 3.For each trial, we first verify whether the null H 0 (C k , C k ) holds-in other words, if the recovered clusters C k , C k are both subsets of the same original true cluster.Empirical power is then defined as the proportion of times that we reject the null, among all trials for which this null is false.
• Setting 3: K = 3 true clusters, with µ i = (0, 0) for n/3 = 10 data points, µ i = (δ, 0) for n/3 = 10 data points, and µ i = (2δ, 0) for the remaining n/3 = 10 data points.We then run hierarchical clustering with the true number of clusters K.In each setting, the parameter δ ∈ {0, 1, . . ., 7} controls the signal strength.Note that δ = 0 corresponds to the null(s) being true, as all data points have mean µ i = 0. Setting 1 with δ = 0 is identical to the first simulation under the null in Section 4.1 (where hierarchical clustering is run with K = 2), while Settings 2 and 3 with δ = 0 are identical to the second simulation under the null in Section 4.1 (where hierarchical clustering is run with K = 3).For the K = 2 case (Setting 1), the p-values are generated for the comparison between clusters k = 1 and k = 2.For the K = 3 case (Settings 2 and 3), in each trial, the p-values are generated for the comparison between two randomly chosen clusters k = k ∈ {1, 2, 3}.In all settings, the threshold for rejecting a p-value is α = 0.05. Figure 4 illustrates the power, as a function of δ, for the four methods in each of the three settings.We see that the highest power is achieved by Gao et al. [2022]'s method applied with the anticonservative estimate σclustered , but this power comes at the cost of loss of Type I error control (as we can see due to the high rejection rate at δ = 0, where the null is true).Among the remaining methods, we see that the proposed method always has power at least as high as Gao et al. [2022]'s method applied with the conservative estimate σall -sometimes approximately the same, and sometimes substantially higher, in the various settings.Determining the types of settings where our proposed method will result in a substantial gain in power remains an interesting open question.

Robustness to model misspecification
In this section, we empirically study the robustness of the proposed method to model misspecification for two different settings, with comparison to that of Gao et al. [2022]'s method applied with σall and σclustered .
We reproduce the experiments in Section 4.1 (where µ i ≡ 0, i.e., all data is drawn from the null), with the only difference being the data generating process.First we consider a setting where the data are generated from an extremely heavy-tailed distribution, We next repeat with a more moderately heavy-tailed distribution, Finally, we consider a Gaussian distribution with a non-isotropic covariance matrix.
The results for these three settings are shown in Figures 5, 6, and 7, respectively.In all three settings, we see that all of the methods show inflation of Type I error in the presence of model misspecification.As expected, this is more severe for the extremely heavy tailed t 5 noise, and more mild for the other two settings.Moreover, as expected, Gao et al. [2022]'s method with the anti-conservative estimate σ2 clustered shows substantially more inflation of the Type I error; the inflation is similar between our method and Gao et al. [2022]'s method with the anti-conservative estimate σ2 all .

Real dataset
We next compare the methods on a real dataset-the penguin dataset [Horst et al., 2020], which contains information about the bill length (mm) and flipper length (mm) of three different species of penguins, Adelie, Chinstrap, and Gentoo.(This dataset was also studied by Gao et al. [2022, Section 6] to test their method for inference after clustering.)The data is shown in Figure 8.
Table 1 shows the p-values computed by each of the three methods that does not require knowledge of σ-our proposed method, along with Gao et al. [2022]'s with σall or with σclustered .We test three different pairwise cluster comparisons (i.e., comparing C k and C k , for three different choices of (k, k )).Table 1: Results on the centered and standardized penguin dataset.Since the true clustering (i.e., the penguin species) is known, we can see that (k, k ) = (1, 2) corresponds to two estimated clusters that are not very well separated in terms of the true species labeling, while (k, k ) = (1, 5) and (k, k ) = (4, 5) clearly correspond to a true difference in species.In Table 1, overall we see that the proposed method produces p-values that are substantially lower than those computed by Gao et al. [2022]'s method with the conservative estimate σall , corresponding to a gain in power.Gao et al. [2022]'s method applied with σclustered produces p-values even lower than our proposed method, but may not be reliable in terms of Type I error control as we have seen in our simulations.

Discussion
In this work, we have proposed an extension of Gao et al. [2022]'s framework for selective inference for clustering to the case where the isotropic covariance matrix is unknown.Since the method does not rely on plug-in empirical estimates of σ, we can avoid loss of power and/or loss of Type I error control.For location-and scale-invariant clustering algorithms, we have shown that the resulting p-value can be computed exactly in the case of K = 2 clusters by leveraging connections to the findings of Gao et al. [2022]; more generally, standard sampling strategies allow for accurate estimation of the p-value.
These results suggest a number of possible extensions and open questions.First, in our work we allow for unknown variance but assume an isotropic covariance structure, i.e., σ 2 I q , for the q-dimensional data points.It would be interesting to extend these techniques to the setting of diagonal covariance, or even an arbitrary covariance matrix, to allow for nonconstant variance along the q coordinates and/or nonzero correlation.Another important question is whether these tools can be extended to the non-Gaussian setting, which would offer further flexibility and robustness for real-data applications.

A.2 Proof of Proposition 1
Let , specializing the construction from before to the case K = 2 (i.e., we are comparing clusters k = 1 and k = 2).Define We can rearrange terms to obtain and therefore We compute where for the last step we apply the calculation (7).Furthermore, in the case K = 2, we can verify that P 2 = 1n1 n n , the projection to the span of 1 n .Therefore, we can write where By our assumption on the clustering procedure, we therefore have Cluster(x (r)) = Cluster(x(φ)), and so r ∈ S if and only if φ ∈ S, as desired.

A.3 Implementation details
In our experiments, the three variants of Gao et al. [2022]'s method (i.e., using the true σ, or the estimates σall or σclustered ) are run using the code provided by Gao et al. [2022] in the R package clusterpval.We now give details for implementation of our proposed method.
A.3.1 Notes on computation in the K = 2 case Computing the p-value P = 1 − F F q,(m−2)q (R; S ) for our method often requires extremely precise calculations of the CDF of the F distribution due to the truncation.
To estimate the CDF in the tail of the F distribution, we use Li and Martin [2002]'s approximation, Note that Li and Martin [2002]'s method is accurate in the regime where k is finite while → ∞, which is appropriate to our setting as we apply the approximation with k = q and = (m − 2)q.Combining all these calculations means that we can approximate P via the CDF of a truncated χ 2 distribution, P ≈ 1 − F χ 2 q 2(m − 2)q + qR 3 + q − 2 2(m − 2)q + 4qR 3 • qR ; S where S = 2(m − 2)q + qr 3 + q − 2 2(m − 2)q + 4qr 3 • qr : r ∈ S .
Finally, we use the TChisqRatioApprox function from Gao et al. [2022]'s R package clusterpval for this remaining calculation with the truncated χ 2 distribution.
A.3.2 Importance sampling algorithm for the K ≥ 2 case For the setting where K > 2, or where K = 2 but we do not have an exact characterization of the truncation set S , we use importance sampling to approximate the p-value P .
In order to run importance sampling, it is convenient for us to transform the F distribution to a Beta distribution, since the Beta distribution takes values over a bounded range.From properties of these two distributions, it holds that Z ∼ Beta(a/2, b/2) if and only if b a • Z 1−Z ∼ F a,b , or equivalently, R ∼ F a,b if and only if R b a +R ∼ Beta(a/2, b/2).We can then derive an analogous transformation for the truncated versions of these distributions: for any r it holds that F F q,(m−2)q (r; S ) = F Beta(q/2,(m−2)q/2) r m − 2 + r ; S ,

Figure 1 :
Figure 1: The top row shows results under the null, and the bottom row shows results under the alternative.In each row, the left plot shows one draw of the data, along with the estimated values σall and σclustered , while the middle and right plots show results for Gao et al. [2022]'s method applied with σclustered or with σall , respectively.(See Section 2.2 for discussion.)

Figure 2 :
Figure 2: Left: visualization of a dataset with colors indicating the clusters formed by the clustering algorithm with C k represented in blue and C k in red.The blue triangle represents the combined mean of C k ∪ C k , and the black dots represent the cluster means.Middle: the original dataset with P 0 X F scaled by a factor of 2, which pushes apart the clusters C k and C k .Right: the original dataset with P 1 X F scaled by a factor of 2, which spreads points in C k apart from each other while preserving the cluster mean, and same for points in C k .

Figure 3 :
Figure 3: Simulation under the null (see Section 4.1 for details).The left plots show one draw of the data, along with the result of hierarchical clustering for K = 2 (top) and K = 3 (bottom); in these figures, and in analogous figures below, the shape of each data point indicates the true cluster from which the data point originated, while the color indicates the cluster assignment estimated from the data.The right plots show QQ plots comparing the p-values obtained via the four different methods.

Figure 4 :
Figure 4: Simulation under the alternative, for Setting 1 (top row), Setting 2 (middle row), and Setting 3 (bottom row).(See Section 4.2 for details.)The left and middle column of plots show one draw of the data for two different signal strengths δ, along with the result of hierarchical clustering.The right column of plots shows power as a function of signal strength δ for the four different methods.

Figure 5 :
Figure 5: Simulation under the null under model misspecification, with noise generated from an extremely heavy-tailed distribution, t 5 .

Figure 6 :
Figure 6: Simulation under the null under model misspecification, with noise generated from a moderately heavy-tailed distribution, t 10 .

Figure 7 :
Figure 7: Simulation under the null under model misspecification, with noise generated from a Gaussian distribution with a non-isotropic covariance matrix.

Figure 8 :
Figure 8: Visualization of the penguin dataset.The output of the clustering algorithm corresponds to that run on the centered and standardized dataset.