Selective Inference for Latent Block Models

Model selection in latent block models has been a challenging but important task in the field of statistics. Specifically, a major challenge is encountered when constructing a test on a block structure obtained by applying a specific clustering algorithm to a finite size matrix. In this case, it becomes crucial to consider the selective bias in the block structure, that is, the block structure is selected from all the possible cluster memberships based on some criterion by the clustering algorithm. To cope with this problem, this study provides a selective inference method for latent block models. Specifically, we construct a statistical test on a set of row and column cluster memberships of a latent block model, which is given by a squared residue minimization algorithm. The proposed test, by its nature, includes and thus can also be used as the test on the set of row and column cluster numbers. We also propose an approximated version of the test based on simulated annealing to avoid combinatorial explosion in searching the optimal block structure. The results show that the proposed exact and approximated tests work effectively, compared to the naive test that did not take the selective bias into account.


Introduction
A latent block model or an LBM [9,12] has been widely used as a generative model of a relational data matrix, where the rows and columns represent different objects (e.g., customers and items), and its (i, j)th element shows some relationship between objects i and j (e.g., how many times the customer i purchased item j). Until now, its effectiveness has been shown in various practical datasets, including customer-product transaction relationships [28] and gene expression data [27,36]. In LBMs, we assume that there is an underlying block structure (i.e., a set of row and column cluster memberships) behind the observed data matrix and that each element of the matrix is generated independently from an identical distribution, given such a block structure. Particularly, a Gaussian LBM [24,25] is useful to model a relational data matrix with real elements; this type of LBM is the focus of the current study. In a Gaussian LBM, we assume that each entry follows a Gaussian distribution, whose mean and variance are fixed constants in the same block (a formal description of Gaussian LBMs is given in Section 2.1).
Besides estimating the block structure from a given observed data matrix based on an LBM, it is also important to test the validity of a model (i.e., the number of blocks) or an estimation result. Until now, several tests [3,14,22,35,37] have been proposed for determining the number of blocks in block models, such as a stochastic block model (SBM), which is a model for a square symmetric matrix (e.g., an adjacency matrix of the network structure). Among these studies, only [35]'s test can be applied to the LBM setting; however, its target is different from ours in that it is limited to the number of blocks, not to the cluster memberships. Moreover, it is an asymptotic test, and thus its guarantee cannot be verified with a finite size observed matrix.
In regard to an SBM, several studies have proposed a statistical test for a given set of community memberships of an observed matrix [8,14,16]. In [8], based on the numbers of edges within and across the clusters, two tests were proposed for an SBM; one of these tests included a goodness-offit test of community memberships. Although this study's objective is similar to ours, its problem setting is quite different from ours in various aspects, such as the setting of the alternative hypothesis and the assumptions in the network structure (e.g., there are two equal-sized communities in a given network and more intra-community edges than inter-community ones). Another study [14] proposed an asymptotic test on both the number of communities and the community memberships of an SBM, whose validity is guaranteed with the infinite matrix size. This study is different from ours in that our proposed test is validated with a finite size matrix. [16] proposed a non-asymptotic test for an SBM setting; they generate finite samples of networks from the distribution of an SBM, conditioned on its sufficient statistics based on Markov chain Monte Carlo (MCMC), and, subsequently, compute the estimator of the p-value as the ratio of the test statistics of sampled networks being equal to or larger than that of an observed network. This study is somewhat similar to ours in that it tries to approximate the p-value under the condition that some function value of an observed matrix is given; however, it is fully based on a Metropolis-Hastings (MH) algorithm, and thus the resulting p-value is not exact with finite samples.
There have been many studies on statistical tests for SBMs, but none of them have enabled us to test the cluster memberships of LBMs. Particularly, in this study, we derive an exact p-value in the following context, which is a typical case in practice. First, we estimate an underlying block structure or cluster memberships of the rows and columns of an observed data matrix, based on a specific criterion. For instance, as a criterion, we use the squared residue or the sample variance within the same block [5,12], whose formal definition is given in Section 2.2. Subsequently, we perform a statistical test on the clustering result, which has been selected as an optimal block structure based on the data matrix, in terms of the criterion described above. In regard to the construction of a valid statistical test, one concern is that it necessitates taking into account the selective bias [2,18,23]. A test on cluster memberships tends to be inappropriately positive, that is, it tends not to reject the hypothesis that the estimated cluster memberships are correct, when the test fails to consider the fact that the hypothetical set of cluster memberships was selected by using the information of a data matrix.
In order to perform a valid statistical inference in such a situation, [2,18] introduced the methodology of post-selection inference. Particularly, selective inference methods facilitate inference of a hypothesis selected based on some criterion, where we use the same data for the hypothesis selection as well as for its inference [18]. The main idea behind the selective inference is to reveal the probability distribution of a given test statistic under the selection condition. By conditioning on the selection event, we can appropriately construct a test without the selective bias. Such selective inference methods have been developed for various problem settings, including variable selection in linear regression with L1 regularization [18] and that with marginal screening [20] and k-means clustering [15]. Concerning the problems related to the analysis of relational data matrices, several studies have proposed selective inference methods for biclustering [13,19]. Although they also concern a block (or multiple blocks) in a relational data matrix, their problem settings are different from ours. In our problem setting, a block structure corresponds to a set of cluster memberships of all the rows and columns of an observed matrix. In other words, by rearranging the indices of rows and columns, a block structure is represented by a regular lattice on a matrix. However, [13,19] aimed to find a submatrix (or multiple submatrices) of the original data matrix whose mean is significantly larger than zero. Figure 1 illustrates the difference between the optimal cluster memberships of the proposed and existing methods [19] 1 . Since they are based on the mutually different assumptions on the latent bicluster structure, 1 To plot Figure 1, we randomly generated data matrices with the sizes of (n, p) = (9,9). We set the null and hypothetical sets of cluster numbers at (2,2); we defined the null cluster memberships as g  Figure 1: Examples of the null and estimated bicluster structures of the observed data matrix with the size of (n, p) = (9,9). The rows and columns of the observed matrix were sorted according to their clusters, and the blue lines indicate the cluster memberships. The biclustering algorithms of the proposed and existing methods [19] do not necessarily yield identical bicluster structures with the same observed matrix.
their "optimal" cluster memberships are not always identical, even with the same observed matrix. To the best of our knowledge, no study has proposed a selective inference method for the LBM setting, despite the effectiveness of LBMs in relational data analysis.
This study proposes a new selective inference method for LBMs. Unlike our previous study [35], where the validity of the test is guaranteed only in the asymptotic sense (i.e., with the infinite matrix size), we develop an exact test on a block structure, which is selected based on a given observed matrix with a finite size and the squared residue minimization algorithm. To construct such a statistical test, we considered the fact that the selection event based on the squared residue can be formulated as a set of quadratic inequalities in terms of the data vector, which is the vectorization of the observed data matrix. On this basis, we can show that the test statistic follows a truncated chi distribution, under the selection condition (a formal definition of the test statistic is given in Section 3).
Since the exact test requires solving two combinatorial optimization problems-one for selecting the block structure with the minimum squared residue, and the other for determining the truncation interval of the distribution of the test statistic-its computation will be intractable with a large size observed matrix or with a larger hypothetical number of blocks. To cope with such combinatorial explosion, we also develop an approximated version of the test based on simulated annealing (SA).
The remaining part of this paper is as follows. In Section 2, we first define notations and describe assumptions necessary for developing the proposed statistical test. We also define the squared residue, which we use for measuring the quality of a given set of row and column cluster memberships. In Section 3, we give the formal statement of the null and alternative hypotheses of the proposed test, define the test statistic, and derive its null distribution. Our main contribution lies in Theorem 3.1; it states that, under the null hypothesis, the test statistic follows a truncated chi distribution, whose truncation interval is determined by the selection result. We also give an approximated version of the test. In Section 4, we experimentally show the effectiveness of the proposed exact and approximated tests, by checking the behavior of the p-values and measuring the true and false positive ratios (TPR and FPR) in both the realizable (i.e., the hypothetical cluster numbers of rows and columns (K, H) are equal to the null ones (K (N) , H (N) )) and unrealizable (i.e., at least one of K < K (N) and H < H (N) holds) cases. Finally, we discuss the findings and conclude the paper in Section 5 and Section 6, i, and g (N),(2) j = (j mod 2)+1 for all j. In regard to the mean vector, we used the following setting: µ 0 = vec 0.5 0 0 0 .
Based on the above settings, we generated a data vector by x ∼ N (µ 0 , 0.75 2 I) and applied the biclustering algorithms of the proposed and existing methods [19]. The biclustering algorithm of the proposed method outputs a regular-grid bicluster structure based on the squared residue minimization, while that of [19] outputs an n 0 × p 0 submatrix with the largest sample mean, where we set n 0 = p 0 = 5. respectively.
2 Problem settings 2.1 Notations and assumptions on data matrix Throughout this study, we use the following definitions and notations.
• Let A = (A ij ) 1≤i≤n,1≤j≤p ∈ R n×p be an observed data matrix with the size of n × p. When constructing a statistical test, it is more convenient to use the vector representation of matrix A, instead of A itself: . . , n, j = 1, . . . , p. (1) be the cluster index of the ith row, and g (1) = (g (1) i ) 1≤i≤n . Similarly, let g (2) j be the cluster index of the jth column, and g (2) = (g (2) j ) 1≤j≤p . We denote a set of row and column clusters as g = (g (1) , g (2) ) ∈ G, where G = {(g (1) , g (2) )} is a set of all possible cluster memberships. We also define that G KH is a set of all possible cluster memberships with K × H or less blocks.
• In the null hypothesis of the proposed test, we assume that there exists a set of block memberships g (N) = (g (N), (1) , g (N), (2) ) and that, given g (N) , each (i, j)th element A ij of an observed matrix A is generated independently from a Gaussian distribution with a block-wise (unknown) mean and (known) variance σ 2 0 2 , where B kh is the mean of the (k, h)th block: In vector representation, this assumption is given by where µ 0 is the unknown block-wise mean vector.
• Let (K (N) , H (N) ) be the minimum set of row and column cluster numbers required to represent the above null set of block memberships g (N) . In the proposed test, we fix a hypothetical set of cluster numbers (K, H), estimate the block structure of an observed matrix with K × H blocks, and perform a test on the estimated block memberships, which, by its nature, includes a test on cluster numbers (i.e., (K (N) , H (N) ) = (K, H) or at least one of K < K (N) and H < H (N) holds) 3 .
• We denote the set of rows in the kth cluster as I k = {i : g Similarly, we denote the set of columns in the hth cluster as J h = {j : g • We denote the cluster membership vector of rows as follows: otherwise.
(4) 2 We also derive the null distribution of a test statistic in case that variance σ 2 0 is unknown in Appendix G. 3 It must be noted, however, that the proposed test cannot be applied directly for sequential testing on cluster numbers, where the hypothetical numbers of clusters are tested in ascending order (i.e., (K, H) = (1, 1), (1, 2), (2, 1), . . . ) until the null hypothesis is accepted. This is because the proposed test cannot distinguish the following two alternative cases: (1) (K (N) , H (N) ) = (K, H) holds, however, the estimated cluster memberships are incorrect, and (2) K < K (N) or H < H (N) holds.
Similarly, we denote the cluster membership vector of columns as follows: Based on these vectors e (k) and e (h) , we define a vector e (k,h) ≡ e (h) ⊗ e (k) ∈ R np and matrix ,h) ) . It must be noted that E (g) is a projection matrix, that is, (E (g) ) = E (g) and (E (g) ) 2 = E (g) hold.

Clustering algorithm based on squared residue minimization
To estimate the block structure of a given observed matrix A, we use a clustering algorithm A : x →M ∈ G KH that outputs a block structure minimizing the squared residue, that is, the sample variance σ 2 within the same block. A squared residue has been proposed for measuring the quality of a biclustering result [5,12], and its definition is given by Therefore, the squared residue minimization clustering algorithm A outputs the set of cluster membershipsĝ = (ĝ (1) ,ĝ (2) ), which satisfieŝ It must be noted that the above solutionĝ is the maximum likelihood estimator of the cluster memberships with a mean estimatorB(g) and a known standard deviation σ 0 . The log likelihood of a set of cluster memberships g = (g (1) , g (2) ) and the mean parameter B is given by LetB(g) = (B kh (g)) 1≤k≤K,1≤h≤H be the maximum likelihood estimator of mean B for a given fixed cluster memberships g. From (8), we can easily derive thatB kh (g) = (1/|I k ||J h |) i∈I k j∈J h x n(j−1)+i . By combining this fact with (6) and (8), we see that the squared residue minimization is equivalent to the likelihood maximization with a mean estimatorB(g). Equation (7) is equivalent to a set of quadratic inequalities for all g ∈ G KH . In other words, the selection rule can be represented as a set of quadratic inequalities in terms of the data vector x. It must be noted that, under the null hypothesis, the solutionĝ of (7) is unique almost surely. To prove this fact, we first define a quadratic function F (g,g ) : We also define that g = g , if the sets of cluster memberships g and g are equivalent up to the permutation of cluster indices, and that g = g otherwise.
is not a zero matrix (the proof of this is in Appendix A), and thus the Lebesgue measure of a set of points x that satisfy F (g,g ) (x) = 0 is zero. By combining this fact and the assumption (3) of the null hypothesis, F (g,g ) (x) = 0 holds for a fixed combination of (g, g ) almost surely. Since {(g, g ) : g, g ∈ G KH , g = g } is a finite set, we finally have In case of a tie (i.e., multiple solutions ofĝ exist that satisfy (7)) that occurs with probability zero, we can choose any one of them asĝ independently with x.
3 Statistical test on the solution of squared residue minimization 3.1 Null distribution of test statistic T As described in Section 2, in the null hypothesis of the proposed test, we assume that there exists a set of block memberships g (N) and that given g (N) , each element of an observed data vector x is generated independently from a Gaussian distribution, whose mean is constant within the same block. Our main purpose is to test whether an estimated block structureĝ ≡ (ĝ (1) ,ĝ (2) ), which is selected based on the squared residue criterion in Section 2.2, is equal to the null one g (N) . Formally, the null and alternative hypotheses of the proposed test are given by It must be noted that the equation E (ĝ) µ 0 = 0 is equivalent to the statement that the elements of the vector µ 0 are constant in the same block in the set of cluster membershipsĝ. In other words, the above statement of the null hypothesis is that a given observed matrix is generated based on the latent block structureĝ, which is selected as a solution that minimizes the squared residue.
To perform the test of (11), we check the squared residue σ 2 of the given observed matrix A under the condition that the estimated block structureĝ is selected. Under the null hypothesis, we have E (ĝ) µ 0 = 0. Here, matrix E (ĝ) solely depends on the estimated set of cluster membershipsĝ. In other words, under the condition thatM(x) =ĝ holds, matrix E (ĝ) is fixed. Therefore, based on the result in [23], the following theorem holds: Theorem 3.1. Under the null hypothesis, we have where · 2 and χ c|M , respectively, denote the Euclid norm and the truncated chi distribution with c degrees of freedom and with truncation interval of M and Proof. Let E be a fixed np × np projection matrix satisfying the following conditions: • rank(E) = np − KH.
A singular value decomposition of a matrix E satisfying the above two conditions is given by where we denote the a × a identity matrix and a × b zero matrix, respectively, as I a and O a,b . Based on such a matrix E, we use the following notations: In the above definitions, we can transform T E by the following equations: Here, we used the fact thatD D = D. By using the assumption that x ∼ N (µ 0 , σ 2 0 I) holds and the independence of matrix E of x, we have Here, we considered the fact thatDD = I. Therefore, by combining (16) and (17), we have where χ c denotes the chi distribution with c degrees of freedom.
In regard to u E and z E , we have In the last equation, we considered the fact that E E = E.
Here, since T E and (u E , z E ) are mutually independent (the proof of this is in Appendix D), we have Next, we consider adding a condition of selection event ofĝ to the distribution of T E |u E , z E in (20). Given u E and z E , the result of selection depends solely on the value of T E . Therefore, adding the selection conditionM(u E T E σ 0 + z E ) =ĝ to (20) corresponds to truncation of T E to the region whereM(u E T E σ 0 + z E ) =ĝ holds: Third, we consider replacing E in (21) with E (ĝ) , which is the output by clustering algorithm A based on the data vector x. It must be noted that the matrix E (ĝ) is also a projection matrix with the rank of (np − KH) (the proof of this is in Appendix B), from its definition, and E (ĝ) µ 0 = 0 holds.
Since matrix E (ĝ) depends on the data vector x only through the choice ofĝ (i.e., E (ĝ) is fixed, givenĝ), under the condition that the selection resultĝ is given, (21) still holds with matrix E (ĝ) , which concludes the proof.
It must be noted that if there exists multiple sets of cluster memberships that minimize the squared residue σ 2 , which occurs with probability zero, from the discussion in Section 2.2, then Theorem 3.1 will hold for any one of them. Moreover, we define that a set of block memberships g is a refinement of g iff any block in g is a submatrix of some block in g. Ifĝ is a refinement ofĝ, then Theorem 3.1 will also hold whenĝ is replaced byĝ . In other words, we cannot detect that a given block structure represents a "finer division than necessary" with the proposed test; solving this problem is beyond the scope of this paper.

Statistical test based on truncated chi distribution
To perform a statistical test based on Theorem 3.1, we have to specify the truncation interval of M (ĝ) ≡ {t ≥ 0 :M(tσ 0 u + z) =ĝ}. As shown in (9), this is equivalent to an interval satisfying the following condition for all g: where From the definition of u and z in (13), we have E (ĝ) u = u and E (ĝ) z = 0, which simplifies the above coefficients a (g,ĝ) , b (g,ĝ) , and c (g,ĝ) as follows: Here, in the transformation of b (g,ĝ) , we used the fact that matrices E (g) and E (ĝ) are symmetric. We consider the condition under which (23) holds in the two cases, a (g,ĝ) = 0 and a (g,ĝ) < 0.
Overall, the interval of t where (23) holds is given bŷ It must be noted that ∩ g∈G KH ,g =ĝ a (g,ĝ) < 0 holds almost surely, based on a similar discussion as that in Section 2.2. Formally, for a fixed g, g ∈ G KH , y ≡ E (g ) x follows a Gaussian distribution. If g = g , E (g) − E (g ) is not a zero matrix, and thus the Lebesgue measure of a set of points y satisfying y E (g) − E (g ) y = 0 is zero. Similarly, y 2 2 > 0 holds with probability one. By combining these facts, a (g,g ) ≡ 1 y 2 2 y E (g) − E (g ) y = 0 holds for a fixed combination of (g, g ) satisfying g = g almost surely. Therefore, we have To derive the last equation, we used the fact that {(g, g ) : g, g ∈ G KH , g = g } is a finite set. We denote a set of cluster memberships attaining the boundary of this interval asg, that is, From Theorem 3.1, given {ĝ, z, u}, a p-value p T of the test statistic T in (12) is given by where γ(·, ·) is the lower incomplete gamma function. This holds from the fact that, for any random variable X with a probability density function To derive the p-value in (29), we used the fact that the cumulative distribution function of chi-square distribution with c degrees of freedom and with truncation interval of [0, a] is given by

Approximated test based on simulated annealing
The exact statistical test in Section 3.2 requires us to search (i) the optimal set of cluster membershipsĝ, which minimizes the squared residue in (6), and (ii) the set of cluster membershipsg in (28), which determines the truncation interval. We can see that the number of mutually different patterns of block structures with exactly K × H blocks is lower bounded by K n−K H p−H (see Appendix C for more detailed discussions).
To cope with such combinatorial explosion, we propose an approximated statistical test based on SA, besides the exact test described in Section 3.2. SA is an iterative algorithm that can be used for obtaining approximated solutions of combinatorial optimization problems [34,17]; its basic procedure is given as follows: 1. Define a cooling schedule or the sequence of temperatures {T t } ∞ t=0 , a threshold , a finite set of states S, and an objective function f on S. For all the experiments, we set the threshold at = 10 −6 . Our purpose is to find a state x ∈ S that minimizes f (x). For each state x ∈ S, we also define a set of neighbors N (x) ⊆ S and a transition probability R(x, x ) from state x to x , for all x ∈ S, where R(x, x ) > 0 if x ∈ N (x) and R(x, x ) = 0 otherwise. Finally, define an initial step t ← 0 and initial state x 0 ∈ S, and let f (0) ≡ f (x 0 ).
2. If T t < , stop the algorithm and output the current state x t . Otherwise, randomly choose a neighbor x of the current state • If ∆f < 0, then move to state x and set x t+1 = x and f (t+1) = f .
• Otherwise, with probability exp − ∆f Tt , move to state x and set x t+1 = x and f (t+1) = f . Otherwise, stay at the current state x t and set x t+1 = x t and f (t+1) = f (t) .
It has been proven that the solution given by the above SA algorithm converges in probability to the global optimal solution of a given problem, under the following conditions [10]: (a) Irreducibility: we call that the state y is reachable at height E from state x if x = y or a sequence of states x = x 1 , x 2 , . . . , x p = y exists such that (1) R(x t , x t+1 ) > 0, for all t ∈ {1, . . . , p − 1}, and , . . . , p}. We simply call that y is reachable from x if y is reachable from x at some height E. The first condition is that for any pair of states (x, y), y is reachable from x.
Algorithm 1 SA algorithm for finding the minimum squared residue solutionĝ. Input: A cooling schedule of temperature {T t } ∞ t=0 and a threshold . Output: Approximated optimal set of cluster membershipsĝ in terms of the squared residue.

11:
Randomly generate a new cluster index h of the jth column from the uniform distribution on {1, . . . , H} \ĝ (2) j . Letĝ be the set of cluster memberships given byĝ = ((ĝ ) (1)  Compute the value of the objective function: 14:

19:
end if 20: t ← t + 1. 21: end while (b) Weak reversibility: The second condition is that, for any E ∈ R and for any pair of states (x, y), y is reachable at height E from x iff x is reachable at height E from y.
(c) We call that state x is a local minimum if no state y ∈ S satisfying f (y) < f (x) (i.e., a better solution) is reachable at height f (x) from x. In other words, to find a better solution from a local minimum x, we need to pass through some "worse" states, where the value of the objective function is larger than that of x. We define that the depth of a local minimum x is +∞ if x is a global optimal state; otherwise, it is the minimum E > 0 such that some state y (i.e., better solution) with f (y) < f (x) exists and y is reachable at height f (x) + E from x. The third condition is that the cooling schedule of temperature satisfies the following conditions: (1) where d * is the maximum depth of all the states that are locally, but not globally, optimal solutions. Algorithm 1 is the SA algorithm for obtaining an approximated solution for the optimal set of cluster membershipsĝ in terms of the squared residue. In this algorithm, from (7), we define that the set of states S and the objective function f are given by S ≡ G KH and f (g) ≡ x E (g) x, respectively. In each step of the algorithm, neighbors N (g) of the current state g are defined as a set of all the cluster memberships that differ from g in exactly one row or column. It must be noted that the size of such neighbors is |N (g)| = n(K − 1) + p(H − 1). We choose a neighbor g from the uniform distribution on N (g) (i.e., with probability R(g, g ) = 1/|N (g)|). By these definitions, Algorithm 1 satisfies the conditions of (a) irreducibility and (b) weak reversibility.
Algorithm 2 SA algorithm for finding the solutiong of the truncation interval. Input: Optimal set of cluster membershipsĝ in terms of the squared residue, a cooling schedule of temperature {T t } ∞ t=0 , and a threshold . Output: Approximated optimal set of cluster membershipsg for determining the truncation interval.
1: t ← 0. 2: Randomly generate initial cluster memberships:g = (g (1) ,g (2) ). 3: Compute the initial value of the objective function: if a (g,ĝ) = 0, then f ← +∞; otherwise, Randomly choose the size s of a subset of row or column indices from {1, . . . , n + p}: 6: Randomly choose a set of s row or column indices S without duplication from the uniform distribution. 7:g ←g. 8: for each row or column index in S do 9: if the ith row is selected then 10: Randomly generate a new cluster index k of the ith row from the uniform distribution on {1, . . . , K} \g

11:
else if the jth column is selected then 12: Randomly generate a new cluster index h of the jth column from the uniform distribution on {1, . . . , H} \g Compute the value of the objective function: if a (g ,ĝ) = 0, then f ← +∞; otherwise, f ← ). 16: t ← t + 1.

23: end while
Algorithm 2 is the SA algorithm used for finding an approximated solution of the cluster membershipsg, which determines the truncation interval. In this algorithm, we define that the set of states S and the objective function f are given by S ≡ G KH and f (g) , respectively. Unlike Algorithm 1, we have to consider the feasibility of a solutiong, that is, it should satisfy a (g,ĝ) < 0. To guarantee the condition of (a) irreducibility while avoiding infeasible solutions, we defined the neighbors N (g) of the current state g as N (g) ≡ G KH . By this definition, for any pair of states (g, g ), transition from g to g is possible with non-zero probability: R(g, g ) > 0. Accordingly, we restrict the significant change in the state by controlling the transition probability R(g, g ), as in (31). By setting the objective function values for infeasible solutions at +∞, we can avoid moving to them throughout the algorithm while satisfying the conditions of (a) irreducibility and (b) weak reversibility.
Regarding the cooling schedule of temperature, we can use the following definition [10], which satisfies the conditions (1), (2), and (3) in (c): where c is a constant satisfying c ≥ d * . In our cases, for instance, we can define the constant c as follows: for Here, we take into account the fact that the cluster memberships g ≡ arg max g∈G KH x E (g) x are attained by assigning all the elements of an observed matrix into a single block, where the objective function value is given by there is no state that is local but not global minimum (i.e., all local minima are also global minima); this is because, for any pair of states (g, g ), g is reachable at height f (g) from g. Therefore, from the result in [10], the convergence in probability to a global minimum state is guaranteed without the condition (3) in (c).
Practically, an algorithm based on the cooling schedule (32) is too slow, that is, it requires much computation time before convergence. Therefore, in the experiments in Section 4, we used the cooling schedule of T t = T 0 × r t , for all t ≥ 0, though this definition satisfies only the conditions (1) and (2), not (3).

Experiment
To show the validity of our proposed test, we compared its behavior with that of a naive statistical test, which does not consider the selection event. By ignoring the fact that the set of cluster membershipsĝ was selected based on the data vector x, we construct a naive test (which is invalid in fact) with test statistic T in (12) by assuming from (20). The p-value of such a naive test is given by where Γ(·) is the Gamma function.
In the following sections 4.1 and 4.2, we check the behavior of the p-values and the TPR and FPR when using the proposed and naive tests, in order to show the validity of our proposed method. In these sections, we use the term "realizable case" to indicate that the hypothetical cluster numbers of rows and columns (K, H) are equal to the null ones (K (N) , H (N) ), and the term "unrealizable case" to indicate that at least one of K < K (N) and H < H (N) holds, as described in Section 1.
Aside from the experiments in this section, we conducted sensitivity analysis of the approximated version of the proposed test with respect to the cooling schedule of SA in the realizable case in Appendix E. Moreover, we conducted an additional experiment to employ an existing fast biclustering method [30] instead of the proposed SA-based algorithm for estimating the cluster memberships in Appendix F.

Exact test in realizable case
First, we check the behavior of the p-values calculated by using the proposed (29) and naive (35) tests, under the condition that the given set of cluster numbers (K, H) are equal to that of the null one (K (N) , H (N) ). As shown in Section 3.2, the p-value of the proposed test follows the uniform distribution on [0, 1], while there is no such guarantee for that of the naive test.
For experiment, we randomly generated data matrices with the sizes of (n, p) = (5, 5), (6, 6), . . . , (9,9). We set the null and hypothetical sets of cluster numbers at (2, 2); we defined the null cluster memberships as g (N),(1) i = (i mod K (N) ) + 1, for all i, and g (N),(2) j = (j mod H (N) ) + 1, for all j. In regard to the mean vector µ 0 , we tried the following five settings: Based on the above settings, we generated 1000 data vectors by x ∼ N (µ (l) 0 , 0.05 2 I), for each setting of matrix size (n, p) and mean vector µ 0 . Figure 2 shows the examples of the generated data matrices. For each generated data vector x, we computed the squared residues of all the patterns of cluster memberships g. Subsequently, we chose the optimal set of cluster membershipsĝ (i.e., solution with the minimum squared residue) and checked if it is equivalent to the null set of cluster memberships g (N) . For both cases ofĝ = g (N) andĝ = g (N) , we computed the test statistic T in (12), the truncated interval in (26), and the p-values in (29) and (35). Subsequently, we plotted the results as follows: • For the trials whereĝ = g (N) holds (i.e., under null hypothesis), we plotted the p-values given by (29) and (35), in Figures 3 and 4, respectively. We also plotted (i) the test statistics D √ r of the Kolmogorov-Smirnov test [6] for the p-values of the proposed and naive tests and (ii) the accuracy of the clustering algorithm A, that is, the ratio of the number of such null cases (i.e., g = g (N) ) to the 1000 trials, for each setting, in Figures 5 and 6, respectively.
From Figures 3, 4, and 5, we see that the distribution of the p-values of the proposed test was closer to the uniform distribution on [0, 1] than that of the naive test, particularly when the difference in block-wise mean between the blocks was small. This result shows that the proposed test can successfully consider the selective bias of using the squared residue minimization solution, by using the truncated chi distribution in (12). However, the p-values of the naive test based on (34) did not follow the uniform distribution on [0, 1], since we did not consider the selective bias and conducted tests based on the (not truncated) chi distribution in the naive test. It must be noted that, in our problem setting, unlike the common statistical tests, the assertion of the null hypothesis ((g (N),(1) , g (N),(2) ) = (ĝ (1) ,ĝ (2) )) is stronger than that of alternative hypothesis ((g (N), (1) , g (N),(2) ) = (ĝ (1) ,ĝ (2) )). This results in that the p-values of the naive test are biased toward larger values than the correct ones.
In regard to the test performance, from the results in the top of Figure 7, we see that the FPR was low in all the settings (i.e., proposed and naive; significance rate α = 0.1, 0.05, and 0.01; and  Figure 2: Examples of the observed data matrices with the size of (n, p) = (9,9), which are generated based on the different block-wise means. The title of each figure shows the range of the block-wise mean vector µ 0 . The blue lines show the null cluster memberships. For visibility, we plotted the matrices whose rows and columns were sorted according to their null clusters.
block-wise mean µ 0 ). The results in the bottom of Figure 7 shows that the TPR of the proposed test was higher than that of the naive test in the same setting, which is consistent with the discussion in the previous paragraph. However, the TPR of the proposed test was not sufficiently close to one, in all the settings. This can be attributed to the few "true positive" cases in the realizable setting. Figure 6 shows that the estimated block structureĝ that attained the minimum squared residue was equivalent to the null one g (N) in most cases. In other words, almost all trials were "null cases," where the small number of false negative cases significantly affect the TPR. Particularly, when there is an increase in the matrix size or in the difference in the block-wise mean between the blocks, it becomes easier to estimate the null block structure, and the clustering algorithm almost always outputs the correct cluster memberships.

Exact test in unrealizable cases
Next, we compared the behavior of the proposed and naive tests in the unrealizable cases, that is, For the experiment, we randomly generated data matrices with the sizes of (n, p) = (5, 5), (6, 6), . . . , (9,9). We set the null set of cluster numbers at (3,2) and defined the null cluster memberships as in Section 4.1. In regard to the mean vector µ 0 , we tried the following five settings: Based on the above settings, we generated 1000 data vectors by x ∼ N (µ (l) 0 , 0.05 2 I), for each setting of matrix size (n, p) and mean vector µ 0 . For each generated data vector x, we computed the squared residues of all the patterns of cluster memberships g. Subsequently, we chose the optimal set of cluster membershipsĝ with a given set of cluster numbers (K, H). In regard to the hypothetical cluster numbers, we tried the following five settings: (K, H) = (1, 1), (2, 1), (3,1), (1,2), and (2, 2). For each setting, based on the selected resultĝ, we computed the test statistic T in (12), the truncated interval in (26), and the p-values in (29) and (35). Finally, we plotted the TPR of the proposed and naive tests in Figure 8. Figure 8 shows that the TPR of the proposed test was higher than that of the naive test in the same setting; however, in most cases, there was a small difference between them. This may be attributed to the fact that we set the matrix size (n, p) and the hypothetical block size (K, H) at small numbers in order to perform the exact test, which is computationally expensive, and thus there is a marginal effect of selecting the optimal block structureĝ from all the patterns G KH . It must be noted that, unlike the realizable case in Section 4.1, the block structures output by the clustering algorithm were always different from the null ones because the hypothetical set of cluster numbers were insufficient to represent the null block structure in the unrealizable cases. In other words, all the 1000 trials in   each setting correspond to the alternative cases. The TPRs of the proposed and naive tests were comparable particularly under the following two settings: (1) the case where we set the hypothetical number of blocks at (K, H) = (1, 1) and (2) the case where the difference in the null block-wise mean µ between the blocks was relatively big. These results were caused by the nature of the biclustering problem itself, as well as by the limitation in the power of the proposed selective test. In the case of (1), we assume that the entire data matrix A consists of a single block, and thus there is only one possible estimated bicluster structureĝ, regardless of the selection event. Therefore, in this case, the proposed and naive tests are equivalent in the first place. As for the case of (2), since the difference in the null block-wise mean between the blocks was big, the test statistics of both the proposed and naive tests got large enough for the null hypothesis to be rejected.

Approximated test in both realizable and unrealizable cases
Finally, we checked the behavior of the approximated test introduced in Section 3.3. In both realizable and unrealizable cases, we generated data matrices with the sizes of (n, p) = (10 + 2 × m, 10 + 2 × m), for m = 0, 1, . . . , 4, in the same way as that in Sections 4.1 and 4.2. Concerning the following conditions, we used the same setting as in Sections 4.1 and 4.2, respectively, for the realizable and unrealizable cases: the null and hypothetical sets of cluster numbers, the definition of null cluster memberships g (N) , mean vectors and the standard deviation σ 0 , and the number of data vectors for each setting. Concerning the SA algorithms, both in the Algorithms 1 and 2, we defined the cooling schedule as follows: T 0 = 10, T t = T 0 × 0.99 t for all t.
As in the cases of the exact tests, Figures 9 and 10, respectively, show the histograms of the p-values of the proposed and naive approximated tests in the realizable case. For the realizable case, we also plotted (i) the test statistics D √ r of the Kolmogorov-Smirnov test [6], for the p-values of the proposed and naive tests, and (ii) the accuracy of the approximated clustering algorithm in Figures 11 and 12, respectively. Figure 13 shows the FPR and TPR in the realizable case, and Figure 14 shows the TPR in the unrealizable cases. Figures 9, 10, and 11 show that the distribution of the p-values of the proposed test was closer to the uniform distribution on [0, 1] than that of the naive test, as in the result of the exact test in Section 4.1. Concerning the test performance in the realizable case, Figure 13 shows that the FPR was low in all the settings, and the TPR of the proposed test was higher than that of the naive test in the same setting. However, as in the exact case, the TPR of the proposed test was not sufficiently close to one in all the setting; this can be attributed to the few "true positive" cases. In the next Section 4.4, we checked more difficult cases for the approximated clustering algorithm, where the null cluster numbers were more than (2, 2).

4.4
Approximated test in the realizable case, (K (N) , H (N) ) = (3, 3), (4, 4), (5,5) To check the behavior of the p-values, FPR, and TPR of the proposed test in more difficult settings, where the clustering algorithm cannot successfully estimate the cluster memberships in most cases, we tried the following three settings of null cluster numbers: (K (N) , H (N) ) = (3, 3), (4,4), and (5, 5). These settings have more patterns of the possible block structures than those in the case of (K (N) , H (N) ) = (2, 2) in Section 4.3. Hence, it becomes difficult for the approximated clustering algorithm (which stops at a fixed finite number of steps in the experiment) to output the null set of cluster memberships.
We generated data matrices in the same way as that in Section 4.3. Concerning the following conditions, we used the same setting as that of the realizable case in Section 4.3: the set of matrix sizes (n, p), the definition of the null cluster memberships g (N) , the standard deviation σ 0 , and the cooling schedule of the SA algorithm. We tried the following three settings of the null number of blocks: (3,3), (4,4), and (5, 5); subsequently, for each setting, we defined the mean vector µ 0 as follows: for (K, H) = (5, 5). We set the number of data vectors for each setting at 500.
We plotted the test statistics D √ r of the Kolmogorov-Smirnov test [6] for the p-values of the proposed and naive tests and the accuracy of the approximated clustering algorithm in Figures 15  and 16, respectively. Figures 17, 18, and 19 show the FPR and TPR of the proposed and naive tests. These figures show that the TPRs of both the proposed and naive tests were higher than the case of (K (N) , H (N) ) = (2, 2); the TPR of the proposed test was higher than that of the naive one in these settings.

Discussion
In this section, we discuss the following three points about the proposed statistical test: its power, the trade-off between computational efficiency and accuracy, and the extension of the finding to more generalized cases.
First, as also pointed out in a study [23], the null distribution of the test statistic of the proposed test is given by the conditioning on z and u, besides the selected set of cluster membershipsĝ, which leads to a reduction in the test power [7]. For now, we do not have any way of removing these unnecessary parameters, owing to the problem setting of an LBM. In a one-way clustering problem, where there are n data vectors with p dimensions, we can at least approximate the distributions of z and u based on their histograms; however, in the LBM setting, there is only a single observed matrix with the size of n × p. Solving this problem is beyond the scope of this paper; future studies should focus on constructing a more powerful selective test on a bicluster structure by using an additional technique such as a bootstrap method [31,33].
Next, we have proposed both exact and approximated tests to cope with the combinatorial explosion of the possible block memberships. The null distribution (12) of the proposed test statistic is based on the assumption that the estimated cluster membershipsĝ is the global minimum solution of the squared residue, which is difficult to obtain in the first place. Although it is guaranteed that the solutions of the two SA algorithms 1 and 2 converge in probability to the globally optimal solutions of their corresponding problems, we cannot validate the outputs of these algorithms with a finite number of steps, which are used in practice. It would be more desirable to derive an exact p-value of some other approximated test. Stopping the SA algorithms in a constant number of steps would also affect the accuracy of the test; to find the optimal solutions, we should have checked all the patterns of possible block memberships, which increase with the observed matrix size and the number of blocks. However, if we increase the number of steps according to such a problem size, then computation of the SA algorithms will get intractable. Therefore, it would be another important direction to seek a more computationally efficient test, which mitigates this trade-off.
Finally, the proposed test has enabled us to perform a valid statistical inference for a Gaussian LBM, where we assume that each element of an observed matrix independently follows a Gaussian distribution, given a block structure. This Gaussian assumption is crucial for deriving the exact p-value in the selective inference framework, as in [23]. However, in many practical datasets, including the "MovieLens" dataset of movie ratings [11] and the dataset of document-word relationships in NeurIPS conference papers [26], the elements of the observed matrix take discrete values, where the proposed test cannot be employed. So far, there has been no selective test that can be directly applied to binary data vectors from a Bernoulli distribution. To address this problem, a randomized model selection method [32] has been proposed to construct an asymptotically valid selective test on binary data by adding a random noise to the statistic used for a selection event. By using such a technique, future studies should generalize the proposed test for non-Gaussian cases.

Conclusion
We developed a new selective inference method on the row and column cluster memberships of a latent block model given by a clustering algorithm based on the squared residue minimization. By considering the selective bias, which is caused by the fact that the hypothetical block structure is estimated based on a given data matrix, we constructed a valid test based on a truncated chi distribution. Since such an exact test required us to obtain the global optimal solutions of two combinatorial optimization problems, we also constructed an approximated test based on simulated annealing algorithms. Experimental results showed that the proposed exact and approximated tests worked successfully, compared to the naive tests that did not take the selective bias into account.  Figure 7: FPR and TPR in the realizable case with different significance rates (e.g., α = 0.1, 0.05, and 0.01), for the proposed (left) and naive (right) statistical tests. If there were no null (i.e.,ĝ = g (N) ) or alternative (i.e.,ĝ = g (N) ) cases, respectively, then the corresponding points of FPR or TPR would not have been plotted.             Column cluster structure: (2) Row cluster structure: (1) Figure 20: A data matrix A whose squared residue σ 2 is zero with block structure g.
Appendix A Proof that E (g) − E (g ) = O for g, g ∈ G KH , g = g Proof. We prove that E (g) − E (g ) = O by contradiction. Let g, g ∈ G KH be two sets of cluster memberships, both of which have K × H blocks or less and which satisfy g = g . Specifically, we denote the exact number of blocks of g as (K (g) , H (g) ). Assume that E (g) − E (g ) = O holds. Then, for all x ∈ R np , we have x E (g) x = x E (g ) x. In other words, from (6), block structures g and g yield the same squared residue σ 2 for any data matrix A. Let us consider a data matrix A that has a block structure g, and all of the elements in the (k, h)th block are (k − 1)H (g) + h, where k = 1, . . . , K (g) and h = 1, . . . , H (g) , as shown in Figure 20. The squared residue of such matrix A and block structure g is zero, and thus x E (g) x = 0 holds. However, in block structure g satisfying g = g, there exists at least one block of matrix A that contains two or more mutually different values, unless g is a refinement of g, which results in x E (g ) x > 0. In case that g is a refinement of g, by considering an observed matrix A with block structure g instead of g, we obtain x E (g ) x = 0 and x E (g) x > 0 from the similar discussion. This contradicts the assumption that Appendix B Proof that rank(E (ĝ) ) = np − KH Proof. For any cluster membershipsĝ, by simultaneously switching rows and columns with the same indices, matrix E (ĝ) can be transformed into matrixẼ (ĝ) , which is given bỹ where for all (k, h). Letẽ are linearly independent for an arbitrary set of (i, j). From here, we show that within the same [H(k − 1) + h]th block, the maximum number of linearly independent columns is |I k ||J h | − 1. First, from (42), we have Therefore, the maximum number of linearly independent columns is smaller than |I k ||J h |. Next, the columns of the indices of i = 1, · · · , |I k ||J h | − 1 are linearly independent, since By combining the above results, the maximum number of linearly independent columns of matrix E (ĝ) is k,h (|I k ||J h | − 1) = np − KH. Since the rank of matrix E (ĝ) is equal to that of matrixẼ (ĝ) , we finally have rank(E (ĝ) ) = np − KH.  0 . The last (n −ñ) elements are arbitrary (i.e., different indexing of these elements yields mutually not equivalent set of row cluster memberships), which have K n−ñ patterns. Therefore, there are ñ n=K (ñ−2)! (ñ−K)!(K−2)! K n−ñ patterns of mutually different sets of row cluster memberships. From the same discussion for column cluster memberships, we obtain a lower bound for the total number κ of the patterns of mutually different block structures: which is in the exponential order of n and p for a fixed number of blocks (K, H).
Appendix D Proof that T E and (u E , z E ) are mutually independent Proof. We have assumed that x ∼ N (µ 0 , σ 2 0 I) and have defined that Note that the following equations hold: To obtain the last equation, we used the assumption that Eµ 0 = 0. Therefore, we have Next, we use the following Proposition 2.1 in [29]: let p θ be a probability density function of an exponential family distribution with parameter θ, which is given by Then, T is complete and sufficient for η. From this proposition and (48), z E is complete and sufficient for µ 0 .
We also show that (T E , u E ) are ancillary for µ 0 . To prove this, we first show that T E and u E are mutually independent. Let y ≡DV x ∈ R np−KH , where V andD are the matrices defined in (14) and (16), respectively. From (17) and the fact thatDV . Therefore, we have From Proposition 2.1 in [29] and the fact that T 2 E = y 2 2 /σ 2 0 , T 2 E is complete and sufficient for µ 0 . Since there is a one-to-one correspondence between T E and T 2 E , T E is also complete and sufficient for µ 0 . Letũ E ≡ y/ y 2 . Sinceũ E follows a uniform distribution on the surface of unit sphere and u E = V D ũ E , u E is ancillary for µ 0 . By combining these results, T E and u E are mutually independent from Basu's theorem [1]. Therefore, we have p(T E , u E ) = p(T E )p(u E ), where p(·) denotes a probability density function. From the above discussion about p(u E ) and the fact that T E ∼ χ (np−KH) (∵ (18)), (T E , u E ) are ancillary for µ 0 .
Based on the above results, z E and (T E , u E ) are independent from Basu's theorem [1]. Therefore, we have From (50) and the fact that the ranges of z E , T E , and u E do not depend on each other, we also have p(u E , z E ) = p(u E )p(z E ) and thus which concludes the proof. schedule of simulated annealing We conducted sensitivity analysis of the approximated version of the proposed test with respect to the cooling schedule of SA in the realizable case (i.e., (K, H) = (K (N) , H (N) )). Aside from the settings of the mean vector µ 0 and the cooling schedule of SA, we employed the same settings as in Section 4.3. We tried the following five cooling schedules: T t = 10 × r t , for all t ≥ 0, where r = 0.99, 0.97, 0.95, 0.93, 0.91. As for the mean vector, we used the following setting: We also plotted (i) the test statistics D √ r of the Kolmogorov-Smirnov test [6], for the p-values of the proposed and naive tests, and (ii) the accuracy of the approximated clustering algorithm in Figures 24 and 25, respectively. Figure 26 shows the FPR and TPR. From Figure 25, we see that the accuracy of the SA algorithm got lower with the smaller value of r. As shown in Figure 26, the FPR was low in all the settings, while the TPR of both the proposed and naive tests got lower with the larger value of r. A possible reason for this result is that with small r, the SA algorithm tends to output "bad" solutions (i.e., solutions that yield large squared residues) and thus both the proposed and naive tests can easily reject the null hypothesis.

Appendix F Application of computationally efficient biclustering algorithm for estimating the cluster memberships
The proposed approximated test based on the SA algorithm is guaranteed to converge in probability to the global minimum solution in terms of the squared residue under the conditions given in Section 3.3. However, this SA algorithm requires much computation time before convergence. As another option, we can use some computationally efficient biclustering algorithm for estimating the cluster membershipsĝ.
There have been proposed various fast biclustering algorithms [4,21,30]. Among these algorithms, we applied the biclustering algorithm that has been proposed by Tan and Witten [30], which is aim to minimize the loss function L(g, B; x) in (8). In this algorithm, to find the local optimal solution, we iteratively estimate the block-wise mean B and row and column cluster memberships, g (1) and g (2) , respectively. The specific algorithm of this method is given in Algorithm 3. There is no theoretical guarantee that this algorithm converges to the global optimal solution in terms of the squared residue in any way, however, under the assumption that it yields a good approximation of the global optimal solution, we can use this algorithm instead of the proposed SA algorithm in Section 3.3 for estimatinĝ g.
We checked the behavior of the approximated test in a realizable case when using Algorithm 3 for estimating the optimal cluster membershipsĝ. For finding the solutiong of the truncation interval, we used Algorithm 2 as in the experiment in Section 4.3. As in Section 4.3, we generated data matrices and applied the approximated test. Aside from the method for estimating the cluster memberships, we used the same settings as in Section 4.3. This experiment was conducted on an Intel Xeon E5-2680 v3 (12 cores @ 2.50 GHz) server with 1, 007 GB of RAM. Figures 27 and 28, respectively, show the histograms of the p-values of the proposed and naive approximated tests. We also plotted (i) the test statistics D √ r of the Kolmogorov-Smirnov test [6], for the p-values of the proposed and naive tests, and (ii) the accuracy of the biclustering algorithm in [30] in Figures 29 and 30, respectively. Figure 31 shows the FPR and TPR. Finally, we plotted     ) for each setting of matrix size (n, p) and cooling schedule r, whereĝ is output by the approximated clustering algorithm in Section 3.3. For the experiment, we used the setting of n = p.
the computation time for each setting of mean vector µ 0 and matrix size n in Figure 32. From these figures, we see that the biclustering algorithm in [30] was able to achieve accuracy comparable to or better than the proposed SA-based algorithm in less computation time. ) for different matrix sizes, which was computed by the approximated version of the naive test (35) based on the biclustering algorithm in [30].

Algorithm 3
Computationally efficient biclustering algorithm that has been proposed by Tan and Witten [30].
Output: Approximated optimal set of cluster membershipsĝ = (ĝ (1) ,ĝ (2) ). 2: Define initial row cluster membershipsĝ (1) by applying one-way k-means clustering to the rows of matrixĀ. 3: Define initial column cluster membershipsĝ (2) by applying one-way k-means clustering to the columns of matrixĀ. 4: while true do 5:ĝ (1) 0 ←ĝ (1) ,ĝ (2) 0 ←ĝ (2)  From the similar discussion as in the proof of Theorem 3.1, we have Since r Q and rQ are mutually independent, so do r Q 2 and rQ 2 . By combining this fact, (55), and (56), we have Here, we show that T E and (u Q , uQ, z E , r E 2 ) are mutually independent. From the assumption, we have  ) for each setting of matrix size (n, p) and mean vector µ 0 , whereĝ is output by the biclustering algorithm in [30]. For the experiment, we used the setting of n = p.
By using the notation of η ≡ − 1 and thus ( r E 2 2 + z E 2 2 , z E ) are complete and sufficient for η from Proposition 2.1 in [29]. Since there is a one-to-one correspondence between (z E , r E 2 ) and ( r E 2 2 + z E 2 2 , z E ), (z E , r E 2 ) are also complete and sufficient for η.
Next, we show that (T E , u Q , uQ) are ancillary for η. To prove this, we first show that T E and (u Q , uQ) are mutually independent. We use the following notations: where Q = V Q D Q V Q ,Q = V Q DQVQ, and I − E = V E D E V E are singular value decompositions of matrices Q,Q, and I − E, respectively, and It must be noted that we have y Q ∼ N (0, σ 2 0 I d1 ), yQ ∼ N (0, σ 2 0 I d2 ), y E ∼ N (D E V E µ 0 , σ 2 0 I KH ). = p(y Q )p(yQ)p(y E ), which results in that y Q , yQ, and y E are independent. Since r Q = V QD Q y Q , rQ = V QD Q yQ, and z E = V ED E y E hold, r Q , rQ, and z E are also independent. Based on a similar discussion as in Appendix D, r Q 2 and u Q are mutually independent, and so are rQ 2 and uQ. Therefore, we have p( r Q 2 , u Q , rQ 2 , uQ, z E ) = p( r Q 2 , u Q )p( rQ 2 , uQ)p(z E ) =p( r Q 2 )p(u Q )p( rQ 2 )p(uQ)p(z E ) =p( r Q 2 , rQ 2 )p(u Q , uQ)p(z E ), which results in that ( r Q 2 , rQ 2 ) and (u Q , uQ) are mutually independent. Based on this fact, and (u Q , uQ) are mutually independent. By using this result and the fact that u Q and uQ are also mutually independent, we have p(T E , u Q , uQ) = p(T E )p(u Q )p(uQ). Based on a similar discussion as in Appendix D about p(u Q ) and p(uQ) and the fact that T E ∼ F d1,d2 from (57), (T E , u Q , uQ) are ancillary for η. Therefore, (T E , u Q , uQ) and (z E , r E 2 ) are mutually independent from Basu's theorem [1]. By combining the above results, we have To derive the third equation, we used the fact that (u Q , uQ) and (z E , r E 2 ) are mutually independent based on a similar discussion as above. From (65), T E and (u Q , uQ, z E , r E 2 ) are mutually independent. By combining the above fact and (57), we have Next, we consider adding a condition of selection event ofĝ to the distribution of T E |u Q , uQ, z E , r E 2 in (66). Given (u Q , uQ, z E , r E 2 ), the result of selection depends solely on the value of T E , since cT E +1 uQ + z E holds. Therefore, adding the selection condition to (66) corresponds to truncation of T E to the region whereM r E 2 Third, we consider replacing Q andQ in (67) with Q (ĝ) andQ (ĝ) , which is the output by clustering algorithm A based on the data vector x. Based on a similar discussion as in Appendix B, the matrices Q (ĝ) andQ (ĝ) are also projection matrices with the ranks of d 1 and d 2 , respectively, and they satisfy the following conditions: • Q (ĝ) µ 0 =Q (ĝ) µ 0 = 0.
• There exists a set of row and column indices I ⊆ {1, . . . , np} such that Q Since matrices Q (ĝ) andQ (ĝ) depend on the data vector x only through the choice ofĝ, under the condition that the selection resultĝ is given, (67) still holds with matrices Q (ĝ) andQ (ĝ) , which concludes the proof.