Simultaneous confidence intervals for ranks using the partitioning principle

We consider the problem of constructing simultaneous confidence intervals (CIs) for the ranks of n means based on their estimates together with the (known) standard errors of those estimates. We present a generic method based on the partitioning principle in which the parameter space is partitioned into disjoint subsets and then each one of them is tested at level α. The resulting CIs have then a simultaneous coverage of 1 − α. We show that any procedure which produces simultaneous CIs for ranks can be written as a partitioning procedure. We present a first example where we test the partitions using the likelihood ratio (LR) test. Then, in a second example we show that a recently proposed method for simultaneous CIs for ranks using Tukey’s honest significant difference test has an equivalent procedure based on the partitioning principle. By embedding these two methods inside our generic partitioning procedure, we obtain improved variants. We illustrate the performance of these methods through simulations and real data analysis on hotel ratings. While the novel method that uses the LR test and its variant produce shorter CIs when the number of means is small, the Tukey-based method and its variant produce shorter CIs when the number of means is high. MSC2020 subject classifications: Primary 62F07; secondary 62F03, 62F30.


Introduction
In many applications, we seek to rank objects, entities or institutions on the basis of some numerical characteristic that is measured with uncertainty. One important example is assessing the quality of institutions such as medical centers Simultaneous confidence intervals for ranks 2609 and universities [19,3]. Ranking institutions is usually carried out using performance indicators calculated based on samples. However, these indicators are only estimates of the true ones, and they are accompanied by standard errors, so that confidence intervals for their ranks are crucial.
We refer to a collection of CIs for ranks as having pointwise coverage of, say, 95% when the rank of any particular object is covered with 95% probability. We refer to simultaneous coverage if all ranks are covered with 95% probability. The latter is more useful, because it allows us to consider selected centers. For example, it ensures correct coverage for the object with the highest observed rank. Or the second-highest. Or for all objects in the top-5.
In the literature, the ranking problem is considered in several papers mostly focusing on pointwise CIs for the ranks. We mention the parametric bootstrap approach of [19] which is widely used, see [34,17,14] among others. Other methods are proposed based on testing pairwise differences between institutions [29,30,23,10].
Methods for simultaneous CIs for ranks are proposed only by [47,1,25]. The method of [47] uses the parametric bootstrap to construct CIs for ranks and then Monte-Carlo simulations to estimate the simultaneous coverage. In [1], the authors show that the method of [47] is anticonservative and propose a new method based on Tukey's honest significant difference (HSD) test which ensures that the simultaneous confidence level is at least 1 − α. The method of [25] creates simultaneous CIs for ranks starting from simultaneous CIs for the means which result in less powerful results than the method of [1].
Other papers from the literature considered the ranking problem but not with the objective to calculate CIs for the ranks, see [9,21,7,27,11,31,32,33,36,43,40] We introduce in this paper a generic method for simultaneous CIs for ranks of a vector of means. We propose to partition the parameter space by considering all possible orderings of a set of means. Then, using a suitable (local) test, we test all the partitions at level α. The partitioning principle [42,15,18] ensures that by doing so, the familywise error rate is controlled at level α which enables us to build simultaneous CIs for ranks at level 1 − α. The properties of the CIs for ranks depend on the local test we use, therefore different choices of the local test will lead to different methods.
Another important property of our generic procedure is that given some procedure that produces simultaneous CIs for ranks, we can construct a local test for the partitions so that the resulting partitioning procedure is equivalent to the original procedure. This shows that all valid procedures for simultaneous CIs are special cases of our approach, motivating the use of our generic procedure when looking for new methods for simultaneous CIs for ranks. Furthermore, in order to improve the partitioning procedure, it suffices to improve its local test which might be easier than improving the original procedure.
We present two examples of local tests. First, we use the likelihood ratio test as a local test and show how the partitioning procedure can be carried out. Although the complexity of the procedure is very high, we show some shortcuts which allow the method to be feasible with up to 30 − 40 means with a regular computer. Second, we show that the method of [1] based on Tukey's HSD can be written as a partitioning procedure by giving an explicit local test for the partitions. Using our generic partitioning procedure, we present two variants of the partitioning procedure that uses the LR test and the Tukey-based method which uniformly improve their corresponding methods.
Simulation studies show that the method of [1] based on Tukey's HSD is more powerful than the one based on the LR test especially when the number of means is high while the converse happens when the number of means is small. The improved versions of these methods are shown to be computationally feasible only up to n = 10.
The paper is organized as follows. In Section 2, we give a formal definition of the ranking problem and set the objectives. In Section 3, we present the testing problem and show how to use it in order to produce simultaneous confidence intervals for the ranks. In Section 4, we show how, for any procedure for CIs for ranks, we can construct an equivalent one based on the partitioning principle. In Section 5, we present the likelihood ratio (LR) test and use it as a first example of a local test. In Section 6, we revisit the Tukey-based method of [1] and give an equivalent procedure using the partitioning principle. Finally, in Section 7, we compare the results of the LR test and the Tukey-based method on simulated and real data samples. In the Appendix, we provide proofs of main results and further details (algorithms, simulations and data). Software to perform the methods presented in this paper are available in the ICRanks package downloadable from CRAN.

Context and notation
Let μ 1,T , · · · , μ n,T be real valued numbers which represent the unknown true means, that represent for example the true performance of n institutions we want to rank. Denote μ T = (μ 1,T , · · · , μ n,T ). For ease of readability, we will use μ i in place of μ i,T when reference to the true mean is clear from context. The performance could be the mortality rates of hospitals or the customer rating as in our example in Section 8. We estimate μ i,T using y i . We assume that the estimator y i is calculated based on many independent and identically distributed subjects (e.g. customers, patients), so that it becomes reasonable to assume that y i is normally distributed with known standard error σ i . Our starting point, therefore, is a sample y = (y 1 , · · · , y n ) of n independent observations, each drawn from a Gaussian distribution N (μ i,T , σ 2 i ). Definition 1 (Ranks in the presence of ties). Let μ = (μ 1 , · · · , μ n ) ∈ R n . We define the lower-rank of μ i by We also define the upper-rank of μ i by Simultaneous confidence intervals for ranks 2611 We finally define the set-rank of μ i as the set of natural numbers r i = {l i , l i + 1, · · · , u i } denoted here [l i , u i ]. Furthermore, if for all j = i we have μ j = μ i , then l i = u i and the set-rank is the singleton {l i }.
When there are ties between the means, then each one of the tied means possesses a set of ranks r i = [l i , u i ]. For example, suppose that we only have 3 means μ 1 , μ 2 and μ 3 such that μ 1 = μ 2 < μ 3 . Then, the set-rank of μ 1 is [1,2] which includes both ranks 1 and 2, and the set-rank of μ 2 is also [1,2], whereas the rank of μ 3 is [3,3] which is simply rank 3. The rationale of the definition of the set-ranks is that in case of ties, the ranking is arbitrary, and a small perturbation of the true performance may produce any rank in the set of ranks. We call the ranks induced from the observed sample y the empirical ranks. These ranks might be different from the true ranks of the means, and since the sample is assumed to have a continuous distribution, the empirical (set-)ranks are all singletons.
We aim on the basis of the sample Y to construct (rectangular) simultaneous confidence intervals for the set-ranks of the means μ 1,T , · · · , μ n,T . In other words, for each i we search for a confidence interval [L i , U i ] such that:

The testing problem and the partitioning principle
To obtain simultaneous confidence intervals for the set-ranks, we propose to test simultaneously all possible sets of set-ranks and then use the non-rejected ones to construct these CIs. Proposition 1 establishes this result. First, we need to define the hypotheses more formally. Consider the special case of three means A, B and C.
Each set of set-ranks correspond to an ordering of the means. For example, the set-ranks Figure (1) shows all the corresponding cases. Note that these cases partition the space R 3 . In order to calculate simultaneous CIs for the ranks of the three means, we propose to test if the vector of means (A, B, C) comply with each of the cases in Figure (1) which is equivalent to testing all sets of set-ranks (1). Note that in this example, if we assume that there no ties, only the third level of hypotheses remain. More generally, let =:< denote either = or <. In order to calculate simultaneous CIs for the ranks of μ 1,T , · · · , μ n,T , we test the elementary hypotheses

D. Al Mohamad et al.
for all possible configurations of equalities and inequalities among the means and for all permutations π ∈ S n (the symmetric group of all permutations with n numbers). Some of these hypotheses are the same because permuting tied means has no effect on the ordering. Therefore, we do not need these redundant copies.
Clearly, the elementary hypotheses (2) become disjoint after we omit the redundant copies. In other words, if (μ 1 , · · · , μ n ) ∈ H 1 ∩ H 2 , then H 1 = H 2 . Moreover, the union of the parameter spaces implied by these elementary hypotheses is equal to R n , and hence they form a partitioning of R n . In order to test all these hypotheses simultaneously at level α, it suffices to test each one of them at level α due to the so-called the partitioning principle [42,15,18]. Two examples of tests will be introduced later on, in Sections 5 and 6.
The following result states how the confidence intervals are constructed. This is the general approach for simultaneous CIs for ranks based on testing the elementary hypotheses (2) and using the partitioning principle. (2) with significance level equal to α. The union of unrejected elementary hypotheses at level α constitutes simultaneous confidence intervals for the set-ranks of the means μ 1,T , · · · , μ n,T at level 1 − α.

Proposition 1. Assume we have a statistical test for the elementary hypotheses
Although this is useful as a general method, it is not always practical because the number of elementary hypotheses increases rapidly with n. Proposition 2 states the exact number of the elementary hypotheses (2) defining the partitioning of R n . When there are no ties, the number of hypotheses to test drops to n!, therefore, in general the number of hypothesis to test is higher than n!, and we can even be more precise and calculate an upper and a lower bound using the result in [37]. In the statistical tool R, function Stirling2 from package multicool calculates these numbers. In any case, the total number of hypotheses is large, and it is very important to find a way to reduce this complexity by finding relations among tested partitions. In the literature, these relations are called shortcuts [12]. We provide two examples where some shortcuts are exploited in order to reduce the number of tested hypotheses.

Any procedure generating valid simultaneous CIs for ranks is equivalent to a partitioning procedure
Although the literature on simultaneous CIs for ranks includes the two papers [1,47], it is always possible to start from pointwise CIs methods such as [23,10,29] and correct for multiplicity, for example using Bonferroni's method, so that the resulting CIs become simultaneous, but they tend to be very conservative. In this section, we show that any procedure which produces valid simultaneous CIs for ranks can be written as a partitioning procedure with elementary hypotheses (2) and a suitable statistical test. We note that the class of partitioning procedures is larger than the class on rectangular confidence intervals for ranks, since partitioning may sometimes lead to non-rectangular confidence intervals. Assume that we have a procedure that generates confidence intervals for ranks with joint confidence level 1 − α. Let [L 1 ,Ũ 1 ], · · · , [L n ,Ũ n ] be the corresponding confidence intervals. This means Let H be some elementary hypothesis from (2). This partition includes all vectors of means having one specific set of set-ranks, say r 1 (H), · · · , r n (H). Hence, for any (μ 1 , · · · , μ n ) under H, we have r i (μ i ) = r i (H) according to Definition 1. We define a local test for H, say ϕ by

Proposition 3. ϕ is a valid test for H at level α.
The test does not reject an elementary hypothesis H if the confidence intervals [L 1 ,Ũ 1 ], · · · , [L n ,Ũ n ] cover the set-ranks of any vector of means under H. Since we have a valid local test for the partitions, we can build a partitioning scheme leading to a set of simultaneous CIs with joint confidence level of 1 − α for the ranks, say [L 1 , U 1 ], · · · , [L n , U n ] due to Proposition 1. We show in the following proposition that they are the same as the ones produced by the original procedure (3). (3) are the same as the ones produced by the partitioning procedure using the elementary hypotheses (2) and the local test ϕ, that is for all i,

Proposition 4. The confidence intervals produced by the method
This fundamental result indicates that our partitioning procedure is complete [28, Section 1.8] for constructing simultaneous confidence intervals for ranks: every valid method is a special case of this method. Partitioning should, therefore, be considered as a design principle when thinking about new methods. When designing a method, it suffices to look for a new local test for the elementary hypotheses. In order to improve a procedure uniformly, it suffices to improve the local test. Note that the result holds whether there are ties or not. Two examples are given in the following sections. Note that in these two examples, the resulting confidence intervals are invariant against a translation of the means by a common constant.

A first example: The likelihood ratio test
In the literature on ordered hypotheses such as our elementary hypotheses (2), there is not yet a general result about an optimal test. However, as stated by [8] "the method has a strong intuitive appeal and leads to a meaningful test", referring to the likelihood ratio test. Let H be some elementary hypothesis. The likelihood ratio statistic (LR) related to testing H against all alternatives is given by Note that the term related to the alternative hypothesis is zero since the minimum is attained at μ i = y i . In some cases, the minimum in (5) can be calculated directly. For example, when H is the hypothesis under which the means are equal, the minimum in (5) is attained at the average of the observations. In the sequel, we will use the following notation. Let H be the hypothesis μ 1 = μ 2 < μ 3 < μ 4 = μ 5 . We convene the following writing H : More generally, assume that H can be written as a union of blocks of means B 1 , · · · , B l where in each block the means are equal under H and such that if μ i is in block B t and μ j is in block B s such that t < s, then μ i < μ j under H. Using our new notation, we write H : It can be shown [8] that ifμ B1 ≤ · · · ≤μ B l , the minimum in (5) is attained on H and the LR is given by Note that if the block B j contains only one mean, say μ k , the argument of the minimumμ Bj is equal to the single observation y k . Hence, all subsets B j with a single mean do not appear in the calculation of the LR statistic.
In general, the basic R function isoreg or function activeSet from package isotone [13] can be used to calculate the minimum in (5) using the Pool Adjacent Violators Algorithm known as the PAVA ( [5,45,8,6,13]).
In order to perform a LR test of H against all alternatives, two approaches from the literature are available; an adaptive [46,24,2] and a non adaptive one [38]. The adaptive approach compares the LR with a quantile of a chisquare with data-dependent degrees of freedom whereas the non adaptive one compares the LR with the quantile of a mixture of chi-squares. Paragraph B in the Appendix provides further details. We will see later on that the adaptive approach is always more suitable in this context.

Some shortcuts and practical issues
The complexity of the partitioning scheme is very high (Lemma 2), so that it is important to find a way to simplify calculations as much as possible and preferably reduce the complexity of the algorithm we use.
We start by the most simple and immediate shortcut.

Lemma 1.
If the hypothesis μ 1 = · · · = μ n is not rejected, then there is no need to test any other hypothesis and the confidence intervals for the ranks are the trivial ones, that is [1, n].
Without loss of generality, assume that y 1 < ... < y n . The hypotheses in the partitioning scheme are three types according to the relative ordering of the y i 's with respect to the ordering of the μ i 's in the hypothesis. Let H : B 1 < · · · < B l be some partition where we group the means which are equal under H in the blocks B 1 , · · · , B l . Letμ Bj be given by (6). We say that 1. H is a correctly ordered hypothesis wheneverμ B1 < · · · <μ B l and for any i < j then μ i ≤ μ j ; 2. H is a partially correctly ordered hypothesis wheneverμ B1 < · · · <μ B l and there exist i < j such that μ i > μ j . This means that the sample means of the blocks respect the empirical ordering (of the y i 's) whereas the means do not; 3. H is an incorrectly ordered hypothesis if there exist i < j such thatμ Bi > μ Bj , which means that neither the means nor the sample means of the blocks respect the empirical ordering of the y i 's.
Note that in the first two cases, the LR is given by (7) which means that the solution of the PAVA does not pool any adjacent blocks. In the third case, the LR results from the PAVA by pooling all blocks of means B i , · · · , B j violating the empirical ordering into one block, and then using formula (7). The resulting pooled blocks are partially correctly ordered hypotheses. To illustrate the differences among the three types of hypotheses, assume for example y 1 = 0, y 2 = 2, y 3 = 3.
The following propositions state that only the correctly and partially correctly ordered hypotheses are required. When we have a common standard deviation, we show that only a subset of the partially correctly ordered hypotheses is required.

Proposition 5.
Assume that all the standard deviations are the same. If we use the LR to test the elementary hypotheses (2), then in order to obtain simultaneous CIs for ranks at joint level 1 − α, it suffices to test at level α the following list of hypotheses: 1. the correctly ordered hypotheses; 2. if the correctly order hypothesis H : μ 1 =:< · · · =:< μ n is not rejected, test all partially correctly ordered hypotheses of the form μ π(1) =:< · · · =:< μ π(n) for all permutations π from the list (1, 2, 3, · · · , n), (n, n − 1, · · · , 2, 1) Furthermore, if we apply the list (8) column after column, then for each column it suffices to test until we encounter the first rejected partially correctly ordered hypothesis.
When the standard deviations are not equal, the proof of Proposition 5 is no longer valid. This is because we use (and prove) the fact that the LR does not decrease as the number of permuted means increases. For example, let H : B 1 < · · · < B k be some correctly ordered hypotheses. If we permute μ i with μ j , then the LR increases. When the standard deviations are not the same, then this result no longer holds in general, especially when we permute two means, one with high standard deviation and one with small standard deviation. Proposition 6. Assume that there exist i = j such that σ i = σ j . If we use the LR to test the elementary hypotheses (2), then it suffices to test the correctly ordered and partially correctly ordered hypotheses.
While this result shows that there is no need to test the incorrectly ordered hypotheses, we do not know how to characterize the set of partially correctly ordered hypotheses in general efficiently. In the Appendix, we provide an algorithm to test all the elementary hypotheses. In that algorithm, we test first the correctly ordered hypotheses, then, we permute the indexes using some π ∈ S n and repeat the same procedure while taking into account that some hypotheses become incorrectly ordered because of the permutation and they can be discarded. When n > 10, this becomes computationally infeasible. In that case, we may still use the list of permutations (8) and then randomly select another some 10 5 permutations from S n and apply them all. Of course, this is an approximation and we might just hope that the resulting CIs get a joint confidence level of 1 − α, but we have no guarantee that they will. In the Appendix, we provide a few simulations for the case of different standard deviations when n = 10, 15 which show that the approximate CIs are still conservative for a variety of vectors of means and vectors of standard deviations.
As we mentioned here above in this section, it is possible to test the LR statistic using either an adaptive test [24,46,2] or using a non adaptive test [38]. According to Propositions 5 and 6, we only test the correctly and partially correctly ordered hypotheses for which the LR is given by (7). Therefore, the adaptive quantile is the quantile of a χ 2 ( ) where is the number of equalities in H whereas the non adaptive quantile is the quantile of a mixture of the chi-squares χ 2 ( ), · · · , χ 2 (n − 1). This proves the following corollary.

Corollary 1.
If we compare the LR to the adaptive quantile, the resulting simultaneous CIs for ranks are never longer than the ones obtained if we compare the LR to the non adaptive quantile, that is the quantile of a mixture of chi-squares.

An improved variant
We give in this paragraph a way to improve the partitioning procedure when the local test is the LR test using our generic procedure from Section 4. We use the partitioning procedure in order to produce simultaneous CIs for the ranks of μ 1,T , · · · , μ n,T and then define function ϕ LR through (4).
Consider the hypothesis H : μ 1 = · · · = μ n . Using the LR test, it is tested at an exact level α by comparing LR(H) with the quantile q n−1 of χ 2 (n − 1). However, using the test ϕ LR , hypothesis H is not rejected not only when LR(H) ≤ q n−1 , but also whenever a set of partitions implying the trivial CIs [1, n] to all the means is not rejected either. For example, the CI for the rank of μ 1,T can be [1, n] if the elementary hypotheses μ 1 < μ 2 = · · · = μ n and μ 2 < μ 1 = · · · = μ n are both not rejected. This means that the test ϕ LR (H) does not exhaust the α-level and thus we can estimate the gap between the actual level of the test and α and rescale the test significance level in order to gain power. In Section 7, we show that when μ 1,T = · · · = μ n,T with n = 10 and α = 0.1, then 95.6% of the simulations result in trivial CIs [1, n] for all the means simultaneously. This means that we actually reject μ 1 = · · · = μ n using function ϕ LR at the rate of 0.044 instead of the 0.1 used in the simulations.
In order to rescale the local tests ϕ LR , we need to find a least favorable vector μ 0 with respect to which we can rescale. In other words, μ 0 has to verify for any μ under H, It appears that μ 0 = (0, · · · , 0), and the following lemma states this result when the non adaptive critical value is used. The case when we use the adaptive critical value remains unknown.

Lemma 2.
Let H be some elementary hypothesis from (2). Assume that we compare the LR with the non adaptive critical value [38], then for any μ ∈ H Let ϕ LR(α) denote ϕ LR when the original partitioning procedure is calculated at level α. Rescaling the partitioning procedure defined using ϕ LR(α) as a local test is done by looking for a zero of the functionα → P 0 (ϕ LR(α) (H) = 1) − α for all the elementary hypotheses H from (2). For anyα, the probability P 0 (ϕ LR(α) (H) = 1) can be estimated by simulations.
This improved variant is uniformly more powerful than the original partitioning procedure since the rescaled levelα is in [α, 1]. However, in practice, the variant is computationally feasible only for small number of means. As implemented in our R package ICRanks, when the standard deviations are the same, the improvement is computationally feasible up to n = 10. When the standard deviations are not the same, then we have to calculate the non adaptive quantile for each one of the elementary hypotheses (2) by simulations or using iterative methods [20] which makes the improvement computationally feasible only for n ≤ 5.

A second example: Tukey's procedure for ranks
Tukey's pairwise comparison procedure [44,22] well-known as the Honest Significant Difference test (HSD) is an easy way to compare a set of n means based on a Gaussian sample especially in ANOVA models. The interesting point about the procedure is that it provides simultaneous confidence intervals for the differences and controls the FWER at level α. Tukey's HSD is employed by [1] to produce simultaneous CIs for the ranks. The objective here is to review this method and get more insights about it in terms of the partitioning principle.

The method
Suppose that y 1 , · · · , y n are generated independently from the Gaussian distributions N (μ i,T , σ 2 i ). In order to produce simultaneous confidence intervals for the ranks of the means μ 1,T , · · · , μ n,T , we test all hypotheses of the form H i,j : μ i = μ j using the following rejection region and Y i and Y j are two independent Gaussian random variables with mean 0 and standard deviations σ i and σ j respectively. The confidence interval for the rank of mean μ i,T , say [L i , U i ] is calculated by counting how many hypotheses H i,j are rejected and such that y j < y i (which yields L i − 1). Then we calculate how many hypotheses H i,j are not rejected and such that y j > y i (which yields n − U i ).

A new look at Tukey's pairwise comparison using the partitioning principle
We define a statistical (local) test over the elementary hypotheses (2) which yields the same confidence intervals for the ranks as the method based on Tukey's HSD. Assume that σ i = σ for all i. Let H : where q 1−α is the quantile of order 1 − α of the Studentized range (9) as in Tukey's HSD procedure. Note that we use the same critical value for all the elementary hypotheses.
Proposition 7. If we use the Tukey-based method for ranks to construct a new partitioning procedure using the local test ϕ, then {ϕ = 1} is equivalent to the rejection region (10).
When the standard deviations are not the same, we can show that the partitioning procedure produces slightly shorter CIs for the ranks than the Tukeybased method. Although this would seem as if we obtained an improved procedure through the partitioning procedure, we do not have a proof that the local test (10) is an α-level test and hence the resulting CIs are not guaranteed to have a joint level 1 − α.

An improved variant based on the partitioning principle
Similarly to the partitioning procedure that uses the LR as a local test, we can define an equivalent partitioning procedure to the Tukey-based method of [1] using the test ϕ (4). We show using Proposition 3.2 from [1], that μ = 0 is the least favorable case. We consider a new partitioning procedure in which we test the elementary hypotheses (2) using a local test ϕ = ϕ TKY of the form of (4) that uses the simultaneous CIs for the ranks obtained through the Tukey-based method of [1].

Lemma 3. Let H be some elementary hypothesis from (2). For any
Rescaling the partitioning procedure defined using function ϕ TKY as a local test is done by looking for a zero of the functionα → P 0 (ϕ TKY(α) (H) = 1)−α for all the elementary hypotheses H from (2). The probability P 0 (ϕ TKY(α) (H) = 1) can be estimated through simulations.
Similarly to the case of the partitioning procedure that uses the LR test, in practice, this improvement is computationally feasible on ordinary computers for n ≤ 10. In contrast to the LR case, when the standard deviations are not the same, the procedure does not imply any further complications and is computationally feasible up to n = 10.

Simulation study: A comparison of simultaneous coverage and efficiency
The goal of this section is to compare the performance of the novel LR-based method with the Tukey-based method of [1] which is the only method available in the literature which provides valid simultaneous CIs for ranks. Note that [1] show that the method of [47] does not control the joint confidence level of the CIs, therefore, it is unfair to include it in the comparison. We also illustrate the performance of the method of [25] that uses the Sidak correction. The simulation setup is the following. We estimate the simultaneous coverage of both methods, the LR-based and the Tukey-based method of [1] for vectors of n means with n ∈ {5, 10, 20}. We generate 1000 means μ 1,T , · · · , μ n,T independently from the Gaussian distribution N (0, τ 2 ) for τ ∈ {0, 1, 3, 5}. For each vector of means, we generate independently a Gaussian sample y 1 , · · · , y n such that y i ∼ N (μ i,T , 1). For each τ , the coverage is estimated as the proportion of vectors of means which are being covered simultaneously by the CIs calculated based on the corresponding samples y 1 , · · · , y n . The results are presented in Table 1. We also calculate the average length of the confidence intervals whereR n (α) is the rankability measure defined in [1]. The quantity 1 −R n (α) is a measure of efficiency of a method producing CIs for ranks. A better method has shorter CIs and therefore a smaller 1 −R n (α). We provide in Appendix D simulations when the standard deviations are not the same, cases when more ties are present among the true means and when the normality assumption is violated.
The results of Table 1 show that on average the Tukey-based method produces shorter confidence intervals than the LR-based one especially as the number of means increases to 20. When the number of means is smaller than 10, our LRbased method produces shorter CIs. Both variants produce shorter CIs than their corresponding methods.
It is not surprising that when τ = 0 (all the means are tied and their true set-ranks are all [1, n]), the Tukey-based method delivers CIs for ranks with joint level equal to 1 − α because this method is exact when μ 1,T = · · · = μ n,T [1, Proposition 3.2]. On the other hand, our novel method based on the LR test does not seem to share this property empirically except for n = 5. The method of [25] is the least performing method. We recall Proposition 3.2 from [1] that states that when the standard deviations are the same, the Tukey-based method produces shorter simultaneous confidence intervals for the ranks than the method of [25].
The rescaled version of the Tukey-based method does not improve as much as the rescaled version of the partitioning procedure that uses the likelihood ratio test. When the standard deviations are equal, we show in Lemma 3 in the Appendix that in order to perform the partitioning procedure that uses the local test (10), it suffices to test the correctly ordered hypotheses. We can see the implication of such result on the example of testing the hypothesis μ 1 = · · · = μ n . Indeed, we obtain trivial CIs for the ranks only when that hypothesis is not rejected, because if μ 1 = · · · = μ n is rejected, then there is no correctly ordered hypothesis that has μ 1 in the nth position except for μ 1 = · · · = μ n . This means that it is not possible to improve the local test for this hypothesis. In the case of the partitioning procedure that uses the LR test, it is possible to improve the level at which we test the hypothesis μ 1 = · · · = μ n . When the standard deviations are not the same, Lemma 3 from the Appendix no longer holds, and the improved procedure may be more efficient.

Data analysis
Ratings of hotels is one of the tools that booking websites use to show the quality of these hotels and guide new customers choose a suitable one. Booking.com is one of the world leading websites for booking hotels. A hotel is rated by some of its customers for different criteria such as cleanness, breakfast, etc. An overall rating between 1 and 5 stars is also attributed to the hotel by the customer. We used the data publicly available on the website www.booking.com for a room reservation in the city of Leiden (The Netherlands) to rent a room for one night on the 2 nd of May 2019. The query was made on the 15 th of April 2019. We restricted our search for hotels with free Wifi, free cancellation and within 1 Km from the city center. We obtained a list of 9 hotels (see raw data in the Appendix). For each hotel, we have the number of customers who rated the hotels for 1, 2, 3, 4 or 5 stars. We compute the average rating for each hotel and its standard error in the following way. Let X be a random variable taking values in the set {1, 2, 3, 4, 5} which represents the rating of a customer. We calculate where n i is the number of customers reviews for the i th hotel and n i,j is the number of customers reviews of j stars in the i th hotel. The result is in table 2.
We apply both the Tukey-based method of [1] and our new LR-based method on this data and calculate simultaneous CIs for the ranks of these hotels at joint level 90%. We also apply the rescaled versions of these methods presented in paragraphs 5.2 and 6.3. Since the standard errors of the means are not the same, Proposition 5 does not hold so that we have to test all elementary hypotheses (2). Furthermore, the rescaled version of the partitioning procedure that uses the LR as a local test is not computationally feasible, therefore, we use the maximum standard error of all the hotels ratings as the common standard error for all the hotels ratings. The resulting simultaneous CIs are upper-bounds of the CIs that the procedure will produce in case applied. We illustrate the result of the method of [25] that uses the Sidak correction. The method of [25], the partitioning procedure that uses the LR, the Tukeybased procedure and its rescaled version gave all the same result. The rescaled version of the partitioning procedure that uses the LR delivered the best result. Furthermore, all the methods single out the best and second best hotels. The rescaled version of the partitioning procedure that uses the LR singles out the worst two hotels.

Discussion
We presented in this paper a generic method for simultaneous CIs for ranks where we partitioned the parameter space R n into sets defined through possible orderings of a set of means μ 1 , · · · , μ n . The Partitioning principle allowed to control the FWER below α by testing each set at level α which was used to construct simultaneous CIs for the ranks at level 1−α. We showed that any procedure producing simultaneous CIs for ranks could be written as a partitioning procedure with a suitable local test for the partitions.
We presented an example of our procedure using the likelihood ratio test and also showed that a recently developed method based on Tukey's HSD could be written as a partitioning procedure. We proposed rescaled versions of these two methods by embedding them inside a new partitioning procedure. Although the rescaled version uniformly improve these methods, they are computationally feasible only up to 10 means. Recall that the procedure that uses the LRT is feasible up to n = 40 when the standard deviations are equal and only up to n = 10 when they are not equal. The Tukey-based approach has a polynomial complexity and is feasible for large n.
In [1], the authors propose a rescaling method based on empirical evidence in order to reduce the conservativeness of the Tukey-based method. The idea is to rescale the Tukey-based method with respect to a worst-case which is different from our rescaling idea in this paper where the rescaling is done for each partition separately. We believe that a similar method can be developed for our LR-based method which could lead to a procedure that is more computationally feasible.
We assumed the standard errors to be known, which is a standard assumption in most papers considering confidence intervals for ranks, see [34,40,25] among others. This assumption becomes challenging when the standard errors are estimated with a few measurements (patients, rating, etc.). In Appendix D.4, a simulation example shows that using estimated standard errors still results in conservative CIs for the ranks with close results to when we used the true standard errors except for the case when there are only three measurements for each sample mean. More extensive simulations are needed and developing rigorous approach under the assumption of unknown standard errors remains an open question.
For a different objective, it is possible to look for the rank of only one prespecified institution that we are interested in. [16] use the partitioning principle to make multiple comparisons to the best or to a control. Combining their work with ours could be the objective of a future work.
We provide in this appendix proofs of the main results in the paper and a detailed algorithm of how to perform the partitioning scheme when we use the likelihood ratio (LR) test. It also includes further simulations and the raw data that we collected for the data analysis section of the paper.

A.1. Proof of Proposition 1
Proof. Since the partitioning principle ensures that the FWER is below α, we may write P (Number of type I errors ≥ 1) ≤ α which is equivalent to 1 − P (Number of type I errors = 0) ≤ α.
Denote ∪ i∈I P i the set of rejected elementary hypotheses at level α and μ T the true vector of means. We can write Since the P i 's partition the parameter space R n , then Finally, recall that each partition represents a single set of set-ranks of the means. Thus, the union of unrejected partitions implies a set of simultaneous confidence intervals for the ranks of the means, this set has a confidence level of at least 1 − α.

A.2. Proof of Proposition 2
Following the example in figure 1, we arrange the set of elementary hypotheses by levels according to the number of ties between the means. The 1 st level corresponds to the hypothesis where all means are tied. The second level corresponds to hypotheses with n − 1 ties and so on. The n th level corresponds to hypotheses without any ties. We calculate the number of hypotheses in each level and then sum them up, that is the hypotheses having the same number of inequalities between the means. At level n − i, for i ∈ {0, · · · , n − 1}, with i equalities, we have i equalities and n − i − 1 inequalities. Any partition H from level n − i can be written as a set of n − i − 1 blocks H : B 1 < · · · < B n−i−1 where each block includes means which are related to each others by an equality. Given a set of blocks, the number of different orderings of these blocks is equal to (n − i − 1)!. It remains then to calculate the number of possible partitions for a given ordering of the means. This number is the same for all possible orderings. Assume then that μ 1 ≤ μ 2 ≤ · · · ≤ μ n . The indexes are the set {1, · · · , n} and the blocks are mere ordered subsets (or partitions) of indexes which are disjoint and whose union is equal to the whole set {1, · · · , n}. This is an ordered partition of the set {1, · · · , n} [

A.3. Proof of Proposition 3
Due to equation (3), it is straightforward that ϕ is a valid test for H at level α. Indeed,

A.4. Proof of Proposition 4
We first show that [L i , U i ] ⊂ [L i ,Ũ i ]. This is straightforward because by construction of the test ϕ, a partition (a set of set-ranks) is not rejected only if it induces set-ranks in the CIs [L i ,Ũ i ]. Moreover, the confidence intervals for the ranks based on the partitioning scheme are built based on only the unrejected partitions. Thus, the inclusion holds.

A.5. Proof of Proposition 5
The proof requires the following Lemma. Lemma 4. Let B 1 , · · · , B l be subsets that partition the set {μ 1 , · · · , μ n } so that B i ∩ B j = ∅ and ∪ i B i = {μ 1 , · · · , μ n }. Assume that we obtainB 1 , · · · ,B l by swapping μ j1 with μ j2 such that y j1 < y j2 (so that all subsets remain the same except for two). Letμ j denote the sample mean over block B j whereasμ j denote the sample mean over blockB j . Then l j=1 μi∈Bj In particular, if H : B 1 < · · · < B l andH :B 1 < · · · <B l be two (partially) correctly ordered hypotheses, then

LR(H) ≥ LR(H).
Proof. LetB i1 andB i2 be the two subsets that have changed due to swapping μ j1 with μ j2 . Let also B i1 and B i2 be the corresponding subsets before swapping. When both H andH are partially correctly ordered hypotheses, then Thus, we can write easily the LR of both hypotheses H andH as We study the contribution of the blocks that have changed. Let

D. Al Mohamad et al.
Note that the two likelihood ratios LR(H) and LR(H) have the same first term. Therefore, it suffices to prove that Finally, since all these terms are non negative, then in order to prove the lemma, it suffices to compareμ i2 −μ i1 withμ i2 −μ i1 . It is straightforward to see that

LR(H) ≤ LR(H).
Without loss of generality, we assume that σ i = 1 for all i = 1, · · · , n. The proof consists of two main parts. We prove in the first part that it suffices to test only hypotheses corresponding to cases 1 and 2. In other words, there is no need to test incorrectly ordered hypotheses (case 3). We show in the second part that not all the hypotheses corresponding to case 2 need to be tested and give only the relevant list.
We prove the first part. Consider a hypothesis H l from the l th level, that is it contains l − 1 inequalities. Write this hypothesis as a union of blocks where each block contains all means which are equal under H l , that is H l = A 1 < · · · < A l . Suppose that this hypothesis is incorrectly ordered. According to Proposition 1, we are interested in H l only if it is not rejected. Suppose then that the hypothesis H l is not rejected. When we calculate the maximum likelihood under this hypothesis by the pool adjacent violators algorithm (PAVA), adjacent blocks which violate the orderingμ A1 < · · · <μ A l will be pooled together. By merging the pooled blocks of hypothesis H l , we can construct a partially correctly ordered hypothesisH s = {Ã 1 , · · · ,Ã s } with s < l such thatμÃ 1 < · · · <μÃ s . Note that LR(H s ) = LR(H l ) due to the PAVA. Moreover, the adaptive critical value is also the same since it depends on the PAVA solution. Thus, the non rejection of H l will imply the non rejection of the hypothesisH s . The set-ranks induced by H l are subsets of the set-ranks induced byH s since in the later the pooled blocks become one so that their means are equal underH s whereas they where not under H l . Thus, testing the partially correctly ordered hypothesisH s We prove the second part. The partially correctly ordered hypotheses result from the correctly ordered hypotheses by switching at least a pair of means in a way that the switching does not result in a modification of the ordering of the observed means inside the blocks. Moreover, the switching only influences the position of the means and not the size of the blocks defining the hypothesis. We need to show two things.
1. If a partially correctly ordered hypothesis is not rejected then the corresponding correctly ordered hypothesis is not rejected either. This allows to conclude that we need to look at switches only if we find a correctly ordered hypothesis which is not rejected. As long as we are rejecting the correctly ordered hypothesis, we do not need to care about partially correctly ordered ones because they are automatically rejected. 2. If a correctly ordered hypothesis is not rejected, then we need to consider permutations of indexes only from the list (8).
We prove the first claim. Let H be any hypothesis (correctly ordered or partially correctly ordered) that consists of l blocks such thatμ B1 < · · · <μ B l . Assume that we switch between two means μ j1 from block B i1 with mean μ j2 from block B i2 such that j 1 < j 2 . Assume also that this permutation does not result in changing the hypothesis from being (partially) correctly ordered into incorrectly ordered hypothesis. Due to Lemma 4, we have

LR(H) ≤ LR(H).
Now, if the hypothesisH is not rejected, then so does H since they are tested against the same adaptive quantile, that is a quantile of χ 2 (l). Conversely, if the hypothesis H is rejected, then so doesH.
Last but not least, assume that a partially correctly ordered hypothesisH results from a correctly ordered hypothesis H by permuting s means following some permutation p. It is possible to write p as the composition of a finite set of transpositions, that is there exist m ≤ s transpositions τ i such that p = τ m τ 2 ...τ 1 . Applying the permutation p on the set of means indexes is equivalent to applying successively the transpositions on the set of means. In other words, the hypothesisH is the result of m single switches applied successively on the indexes of means considered in H. Denote τ (H) the hypothesis which results from H by applying the transposition τ on the means indexes. Theñ In order to apply Lemma 4, the transpositions must change the positions of two means μ j1 < μ j2 (underH) only if y j1 > y j2 . In order to do so, we start by picking the mean which corresponds to y 1 (the smallest observation), that is μ 1 . If it is already in position 1 inH, we do nothing, otherwise, we switch it with the mean in position 1 inH. We thus set τ 1 = (1, i y1 ). More generally, let i yj be the position of μ j inH. Then, we have Some of these transpositions may be the identity function so that only m ≤ s transpositions remain. Thus, by recurrence and using Lemma 4, we have This reads as follows. Any supplementary switch between two means in a (partially) correctly ordered hypothesis results in increasing the LR.
We prove now our second claim. Since we need to consider a partially correctly ordered hypothesis only when the corresponding correctly ordered hypothesis is not rejected, let H be a correctly ordered hypothesis which is not reject. LetH be some partially correctly ordered hypothesis which results from H by permuting the means indexes using a permutation p such thatH is not rejected. We show that ifH induces wider CI for the rank of μ i,T than H, then there exist permutations p 1 , · · · , p k from the list (8) such that the partially correctly ordered hypotheses resulting from applying these permutations on the indexes of the means through H, denoted as before p 1 (H), · · · , p k (H) are not rejected. Furthermore, the unrejection of those hypotheses result in the same CI for the rank of μ i,T asH. This suffices to conclude that only permutations from the list (8) are needed.
Any permutation has a disjoint decomposition of cycles. Two cycles in this decomposition have disjoint orbits. Two disjoint cycles modify the set-ranks of two disjoint groups of means. Therefore, it is possible to treat each cycle separately. For this reason and without loss of generality, we assume that p = (i 1 , · · · , i k ) is a permutation with one cycle. Note that if the orbit is smaller than n, that is k < n, then the permutation p leaves some of the means in their own position. Otherwise, all the means move from their original positions in H to new ones inH.
Let s ∈ {1, · · · , n}. Suppose that the original position of mean μ is in H is i s−1 , then its new position inH is i s with the convention i 0 = i k . If i s > i s−1 , then μ is−1 moves forward inH (with respect to H). Otherwise, it moves backward inH. The proof slightly differs according to whether μ is−1 moves forward or backward.
We assume first that μ is−1 moves forward inH. It is possible to reorder all the means which have new positions different from i s by composing p successively with suitable transpositions. The reordering will be done based on the corresponding observed values. We will prove that this reordering results in a decrease of the LR or at least does not increase it. Indeed, we choose the mean with the maximum observed value among the means with new positions different from i s . If its new position is different from n, say i max1 , then there is some mean whose new position is n and whose observed value is inferior to the maximum. We switch these two by composing p with the transposition (i max 1 , n). This single reordering puts a mean with a small observed value back before another mean with a larger observed value. Therefore, this single reordering does not make the LR increase similarly to (13). Now, we consider again the set of means whose new positions in (i max 1 , n)p(H) = (i max 1 , n)H are different from i s except for the one who is at position n, that is the set {1, · · · , n − 1} \ {i s }. We choose the mean with maximum observed value. If its new position, say i max2 , is inferior to n − 1, then we switch it with the one whose new position is n − 1 by composing (i max1 , n)H with the transposition (i max2 , n − 1). Similarly to the previous switch, this one also makes the LR decrease (or at least does not increase). We iterate this procedure t times until we reorder all the means whose new positions are different from i s . The result of this reordering is denotedH t and is given byH This can also be written as so that using Lemma 4, we have Moreover, we can writeH t explicitly as H t : μ 1 =:< · · · =:< μ is−1−1 =:< μ is−1+1 =:< · · · =:< μ is =:< μ is−1 =:< μ is+1 =:< · · · =:< μ n In other words,H

Thus using Lemma 4, we have LR(H t ) ≤ LR(H) which together with (14) implies LR(H) ≤ LR(H t ) ≤ LR(H).
We conclude that ifH is not rejected, then any mean whose position in H moves forward inH does not get a wider CI for its rank than the CI that it gets from testing the partially correctly ordered hypotheses resulting from applying the list (8) on H.
Last but not least, if μ is moves backward inH with respect to H to position i s−1 , then similar steps to the previous case allows to reorder the means whose new positions are different from i s−1 . Denote the resulting hypothesisH t , we haveH We can writeH t explicitly as H t : μ 1 =:< · · · =:< μ is−1−1 =:< μ is =:< μ is−1 =:< · · · =:< μ is−1 =:< μ is+1 =:< · · · =:< μ n In other words,H Using Lemma 4, we get LR(H t ) ≤ LR(H) which together with (15) implies We conclude that ifH is not rejected, then any mean whose position in H moves backward inH does not get a wider CI for its rank than the CI that it gets from testing the partially correctly ordered hypotheses resulting from applying the list (8) on H.
To end the proof, since any transposition (i, j) is the composition of transpositions (i, i + 1), · · · , (j − 1, j), we conclude that for any partially correctly ordered hypothesis that we do not reject, we may construct partially correctly ordered hypotheses using the list (8) which are not rejected either and which produce the same CIs for the ranks of μ 1,T , · · · , μ n,T .
Finally, if we test the list (8) column after column, then for each column it suffices to test until one of the permutations gets rejected then the remaining permutations with a larger orbit (the set of indexes to permute) will automatically be rejected. Indeed, by Lemma 4, as the orbit of the permutation contains more means, the LR increases.

A.6. Proof of Proposition 6
See the first part of the proof of Proposition 5.

A.7. Proof of Lemma 2
Proof. We characterize the event {ϕ LR (H) = 1} when μ ∈ H. We abbreviate PP for the partitioning procedure that uses ϕ LR as a local test, and PLR for the partitioning procedure which uses the LR as a local test. Let [L i , U i ] for i = 1, · · · , n be the set of simultaneous CIs produced by PLR. Note that according to Proposition 4, the simultaneous CIs produced by PP are the same as the ones produced by PLR, which are [L i , U i ] for i = 1, · · · , n. For μ = (μ 1 , · · · , μ n ), let r i (H) be the set-rank of μ i when μ ∈ H. According to the definition of ϕ LR , we reject H in PP (ϕ LR = 1) if for any μ ∈ H, r i (H) [L i , U i ] for some i ∈ {1, · · · , n}. In other words,
Further results for the Tukey-based method Lemma 5. For the partitioning procedure defined for the Tukey-based method using the local test ϕ TKY , it suffices to test only the correctly ordered hypotheses, that is the hypotheses whose ordering does not violate the empirical one.
Let H be an elementary hypothesis. Without loss of generality, suppose that it has only three blocks H : B 1 < B 2 < B 3 . Suppose that the empirical ordering is such that max μi∈B1 y i > min μi∈B2 y i , then our testing procedure will pool B 1 and B 2 intoB 1 . In the same spirit of the proof of Proposition 5 and according to Proposition 1, if H is rejected, this changes nothing in terms of the confidence intervals and we only need to look at the unrejected hypotheses.
Suppose now, that H is not rejected, then where y i1 and y i3 correspond to the smallest observed values related to the means in blocksB 1 and B 3 respectively. The hypothesisH :B 1 < B 3 is also an elementary hypothesis whose ordering coincides with the empirical one so that it is a correctly ordered one. Besides, this hypothesis is not rejected due to (16) because on the one hand, it has the same test statistic as H i and on the other hand, it is tested against the same common critical value q 1−α . Thus, for any hypothesis H with incorrect ordering, there exists a correctly ordered hypothesisH which has the same test statistic so that whenever one of them is not rejected the other one is not, either.

Proposition 8.
Assume that we have a common standard deviation σ. In terms of ranks, the partitioning procedure defined using the rejection region (10) is equivalent to the Tukey-based method of [1]. In other words, they produce the same simultaneous confidence intervals for the ranks of the means μ 1,T , · · · , μ n,T at level 1 − α.
Due to Lemma 5, we only need to test the correctly ordered hypotheses. The rejection region for these hypotheses turns out to be a calculus of the maximum of the maximal differences inside the blocks composing the hypothesis.
Take mean μ i,T . Suppose that with the Tukey-based procedure, we determine a confidence interval for the rank of μ i,T to be [L i , U i ]. This means that we could not reject all hypotheses μ i = μ j for j ∈ [L i , U i ]. In other words, we have: Besides, we reject all hypotheses μ i = μ l for l ≤ L i − 1 and l ≥ U i + 1. In other words Let us check what is the confidence interval that we can get using the partitioning with (10) from these rejections and non rejections. First of all, we have Thus any partition containing the block μ i = · · · = μ Ui+1 or the block μ Li−1 = · · · = μ i (or larger ones) is rejected using the rejection region (10). This also entails that any hypothesis producing a larger confidence interval (more equalities) will also be rejected. Therefore, we can conclude that the confidence interval for μ i,T produced by the partitioning procedure is at most the one produced by the Suppose now that with the partitioning procedure, we get a confidence interval for μ i,T equal to [L P , U P ]. We are then sure that any hypothesis containing the block μ i = · · · = μ U P +1 or the block μ L P −1 = · · · = μ i is also rejected. In particular, the hypotheses {μ 1 < · · · < μ i = · · · = μ U P +1 < · · · < μ n } and {μ 1 < · · · < μ L P −1 = · · · = μ i < · · · < μ n } are rejected. This means that max j=i,··· ,U P +1 for some j 0 ∈ {L P − 1, · · · , i} and j 1 ∈ {i, · · · , U P + 1} verifying ∀j ∈ {i, · · · , U P + 1}, This entails that with the Tukey-based procedure, we must reject hypotheses μ i = μ j1 and μ j0 = μ i . Thus, the confidence interval provided by Tukey's procedure is at most the confidence interval produced by the partitioning, that is [L P , U P ]. We proved that the Tukey-based procedure cannot produce larger confidence intervals than the partitioning procedure using (10), and that the latter cannot produce larger confidence intervals than the former. Hence, Both methods are equivalent in terms of ranks, that is they produce the same simultaneous confidence intervals for the ranks.

A.8. Proof of Proposition 7
Proof. Assume y 1 < · · · < y n . Let H be a correctly ordered hypothesis that consists of l blocks, that is H : B 1 < · · · < B l . Assume that H is not rejected. Let μ is (μ it , resp.) denote the mean with the smallest (highest, resp.) observed value in block B i . Since H is not rejected, then for all j ∈ {i s , · · · , i t }, the rank CI of μ j includes the ranks {i s , · · · , i t }. Since the standard deviations are the same, then it implies that Tukey's procedure does not reject the hypothesis μ is = μ it and any hypothesis μ k = μ r for k, r ∈ {i s , · · · , i t }. This means that if H is not rejected, then max k,r∈{is,··· ,it} Similarly, if for all blocks of means in H (17) holds, then μ k = μ r is not rejected for μ k , μ r ∈ B i for i = 1, · · · , l. Thus, not rejecting H is equivalent to LetH be a hypothesis that results from H by switching μ i with μ j . Assume also that μ i ∈ B s and μ j ∈ B t and denoteB s andB t the new blocks after switching μ i with μ j . We only need to take care of the blocksB s , B s+1 , If ϕ(H) = 1, thenH is not rejected and μ i gets rank j whereas μ j gets rank i. On the other hand, since the empirical ranks are never rejected, μ i has already rank i in its rank CI. Since the standard deviations are assumed equal, then μ i = μ j is not rejected by the Tukey procedure. Moreover, for all k ∈ {i + 1, · · · , j − 1}, Tukey's procedure does not reject μ i = μ k .
SinceH is not rejected, then all means in blockB s get the same set-rank. If μ is corresponds to the mean with the lowest observed value in blockB s , then μ j gets also rank i s . Similarly, if μ it corresponds to the mean with the highest observed value in blockB t , then μ i gets also rank i t . Since y i − y is < y j − y is , then Tukey's procedure does not reject any of the hypotheses μ i = μ k for any k ∈ {i s , · · · , i j }. This is equivalent to pooling all the blocks B s , · · · , B t into one block. Moreover, not rejecting H is equivalent to y it − y is < q where q is the Studentized range quantile.
More generally, any conflict of ordering between the empirical ranks and the ranks that the elementary hypothesis imply leads to pooling all the blocks of means in between and all means in these blocks share the same set-ranks.

A.9. Proof of Lemma 3
Proof. Let [L i , U i ] for i = 1, · · · , n be the set of simultaneous CIs produced by the Tukey-based method. Note that according to Proposition 4, the simultaneous CIs produced by partitioning procedure defined on the elementary hypotheses (2) through function ϕ TKY are also [L i , U i ] for i = 1, · · · , n. For μ = (μ 1 , · · · , μ n ), let r i (H) be the set-rank of μ i when μ ∈ H.

Appendix B: Testing a simple order
Let Y 1 , · · · , Y p be random variables distributed independently as N (μ i,T , σ 2 i ) for i = 1, · · · , p. We test the null hypothesis H : μ 1 ≤ · · · ≤ μ p against all alternatives based on the observation (y 1 , . . . , y n ). The likelihood ratio can be calculated using the pool adjacent violators algorithm known as the PAVA (Bartholomew [8], van Eeden C. [45]). Function isoreg in the statistical program R does the job. Note that the maximum likelihood estimator results from the vector y = (y 1 , . . . , y n ) by pooling certain adjacent observations so that the maximum likelihood estimator has distinct coordinates at most equal to n. From the literature, [38] proposed to compare the LR statistic with the quantile of a mixture of chi-squares with degrees of freedom ranging from 1 to n. In our paper, this refers to the nonadaptive test since the critical value does not adapt to the form of the maximum likelihood estimator. The nonadaptive test is defined by P(LR > γ) ≤ P μi=0,∀i (LR > γ) = n−1 j=1 w j,n q n−j where w 1,n = 1 n , w n,n = 1 n! , w j,n = 1 n w j−1,n−1 + n − 1 n w j,n−1 These weights can be calculated using Stirling numbers of the second kind, see [35]. The adaptive LR test compares the likelihood ratio statistics with the quantile of a χ 2 (p − ) at order 1 − α. The adaptive critical value is given by q(y, α) = q p− . Theorem 1 from [2] shows that this adaptive LR test has level α.
Similarly, if we want to test H : μ 1 = · · · = μ m ≤ μ m+1 ≤ · · · ≤ μ p , then the PAVA provides a solution where the first m observations are always pooled (possibly together with other ones). The adaptive LR test compares the LR statistic with the quantile of a χ 2 (p − ) at order 1 − α where is the number of levels in the result of the PAVA. Note that p − ∈ {m − 1, · · · , p − 1}. The nonadaptive test compares the LR statistic with the quantile of a mixture of chi-squares with degrees of freedom ranging from m − 1, · · · , p − 1.
2 n−1 × c where c is the length (or the lengths) of the representation. Thus, for "normal" computers it becomes easily impossible to generate such matrix (or structure) as n grows. Therefore, it is necessary to be able to generate the configurations (representations) one by one to avoid memory issues.
We propose to represent a hypothesis by keeping track of the positions of the inequalities so that a hypothesis is made into groups of means which are equal under that hypothesis. This is the same representation considered in the paper. Let H : B 1 < · · · < B l . Since the hypotheses are grouped in levels where the level number is given by the number of inequalities, then H belongs to the (l+1)th level. This representation of H also provides an efficient way to calculate the LR. Indeed, since we only test hypotheses with a correct ordering w.r.t the empirical one, the PAVA is not needed and the LR for some partition is only a sum of averages of the blocks of equal centers and our representation tells us directly where are the bounds of each block. Indeed, the LR is given by The first level has only one hypothesis which is μ 1 = · · · = μ n . This hypothesis is tested at the beginning of the procedure. The hypotheses from level 2 to level n − 1 are coded according to the positions of the inequalities among the means in the following manner. Consider first the case of 3 means A, B and C, the representation of the correctly ordered hypotheses (excluding the 1st level), say A < B = C, A = B < C and A < B < C, is the set For n ≤ 25, it is possible (on regular computer) to use function combn from the utils package in the statistical program R in order to generate efficiently the set of configurations for levels 2 to n − 1. For higher values of n, we need to generate these configurations one by one in order to avoid memory issues.
In the ICRanks package, we generate the representations for any n in the same way by considering the following function C well-known in combinatorics as the combinatorial number system, see [26]. Consider level l + 1 where the hypotheses have l inequality. Let (c 1 , · · · , c l ) be a vector of natural numbers such that c 1 < · · · < c l . Define function C as follows C(c 1 , · · · , c l ) = c 1 1 + · · · + c l l .
This is a one-to-one function between the set of configurations {(c 1 , · · · , c l ) ∈ N k , 0 ≤ c 1 < · · · < c l ≤ n − 2, } which represent the correctly ordered hypotheses from level number l + 1 and the set of numbers S l = 1, · · · , n − 1 l .
In order to generate the coding, we go through the numbers from S l . For each number m, we calculate the inverse of function C using Algorithm 1.

Algorithm 1:
An iterative algorithm to calculate C −1 .
Data: Level number l and a number m between 1 and max S l . Result: A vector (c 1 , · · · , c l ) such that 0 < c 1 < · · · < c l < n. Algorithm 2 provides a pseudo-code of the procedure explained here above when the standard deviations are the same. If the standard deviations are not equal, Algorithm 3 provides the corresponding pseudo-code. In both algorithms, the set Π refers to the list of permutations (8). The set S represents a subset of S n selected randomly that the user provides. For n ≤ 10, we can take S = S n , otherwise it becomes computationally infeasible with a normal laptop.

Algorithm 3:
(1 − α)-Simultaneous CIs when the standard deviations are not the same.

D.3. Example with more ties
Here is an example of 3 groups of 3 means (so that n = 9) and also 2 groups of 4 means (n = 8). We follow the same setup as in Section 7, but we only use τ = 1 (recall that we generate the true means from N (0, τ)). The results of Table (4) show conservative confidence intervals but less than when there are no ties.

D.4. Example with estimated standard errors
In this example, we consider 2 groups of 4 means (n = 8). We follow the same setup as in paragraph D.3, but we only use τ = 1 (recall that we generate the true means from N (0, τ)). For each true mean, we generate m observations randomly from the Gaussian distribution N (μ i,T , √ m). Then, we calculate the sample means and sample standard errors. Note that the true standard error is 1 in order to get comparable results to paragraph D.3. The results of Table (5) show very close results to when we used the true standard error for m = 30. For m = 3, the simultaneous coverage goes slightly below the nominal level.