Informative Priors for the Consensus Ranking in the Bayesian Mallows Model

The aim of this work is to study the problem of prior elicitation for the consensus ranking in the Mallows model with Spearman’s distance, a popular distance-based model for rankings or permutation data. Previous Bayesian inference for such a model has been limited to the use of the uniform prior over the space of permutations. We present a novel strategy to elicit informative prior beliefs on the location parameter of the model, discussing the interpretation of hyper-parameters and the implication of prior choices for the posterior analysis.


Motivation
In recent years, interest in preference data has increased, in part due to internet-related activities. The study of rankings, in particular, has received special attention, since this type of data arise in many fields. Notable examples are electoral systems in which voters are required to rank candidates, as is the case of the Irish general elections (Gormley and Murphy, 2008); automatic recommender systems seeking to aggregate preferences in order to suggest products to the customers (Sun et al., 2012); market research based on surveys in which competing services, or items, are compared or ranked by customers (Dabic and Hatzinger, 2009); medical applications, especially in genomics, in which genes are sometimes ranked according to their expression levels under various experimental conditions (Vitelli et al., 2018), and other data is often transformed into rankings in order minimize the effect of miscalibration error from the measuring devices (Mollica and Tardella, 2014).
The Mallows model (MM) (Mallows, 1957;Diaconis, 1988) is a popular two-parameter distance-based family of models for ranking data, based on the assumption that a modal ranking, which can be interpreted as the consensus ranking of the population, exists. The probability of observing a given ranking is then assumed to decay exponentially fast as its distance from the consensus grows. Individual models with different properties can be obtained depending on the choice of distance on the space of permutations. The scale or precision parameter, controlling the concentration of the distribution, determines the rate of decay of the probability of individual ranks.
We focus on the Mallows model with Spearman's distance (MMS), introduced by (Mallows, 1957) with the name of rho-model, since Spearman's distance, when re-scaled to lie between −1 and 1, arises naturally as the correlation between the ranks of two samples. Marden (1995) and Vitelli et al. (2018) have studied Bayesian inference for the MMS, limiting the analysis to the use of a uniform, non-informative prior on the consensus ranking.
Within the Bayesian literature, non-informative and objective priors can be used to provide a sense of neutrality to the analysis by allowing the data to be the only source of information in the estimation procedure. However, when information is available from experts or external sources, it may be argued that a fully Bayesian analysis should include this subjective prior belief. Dawid (1997) clearly stated that "no theory which incorporates non-subjective priors can truly be called Bayesian, and no amount of wishful thinking can alter this reality". While admitting that both approaches may be valid in different situations, in this paper we explore the possibility of including genuine prior information, which might come from a literature review, from an expert or from an earlier data analysis, into the Bayesian Mallows model for ranking data (Vitelli et al., 2018).
Previous proposals to include prior information on the consensus ranking of a MM include Gupta and Damien (2002), who suggest eliciting a prior on the consensus which is constant on conjugacy classes. In other words, they propose a prior that assigns a priori equal probability to all permutations with the same cyclic structure. However, the conjugate classes defined by cyclic structures do not coincide with those defined by permutations lying at the same distance (e.g. Spearman's) from the consensus ranking, making this approach impractical for the MMS, as it is difficult to assess a way in which prior information enters the model. Meilǎ and Bao (2010) and Meilǎ and Chen (2010) consider the MM with Kendall's distance within the Bayesian paradigm and provide a conjugate prior for the model parameters which is known up to a normalization constant. However, their analysis does not extend to the MMS. Xu et al. (2018) propose an alternative family of models for rankings, based on a mapping of the data to the unit sphere (see also McCullagh, 1993). The location parameter of their model has an interpretation analogous to that of the consensus ranking but it is not limited to be itself a ranking, thus allowing to express a more general form of consensus. The MMS is a particular case of this model, and the authors propose a conjugate Bayesian prior for the consensus parameter. However, the emphasis of the paper is on efficient inference via an approximation of the model's normalizing constant and the use of variational methods; prior elicitation and the inclusion of prior information are not discussed. In a different setting, when data consist of rankings which vary in time, Asfaw et al. (2017) introduce a dynamic version of the Bayesian Mallows model and assume a smoothing prior for modelling the slow time-varying consensus ranking.
In the present work, which stems from Chapter 6 of Crispino (2017), we aim to provide experts using the MMS with a tool to express their beliefs, knowing the effect of prior choices in their analysis, should they wish to do so. With this in mind, by exploiting the notion of permutohedron, also known as permutation polytope, (Thompson, 1993;McCullagh, 1993;Marden, 1995), we propose an explicit form for a conjugate prior on the consensus parameter for the MMS. We then study its properties, presenting some theoretical insights on the prior elicitation problem. Subjective prior information on the consensus ranking can therefore be elicited by choosing appropriate hyper-parameters. The proposed prior density can handle a situation when only partial information is available, which is particularly relevant when the set of items to be ranked is very large. In such cases it is unlikely that a full ranking is a priori available, while it could be possible to express some prior belief regarding which are the most (or least) preferred items. An additional advantage of our prior is given by the interpretability of the hyperparameters in terms of the amount and type of information included.
We initially assume the scale parameter of the MMS to be known, given that in most applications it is considered a nuisance, the interest being focused on the estimation of the consensus ranking (see Vitelli et al., 2018, Section 3). In the more realistic case when the scale parameter is unknown, multiple approaches are possible. For instance, in Vitelli et al. (2018) an exponential prior density is proposed; In Marden (1995), Section 6.4, the conjugate prior for the scale parameter is used, when a uniform prior density for the location is employed. In this manuscript we propose as an alternative a reference prior on the scale parameter, which is a valid option when no prior information on this parameter is available.
The paper is organized as follows. In Section 2 we give an overview of the MMS. In Section 3 we discuss the novel results regarding the conjugate prior for the consensus parameter of the MMS, initially assuming the dispersion parameter to be known (Section 3.1), then (Section 3.2) working with both parameters unknown. In Section 4 we outline the MCMC algorithm used to perform inference on our model, and in Section 5 we illustrate the inference on simple examples, exploiting both simulations and real datasets. We conclude with some final remarks in Section 6.

Preliminaries
A (full) ranking of n items, or n-ranking is defined as a map from a finite set, {A 1 , . . . , A n }, of labeled items to the space P n of n-dimensional permutations. A ranking can, therefore, be represented by a vector r = (r 1 , . . . , r n ), where r i is the rank assigned to item A i according to some criterion. Formally, individual ranks are ordinal numbers, so that r i < r j when item A i is preferred to (ranked lower than) item A j . Alternatively, rank data may be represented through orderings, which are ordered vectors of labels. Clearly, there is a one-to-one relationship between the two representations, e.g. a possible ranking of the set A 1 , . . . , A 5 is r = (1, 3, 4, 5, 2), corresponding to the ordering o = (A 1 , A 5 , A 2 , A 3 , A 4 ). Since the ranking vector representation has many advantages in terms of modelling, we will stick to it throughout the paper, and only use the orderings when necessary for illustrative purposes. Given the trivial one-to-one relation between ordinal and cardinal numbers, with a slight abuse of notation, one may consider nrankings as n-dimensional vectors obtained by permuting the first natural numbers, {1, . . . , n}. It is then easy to see that P n is contained in a (n − 1)-dimensional affine subspace of R n . In fact, it is composed by the n! points on the intersection between the hyper-plane with coordinate sums equal to s n = n(n + 1)/2 and the surface of an n-dimensional sphere of squared radius c n = n(n + 1)(2n + 1)/6 centered at the origin. Thus, all the points of P n lie on an (n − 1)-dimensional sphere of squared radius φ n = n(n 2 − 1)/12 centered at (n+1) 2 1 n , where 1 n ∈ R n denotes the vector with all entries equal to 1 (McCullagh, 1993). The Mallows model for ranking data (Mallows, 1957) defines the probability that a random n-ranking R takes a value r ∈ P n as where ρ ∈ P n is a location parameter representing the shared consensus ranking and θ ≥ 0 is a scale parameter describing the concentration of the mass around the shared consensus. Different families of models are obtained through different choices of the right-invariant (Diaconis, 1988) distance d(·, ·) on P n . Right-invariance, which ensures that distances are independent of any relabeling of the items, is an important property in this context, as it ensures that the partition function Z d (θ) = r∈Pn e −θd(r,ρ I ) of the MM does not depend on ρ (Mukherjee, 2016;Vitelli et al., 2018). In the above expression ρ I = (1, 2, 3, . . . , n) denotes the identity permutation. Nevertheless, the number of terms in the sum makes direct calculation of this partition function unfeasible for all but very small values of n. As a consequence, the MM is known up to a proportionality constant, except for some particular choices of the distance, for which Z d has a closed form (Fligner and Verducci, 1986). Different approximation strategies have been proposed (see e.g. McCullagh, 1993;Mukherjee, 2016;Vitelli et al., 2018), allowing inference even with a large number, n, of items. Notice that the distance function induces a partition of P n formed by sets of rankings which are equidistant from ρ. Within each partition set, the MM assigns equal probability to all rankings. As a consequence, exact computation of the partition function is possible for moderate n, for some choices of d(·, ·) for which the cardinalities of the partition sets are known (see e.g. Irurozki et al., 2016;Vitelli et al., 2018). The partitions of P n associated to Spearman's distance play a crucial role in understanding the behavior of the prior proposed here for the MMS.
In this work we focus on the Mallows model with Spearman's distance, given by d S (r, ρ) = ||r − ρ|| 2 = n i=1 (ρ i − r i ) 2 , for r, ρ ∈ P n , which was first introduced by Mallows (1957). Notice that Spearman's distance is an unnormalized version of the Spearman's rank correlation, used to measure the statistical correlation between the ranks of two variables, but, when rankings are considered as vectors in R n , it is simply the squared Euclidean distance, or L 2 -norm. Therefore, we say that a random ranking R follows an MMS distribution, denoted by R|ρ, θ ∼ M(ρ, θ), if its probability mass function is given by where Z(θ) . . = Z d S (θ) does not have a closed form. Notice that when θ = 0, the MMS reduces to the uniform distribution on P n .
Given a sample R 1 , . . . , R N |ρ, θ iid ∼ M(ρ, θ), the likelihood function takes the form In most applications the parameter θ is considered a nuisance and the main interest is in the estimation of ρ. It can be shown that, for θ > 0, the maximum likelihood estimator (MLE) ρ MLE is given by where the dot denotes the scalar product on R n ,R = (R 1 , . . . ,R n ) is the sample mean vector ofR i = 1 N N j=1 R ij , i = 1, . . . , n, and Y (r) = (Y 1 (r), . . . , Y n (r)) ∈ P n is the rank transformation of vector r, whose coordinates are defined as Y i = Y i (r) = n h=1 1(r h ≤ r i ), i = 1, . . . , n, 1(E) being the indicator function of the event E.
In the remainder, we propose and study an informative prior density for ρ, specifically tailored to the MMS, building on the Bayesian Mallows model for ranking data of Vitelli et al. (2018).

An informative prior
This section is devoted to the proposal of a prior distribution for the ρ parameter of the MMS. In Section 3.1 we analyze the simpler case in which the precision parameter θ is assumed known. Then, in Section 3.2, we give an intuition on how to deal with the more general and realistic case of unknown θ.

Known precision parameter
For fixed θ and ρ ∈ P n , the likelihood (3) can be simplified as Notice that the sample meanR belongs to the permutohedron of order n, denoted by pp n , that is the convex hull of the points ρ ∈ P n ⊂ R n . The set pp n is sometimes called the permutation polytope (see e.g. Thompson, 1993;Marden, 1995). This term, however, refers also to a similar polytope whose vertices follow a different order. We, here, use the term permutohedron to avoid ambiguity.
A conjugate prior for ρ ∈ P n is given by We call this the Extended Mallows Model with Spearman distance (EMMS) and write ρ|η 0 , ρ 0 ∼ EM(ρ 0 , η 0 ). Note that the conjugate prior (5) is analogous to the angle-based model proposed by Xu et al. (2018), originally developed in McCullagh (1993). The two hyper-parameters η 0 ≥ 0 and ρ 0 ∈ pp n can be interpreted as precision and location parameters, respectively, analogous to those of the MMS. In particular, η 0 determines the concentration of the distribution around ρ 0 , with η 0 = 0 corresponding to a uniform prior on pp n while larger values reflect a stronger prior belief on ρ 0 . Notice however that, differently from the MMS, the modal parameter ρ 0 is not, in general, a permutation, except when it lies on the vertices of the permutohedron pp n . Recall that Mallows models have the limitation that all rankings which are equidistant (in terms of the distance in (1)) from the consensus ranking have the same probability. For the MMS, this implies in particular that it is not possible to freely assign different masses to different rankings at the same Spearman's distance to the consensus ranking. By allowing the modal parameter of (5) to take any value in the permutohedron pp n , that is, to be any convex combination of the elements of P n , such structure can be broken, allowing for a more flexible distribution of the mass. In fact, the prior (5) assigns equal mass to all permutations that lie at the same L 2 -norm from ρ 0 , and greater mass is given to permutations closest to ρ 0 . For instance, consider the EMMS centered at the barycenter of the permutohedron, that is, with ρ 0 = (n+1) 2 1 n . This results in a uniform distribution on rankings for any value of the precision parameter η 0 . Small deviations from uniformity can be achieved by letting η 0 > 0 and ||ρ 0 − (n+1) 2 1 n || 2 be small. The direction of the vector ρ 0 − (n+1) 2 1 n in R n determines the rankings for which the mass increases and those for which it decreases. The case described above, where ρ 0 = (n+1) 2 1 n , is therefore equivalent to assigning to ρ the uniform prior on P n , π(ρ) = 1 n! , like in Marden (1995) and Vitelli et al. (2018). Note that, since ρ 0 ∈ pp n , the partition function in (5), in general depends on both η 0 and ρ 0 , unless ρ 0 ∈ P n ⊂ pp n , in which case the Z * is a function only of η 0 . 1 This implies that (5) is known up to a normalization constant. However, in the following sections we show that this drawback can be overcome in practice.
The posterior density for ρ is given by The first thing we observe is that the proposed prior is indeed conjugate. In other The above expressions evoke the classical result (Diaconis and Ylvisaker, 1979) that, under regularity conditions, the posterior estimates have the form of a linear combination of the prior belief and the empirical evidence. Furthermore, the reparametrization of (5) obtained by letting η 0 = θ 0 N 0 (with the possibility to choose θ 0 = θ), shows that the mixing weights of the posterior parameters in (8) explicitly depend on N and N 0 which can be thought of as an a priori sample size, representing the amount of information on which the expert bases the prior belief about the central tendency of ρ. For any finite prior precision, as the sample size increases, the posterior accumulates mass around ρ N , which approaches the sample mean,R as N increases. Some insights into the role of the prior hyper-parameters can be obtained by considering limiting situations. An infinite prior precision would express a priori certainty, by accumulating all the prior mass on ρ 0 . The posterior would maintain the infinite precision thus accumulating mass on ρ N = ρ 0 . In such hypothetical case, learning would be possible only for infinite sample sizes, with Notice that, if all the coordinates of the vector ρ N take different values, the maximum a posteriori (MAP) of ρ is unique and given by ρ MAP = Y (ρ N ) ∈ P n .
The prior (5) has a shape which is analogous to the one discussed earlier by Gupta and Damien (2002). In their paper, however, the authors propose the use of the Hausdorff distance among subsets (conjugacy classes) of P n , in place of the squared L 2 -norm between a ranking and the location parameter of the prior (5), which is an element of the permutation polytope. This difference implies that the proposal of Gupta and Damien (2002) assigns equal probability to all permutations within a conjugacy class. In particular, all rankings in the modal conjugacy class of the prior are assigned the same mass, even if information may not be available on all such rankings. Furthermore, two permutations in the same class are not necessarily close with respect to the distance used in the MM, which is a crucial element of the model specification. Our proposal, instead, is specifically tailored to the MMS, and gives the possibility to choose whether to give maximum prior weight to a unique permutation, or to more than one. In Section 5.2 we show the inferential differences resulting from using the prior of Gupta and Damien (2002) and our proposal.
To complete this section, we note in the following Result that the findings in Gupta and Damien (2002, Section 3.3) can be extended to our prior (5).
a) for each ρ 1 , ρ 2 ∈ P n , and given θ, η 0 , the ranking ρ 1 will have higher posterior probability than ρ 2 if and only if , then ρ 1 will have higher posterior probability than ρ 2 .
The result, analogous to Gupta and Damien's Theorem 2 and corollaries, gives an intuition on the behavior of the posterior density, by providing a relationship between θ and η 0 , that determines which rankings receive the highest posterior probabilities. In Section 5.2 we illustrate, through simulated data, some of the consequences of this result on the inference.

Elicitation of the hyper-parameters
The elicitation involves the two hyper-parameters (ρ 0 , η 0 ) of (5) which, as mentioned in the previous section, can be interpreted as a location and a precision parameters, analogous to the parameters (ρ, θ) of the MMS.
An expert would be asked her prior opinion about the modal (also referred to as consensus) ranking ρ, and to express it via the vector ρ 0 . In the simplest case, we request from the expert a prior modal ranking of all the items {A 1 , . . . , A n }. If she were able to provide one, this would result in ρ 0 being a proper ranking, that is ρ 0 ∈ P n . However, particularly in situations when the set of items to be ranked is very large, the expert may only able to express partial information about the consensus ranking. For example, in the field of Genomics, the number n of items, corresponding to genes, is often of the order of thousands, with the geneticists normally knowing barely a few dozens of them, typically, the k most relevant for their analysis. In such a case, the expert would be asked to rank as many items as possible, say the in-her-opinion top-k out of n. Then, ρ 0 would contain k elicited ranks, and n − k values equal to (n + k +1)/2 corresponding to the items that the expert was not able to rank. This corresponds to assigning the same prior mass to all rankings for which the top-k ranks coincide. The uniform distribution on this class represents the lack of prior information on the ranks of the n − k bottom items. Therefore, the vector ρ 0 would not be a ranking, ρ 0 ∈ P n . However, being an element of pp n , it could still be used as hyper-parameter of the EMMS prior, conveying only partial information about the modal ranking ρ.
In a second moment, to elicit a value for η 0 , we would reason by calibration, in the spirit of Paganin et al. (2021). Consider the prior expectation f (η 0 , ρ 0 ) := E π(ρ) 1 n ||ρ − ρ 0 || 2 | η 0 , ρ 0 . This quantity is decreasing in η 0 for each ρ 0 ∈ pp n , and can be interpreted as the expected average (per-item) error in the i-th prior rank ρ 0i (Vitelli et al., 2018). A value for η 0 may be found by first asking the expert to choose the a priori per-item expected error size, e 0 , and then finding the value of η 0 such that f (η 0 , ρ 0 ) = e 0 . We can also guide the expert in the choice of a reasonable value of e 0 , for instance by providing the range of possible values of f (η 0 , ρ 0 ), and by asking her to express a belief on the per-item expected error size, as a fraction of such range, for instance e 0 = 0.5(f max − f min ). The minimum and maximum values of f (η 0 , ρ 0 ) depend on k and n only (that is, on the partial information carried by ρ 0 ) and can be easily computed, for given ρ 0 , on a grid of η 0 values. 2 In a different setting, we can imagine a researcher wishing to include covariate information into the analysis. For instance, a certain number x h = (x h1 , . . . , x hn ), h = 1, . . . , H, of covariates may be available, describing some features of the items. We could then introduce this information into the prior (5), by choosing a hyper-parameter ρ 0 = ρ 0 (x 1 , . . . , x H ) which depends on the relevant covariates. An example of the latter scenario is given in Section 5.3.
Our proposed prior naturally handles the case when multiple sources of prior information are available. Notice that, since any ρ 0 ∈ pp n can be expressed as a convex combination of rankings in P n , it can always be interpreted as arising from multiple (possibly infinite) experts, the calculation of the individually elicited parameters being an exercise in linear algebra.
In the elementary case, two experts may believe, a priori, in different modal rankings, say ρ 01 and ρ 02 . An analyst wishing to express an equally strong prior on such two rankings may simply use the prior (5) with ρ 0 = (ρ 01 +ρ 02 )/2 ∈ pp n . More generally, an analyst may like to aggregate prior opinions from a number L of experts by calculating ρ 0 and η 0 as a simple average (see e.g. Burgman et al., 2011) of the individual ρ 0, , η 0, parameters elicited from each expert . A more robust way of aggregating multiple prior opinions is pooling (O'Hagan et al., 2006), which can also allow to weight unequally the different experts' opinions (Genest et al., 1986). However, when it is not reasonable to think that the experts provide independent observations, these approaches may not be adequate.
2 If ρ 0 is the barycenter of the permutohedron, ρ 0 = (n+1) 2 1n (which corresponds to the case k = 0), the prior is uniform, and the choice of η 0 is not relevant (no prior information is available).
If all the ranks of ρ 0 are elicited (which corresponds to the case ρ 0 ∈ Pn, that is k = n), then An interesting way of dealing with dependent experts is to treat the elicited information as data in the spirit of French (2011) and Albert et al. (2012). This latter approach amounts to performing indirect elicitation, by inferring the hyper-parameters of interest using the posterior of a prior Bayesian analysis. In our framework, this means that, instead of directly using a combination of the experts' opinions ρ 0, , ≥ 1, as hyper-parameter of (5), we use the ρ 0, , ≥ 1 as conditionally independent data, and elicit the hyper-parameters ρ 0 and η 0 based on a prior analysis. More formally, letting ρ 0,1 , . . . , ρ 0,L |ρ 0 , η 0 iid ∼ M(ρ 0 , η 0 ), we can infer, based on the posterior density of such an analysis,η 0 andρ 0 and set them equal to the hyper-parameters η 0 and ρ 0 of (5) respectively. In Section 5.3, we show how this can be done in a practical example.

Unknown precision parameter
When θ is unknown, the Bayesian paradigm requires a prior on the pair of parameters (ρ, θ). We here suggest to choose a joint prior of the form π(ρ, θ) = π(θ)π(ρ|θ), where π(ρ|θ) is the EMMS of (5). Notice that the particular case of prior independence, π(ρ, θ) = π(θ)π(ρ), is achieved in practice by choosing the hyper-parameter η 0 independent of θ. Regarding the choice of π(θ) some proposals are present in the literature, for instance an exponential density (Vitelli et al., 2018), or the conjugate prior of Marden (1995). Both options can be employed in our framework, if the researcher wishes to put some prior information on the θ parameter. As an alternative, we suggest the use of the Jeffreys prior for θ, which, for small values of n, can be computed exactly and may be an interesting alternative when no information on θ is available a priori. The following proposition, proved in Crispino and Antoniano-Villalobos (2022), holds for any MM with a right-invariant distance, and in particular for the MMS.

Proposition 1. The Jeffreys prior for θ in a MM with right-invariant distance d takes the form
where V R|θ denotes the variance with respect to R ∼ M(ρ I , θ), which depends on θ.
The posterior density of the model parameters, with the conjugate prior π(ρ|θ) given in (5) is Equation (12) can be easily evaluated in two cases: when (a) Z * does not depend on θ, that is, when η 0 is independent of θ (prior independence scenario), or when (b) η 0 = θN 0 , and n is small enough, so that Z * can be calculated exactly, for given prior hyper-parameters ρ 0 and N 0 (see also Section 4).
The more problematic case (c) when η 0 = θN 0 and n is too large for computing Z * exactly, can be handled by using as prior density for θ, π large n (θ) ∝ Z * (θN 0 , ρ 0 ), so that the posterior density (12) can be written as We believe that the choice of π large n (θ) motivated by the simplification of the posterior represents, nevertheless, a sensible belief. Indeed, Z * (θN 0 , ρ 0 ) is a decreasing function of θ, and its shape is dominated by an exponential, with rate parameter depending both on N 0 , and on ρ 0 . The larger N 0 , the more peaked the density is around θ = 0 (reducing to the improper constant on R + when N 0 = 0). ρ 0 also affects the tightness of the prior (the larger the number of elicited ranks, the more peaked is the density around θ = 0), but its influence is smoother.
In the next section we outline the algorithms developed for inference on the MMS in both cases of known and unknown θ, within the situations (a), (b) and (c) described above.

Posterior simulation
Notice that, when θ = θ * is known, the posterior (7) is known up to a normalization constant. Posterior simulation is straightforward in this case and it basically reduces to a visualization problem because of the complexity of the space of permutations. In this simple case, we employ a Metropolis-Hastings (M-H) Markov Chain Monte Carlo (MCMC) scheme for the update of ρ. We propose ρ according to the Leap and Shift distribution of Vitelli et al. (2018), which is an asymmetric proposal centered around the current value of ρ. We then accept ρ with probability = min{1, a ρ }, where log a ρ = 2θ * (ρ − ρ) ·R + log p LS (ρ |ρ) − log p LS (ρ|ρ ), where,R = NR + N 0 ρ 0 , and p LS denotes the transition probability of the Leap and Shift distribution. Notice that, for the sake of simplicity, we are considering the case η 0 = θ * N 0 , but the results follow trivially for other parametrizations.
When θ is not known, we implement a Metropolis within Gibbs scheme for posterior simulation. However, further considerations must be made for the different cases outlined in Section 3.2. First, we consider case (a), where ρ is assumed a priori independent of θ, which amounts to eliciting η 0 of (5) independently of θ; in cases (b) and (c) the precision parameter of the EMMS takes the form η 0 = θN 0 .
In (a) Z * is simply constant, so it creates no additional difficulty. Posterior inference can be performed with the efficient scheme of Vitelli et al. (2018, Algorithm 1), by simply modifying the acceptance probabilities of the M-H steps to include the nonuniform prior density on ρ.
In cases (b) and (c) we have the additional issue of dealing with Z * , for which different solutions are possible. In (b), that is for small n, we can compute Z * on a grid of η 0 values; whenever its evaluation is required within the M-H step for the update of θ, an approximate value can be obtained via interpolation for values of η 0 = θN 0 not in the grid. In this case we therefore have two steps. First, we update ρ conditional on θ from the posterior full conditional (see (12)), This is done as described above, that is, we propose ρ according to the Leap and Shift distribution and accept it with probability = min{1, a ρ }, where a ρ is given in (14), with θ * equal to the current value of θ. Second, we update θ conditional on ρ. Note that the posterior full conditional for θ is whereg = (2N + N 0 )c n + N 0 ρ 0 2 . The proposal θ is sampled from a log-normal density centered on the current value of θ with a variance tuned in order to obtain a desired acceptance rate.
In (c), that is, for large values of n, only the proposed prior for θ, and therefore its posterior full conditional, changes and it is given by Posterior simulation is therefore identical to that of case (b), with the obvious difference in the acceptance probability for θ.

Illustrative analyses
The examples considered in this section have multiple purposes. First we illustrate the effects of our prior on the inference through very elementary datasets (Sections 5.1 and 5.2). Second, we show an example of how to elicit the hyper-parameters of interest based on covariates (Section 5.3). Finally, in 5.4, we consider an application of data related to the COVID-19 pandemic, where the inclusion of prior information is relevant.

Simulation study
In this section we illustrate the effect of the prior on the posterior via a small simulated dataset. A small n is used so that all possible permutations can be listed.
We generate a sample of N = 30 rankings from P 4 from the MMS with given true parameters ρ * = (2, 1, 4, 3) and θ * = 0.06. We then set the prior consensus to ρ 0 = (2, 1, 3, 4), and perform inference on the model in different settings corresponding to increasing prior sample size for the prior parametrization η 0 = θN 0 , and the Jeffreys prior for θ. The observed sample mean vector isR = (2.33, 2.17, 3, 2.5), which leads to ρ MLE = Y (R) = (2, 1, 4, 3). We report in Table 1 the estimated posterior probability (EPP) of each of the rankings in P 4 . Notice that ρ MLE is the ranking with smallest value  Table 1: Results of the simulation study of Section 5.1. List of the 24 4-rankings (column 1), along with the quantities D(ρ) and D * (ρ) defined in Result 1 (columns 2 and 3 respectively). Columns 4 to 9 contain the estimated posterior probabilities of each ranking (rows) and each setting, for increasing values of N 0 . Four rows are highlighted: in dark-gray, the prior consensus ρ = ρ 0 (D * (ρ) = 0); in light-gray, the rankings nearest ρ 0 (D * (ρ) = 2). The MLE (where D(ρ) = 220 is minimized) is indicated by bold characters.
We can also notice the following sensitivity behavior of the posterior probabilities: with increasing N 0 , the rankings which are closer to ρ 0 (in terms of Spearman's distance, or equivalently a smaller D * (ρ)) have increasing posterior probabilities, while those that are farthest from ρ 0 have decreasing posterior probabilities, even when the distance to the data D(ρ) is not too high. An example of this can be seen in the row corresponding to ρ = (3, 1, 4, 2), which has D(ρ) = 230 and D * (ρ) = 6 and for which increasing N 0 from 0 to 20 has the effect of decreasing the posterior probability from 0.169 to 0.012. The posterior means of θ in the six settings were 0.068, 0.074, 0.065, 0.06, 0.057, 0.055, while θ MLE = 0.08. 0.337 0.047 0.055 0. Table 2: Results for the idea dataset. List of orderings corresponding to the rankings with the highest observed frequencies in the data (columns 1 and 2 respectively), along with their EPP in different settings, corresponding to values of N 0 between 0 and N (columns 3 to 8). In column 9 we present the Spearman distance between each ranking and the prior mode. The highest EPP of each setting is highlighted in bold characters.

idea dataset
For illustrative purposes, in this section we use the benchmark dataset idea (see e.g. Fligner and Verducci, 1990;Gupta and Damien, 2002). The data, collected by the Graduate Record Examination (GRE) Board, consist of a sample of N = 98 rankings, each of them generated by a college student who was asked to rank n = 5 words according to their strength of association with the target word 'idea'. The five words are 'thought' (A), 'play' (B), 'theory' (C), 'dream' (D), and 'attention' (E). Our aim is to show the effect of our informative prior for ρ on the inference. Since n is very small in this example, we can use the exact framework for posterior simulation outlined in Section 4, and choose the Jeffreys prior for the parameter θ, thus reflecting our lack of prior knowledge. In this example, we assume there is reason to believe that o 0 = (A, D, C, B, E) is the true ordering of association of the five words. We therefore choose the corresponding ranking vector ρ 0 = (1, 4, 3, 2, 5) as the prior mode. The choice of N 0 , interpreted as an equivalent sample size, reflects our confidence in ρ 0 , so we consider different settings, corresponding to increasing values of N 0 . Inference is carried out via MCMC posterior simulation, using a sample size of 5 × 10 4 iterations, after a burn-in of 5 × 10 3 , and the results are shown in Table 2. The orderings corresponding to the most frequently observed rankings in the dataset and their empirical frequencies or sample proportions are shown in columns 1 and 2 respectively, along with their estimated posterior probabilities (EPP) in the different settings (columns 3 to 8). In column 9 we report the Spearman distance between each of the top observed rankings and the prior mode (that is, D * (ρ)).
Recall that our prior (5) assigns equal mass to all rankings at the same Spearman distance from ρ 0 . This behavior has some analogies with the prior of Gupta and Damien (2002). However, while there is always a unique ranking at Spearman's distance 0 from ρ 0 , each conjugacy class contains more than one ranking, all of which are assigned the same mass by the prior of Gupta and Damien (2002), henceforth GD. As we show below, this difference has a relevant effect on the posterior inferences based on our prior (5), when compared to the results by GD.
From this table we can notice the following: • the EPP of (A, D, C, B, E), which corresponds to the prior mode ρ 0 (row 4), increases consistently with N 0 ; when N 0 = N , it becomes the posterior modal ranking; • the ordering (A, C, D, E, B), corresponding to ρ MLE (row 1), remains the ranking with largest EPP provided that the equivalent sample size N 0 is not too large. In other words, if the prior does not assign too much mass to ρ 0 = ρ MLE ; • the relative ordering of the seven rankings in terms of posterior probability depends on N 0 , changing for large values which imply strong prior information.
Comparing our results with the findings of GD (Table 3), we notice that: 1. the posterior distribution of GD places most of the mass (about 0.93) on the top 6 rankings, thus penalizing all other rankings in P 5 ; 2. the EPP of the prior modal ranking with ordering (A, D, C, B, E), obtained by GD does not increase with the concentration parameter (in their paper denoted by λ * ), but rather decreases (from 0.019 when λ * = 0, to 0.0067 when λ * = 0.1). This is not in line with the expected behavior of an informative prior.
Our posterior distributions, instead, are generally flatter and, importantly, do not show the contradictory behavior with respect to the concentration parameter exhibited by the results of GD, which is probably a consequence of the complex structure of the conjugacy classes of P 5 .

The prior elicitation problem in practice
In this section we show an example of prior elicitation based on covariates. For the illustration we use the sushi benchmark data of Kamishima (2003), which consists of full rankings of n = 10 different kinds of sushi items given by N = 5000 respondents according to their personal preference. This dataset, available at http://www.kamishima.net/ sushi/, has been extensively analyzed (see for instance Lu and Boutilier, 2011;Vitelli et al., 2018;Xu et al., 2018), and exploited in order to show inferential results under different models. We here are not interested in doing inference on this dataset (which requires a mixture model extension, and a deeper analysis), but rather to illustrate a possibility to elicit the hyper-parameters of the proposed prior in a real case study. Indeed, this dataset is particularly interesting because it includes covariates of the sushi item, which we use to build an informative prior over the consensus ranking.  Table 3: Covariate values of interest (columns) for each of the n = 10 sushi items (rows).
We begin from the elicitation of the consensus ranking hyper-parameter ρ 0 of (5). The following covariates of the sushi items (see Table 3) are likely to have an impact on the personal preference of the respondents: 1. oil: the oiliness in taste (measured on a 0-4 continuous scale, where the smaller the value is, the more oily is the sushi item); 2. eat: How frequently the sushi item is eaten in sushi shops (measured on a 0-3 continuous scale, where high values correspond to highly frequently sold); 3. price: the normalized price of the item; 4. sell: the frequency with which the sushi item is sold (measured on a 0-1 continuous scale, where high values correspond to highly frequently eaten).
According to our judgement, we believe that i) the more oily the sushi item is, the more it is preferred; ii) the more eaten, the more it is preferred; iii) price is positively correlated with preference; iv) the more a sushi is sold, the more it is preferred. Clearly, the above assumptions are subjective, and someone else may decide to include these covariates differently (for instance, another judge could let the price play the opposite role). Table 4 shows the rank vectors obtained from the above criteria by applying the rank transformation Y introduced in Section 2 to the covariate vectors of Table 3. Notice that, while the transformation gives rise to proper rankings when applied to the oil, eat and price variables, it does not result in a proper ranking when applied to the sell variable (column 5): sea eel, tuna, sea urchin and salmon roe have the same covariate value (0.88 in Table 3), which results in a tied rank (3.5 in Table 4). Similarly, shrimp and egg have the same value (0.84) resulting in the tied rank (6.5). Nonetheless, the transformed vector for the covariate sell is an element of the permutation polytope pp 10 , and is therefore a valid choice for the hyper-parameter ρ 0 .
An interesting feature of Table 4 is that the rankings induced by the different covariates are not equal but partially agree. The researcher would therefore be interested in combining the prior information coming from these different sources. The simplest possi-Sushi item oil eat price sell shrimp 9 2 6 6.5 sea eel 3 5 4 3.5 tuna 5 1 5 3.5 squid 8 4 8 1 sea urchin 2 9 2 3.5 salmon roe 4 6 3 3.5 egg 7 8 9 6.5 fatty tuna 1 3 1 8 tuna roll 6 7 7 9 cucumber roll 10 10 10 10 bility in this regard, is to set the prior consensus hyper-parameter equal to the average of the rankings induced by the four covariates, that is, ρ 01 = (5. 875, 3.875, 3.625, 5.25, 4.125, 4.125, 7.625, 3.25, 7.25, 10) ∈ pp 10 , or to its rank vector, ρ 02 = Y (ρ 01 ) = (7, 3, 2, 6, 4.5, 4.5, 9, 1, 8, 10) ∈ pp 10 . Alternatively, the four rankings could be given unequal weights, which would amount to calculating a weighted average, in the same spirit of Genest et al. (1986). The elicitation of the precision parameter, η 0 , requires a more qualitative reasoning. Considering the parametrization η 0 = θ 0 N 0 , we may decide to fix N 0 = 4, since the consensus hyper-parameter comes from the average of four rankings, which may be interpreted as the opinions of four experts. At the same time, we may choose a relatively large value of θ 0 , for instance θ 0 = 0.1 (which is considered large, given the scale of the problem), thus reflecting confidence in ρ 0 , given the partial agreement of the four rankings used to construct the consensus hyper-parameter.
Another option for the elicitation of η 0 , is to reason by calibration, as explained in Section 3. After having elicited the prior modal vector ρ 02 = Y (ρ 01 ) = (7, 3, 2, 6, 4.5, 4.5, 9, 1, 8, 10) ∈ pp 10 , the analyst may elicit the a priori expected per-item error size, e 0 , and then solve for η 0 . To do so, we first compute the expected error size range with n = 10 and k = 8, which results in f max − f min = 16.4. Then, we elicit e 0 as a fraction of the range, say e 0 = 0.5(f max − f min ) = 8.2. Finally, we find the value η 0 such that f (η 0 , ρ 0 ) = e 0 . For instance, if e 0 = 8.2 the corresponding value for the prior precision is η 0 = 0.03, while if e 0 = 0.1(f max − f min ) = 1.64, then η 0 = 0.2, reflecting the larger a priori confidence of the expert in the elicited modal ranking.
Another interesting possibility, mentioned in Section 3.1, is to treat the four rankings as data and perform a prior analysis on these, through a simple Mallows model. The posterior estimates of the parameters resulting from this prior analysis could then be used as the elicited hyper-parameters for the prior. We proceeded as follows. First, we converted the four rankings to pairwise preferences: each item was preferred to all items with strictly higher rank. This preprocessing was done in order to take the sell covariate into account, since the Mallows model does not admit rankings with ties (like the sell covariate of Table 4) as input. After this transformation, however, the Mallows model for pairwise preferences (Vitelli et al., 2018) can be used. We fit a Mallows model with Spearman's distance on such data, and a point estimate is obtained from the posterior distribution of the consensus ranking, which is then used as the hyper-parameter in the prior. We choose, as the point estimate, the cumulative probability (CP) consensus ranking, obtained by first assigning rank 1 to the item which has the maximum a posteriori marginal probability of having rank 1, then assigning rank 2 to the item, among the remaining ones, which has the maximum a posteriori marginal posterior probability of having ranks 1 or 2, and so on. As noted in Vitelli et al. (2018), the CP consensus ranking is a robust estimator which can be seen as a sequential MAP estimator. We obtainρ CP = (7, 3, 2, 6, 4, 5, 9, 1, 8, 10) 3 and the posterior mean of the scale parameter isθ = 0.44. The results of this prior analysis can now be used as hyper-parameters of the prior (5), by setting η 0 =θ, and ρ 0 =ρ CP .

A real-world example: COVID-19 and Italian support policies
In this section we use some data collected by the Bank of Italy in order to show the use of our proposed prior.
The data are part of a special survey (Iseco 4 ) carried out between March and May 2020, a period marked by the spread of the COVID-19 pandemic and by the containment measures taken by the Italian Government. The survey contained questions intended to assess how the pandemic was affecting firms' business and how firms were responding to it. In total 3503 firms were interviewed.
We here focus on one particular question (answered by N = 3462 firms), which asked the firms which support policies were judged most appropriate to contain the impact of the spread of the Coronavirus on the economy. Each firm was asked to select up to two policies (among n = 8 possible options/choices labelled a1 to a8) in order of importance. As such, the answers to the question are an example of top-2 rankings.
Interestingly, the survey was conducted in a 10-week period of time, and in each time point a different sample of firms was interviewed. We then divide the original sample into 10 sub-samples corresponding to the week in which the survey was answered by the firm, and run the model separately in each time period. We denote by the sample of top-2 rankings provided by the N (t) firms at time point t, and assume that, for each week t, there is a consensus ranking ρ (t) of the eight answers, which reflects the consensus of the N (t) firms at time t. We model this with a Mallows model, thus assuming that, for each t, R . We aim at making inference on ρ (1) , . . . , ρ (10) .
The data, in each time period, are analyzed with an adapted version of the Bayesian Mallows model for partial data (see Vitelli et al., 2018), which can handle top-k rankings Figure 1: Results of scenario a. The development of the ranks of the considered 8 options (a1 to a8, see right-vertical axis). Ranking obtained on the posterior cumulative probability rankings, computed from the marginals of the posterior π N (t) (ρ (t) |R (t) ), t = 1, . . . , 10. Best option has rank 1.
with the help of data augmentation techniques, in two different settings: a. with the uniform prior over ρ (t) , t = 1, . . . , 10. This amounts to setting the central parameter of (5) equal to the barycenter of the permutohedron, ρ 0 = (n+1) 2 1 n , in each time priod t; b. using a summary of the posterior density of ρ (t) as hyper-parameter for prior (5) in the current week's inference (at time t + 1).
In Figure 1, obtained under setting a., the CP consensus ranking of the eight options is reported (on the right-vertical axis) for each time period (on the x-axis, in brackets, the number of firms that were interviewed during the corresponding week is also shown). Figure 2: Results of scenario b. The development of the ranks of the considered 8 options (a1 to a8, see right-vertical axis). Ranking obtained on the posterior cumulative probability rankings, computed from the marginals of the posterior π N (t) (ρ (t) |R (t) ), t = 1, . . . , 10. Best option has rank 1.
We see that answer a1 (gray line) acquires popularity over time, passing from being ranked as the third option (first three weeks), to being ranked first in the remaining weeks. Answer a5 (yellow line) instead goes from being the first choice (first three weeks) to be classified as third in the following weeks.
We then repeat the analysis using setting b. and inspect the differences in the inference. Under scenario b., where the inference in time t + 1 is enriched with some prior information gained in period t, one might expect the resulting estimates to be more stable than those in scenario a. Indeed, in Figure 2 we see that answer a5 (yellow line) loses popularity over time as in scenario a., but it passes from being ranked first to being ranked third more smoothly, by being ranked second in week 4. Note that, indirectly, this may also affect other answers' ranks because the rankings are mutually exclusive (see how a2, orange line, makes room for a5 in week 4, to allow the smoother adjustment). The increased stability in the time development of some ranks is also apparent when looking at answers a4 and a3 (green and light blue lines respectively).

Conclusion
In this paper we have proposed an informative prior distribution for the consensus ranking of the Mallows model with Spearman's distance. The peculiarity of the proposed prior is that it is a location-scale family for which the location parameter does not need to be a ranking. This is convenient for the elicitation problem, since the prior can naturally handle the case when it is difficult to indicate a full ranking which is a priori the most likely. For instance, when the total number of items in the application considered is very large, it may be unlikely that an expert is able to elicit a prior ranking over all the items. On the contrary, it may be possible to put some prior information only over the top-ranked items. This is often the case in genomics applications, where thousands of genes are considered in the statistical analysis, but only few of them are known to be related to some disease. Another case which is naturally handled by our prior, is when multiple competing rankings are available prior to the analysis, and we are interested in including all of them into the analysis.
A limitation, discussed in Section 4, arises from the intractability of the normalizing constant Z * of (5) when the location parameter is not itself a ranking. Possible directions for future work include exploring tractable approximations for this quantity, perhaps in the spirit of Mukherjee (2016). In general, more efficient methods for posterior simulations might be developed, but these developments fall outside of the scope of the present work. We do hope, however, that some of the ideas presented here can shed light on potentialities and limitations of the Mallows model with Spearman's distance, and encourage further developments in constructing more flexible priors.