On Bayesian inference for the Extended Plackett-Luce model

The analysis of rank ordered data has a long history in the statistical literature across a diverse range of applications. In this paper we consider the Extended Plackett-Luce model that induces a flexible (discrete) distribution over permutations. The parameter space of this distribution is a combination of potentially high-dimensional discrete and continuous components and this presents challenges for parameter interpretability and also posterior computation. Particular emphasis is placed on the interpretation of the parameters in terms of observable quantities and we propose a general framework for preserving the mode of the prior predictive distribution. Posterior sampling is achieved using an effective simulation based approach that does not require imposing restrictions on the parameter space. Working in the Bayesian framework permits a natural representation of the posterior predictive distribution and we draw on this distribution to address the rank aggregation problem and also to identify potential lack of model fit. The flexibility of the Extended Plackett-Luce model along with the effectiveness of the proposed sampling scheme are demonstrated using several simulation studies and real data examples.


Introduction
Rank ordered data arise in many areas of application and a wide range of models have been proposed for their analysis; for an overview see Marden (1995) and Alvo and Yu (2014). In this paper we focus on the Extended Plackett-Luce (EPL) model proposed by Mollica and Tardella (2014); this model is a flexible generalisation of the popular Plackett-Luce model (Luce, 1959;Plackett, 1975) for permutations. In the Plackett-Luce model, entity k ∈ K = {1, . . . , K} is assigned parameter λ k > 0, and the probability of observing the ordering x = (x 1 , x 2 , . . . , x K ) (where x j denotes the entity ranked in position j) given the entity parameters (referred to as "support parameters" in Mollica and Tardella (2014)) λ = (λ 1 , . . . , λ K ) is (1) We refer to (1) as the standard Plackett-Luce (SPL) probability. This probability is constructed via the so-called "forward ranking process" (Mollica and Tardella, 2014), that is, it is assumed that a rank ordering is formed by allocating entities from most to least preferred. This is a rather strong assumption. It is easy to imagine a scenario where an individual ranker might assign entities to positions/ranks in an alternative way. For example, it is quite plausible that rankers may find it easier to identify their most and least preferred entities first rather than those entities they place in the middle positions of their ranking (Mollica and Tardella, 2018). In such a scenario rankers might form their rank ordering by first assigning their most and then least preferred entities to a rank before filling out the middle positions through a process of elimination using the remaining (unallocated) entities, that is, they use a different ranking process. The Extended Plackett-Luce model relaxes the assumption of a fixed and known ranking process.
It is somewhat natural to recast the underlying ranking process in terms of a "choice order " where the choice order is the order in which rankers assign entities to positions/ranks. For example, suppose a ranker must provide a preference ordering of K entities; a choice order of σ = (1, K, 2, 3, . . . , K − 1) corresponds to the ranking process where the most preferred entity is assigned first, followed by the least preferred entity and then the remaining entities are assigned in rank order from second down. Note that the choice order σ is simply a permutation of the ranks 1 to K and is referred to as the "reference order " in Mollica and Tardella (2014). Whilst the EPL model is motivated in terms of a choice order as described above, this justification is not always appropriate. For example, the notion of a choice order clearly does not apply in the analysis of the Formula 1 data in Section 6, where the data are simply the finishing orders of the drivers in each race. We prefer to view the EPL model as a flexible probabilistic model for rank ordered data; ultimately all such probabilistic models induce a discrete distribution P x over the set of all K! permutations S K and we wish this distribution to provide a flexible model for the observed data.
We adopt a Bayesian approach to inference which we find particularly appealing and natural as we focus heavily on predictive inference for observable quantities. We also make three main contributions, as outlined below. When the number of entities is not small, choosing a suitable prior distribution for σ, the permutation of the ranks 1 to K, is a somewhat daunting task. We therefore propose to use the (standard) Plackett-Luce model to define the prior probability of each permutation, although we note that our inference framework is sufficiently general and does not rely on this choice. We also address the thorny issue of specifying informative prior beliefs about the entity parameters λ by proposing a class of priors that preserve the modal prior predictive rank ordering under different choice orders σ. Constructing suitable posterior sampling schemes for the Extended Plackett-Luce model is challenging; the parameters (λ, σ) ∈ R K >0 × S K reside in a complicated mixed discrete-continuous space that must be effectively explored. This is made more challenging by the multi-modality of the marginal posterior distribution for σ, with local modes separated by large distances within permutation space. We overcome the difficult sampling problem by appealing to Metropolis coupled Markov chain Monte Carlo (MC 3 ). To the best of our knowledge, the algorithm we outline is the only one currently capable of sampling from the posterior distribution under the Extended Plackett-Luce model when the full parameter space is considered. Unfortunately the recent algorithm proposed by Mollica and Tardella (2020) relies on a rescaling of the entity parameters that does not preserve the target distribution; the solution given by Mollica and Tardella (2018) considers a restricted parameter space for σ and suffers from the same issue.
The remainder of the paper is structured as follows. In Section 2 we outline the Extended Plackett-Luce model and our associated notation, and in Section 2.2 we provide some guidance on interpreting the model parameters. In Section 3 we propose our Bayesian approach to inference. In particular we discuss suitable choices for the prior distribution and describe our simulation based scheme for posterior sampling. A simulation study illustrating the efficacy of the posterior sampling scheme and the performance of the EPL model over a range of values for the number of entities and number of observations is considered in Section 4; with further details also given in Section 2 of the supplementary material (Johnson et al., 2021). Section 5 outlines how we use the posterior predictive distribution for inference on observable quantities and for assessing the appropriateness of the model. Two real data analyses are considered in Section 6 to illustrate the use of the (unrestricted) EPL model. Section 7 offers some conclusions.

The Extended Plackett-Luce model
We now present the Extended Plackett-Luce model along with our associated notation and also discuss the interpretation of the model parameters in terms of the preferences of entities.

Model and notation
Recall that there are K entities to be ranked and that the collection of all entities is denoted K = {1, . . . , K}. The Extended Plackett-Luce model we consider is that proposed by Mollica and Tardella (2014) and, in its current form, is only appropriate for complete rank orderings in which all entities are included. Thus a typical observation is x i = (x i1 , . . . , x iK ) where x ij denotes the entity ranked in position j in the ith rank ordering. The choice order is represented by σ = (σ 1 , . . . , σ K ) , where σ j denotes the rank allocated at the jth stage. Conditional on σ, each entity has a corresponding parameter λ k > 0 for k = 1, . . . , K; let λ = (λ 1 , . . . , λ K ) . Crucially, the meaning and interpretation of λ depends on σ and this is addressed shortly.
The probability of a particular rank ordering under the Extended Plackett-Luce model (Mollica and Tardella, 2014) is defined as (2) Therefore, the Extended Plackett-Luce probability (2) is simply the standard Plackett-Luce probability (1) evaluated with entity parameters λ and "permuted data" x * i = x i • σ where • denotes composition of permutations which, in terms of vectors, implies that if z = x • y then z i = x yi . Here x * ij denotes the entity chosen at the jth stage of the ith ranking process and therefore receiving rank σ j .
Indeed, both the (forward ranking) standard Plackett-Luce model and (backward ranking) reverse Plackett-Luce model are special cases of (2) and are recovered when σ = (1, . . . , K) ≡ I, the identity permutation, and σ = (K, K − 1, . . . , 1) , the reverse of the identity permutation, respectively. We use the notation X i |λ, σ ∼ EPL(λ, σ) to denote that the probability of rank ordering i is given by (2). Note that here and throughout we have adopted different notation from that in Mollica and Tardella (2014) however the essential components of the model remain unchanged.
It is clear that the EPL probability (2) is invariant to scalar multiplication of the entity parameters λ, that is, Pr(X|λ, σ) = Pr(X|cλ, σ) for any positive constant c > 0. From a Bayesian perspective this type of identifiability issue is straightforwardly resolved by placing a non-uniform prior on λ. In this case, although the likelihood function may be constant for multiple (scaled) λ values, the posterior distribution is not. That said, this issue can still lead to potential mixing problems for MCMC algorithms (Caron and Doucet, 2012) and thus an inefficient sampling scheme; this is revisited in Section 3.3.

Interpretation of the entity parameters λ
A key aspect of analysing rank ordered data using Plackett-Luce type models is the interpretation of the entity parameters λ. Moreover, it is essential to understand the interpretation of the λ parameters if one is to specify informative prior beliefs about the likely preferences of the entities.
For the Extended Plackett-Luce model, λ k is proportional to the probability that entity k is selected at the first stage of the ranking process and therefore ranked in position σ 1 of the rank ordering x. Then, conditional on an entity being assigned to position σ 1 in the rank ordering, the entity with the largest parameter of those remaining is that most likely to be assigned to position σ 2 , and so on. For the standard Plackett-Luce model, arising from the forward ranking process with σ = I, we have that λ k is proportional to the probability that entity k is assigned rank σ 1 = 1 (and is thus the most preferred entity), and so on. Therefore, for the standard Plackett-Luce model, entities with larger values are more likely to be given a higher rank. In other words, the λ parameters for the standard Plackett-Luce model correspond directly with preferences for entities. A consequence is that ordering the entities in terms of their values in λ, from largest to smallest, will give the modal orderingx, that is, the permutation of the entities which yields the maximum Plackett-Luce probability (1), given λ. Specifically, x = order ↓ (λ), where order ↓ (·) denotes the ordering operation from largest to smallest. This makes specifying a prior distribution for λ, when σ = I, relatively straightforward based on entity preferences. The interpretation of the λ parameters directly in terms of preferences can also be achieved in a straightforward manner with the backward ranking process (σ = (K, . . . , 1) ) of the reverse Plackett-Luce model. Apart from these special cases, however, the interpretation of the λ parameters in terms of preferences is not at all transparent for other choices of σ. For example, suppose that λ i > λ j and σ = (2, 3, 1) . Here entity i is more likely to be ranked in second position than entity j. Further, if another entity = i, j, is assigned to rank 2 then entity i is preferred for rank 3 (σ 2 ) over entity j.
Understanding the preference of the entities under the Extended Plackett-Luce model based on values of λ and σ can be made more straightforward if we first introduce the inverse of the choice order permutation σ −1 . This is defined such that σ • σ −1 = σ −1 • σ = I, the identity permutation. Here σ −1 j denotes the stage of the ranking process at which rank j is assigned. We can then obtain directly the modal ordering of the entities under the EPL(λ, σ) model,x (σ,λ) , and thus obtain a representation of the preference of the entities. Herex (σ,λ) is obtained without enumerating any probabilities by permuting the entries inx (the modal ordering under the standard Plackett-Luce model conditional on λ), by σ −1 , that isx (σ,λ) =x•σ −1 . In other words, ifx = order ↓ (λ) thenx (σ,λ) j =x σ −1 j , for j = 1, . . . , K. Letx −1 represent the ranks assigned to the entities under the standard Plackett-Luce model; this is obtained as the inverse permutation corresponding tox, that is, the permutation such thatx −1 •x = I. Now define η (σ) =x (σ,λ) •x −1 ; this represents the permutation of the entities ranked under the EPL model at the stage corresponding to the rank assigned to entities 1 to K under the standard Plackett-Luce model. It follows that, if λ (σ) has jth element λ is the jth element of η (σ) , thenx (σ,λ σ ) ≡x for all σ ∈ S K , and the modal preference ordering is preserved.
Some simplification is possible if we first order the entities in terms of preferences. Clearly, withx = I we havex (σ,λ) = σ −1 and so the modal ordering is given by the inverse choice order permutation. Moreover, ifx = I thenx −1 = I and so η (σ) = σ −1 . It follows that choosing λ (σ) such that its jth element is λ , thenx (σ,λ σ ) ≡ I for all σ ∈ S K . Therefore if the entities are labelled in preference order then permuting the λ parameters from the standard Plackett-Luce model by the inverse of the choice order permutation will preserve the modal permutation to be in the same preference order. This suggests a simple strategy for specifying prior distributions for the entity parameters which preserves modal preferences under different choice orders; we revisit this in Section 3.1.

Bayesian modelling
Suppose we have data consisting of n independent rank orderings, collectively denoted ( We wish to make inferences about the unknown quantities in the model σ, λ as well as future observable rank orderingsx. Specifically we adopt a Bayesian approach to inference in which we quantify our uncertainty about the unknown quantities (before observing the data) through a suitable prior distribution.

Prior for σ
For the choice ordering σ we need to define a discrete distribution P σ over the K! elements of S K . If K is not small, perhaps larger than 4, then this could be a rather daunting task. Given the choice order parameter σ is a permutation, or equivalently a complete rank ordering, one flexible option is to use the Plackett-Luce model to define the prior probabilities for each choice order parameter. More specifically we let σ|q ∼ PL(q) where q = (q 1 , . . . , q K ) ∈ R K >0 are to be chosen a priori and If desired, it is straightforward to assume each choice order is equally likely a priori by letting q k = q for k = 1, . . . , K. Furthermore, the inference framework that follows is sufficiently general and does not rely on this prior choice. In particular, if we only wish to consider a subset of all the possible choice orderings R ⊂ S K , as in Mollica and Tardella (2018), then this can be achieved by making an appropriate choice of prior probabilities for all σ ∈ R and letting Pr(σ) = 0 for all σ ∈ S K \ R.

Prior for λ|σ
It is natural to wish to specify prior beliefs in terms of preferences for the entities. However, we have seen in Section 2.2 that the interpretation of the entity parameters λ in terms of preferences is dependent on the value of σ. It follows that specifying an informative prior for the entity parameters is problematic unless the choice order σ is assumed to be known. We therefore consider separate prior distributions for λ conditional on the value of σ. Since the entity parameters λ k > 0 must be strictly positive, a suitable, relatively tractable, choice of conditional prior distribution is a gamma distribution with mean a (σ) k ) for k = 1, . . . , K and σ ∈ S K . Without loss of generality we set b (σ) k = b = 1, for all k and σ since b is not likelihood identifiable. Our proposed strategy for specifying the hyper-parameters a (σ) is to first consider the prior distribution for λ under the standard Plackett-Luce model with σ = I. If we specify a = (a 1 , . . . , a K ) thenx, the modal preference ordering from the prior predictive distribution, is obtained by simply ordering the a k values from largest to smallest and sox = order ↓ (a). Then, following the arguments of Section 2.2, we can preserve our beliefs about the modal preference ordering for all values of σ by constructing η (σ) for k = 1, . . . , K and σ ∈ S K . Again some simplification of notation is achievable if we first relabel the entities in terms of our a priori preferences. Specifically if we label the entities such that our most preferred entity is labelled 1, second most preferred labelled 2, and so on then a 1 > a 2 > · · · > a K andx = I. In this case η (σ) = σ −1 and so a (σ) k = a σ −1 k for k = 1, . . . , K. Section 8 of the supplementary material contains two simple numerical examples to help illustrate the issues with prior specification and our proposed mode-preserving solution.
Whilst the framework above gives us a relatively straightforward way to encode information if we have a clear prior preference of the entities, additional complications arise in settings where preferences for some (or all) of the entities are exchangeable a priori. More specifically, if we wish to specify no (prior) preference between two distinct entities j = k then, for the standard Plackett-Luce model, this can be achieved by setting a j = a k . Such choice gives rise to a multi-modal prior predictive distribution and sox = order ↓ (a), and thus η (σ) =x•σ −1 •x −1 , is not uniquely defined. Thankfully this issue can be straightforwardly resolved as part of our simulation-based inference approach. In particular if we let X denote the set of modes of the prior predictive distribution (under the SPL model) then we can simply sample (uniformly at random)x from X at each iteration of our MCMC scheme. A formal discussion is provided in Section 4 of the supplementary material. Note that if we are unwilling to favour any particular preference ordering a priori then this can be achieved by letting a k = a (for all k) as this choice induces a uniform prior predictive distribution over all preference orders. Formally X = S K , however given that a (σ) is formed by permuting the elements of a it follows that a (σ) = a for allx, σ ∈ S K and so samplingx from X is redundant in this special case.

Bayesian model
The complete Bayesian model is That is, we assume that our observations follow the distribution specified by the Extended Plackett-Luce model (2) and the prior distribution for (λ, σ) is as described in Section 3.1.
From which we quantify our beliefs about σ and λ through their joint posterior density π(σ, λ|D) ∝ π(D|λ, σ)π(λ|σ)π(σ), which is obtained via Bayes' Theorem. The posterior density π(σ, λ|D) is not available in closed form and so we use simulation-based methods to sample from the posterior distribution as described in the next section.

Posterior sampling
Due to the complex nature of the posterior distribution we use Markov chain Monte Carlo (MCMC) methods in order to sample realisations from π(σ, λ|D). The structure of the model lends itself naturally to consider sampling alternately from two blocks of full conditional distributions: π(σ|λ, D) and π(λ|σ, D).

Sampling the choice order parameter σ from π(σ|λ, D)
Given the choice order parameter σ is a member of S K it is fairly straightforward to obtain its (discrete) full conditional distribution; specifically this is the discrete distribution with probabilities for j = 1, . . . , K!. Clearly sampling from this full conditional will require K! evaluations of the EPL likelihood π(D|λ, σ = σ j ) and so sampling from Pr(σ = σ j |λ, D) for j = 1, . . . , K! (a Gibbs update) is probably only plausible if K is sufficiently small; perhaps not much greater than 5. Of course, the probabilities Pr(σ = σ i |λ, D) and Pr(σ = σ j |λ, D) are conditionally independent for i = j and so could be computed in parallel which may facilitate this approach for slightly larger values of K.
So as to free ourselves from the restriction to the case where K is small we instead consider a more general sampling strategy by constructing a Metropolis-Hastings proposal mechanism for updating σ. Our investigation into the likelihood of the Extended Plackett-Luce model given different choice orders in Section 3 of the supplementary material revealed that π(D|λ, σ) is likely to be multi-modal. Further, local modes can be separated by large distances within permutation space. In an attempt to effectively explore this large discrete space we consider 5 alternative proposal mechanisms; each of which occurs with probability p for = 1, . . . , 5. The 5 mechanisms to construct the proposed permutation σ † are as follows.
1. The random swap: sample two positions φ 1 , φ 2 ∈ {1, . . . , K} uniformly at random and let the proposed choice order σ † be the current choice order σ where the elements in positions φ 1 and φ 2 have been swapped.
2. The Poisson swap: sample φ 1 ∈ {1, . . . , K} uniformly at random and let Note that t is a tuning parameter and φ 2 → {(φ 2 − 1) mod K} + 1 as appropriate. Again the proposed choice order σ † is formed by swapping the elements in positions φ 1 and φ 2 of the current choice order σ.
3. The random insertion (Bezáková et al., 2006): sample two positions φ 1 = φ 2 ∈ {1, . . . , K} uniformly at random and let the proposed choice order σ † be formed by taking the value in position φ 1 and inserting it back into the permutation so that it is instead in position φ 2 .
4. The prior proposal: here σ † is simply an independent draw from the prior distribution, that is, σ † |q ∼ PL(q).
Note that performing either of the swap or insertion moves (1-3) above may result in slow exploration of the set of all permutations as the proposal σ † may not differ much from the current value σ. To alleviate this potential issue we propose to iteratively perform each of these moves S times, where S is to be chosen (and fixed) by the analyst. More formally (when using proposal mechanisms 1-3) we construct a sequence of intermediate proposals σ † s from the "current" choice order σ † s−1 . In particular we let σ † 0 = σ and generate σ † s |σ † s−1 for s = 1, . . . , S; the proposed value for which we evaluate the acceptance probability is σ † = σ † S . Further, for moves 1 and 2 it may seem inefficient to allow for the "null swap" φ 1 = φ 2 , however this is done to avoid only proposing permutations with the same (or opposing) parity as the current value. Put another way, as S → ∞ we would expect Pr(σ † |σ) > 0 for all σ † , σ ∈ S K and this only holds if we allow for the possibility that φ 1 = φ 2 . Finally we note that each of these proposal mechanisms is "simple" in the respect that Pr(σ † |σ) = Pr(σ|σ † ) and so the proposal ratio cancels in each case. The full acceptance ratio is presented within the algorithm outline in Section 3.4.

Sampling the entity parameters λ from π(λ|σ, D)
Bayesian inference for variants of standard Plackett-Luce models typically proceeds by first introducing appropriate versions of the latent variables proposed by Caron and Doucet (2012), which in turn facilitate a Gibbs update for each of the entity parameters (assuming independent Gamma prior distributions are chosen). However we found that, when coupled with a Metropolis-Hastings update on σ, this strategy does not work well for parameter inference under the Extended Plackett-Luce model (not reported here); this observation was also noted by Mollica and Tardella (2020). We therefore propose to use a Metropolis-Hastings step for sampling the entity parameters, specifically we use (independent) log normal random walks for each entity parameter in turn and so We also implement an additional sampling step in the MCMC scheme, analogous to that in Caron and Doucet (2012), in order to mitigate the poor mixing that is caused by the invariance of the Extended Plackett-Luce likelihood to scalar multiplication of the λ parameters. Note that this step does not equate to an arbitrary rescaling of the λ parameters but rather corresponds to a draw of Λ = k λ k from its full conditional distribution. Full details are given in Section 3.4.

Metropolis coupled Markov chain Monte Carlo
Unfortunately the sampling strategy described above in the previous two subsections proves ineffective with the Markov chain suffering from poor mixing, particularly for σ where the chain is prone to becoming stuck in local modes (results not reported here).
In an attempt to resolve these issues, and therefore aid the exploration of the posterior distribution we appeal to Metropolis coupled Markov chain Monte Carlo, or parallel tempering.
Metropolis coupled Markov chain Monte Carlo (Geyer, 1991), is a sampling technique that aims to improve the mixing of Markov chains in comparison to standard MCMC methods particularly when the target distribution is multi-modal (Gilks and Roberts, 1996;Brooks, 1998). The basic premise is to consider C chains evolving simultaneously, each of which targets a tempered posterior distributionπ c (θ c |D) ∝ π(D|θ c ) 1/Tc π(θ c ), where, for each chain c, T c ≥ 1 denotes the temperature and θ c = {σ c , λ c } is the collection of unknown quantities. Note that the posterior of interest is recovered when T c = 1. Further note that we have only considered a tempered likelihood component as we suggest that any prior beliefs should be consistent irrespective of the model choice. Now, as the posteriorsπ c (θ c |D) are conditionally independent given D, we can consider them to be targeting the joint posterior Suppose now we propose to swap θ i and θ j for some i = j within a Markov chain targeting the joint posterior (4). If we let Θ = {θ 1 , . . . , θ C } denote the current state and Θ † = {θ † 1 , . . . , θ † C } the proposed state where θ † i = θ j , θ † j = θ i and θ † = θ for = i, j, then, assuming a symmetric proposal mechanism, the acceptance probability of the state space swap is min (1, A) where Of course, if the proposal mechanism is not symmetric then the probability A must be multiplied by the proposal ratio q(Θ|Θ † )/q(Θ † |Θ). Further, it is straightforward to generalise the above acceptance probability to allow the states of more than 2 chains to be swapped. However, this is typically avoided as such a proposal can result in poor acceptance rates. Our specific Metropolis coupled Markov chain Monte Carlo algorithm is outlined in the next section.

Outline of the posterior sampling algorithm
A parallel Metropolis coupled Markov chain Monte Carlo algorithm to sample from the joint posterior distribution of the parameters λ and the choice order parameter σ is as follows.

Tune:
• choose the number of chains (C); let T 1 = 1 and choose T c > 1 for c = 2, . . . , C • choose appropriate values for the MH proposals outlined in Section 3.3 2. Initialise: take a prior draw or alternatively choose σ c ∈ S K and λ c ∈ R K >0 for c = 1, . . . , C 3. For c = 1, . . . , C perform (in parallel) the following steps: • Sample with probabilities given by Pr

Return to
Step 3.

Tuning the MC 3 algorithm
The Metropolis coupled Markov chain Monte Carlo scheme targets the joint density (4) by simultaneously evolving C chains; each of which targets an alternative (tempered) densityπ c (θ c |D). Given the data D, these chains are conditionally independent and should therefore be individually tuned to target their respective density in a typical fashion. Of course, it may not be possible to obtain near optimal acceptance rates within the posterior chain (and other chains with temperatures 1) however the analyst should aim to ensure reasonable acceptance rates; even if this results in small moves around the parameter space. Tuning the between chain proposal (Step 4 of the MC 3 algorithm) can be tricky in general. The strategy we suggest, also advocated by Wilkinson (2013), is that the temperatures are chosen such that they exhibit geometric spacing, that is, T c+1 /T c = r for some r > 1; this eliminates the burden of specifying C − 1 temperatures and instead only requires a choice of r. We also suggest only considering swaps between adjacent chains as intuitively the target densities are most similar when |T c − T c+1 | is small. It is generally accepted that between chain acceptance rates of around 20% to 60% provide reasonable mixing (with respect to the joint density of θ 1 , . . . , θ C ); see, for example, Geyer and Thompson (1995); Altekar et al. (2004). A suitable choice of the temperature ratio r can be guided via pilot runs of the MC 3 scheme and individual temperatures can also be adjusted as appropriate.

Monitoring convergence
The target (4) of our algorithm is a complicated mixed discrete-continuous distribution that means traditional convergence checks applied to each stochastic quantity are of little use. Indeed the unique samples of the discrete parameter (σ) could be arbitrarily relabelled which may give a false impression of good mixing and convergence. Further, the natural dependence between the parameters (λ, σ) means that inspecting traceplots of the continuous (λ) components is not sensible as these parameters may make fairly large jumps around the parameter space when the choice order parameter (σ) changes. These issues are akin to those that arise when fitting Bayesian mixture models and so following the approach used in that setting we choose to perform convergence checks with respect to the joint distribution. More specifically we apply the convergence diagnostic of Geweke (1992) to the (log) target distribution (4) as advocated by Cowles and Carlin (1996) and Brooks and Roberts (1998). Additionally we also perform this procedure on samples of the (log) posterior distribution of interest, that is, the marginal distribution π(λ, σ|D) that corresponds to the chain with T c = 1, and also the (log) observed data likelihood π(D|λ, σ) evaluated at the posterior draws. Geweke-Brooks plots are also used to determine whether there is any evidence to suggest that a longer burn-in period may be required; these diagnostics are readily available within the R package coda (Plummer et al., 2006). To alleviate potential concerns about the sampling of the discrete choice order parameter (σ) we also check that the marginal posterior distribution π(σ|D) is consistent under multiple (randomly initialised) runs of our algorithm.

Simulation study
To investigate the performance of the posterior sampling algorithm outlined in Section 3.4 we apply it on several synthetic datasets in which the values of the parameters used to simulate these data are known. We consider K ∈ {5, 10, 15, 20} entities and generate 1000 rank orderings for each choice of K. Further we subset each of these datasets by taking the first n ∈ {20, 50, 200, 500, 1000} orderings thus giving rise to 20 (nested) datasets. The parameter values (λ , σ ) from which these data are generated are drawn from the prior distribution outlined in Section 3.1 with a k = q k = 1 for k = 1, . . . , K. That is, all choice orders and entity preferences (specified by the (λ , σ ) pair) are equally likely. The values of the parameters for each choice of K are given in Section 2 of the supplementary material. For each dataset, posterior samples were obtained via the algorithm outlined in Section 3.4. We choose to use C = 5 chains in each case, with both the temperatures and tuning parameters chosen appropriately. Convergence is assessed as outlined in Section 3.4 and the raw posterior draws are also thinned to obtain (approximately) 10K un-autocorrelated draws from the posterior distribution.
The aim of the simulation study is to investigate whether our algorithm is capable of generating posterior samples, for different (n, K) pairs. To this extent we would like to see that the values (λ , σ ) used to simulate these data look plausible under the respective posterior distributions. This can be judged by inspecting the marginal posterior distributions π(σ|D) and π(λ k |D). However, with space in mind, here we provide some key summaries and refer the interested reader to Section 2 of the supplementary material for a more detailed discussion. Table 1 shows the posterior probability Pr(σ |D) of the choice order parameter used to generate each respective dataset. Perhaps unsurprisingly we see that for each K ∈ {5, 10, 15, 20} the posterior support for the choice order parameter used to generate the data increases with the number of  Table 1: Posterior probability Pr(σ |D) of the choice order used to generate each dataset along with range of tail area probability p k = Pr(λ k > λ k |σ = σ , D). * indicates that σ is also the (posterior) modal observed choice order. observations (rank orderings) considered. Interestingly we observe reasonable posterior support for σ when only considering n = 20 preference orders of K = 10 entities. However for some of the analyses, those where n is relatively small in comparison to K, the choice order σ is not observed in any of the 10K posterior draws. Further inspection reveals that the posterior draws of σ are reasonably consistent with the σ used to generate the respective datasets; this can be seen by considering the marginal posterior distribution for each stage in the ranking process, that is, Pr(σ j = k|D) for j, k ∈ {1, . . . , K}. Figure 1 shows heat maps of Pr(σ j = k|D) for those analyses where σ was not observed; the crosses highlight Pr(σ j = σ j |D) in each case. These figures reveal that, even with limited information, we are able to learn the final few entries in σ fairly well and much of the uncertainty resides within the earlier stages of the ranking process. Section 2.2 of the supplementary material presents the Pr(σ j = k|D) from Figure 1 in tabular form along with the image plots for the remaining analyses.
For the Extended Plackett-Luce model we are trying to quantify our uncertainty not only about the choice order parameter but also about the entity parameters. As discussed in Section 2 the entity parameter values λ only have a meaningful interpretation for a given choice order parameter σ. Section 2 of the supplementary material contains boxplots of the marginal posterior distributions π(λ k |σ = σ , D), that is, the marginal posterior distribution of λ k where we only consider posterior draws with σ = σ . To summarise the results we compute the tail area probabilities p k = Pr(λ k > λ k |σ = σ , D) of λ k under its respective marginal distribution for k = 1, . . . , K. The range of these values, that is, (min p k , max p k ) for each analysis is given in Table 1 and these indicate that there is reasonable posterior support for λ , even when n is small relative to K. In other words, the parameters λ look plausible under the posterior distribution. Of course, we can not compute these quantities for those analyses where σ is not observed. However, although prohibitive for λ inference, this does not prohibit inferences on observable quantities (rank orders) as this is achieved via the posterior predictive distribution; this is the topic of the next section.

Inference and model assessment via the posterior predictive distribution
In this section we consider methods for performing inference for the entities by appealing to the posterior predictive distribution which will also provide us with a mechanism for detecting lack of model fit. Obtaining the posterior predictive distribution, and, more generally, predictive quantities of interest, is computationally burdensome when the number of entities is not small and so we also outline Monte Carlo based approximations that can be used to facilitate this approach for larger values of K.

Inference for entity preferences
The Extended Plackett-Luce model we consider is only defined for complete rankings and so the posterior predictive distribution is a discrete distribution defined over all possible observationsx ∈ S K . These probabilities can be approximated by taking the sample mean of the EPL probability (2) over draws from the posterior distribution for λ and σ, that is Pr(X =x|D) . . , N are sampled from the distribution with density π(λ, σ|D). We can then use this posterior predictive distribution, for example, to obtain the marginal posterior predictive probability that entity k is ranked in position j, that is, Pr(x j = k|D) for j, k ∈ {1, . . . , K}. The posterior modal orderingx is also straightforward to obtain and is simply that which has largest posterior predictive probability. However, when the number of entities is larger than say 9, this procedure involves enumerating the predictive probabilities for more than O(10 6 ) possible observations. Clearly this becomes computationally infeasible as the number of entities increases; particularly as computing the posterior predictive probability also involves taking the expectation over many thousands of posterior draws. When the number of entities renders full enumeration infeasible we suggest approximating the posterior predictive distribution via a Monte Carlo based approach as in Johnson et al. (2020). In particular we obtain a collection P = {x (m) : m = 1, . . . , M, = 1, . . . , L} of draws from the posterior predictive distribution by sampling L rank orderings at each iteration of the M iterations of the posterior sampling scheme. We can then approximate Pr(x j = k|D) by the empirical probability computed from the collection of rankings P, that is Pr( where I(x) denotes an indicator function which returns 1 if x is true and 0 otherwise. If desired the mode of the posterior predictive distribution can be obtained via an efficient optimisation algorithm based on cyclic coordinate ascent (Johnson et al., 2020).

Model assessment via posterior predictive checks
In the Bayesian framework assessment of model fit to the data can be provided by comparing observed quantities with potential future observations through the posterior predictive distribution; the basic idea being that the observed data D should appear to be a plausible realisation from the posterior predictive distribution. This approach to Bayesian goodness of fit dates back at least to Guttman (1967) and is described in detail in Gelman et al. (2014), for example. Several methods for assessing goodness of fit for models of rank ordered data were proposed in Cohen and Mallows (1983) and more recently similar methods have been developed in a Bayesian framework by, amongst others, Yao and Böckenholt (1999), Mollica and Tardella (2017) and Johnson et al. (2020). In the illustrative examples on real data in Section 6 we propose a range of diagnostics tailored to the specific examples. For example, one generic method for diagnosing lack of model fit is to monitor the (absolute value of the) discrepancy between the marginal posterior predictive probabilities of entities taking particular ranks with the corresponding empirical probabilities computed from the observed data. That is, denotes the empirical probabilities computed from those x ∈ D and the posterior predictive probabilities Pr(x j = k|D) are computed as described in Section 5.1. These discrepancies d jk for j, k ∈ {1, . . . , K} can then be depicted as a heat map where large values could indicate potential lack of model fit. By focusing on the marginal probabilities Pr(x j = k) we obtain a broad-scale "first-order" check on the model, but, as described in Cohen and Mallows (1983), we could also look at finer-scale features such as pairwise comparisons, triples and so on. Of course, if the full posterior predictive distribution over all K! possible observations is available (that is, if K is small) then we could also check that the observed data look plausible under this distribution.

Illustrative examples
We now summarise analyses of two real datasets which together highlight how valuable insights can be obtained by considering the Extended Plackett-Luce model as opposed to simpler alternatives. With no direct competitor for Bayesian analyses of the Extended Plackett-Luce model we compare our conclusions to those obtained under standard and reverse Plackett-Luce analyses. For the standard and reverse Plackett-Luce analyses posterior samples are obtained using the (partially collapsed) Gibbs sampling scheme of Caron and Doucet (2012) as this algorithm has been shown to be very efficient for these models. Of course, our algorithm is also capable of targeting these much simpler posterior distributions. This could be achieved by considering a single chain (C = 1) and repeatedly performing the MH update of λ followed by the rescaling step with σ = I or σ = (K, . . . , 1) fixed throughout.

Song data
For our first example we consider a dataset with a long standing in the literature that was first presented in Critchlow et al. (1991). The original dataset was formed by asking ninety-eight students to rank K = 5 words, (1) score, (2) instrument, (3) solo, (4) benediction and (5) suit, according to the association with the target word "song". However, the available data given in Critchlow et al. (1991) is in grouped format and the ranking of 15 students are unknown and hence discarded. The resulting dataset therefore comprises n = 83 rank orderings and is reproduced in the supplementary material.
Posterior samples are obtained via the algorithm outlined in Section 3.4 where the prior specification is as in Section 3.1 with q k = 1 and a k = 1 (for k = 1, . . . , K) and so all choice and preference orderings are equally likely a priori. The following results are based on a typical run of our (appropriately tuned) MC 3 scheme initialised from the prior. We performed 1M iterations, with an additional 10K discarded as burn-in, which were thinned by 100 to obtain 10K (almost) un-autocorrelated realisations from the posterior distribution. The algorithm runs fairly quickly, with C code on five threads of an Intel Core i7-4790S CPU (3.20GHz clock speed) taking around 9 minutes. The results of the convergence diagnostics described in Section 3.4, along with the tuning parameter values, are given in Section 5 of the supplementary material.
Inference for the Extended Plackett-Luce model is substantially more challenging than for the simpler standard/reverse Plackett-Luce models and so it is important to question whether this additional complexity allows us to better describe the data. Put another way, does the EPL model give rise to improved model fit? To this extent we investigate the discrepancies d jk = | Pr(x j = k|D) − Pr(x j = k)|. For comparative purposes we also compute the discrepancies obtained from under both standard and reverse Plackett-Luce analyses of these data; Figure 2 shows these values as a heat map for j, k ∈ {1, . . . , K}. Visual inspection clearly suggests that the observed data look more plausible under the EPL model when compared to the simpler models. In particular, there are rather large discrepancies between the predictive and empirical probabilities that entity k = 1 (Score) is ranked in position j = 3 under the SPL (0.34) and RPL (0.40) analyses. The superior fit of the EPL model is further supported by  the Watanabe-Akaike information criterion (Watanabe, 2010) with values of 464.12, 546.84 and 654.21 for the extended, standard and reverse PL models respectively. Note that these values are on the deviance scale and the effective number of parameters is computed as in Gelman et al. (2014). Given we have a mixed discrete-continuous parameter space we suggest that the WAIC is preferable to other popular information criteria such as AIC, DIC or BIC due to their reliance on point estimates and/or the assumption of a (multivariate) Gaussian posterior which may not be appropriate.
Turning now to inference for observable quantities we again appeal to the posterior predictive distribution. More specifically we can now use the (predictive) probabilities Pr(x j = k|D) to deduce the likely positions of entities within rankings. Figure 3 shows these probabilities as a heat map for j, k ∈ {1, . . . , K}. Focusing on the Extended Plackett-Luce analysis, it is fairly clear that "Suit" (5) is the least preferred entity and "Benediction" (4) is the 4th most preferred, with relatively little (predictive) support for any other entities in these positions. There is perhaps more uncertainty on those entities that are ranked within positions j = 1, 2, 3, although the figure would suggest that the preference of the entities is (Solo, Instrument, Score, Benediction, Suit). Indeed this is the modal predictive ranking and has predictive probability 0.232. Interestingly the same modal (predictive) ranking is obtained under the RPL analysis, although it only receives predictive probability 0.07, whereas for the SPL analysis the modal (predictive) ranking is (Instrument, Solo, Score, Benediction, Suit) and occurs with probability 0.122. Figure 3 also shows there to be much more uncertainty, particularly for the top 3 entities, under the SPL and RPL analyses; this is perhaps more naturally seen through Figure 4 that shows the (full) posterior predictive distribution for each possible future observationx ∈ S K . The crosses (×) show the empirical probabilities of the observed data (thosex ∈ D). Figure 4 illustrates that the observed modal ranking is (Solo, Instrument, Score, Benediction, Suit).

Formula 1 data
We now analyse a dataset containing the finishing orders of drivers within the 2018 Formula 1 (F1) season and so we have n = 21 rank orderings of the K = 20 drivers. It will be interesting to see whether we are able to gain more valuable insights using the EPL model when compared to the simpler variants. In particular whether we are able to gain any information about the choice order parameter σ in this setting as K is fairly large, relative to n. The rank orderings considered here were collected from www. espn.co.uk and are reproduced in Section 7 of the supplementary material.
Numerous variants of the Plackett-Luce model have previously been developed for the analysis of F1 finishing orders; see Henderson and Kirrane (2018) and the discussion therein. In general, models derived from the reverse Plackett-Luce (RPL) model appear to perform better than the standard Plackett-Luce model in the sense that they give rise to better model fit. We choose to incorporate this prior information by letting q = (1, . . . , K) and so a priori the modal choice ordering isσ = (K, . . . , 1) , that is, the choice ordering corresponding to the reverse Plackett-Luce model. Figure 5 (right) shows the (log) prior probabilities Pr(σ j = k) for j, k ∈ {1, . . . , K}. Regarding the drivers, although their individual ability no doubt plays a part, ultimately Formula 1 is a team sport and the performance of a particular driver within a race is often limited by the quality of the car their respective team is able to produce. To this extent we suppose that teams with larger budgets, and hence more capacity to invest in research and development, are more likely to produce a superior car and thus improved race performance. We therefore let the a k value be equal to the team budget for each of the K drivers and then rescale the a k values such that the smallest value is 1. The implication is that, under the standard Plackett-Luce model, an individual who drives for any particular team is twice as likely to win a race when compared to another driver who is part of a team with half of the budget. The budgets were obtained from www.racefans.net and the a k values for each driver are given in Section 7 of the supplementary material. Because this prior specification does not lead to a unique prior predictive mode under the SPL model, we use a slight modification of our mode preserving procedure which preserves all modes when averaged over the prior for σ; details of this modification are provided in the supplementary material. Of course, for the EPL model there is also additional uncertainty on the choice order parameter and the resulting prior predictive distribution is shown in Figure 5 (left). Here the prior (predictive) probabilities Pr(x j = k) are obtained via a Monte Carlo based approach akin to that outlined in Section 5 where the (λ, σ) draws are obtained from the prior as opposed to the posterior. Section 6 of the supplementary material provides posterior summaries from alternative analyses with a = 1 (a uniform prior predictive distribution) and q = 1 (a uniform prior distribution for σ). These analyses reveal that the posterior distribution is not particularly sensitive to the choices of prior distribution considered here.
The following results are based on a typical run of our (appropriately tuned) MC 3 scheme initialised from the prior. We allow a burn-in period of 100K iterations after which we perform a further 2M iterations, with these thinned by 200 to obtain 10K (almost) un-autocorrelated realisations from the posterior distribution. This analysis takes around an hour using C code on five threads of an Intel Core i7-4790S CPU (3.20GHz clock speed). Section 6 of the supplementary material contains the results of the convergence diagnostics and the tuning parameter values used for this analysis.
First, to address the model selection issue, the estimated WAIC values for the extended, standard and reverse PL models are (on the deviance scale) 1501.58, 1672.49 and 1507.32 respectively. These values suggest the marginal superiority of the EPL model over the RPL model, with the SPL model a distant third.
Focusing now on the EPL analysis, investigation of the posterior distribution reveals that there is a large amount of uncertainty on the choice order parameter σ and also potential bi-modality within certain ranking stages. That said, further inspection of the marginal posterior distributions given by Pr(σ j = k|D) reveals that there is a surprisingly small amount of uncertainty on the ranks allocated in the 13th-20th stages; see Figure 6 (right). Further within these positions (σ 13 , . . . , σ 20 ) the ranks allocated are consistent with the choice order parameter corresponding to the reverse Plackett-Luce model which suggests why previous authors may have found the RPL model to be preferable to the SPL model for modelling F1 results. We also note that, although there is some uncertainty within certain ranking stages, it is clear that these data are rather informative about the likely values of σ in the sense that the posterior distribution is not that similar to the prior; see Figure 5 (right).
Again we can turn to the posterior predictive distribution to deduce the likely positions of entities within rankings. Note that here complete enumeration of the posterior predictive probabilities for eachx ∈ S K is computationally infeasible as K! is of O(10 18 ). Figure 6 (left) therefore shows the (log) probabilities Pr(x j = k|D) as a heat map for j, k ∈ {1, . . . , K}; the corresponding figures for RPL and SPL analyses are provided within the supplementary material. Note that the predictive probabilities Pr(x j = k|D) are computed based on synthetic data simulated from the predictive distribution as discussed in Section 5 with L = 10. This figure shows that, perhaps unsurprisingly, there is a fairly large amount of uncertainty about the finishing positions of some of the drivers. That said, this does not mean that useful inferences cannot be made based on these predictive probabilities. For example, the figure would suggest that Stoffel Vandoorne (column 7) is most likely to place somewhere between 11th-15th (probability 0.46) whereas the probability of this driver finishing within the top-10 positions is only 0.21. In contrast, Fernando Alonso (column 8) is much more likely to obtain a top-10 finish (probability 0.43) and we would only expect Alonso to finish within positions 11 to 15 in around 25% of the races. This is particularly interesting given that both of these individuals drove for the same team (McLaren Renault) within the 2018 season and hence were exchangeable a priori under our model. It is also clear that the first 6 drivers are those most likely to perform well during a race. However, a particularly interesting aspect is the bi-modality of the distribution for some of these drivers. Take, Figure 6: F1 data: heat maps showing the posterior predictive distribution (left) and the marginal prior distribution of the choice order parameter (right). Crosses (×) highlight the probabilities corresponding to σ = (K, . . . , 1) , the reverse Plackett-Luce model. for example, Sebastian Vettel (column 1); we can see that although most of the probability is assigned to this driver finishing within the top-10 positions, there is also an increased chance of finishing within the bottom 5 when compared to the middle (11-15) positions. Although perhaps counter-intuitive this result is perhaps not as surprising as it first seems. Indeed, although the drivers in the top cars are likely to do well, there is also a chance they could suffer from a mechanical issue, or perhaps be involved in an accident, and hence obtain a surprisingly low finishing position within any given race. We also note that this particular aspect is not captured by either the standard or reverse Plackett-Luce analyses. The corresponding heat maps can be found in Section 6 of the supplementary material and these show that the distribution of the finishing position of each driver, that is, the distribution within each column, is uni-modal and this perhaps explains why the Extended Plackett-Luce model is favoured here.
We can also look at other quantities of interest, for example, the number of times we would expect each of the top 6 drivers to win a race, feature on the podium (top 3), and also obtain a points (top 10) finish based on the predictive probabilities. More specifically Table 2 shows n × p k=1 Pr(x j = k|D) for p = 1, 3, 10 along with the observed number of times computed from those x ∈ D. It is interesting to see that both the extended and reverse Plackett-Luce models are able to capture these aspects of the distribution fairly well. Further the expected number of points (top 10) finishes under the standard Plackett-Luce model is also fairly consistent with the observed data. However, the shortcomings of the more simple standard Plackett-Luce model become clear if we instead consider the expected number of wins and podiums. For example, we observed that Hamilton won 11 races and the SPL model suggests that he would be expected to win around 5 races within an F1 season whereas the EPL model suggests 10 wins which is much more consistent with the observed data. Again, additional insight into the question of model fit can be obtained via heat maps showing the discrepancies d jk = | Pr(x j = k|D) − Pr(x j = k)| for j, k ∈ {1, . . . , K}; these are provided in Section 6 of the supplementary material. The total discrepancies for each model, that is jk d jk , are 13.35, 17.33 and 14.14 for the extended, standard and reverse PL models respectively. These values support the conclusion from the comparison of the WAIC values; the EPL model provides the best fit out of the three models.
Finally, whilst the number of wins, podium and points finishes is clearly of interest, ultimately the F1 Drivers' Championship is determined by the number of points each   driver is able to accrue over an F1 season. Given the amount of points awarded is simply a deterministic function of the finishing orders we can straightforwardly compute the expected number of points each driver obtains by simulating synthetic F1 seasonsD = {x 1 , . . . ,x n } from the posterior predictive distribution. Table 3 shows the observed number of points in addition to the expected number under each model considered. We can also useD to compute the (predictive) probability that each driver finishes within their respective (observed) position; these values are given in Table 3 (right).

Conclusion
We have considered the problem of implementing a Bayesian analysis of rank ordered data using the Extended Plackett-Luce model. In particular we have considered carefully the problem of prior specification, proposing a Plackett-Luce model as the prior for the choice order parameter σ and proposing a prior distribution on the entity parameters that preserves the modal ordering under the prior predictive distribution. We have also tackled the challenging issue of posterior sampling of a potentially highly multi-modal posterior distribution with both discrete and continuous components via a Metropolis coupled Markov chain Monte Carlo scheme. This has enabled efficient posterior sampling which potentially facilitates further analyses based on the Extended Plackett-Luce model and further extensions of the model. Finally, we have focused on predictive inference for observable quantities that naturally facilitates the assessment of model adequacy.

Reproducibility
With reproducibility in mind, the code to run the algorithm outlined in Section 3.4 can be found at the GitHub repository https://github.com/srjresearch/ExtendedPL. In addition, C code for performing a standard Plackett-Luce analysis along with each of the synthetic and real datasets presented are also provided.

Supplementary Material
Supplementary material of "On Bayesian inference for the Extended Plackett-Luce model" (DOI: 10.1214/21-BA1258SUPP; .pdf). The supplementary material contains details of the data generating mechanism, further discussion of our mode preserving prior specification and additional simulation and real data results.