Bayesian inference, model selection and likelihood estimation using fast rejection sampling: the Conway-Maxwell-Poisson distribution

Bayesian inference for models with intractable likelihood functions represents a challenging suite of problems in modern statistics. In this work we analyse the Conway-Maxwell-Poisson (COM-Poisson) distribution, a two parameter generalisation of the Poisson distribution. COM-Poisson regression modelling allows the flexibility to model dispersed count data as part of a generalised linear model (GLM) with a COM-Poisson response, where exogenous covariates control the mean and dispersion level of the response. The major difficulty with COM-Poisson regression is that the likelihood function contains multiple intractable normalising constants and is not amenable to standard inference and MCMC techniques. Recent work by Chanialidis et al. (2017) has seen the development of a sampler to draw random variates from the COM-Poisson likelihood using a rejection sampling algorithm. We provide a new rejection sampler for the COM-Poisson distribution which significantly reduces the CPU time required to perform inference for COM-Poisson regression models. A novel extension of this work shows that for any intractable likelihood function with an associated rejection sampler it is possible to construct unbiased estimators of the intractable likelihood which proves useful for model selection or for use within pseudo-marginal MCMC algorithms (Andrieu and Roberts, 2009). We demonstrate all of these methods on a real-world dataset of takeover bids.


Introduction
Modelling count data continues to be an important area in the practice of statistics. Often covariate information is available which may prove useful in explaining the counts, for example, time of day influencing the number calls to a particular call centre or amount borrowed influencing the number of loan defaults by a consumer. Poisson regression is a standard framework for modelling covariate-dependent count data allowing the mean response to depend on the covariates through a log-linear link function. In practice, Poisson regression fails to capture the phenomenon of dispersion, where the mean and variance of the response differ significantly. Dispersion measures the spread of a random variable and is quantified through the ratio of the variance of a random variable to its mean. Count data often exhibit underdispersion (variance less than the mean), overdispersion (variance greater than the mean) or equidispersion (equality of mean and variance) but Poisson regression is only capable of modelling equidispersed data. Many alternative approaches have been suggested to capture dispersion, such as using quasilikelihood or the popular negative binomial regression (Hilbe, 2011) which allows modelling of overdispersed data only. The COM-Poisson distribution (Conway and Maxwell, 1962) was first introduced to model queuing systems and was then revived as a statistical model for dispersed count data by Shmueli et al. (2005). This revival lead to the introduction of COM-Poisson regression models by Guikema and Goffelt (2008) as a flexible regression model for dispersed count data with covariates. However, this flexibility comes with the caveat that the COM-Poisson likelihood function contains an intractable normalising term which requires the need for non-standard methods to conduct parameter inference. This intractability has restricted the popularity of COM-Poisson regression as a viable solution to modelling dispersion. Recent work by Chanialidis et al. (2018) has seen a Bayesian analysis of COM-Poisson regression under the framework of doubly-intractable Bayesian inference, where the authors overcome the intractability through the use of a rejection sampling algorithm for the COM-Poisson distribution. This rejection sampler is then used within the exchange algorithm of Murray et al. (2006) to perform Bayesian parameter inference.
Our novel contribution in this paper is to provide a more efficient approach to that of Chanialidis et al. (2018) by designing a simpler rejection sampler for the COM-Poisson distribution which is computationally less intensive to sample from and leads to faster inference within COM-Poisson regression models. In one example in this paper, we observe a greater than threefold reduction in CPU time by using our rejection sampler. The rejection sampler of Chanialidis et al. (2018) is derived from a discrete version of adaptive rejection sampling (Gilks and Wild, 1992) and requires the costly construction of a piecewise-truncated geometric enveloping distribution to tightly encapsulate the COM-Poisson probability density function. Our rejection sampler requires only a single envelope distribution depending on the dispersion parameter of the COM-Poisson distribution. This single envelope distribution does come with a higher rejection rate but completely bypasses the costly setup and sampling cost involved in piecewise-truncated geometric envelope distributions. A further and important novelty of our work is to show that when a rejection sampling algorithm is available to sample from an intractable likelihood function of interest, then it is possible to construct an unbiased estimate of this intractable likelihood function using the rejection sampler. Our idea works by exploiting a link between rejection sampling efficiency and its relationship with the reciprocal intractable normalising constant of the likelihood function.
The remainder of this paper is organised as follows: Section 2 reviews the COM-Poisson distribution and COM-Poisson regression models under the framework of doubly-intractable Bayesian inference. Section 3 reviews rejection sampling for intractable likelihoods and our new COM-Poisson rejection sampler is introduced in Section 3.1. Section 3.3 describes the novel method of constructing unbiased estimators of intractable likelihood functions through rejection sampling. Section 4 shows how pseudo-marginal MCMC can be performed using this unbiased intractable likelihood estimator. Section 5 shows results of these methods applied firstly to a dataset with no covariates in Section 5.1 and then to a real-world COM-Poisson regression dataset of company takeover bids in Section 5.2. Finally, Section 6 gives discussion and some thoughts for future work.

Bayesian doubly-intractable inference for GLMs
The COM-Poisson distribution (Conway and Maxwell, 1962) is a two-parameter discrete probability distribution that allows the flexibility to model count data with over-, underand equi-dispersion. The probability mass function for a COM-Poisson random variable Y with parameters μ > 0 and ν ≥ 0 is defined over the non-negative integers as .

The unnormalised component of the mass function is denoted
is an intractable normalising constant, having no closed form representation for this model. The mode of the COM-Poisson distribution is μ , with two consecutive modes at μ and μ − 1 in the case where μ is an integer. The moments of the COM-Poisson distribution are unavailable directly due to the intractable normalising constant. The moments can however be approximated through the use of an asymptotic representation of Z f (μ, ν) derived by Shmueli et al. (2005). The approximations for the mean and variance using this asymptotic representation are, respectively, These approximations are quite accurate for a wide range of μ and ν, except for small μ or small ν in the case of E(Y ) and for ν > 1, regardless of the value of μ, in the case of Var(Y ). The purpose of the ν parameter is to control dispersion through this parameter's inverse relationship with the variance as seen in (1). When ν = 1 the distribution exhibits equidispersion and the probability mass function reduces to that of a Poisson random variable with Z f (μ, 1) = exp(μ). Alternatively setting ν < 1 gives overdispersion and setting ν > 1 gives underdispersion.
The main application of the COM-Poisson distribution is to GLMs, referred to as COM-Poisson regression models, where n responses y 1:n = (y 1 , . . . y n ) are assumed to follow a COM-Poisson distribution with the parameters μ and ν of each response being conditional on b available covariates x i = (x i1 , . . . , x ib ) for i = 1, . . . n. Due to this conditioning we now index the parameters for observation y i as μ i and ν i . COM-Poisson regression modelling was first introduced by Guikema and Goffelt (2008) and Sellers and Shmueli (2010) in an effort to extend Poisson regression by allowing covariates to influence not only the location parameter μ i but also the dispersion parameter ν i . For notational simplicity we now let θ i = (μ i , ν i ) T be the parameter for observation y i with θ i derived from setting θ i = η(β, x i ) where η(β, x i ) is a link function of the b covariates and β are the parameters of this link function. The major difficulty is that the likelihood of the COM-Poisson regression model involves multiple intractable normalising constants, The particular link function η(β, x i ) for the COM-Poisson was specified by Guikema and Goffelt (2008) as a dual-link function to allow the available covariates to enter both the μ i and ν i parameter. In this dual-link function scenario the parameter β is split into two sub-vectors as β = (β μ , β ν ), where β μ are the coefficients for the μ link component and β ν are the coefficients for the ν link component. The dual-link function is then where some β μ are set fixed to 0 when the covariate is not included in η 1 (β μ , x i ), similarly for η 2 (β ν , x i ).
Bayesian parameter inference for β in COM-Poisson regression models is a doublyintractable problem as we now demonstrate. Placing a prior on β, the posterior distribution for β is formed with this prior and the complete likelihood (2) as π(β|y 1:n ) ∝ f (y 1:n |θ 1:n )π(β) This posterior is termed doubly-intractable since the likelihood is intractable, containing n intractable normalising constants, and the posterior itself cannot be normalised. Without the ability to evaluate f (y 1:n |θ 1:n ) pointwise, the direct application of standard MCMC methods is infeasible. For example, a Metropolis-Hastings algorithm proposing a move in the link function from β to β using a proposal distribution h(β, β ) requires evaluation of the intractable ratios to compute the acceptance ratio Possible methods to compute this acceptance ratio include replacing the intractable ratios with approximations using the asymptotic approximation from Shmueli et al. (2007) or by a truncated sum Z f (θ i ) = k i=0 q f (y|θ i ) but both of these methods introduce some bias into the acceptance ratio.
An ingenious solution to the issue of computing the acceptance ratio (4) was proposed by Murray et al. (2006) drawing on earlier work by Møller et al. (2006). The solution, known as the exchange algorithm, augments the doubly-intractable posterior (3) with auxiliary data, provided that it is possible to sample data exactly from the intractable likelihood. The original exchange algorithm overcomes a single intractable ratio i.e. n = 1 however the algorithm can easily be extended to the case of n > 1 intractable ratios as we now show. Starting from the posterior (3) which contains the product of n independent non-identical intractable likelihood functions, the exchange algorithm proposes, as in standard Metropolis-Hastings, to update the parameter from the current state β to proposed state β using h(β, β ) but in addition the posterior is augmented with n auxiliary draws y 1:n = (y 1 , . . . y n ) generated from the likelihood at the proposed parameters θ i = η(β , x i ). The augmented posterior is written π(β, β , y 1:n |y 1:n ) ∝ f (y 1:n |θ 1:n ) likelihood π(β)h(β, β ) f (y 1:n |θ 1:n ) auxiliary draws (5) and the marginal of this augmented posterior for β is the target posterior of interest (3). The algorithm can be seen as a Markov chain over an augmented parameter (β, y 1:n ). The acceptance ratio for the augmented posterior (5) is now tractable due to the cancellation of all intractable normalising constants The clever cancellation of the normalising constants in (6) is due to the exchange of parameters (θ i , θ i ) associated with the observed data (y 1:n ) and the auxiliary data (y 1:n ), the auxiliary data being data discarded after each move. The major requirement in the exchange algorithm is the ability to generate the auxiliary data, requiring exact draws from the likelihood at each proposed parameter. Perfect sampling (Propp and Wilson, 1996) from the likelihood is one possible method to do this but can be prohibitively expensive for many models. In certain models sampling from an intractable likelihood may even be as difficult as computing the intractable normalising constant itself and instead one could resort to sampling using an approximate likelihood sampler such as in the case of the exponential random graph model (Caimo and Friel, 2011). The COM-Poisson distribution was shown by Chanialidis et al. (2018) to be amenable to rejection sampling, allowing the auxiliary data to be generated. In Section 3.1 we present our faster rejection sampler having a cheaper sampling cost than the sampler of Chanialidis et al. (2018). Rejection sampling is not applicable to all intractable likelihoods but when rejection sampling is available we show in Section 3.3 that the intractable likelihood can be estimated without bias through an unbiased estimate of the reciprocal normalising constant. This unbiased intractable likelihood estimator is also guaranteed to be positive which differs from other estimates using Russian roulette stochastic truncation (Lyne et al., 2015) or truncated Markov chains (Wei and Murray, 2017).

Rejection sampling to an intractable likelihood estimator
Rejection sampling (von Neumann, 1951) is a sampling scheme to generate statistically independent samples from a target probability distribution of interest. We denote the target density (likelihood) as where in this section we drop the observation index i for ease of notation. This target density may be tractable or intractable but is assumed in any case to be difficult to sample from. The rejection sampling method works by constructing an envelope distribution over the target density, proposing samples from this envelope distribution and choosing to accept a portion of these samples. The envelope distribution should be chosen so that it is both computationally efficient to sample from and matches the target density closely in shape. The portion of samples that will be accepted depends on the similarity of the envelope and the target distribution. The closer the envelope distribution matches the target distribution, the higher the number of samples that will be accepted. To generate a single draw y from f (y|θ) using rejection sampling, we consider an envelope distribution g(y|γ) where γ ∈ Γ is its parameter. We assume that the envelope density can be written in a similar form to f (y|θ) as The envelope distribution's normalising constant Z g (γ) may be either tractable or intractable. A necessary condition for g(·|γ) is that it dominates the support of f (·|θ) i.e. g(y|γ) = 0 =⇒ f (y|θ) = 0. It is also necessary to bound f (y|θ) using g(y|γ) and a positive finite bounding constant M that satisfies the envelope inequality Mg(y|γ) > f (y|θ), ∀y. Finding such a constant can be a difficult task even for tractable densities. The optimal value of M is denoted as . In the case of intractable likelihoods, this optimal constant is impossible to evaluate since at least one of either Z f (θ) or Z g (γ) is unknown leaving M f/g intractable. In the case where both f (y|θ) and g(y|γ) are tractable distributions and M f/g can be found, the vanilla rejection sampling algorithm to obtain a single draw from f (y|θ) proceeds as in Algorithm 1.
Algorithm 1 introduces the rejection sampling Bernoulli trial with the outcome conditional on the proposed y * through the acceptance probability α(y * ). This Bernoulli trial is at the core of the rejection sampling method as it decides for a given y * what proportion of samples from g(·|γ) to accept. Once intractability is introduced into this Bernoulli trial through M f/g , it would appear that α(y * ) becomes intractable but this is fortunately not the case. The intractable bounding constant M f/g can be decomposed into tractable and intractable components by extracting the reciprocal normalising constants from M f/g as follows Algorithm 1: Rejection sampling.

and this introduces the tractable bounding constant
qg (y|γ) . B f/g can be viewed as the upper bound on the ratio of the unnormalised densities q f (·|θ) and q g (·|γ), whereas M f/g is the upper bound on the associated normalised densities f (·|θ) and g(·|γ). The advantage of rejection sampling algorithms is that they can be run without (7) gives the tractable acceptance probability for the conditional Bernoulli trial as It is important to note that Z g (γ) is also not required to be tractable, although it typically is for simple envelope distributions. A case where Z g (γ) is intractable would be when the envelope distribution itself requires sampling by another rejection sampler and this opens the possibility of chaining many rejection samplers together to sample from evermore complex target distributions. The main purpose of rejection sampling in this paper is to provide a means to generate the auxiliary draws required within the exchange algorithm acceptance ratio (6) allowing the doubly-intractable inference to be performed but Section 3.3 will show another novel usage for rejection sampling. Before discussing this, we demonstrate our COM-Poisson rejection sampler in the next section.

COM-Poisson rejection sampler
For the COM-Poisson distribution, our sampler uses a choice of two envelope distributions depending on the value of ν. This choice is motivated by Figure 1, where a Poisson distribution is used in the case of ν ≥ 1 and a geometric distribution in the case of ν < 1. Theorem 3.1 demonstrates and proves why this works. The COM-Poisson rejection sampling algorithm is given in pseudocode after this theorem.
Theorem 3.1 (COM-Poisson intractable rejection sampler). Suppose that Y ∼ g 1 (y|γ), a Poisson random variable with parameter γ = μ, and Y ∼ g 2 (y|γ), a geometric random variable with parameter γ = p, for some 0 < p < 1, are used as two enveloping distributions as follows, together with the following tractable enveloping bounds The enveloping bound as illustrated in Figure 1 (left) is the ratio of the COM-Poisson mass function to a Poisson mass function To find the supremum, assume that the supremum occurs at the point y m and by the unimodality of the COM-Poisson distribution, it will satisfy To find the value of y m that will satisfy this collection of inequalities over h it is enough to consider the dominant inequalities at h = ±1, Therefore if y m satisfies (12), it will also satisfy (11) for all h ∈ {−y m , −y m + 1, . . . }.
Thus the bound becomes which is a tractable bound and for the special case of Poisson (ν = 1), then B [ν=1] f/g = 1. In the case of integer μ this bound will also ensure coverage of the dual mode at μ and μ − 1 (see Figure 1).
The enveloping bound as illustrated in Figure 1 (right) is the ratio of the COM-Poisson mass function to a geometric mass function For now we assume p is unknown and we prove the general bound for any p. A discussion of which value of p to choose in practice will be given after this proof. As in Case 1, assume that the supremum occurs at the point y m , which by the unimodality of the COM-Poisson distribution, will satisfy 1 p for h ∈ {−y m , −y m + 1, . . . }. To find the value of y m that will satisfy this collection of inequalities over h it is enough to consider again only the inequalities at h = ±1. This leads to As for Case 1, if y m satisfies (14) it will also satisfy (13) for all h ∈ {−y m , −y m +1, . . . }.
Thus the bound becomes Before we present the practical implementation of the COM-Poisson sampling algorithm developed from Theorem 3.1, it is necessary to provide some brief discussion on the choice of p in the geometric envelope. The optimal p parameter, p * say, would be one for which the acceptance rate of the rejection sampling is maximised. However it is not possible to compute p * since the acceptance rate, as we will show in Section 3.3, involves intractable normalising constants. Nevertheless any choice of p will result in a valid rejection sampler and clearly the closer p is to p * , the more efficient the sampler is. We have chosen to select p by matching the first moment of the geometric to the approximate expected value of the COM-Poisson distribution using (1). Setting the first moment of the geometric distribution equal to the expected value approximation (1) Figure 1: Enveloping the COM-Poisson distribution for rejection sampling: In the underdispersed (ν ≥ 1) case (left), a Poisson distribution, having a higher relative dispersion, is scaled to envelope the COM-Poisson distribution. In the overdispersed (ν < 1) case (right), a geometric distribution, having a higher relative dispersion, is used. The parameter p of the geometric distribution in the overdispersed case controls the efficiency of the enveloping. To choose p we match the first moment of the geometric distribution with the COM-Poisson distribution (15), shown as the black curve (right). The blue curve corresponds to an alternative choice of p, showing how an inefficient choice of p results in a geometric envelope with a high rejection region (area between blue and green curves). implies, Algorithm 2 presents pseudocode of the COM-Poisson sampler for coding on a computer.

Efficiency and comparison of the rejection sampling algorithm
To analyse the efficiency of the COM-Poisson sampler we conducted an experiment in which the sampler was run for different values of μ each over a fine grid of ν values ∈ [0, 6], as outlined in Figure 2. Here we monitored the acceptance rate of the COM-Poisson rejection sampler, based on the average of the reciprocal of the number of draws from the envelope distribution needed to yield a single draw from the COM-Poisson distribution over 100, 000 independent runs at each grid point. This illustrates that the sampler can achieve acceptance rates between 50% and 100% for ν > 1 (corresponding to a Poisson envelope distribution) and acceptances rates between 30% and 80% for ν < 1 (corresponding to a geometric envelope distribution). For ν < 1, the acceptance of the sampler increases as ν → 0. This occurs because the COM-Poisson distribution, with parameter ν = 0, is exactly a geometric distribution with parameter p = μ ν provided μ ν < 1. Thus the geometric proposal matches closely the COM-Poisson distribution for small ν.

15
Calculate B [ν<1] f/g using (10) and set α = (μ y /y !) ν once and then use it as an input to the algorithm.
In Figure 3 we compare the efficiency of our COM-Poisson rejection sampler to the rejection sampler of Chanialidis et al. (2018). The sampler developed in Chanialidis et al. (2018) involves constructing a tight envelope distribution around the COM-Poisson distribution using a piecewise envelope distribution constructed from four truncated geometric distributions. This piecewise geometric distribution is then sampled using the inversion method for truncated geometric distributions. These truncated geometric dis- tributions are designed to match closely to the COM-Poisson density function, however the computational cost of constructing this envelope is significant and can impact on the CPU time. Our COM-Poisson sampler described in Algorithm 2 is much simpler in construction and requires only a single envelope component depending on the value of ν. Although the single envelope can result in a moderately higher rejection rate compared to Chanialidis et al. (2018) and as illustrated in Figure 2, we describe an experiment to illustrate that a potentially higher rejection rate is offset, in CPU time, by the simplicity of constructing and sampling from the single envelope distribution. For example, spending time searching for p * may improve the acceptance rate; but this search is costly and will reduce the number of effective draws per unit time.
In this experiment we compare the computational run time of the sampler in Algorithm 2 to that presented in Chanialidis et al. (2018) over a wide range of μ and ν values. To do this, a 128 × 128 grid of equally-spaced μ ∈ [1, 25] and ν ∈ [0.01, 10.0] values was constructed. At each site 2500 values were drawn from each sampler and the average time taken to draw from each sampler is computed. The plot in Figure 3 illustrates that Algorithm 2 is between 1.01 and 8.01 times faster. From this one can observe that our sampler represents a significant speedup when used in the context of an MCMC algorithm, as outlined in Section 5.2. This analysis outlines an important aspect of rejection sampling in that the choice of using an efficient envelope must be considered simultaneously with the computational cost of sampling from and constructing this efficient envelope.
The speedup of our algorithm in Figure 3 is notably reduced for μ >= 10. The reason for this reduction is a subtle artefact of computer generation of Poisson samples for large μ. As discussed in Ahrens and Dieter (1982), computer implementations of Poisson samplers generally switch to a different sampler for values of μ ≥ 10. Further work could examine methods to overcome this restriction in order to improve our sampler in this region. Moreover, the results of the experiment in Figure 3 can be viewed as a lower bound to the potential speedup of our algorithm.
Another benefit of rejection sampling algorithms is that they can be implemented in parallel thus speeding up the computation, however care is needed when implementing parallel random number generators with potential issues discussed in Mascagni and Srinivasan (2000). Algorithm 2 gives the pseudocode for the COM-Poisson rejection sampler.

Unbiased likelihood estimation by monitoring rejection sampling efficiency
Note: In this section we reintroduce the index parameter i to discuss rejection sampling at different parameters θ i , each θ i necessarily introduces indexed tractable and intractable sampling bounds and envelope parameters which we denote as B f/g,i , M f/g,i and γ i respectively.
Literature on rejection sampling has explored some methods for recycling the rejected proposals from the envelope distribution. Casella and Robert (1996) consider reducing the variance of rejection sampling estimates by incorporating the rejected samples in a post-processing step using the Rao-Blackwell theorem. Rao et al. (2016) uses the rejected proposals from a rejection sampler in a different way by considering the joint distribution of rejected and accepted draws leading to a method to perform doubly-intractable inference in the case where the likelihood has an associated rejection sampler. Our approach considers yet another use for the rejected proposals by examining the relationship of rejected proposals with the efficiency of the rejection sampler. The efficiency is used to provide an unbiased estimator of the reciprocal normalising constant 1 Z f (θi) for each parameter θ i . An unbiased estimator of the reciprocal normalising constant at each θ i will lead directly to an unbiased estimator of the complete likelihood (2).
The efficiency of the rejection sampler is critical to the overall performance of the exchange algorithm. The efficiency is defined as the total number of draws N 1 required from the envelope distribution until the first acceptance of a draw from f (y|θ i ) or in general the total number of draws N r required until r acceptances from f (y|θ i ). The number of draws determines how closely the envelope distribution matches the target distribution. The ideal envelope distribution would exactly match the target distribution and reject no proposals, however in this scenario we could simply circumvent rejection sampling and sample from the envelope distribution directly. In Algorithm 1, run at parameter θ i , each of the samples proposed from the envelope is followed by a Bernoulli trial with conditional acceptance probability α i (y * ) = f (y * |θi) M f/g,i g(y * |γi) . To remove the conditional dependence on y * , consider the unconditional acceptance probabilityᾱ i over all envelope proposals by integrating α i (y * ) with respect to the envelope, It is now obvious that the unconditional acceptance probability and from this the rejection sampler efficiency are controlled directly by the magnitude of the intractable bound M f/g,i . On average, one expects to accept a portion 1 M f/g,i of proposals from g(·|γ i ) as being from f (·|θ i ) and to discard the remainder. Equivalently, the total number of draws until the first accepted draw follows a geometric distribution with success probability 1 M f/g,i and mean M f/g,i i.e. N 1 ∼ Geometric 1 M f/g,i . In general, the total number of draws (N r ) until r acceptances follows a negative binomial distribution with mean rM f/g,i . A natural method to estimate M f/g,i is to run rejection sampling at the parameter θ i and record the observed total number of draws n r,i required for r acceptances. Then an unbiased estimate of M f/g,i based on r acceptances is given as By the central limit theorem increasing r will lead to lower variance estimates of M f/g,i since M (r) f/g,i can be viewed as the average of r independent geometric trials each having variance M f/g,i (M f/g,i − 1). To use M (r) f/g,i to give an unbiased estimate of the likelihood, we have the final assumption that Z g (γ i ) is known i.e. the envelope distribution is tractable. Then by rewriting (8) a link emerges between the inverse normalising constants of g(·|γ i ) and f (·|θ i ) as By replacing M f/g,i with its unbiased estimate M (r) f/g,i and multiplying both sides by q f (y|θ i ) an unbiased estimator of the likelihood for observation y i is given as with all terms on the right-hand side (RHS) of (17) known. The estimate of the complete likelihood follows as by estimating M (r) f/g,i at each θ i . We point out here that the idea to use the acceptance probability of a rejection sampling algorithm to develop an unbiased estimate of the intractable normalising constant has previously been explored, in the context of inference for stochastic differential equations. We refer the reader to Section 5 of Beskos et al. (2006). This unbiased estimator of the intractable likelihood is very useful especially for model selection as it can be used to construct an estimate of the Bayesian information criterion (BIC) (Schwarz, 1978). Bayesian model choice is a challenging problem for models with intractable likelihood functions. The intractability prevents calculation of the likelihood which is required for most model selection criteria. The BIC estimate from the unbiased likelihood estimator (18) is whereθ 1:n maximises the unbiased estimate. The variance of this BIC estimate depends on r and we find in experiments that r needs to be high (> 1000) to achieve an accurate estimate of the BIC. We note that an interesting future research question is to understand the statistical properties, such as consistency, of the maximum likelihood estimatorθ 1:n . A useful starting point might be to consider the literature where this issue has been explored for different intractable likelihood problems including Geyer and Thompson (1992) and Beskos et al. (2006).
It may also be possible to use the unbiased likelihood estimator (17) as part of other Bayesian model selection frameworks, for example, in the calculation of the marginal likelihood (or evidence) and Bayes factors. This could be a focus of future work. In addition to BIC estimation we consider plugging the unbiased estimator directly into the intractable acceptance ratio (4) which is the basis of pseudo-marginal MCMC algorithms discussed in the next section.
Finally, we note that Chanialidis et al. (2018) presented results comparing models based on the deviance information criterion (DIC) (Spiegelhalter et al., 2002), without explaining how they overcame the need to evaluate the intractable likelihood for each draw from the MCMC sample. Following personal communication with the authors, it turns out that they approximated the normalising constant, Z f (μ, ν), using a truncated summation involving k 2 − k 1 + 1 terms, for positive integers, k 1 , k 2 , where k 1 < k 2 , as follows, (20) Here k 1 and k 2 are selected to satisfy the inequality k 1 < μ < k 2 , where, as before, μ is the mode of the COM-Poisson distribution. In turn, they approximated the likelihood for each draw from the sample generated from their MCMC algorithm bŷ f (y 1:n |μ 1:n , ν 1: which is then used, for example, to estimate the posterior expected (deviance) loglikelihood which is required for calculation of the DIC. Of course, this introduces an approximation of the likelihood into the estimation of the DIC. This raises questions as to how many terms k 2 − k 1 + 1 are needed in order to get an accurate approximation of the likelihood as well as how to choose both k 1 and k 2 and finally raises the issue of the computational cost involved in evaluating the finite truncation for the n terms in the product in (21). We return to this issue in Section 5.1, where we find that the number of terms required to get an accurate approximation of the likelihood depends considerably on the values of μ and ν.

Pseudo-marginal MCMC with the intractable likelihood estimator
MCMC methods for Bayesian inference require the ability the evaluate the unnormalised posterior distribution. For doubly-intractable problems where the posterior distribution cannot be evaluated as such, there exists pseudo-marginal MCMC algorithms (Andrieu and Roberts, 2009) which require only an unbiased positive estimate of the posterior distribution. Pseudo-marginal algorithms have been applied to a range of problems such as genetic modelling (Beaumont, 2003) and Markov jump processes (Georgoulas et al., 2017). This pseudo-marginal approach can be seen as an alternative to the exchange algorithm which uses exact sampling in place of unbiased posterior estimators.
We now describe the pseudo-marginal framework and show how our unbiased estimator of the likelihood (18) can be used. Consider a target distribution π(θ) which cannot be pointwise evaluated, but assume that it is possible to construct an unbiased estimateπ(θ, z) of the target through the use of auxiliary variables z ∼ w(·|θ), z ∈ Z.
Usingπ(θ) in place of the true target in a Metropolis-Hastings algorithm leads to the approximate acceptance ratio min 1,π (θ )h(θ , θ) π(θ)h(θ, θ ) Pseudo-marginal MCMC algorithms can be viewed as Metropolis-Hastings algorithms over the joint parameter and auxiliary variable state space θ ∪ Z. Andrieu and Roberts (2009) describe two alternative forms: the "Monte Carlo within Metropolis" or MCWM algorithm and the "grouped independence Metropolis-Hastings" or GIMH algorithm, with only the latter algorithm having the correct target distribution π(θ). In MCWM, both the currentπ(θ) and proposedπ(θ ) estimate of the target are refreshed every iteration using new auxiliary draws z and z . In GIMH, only the proposed estimateπ(θ ) is refreshed andπ(θ) is fixed to the estimate used when θ was last accepted.
The difficulty with pseudo-marginal algorithms is in constructing an unbiased estimate of the target. Using the rejection sampling unbiased estimator (18) the pseudomarginal acceptance ratio for the β to β move in the COM-Poisson model is This acceptance ratio is entirely tractable once the estimates M (r) f/g,i have been computed by running rejection sampling at each θ i for the GIMH algorithm or at both θ i and θ i for the MCWM algorithm.

Results
We present two examples of COM-Poisson modelling. The first example is a retail inventory dataset from Shmueli et al. (2005) which contains no covariate information. This first example is a toy example that demonstrates how the BIC can be approximated using the unbiased likelihood estimate. The second example is a takeover bids dataset which contains covariates which we model using a COM-Poisson regression model. The second example will firstly show the reduction in computation time available using our sampler and also the ability to choose different regression models using the unbiased likelihood estimate. Figure 4: Inventory data: Sales of a particular item of clothing from a well-known clothing brand. The sample mean sales is 3.56 with sample variance 11.31.

Inventory data
Inventory data or stock count data is of high importance to retailers as it facilitates decision-making regarding future stock levels. The counts we observe in this data are the quarterly sales counts of a particular item of clothing in each store across n = 3,168 retail stores. A frequency barplot of the data is shown in Figure 4. The maximum quarterly sales of the clothing item sold was 30 in one store. No stores sold between 22 and 29 items and many stores had 0 sales of the clothing item. The sample mean sales is 3.56, sample variance 11.31 and the sample dispersion index is 3.18, thus overdispersed. This data was analysed by Shmueli et al. (2005) using a maximum likelihood approach.
Since this data contains no extra covariate information, we model μ and ν directly rather than considering them as functions of covariates. This could be viewed as using the link function with only an intercept where μ = exp(β μ,0 ) and ν = exp(β ν,0 ) but we consider it easier to model μ and ν. We place Gamma(1, 1) and Gamma(0.0625, 0.25) priors on μ and ν, respectively. We run the exchange algorithm for 200,000 iterations and discard the first 5,000 as a burn-in. Each parameter μ and ν were updated using a single-site update. Table 1 presents posterior estimates for both parameters and a reasonable acceptance rate for the MCMC sampler. with r = 5000. The BIC for the Poisson distribution can be calculated exactly since the likelihood is tractable. The comparison of both models is shown in Table 2 where the COM-Poisson is shown to outperform the Poisson under BIC.  with r = 5000.

Comparing our method to an approximate (truncated) likelihood approach
We now compare our exact approach to instead carrying out MCMC with the true likelihood in the posterior distribution replaced by an approximate likelihood using a truncated normalising constant to m terms Such a plug-in approximation of the likelihood leads to one carrying out MCMC on an approximate posterior distribution. In turn, this leads to the important issue of understanding the ergodicity of the resulting Markov chain, an issue which is attracting much interest in the literature, for example, the noisy MCMC approach of Alquier et al. (2016) and the Russian roulette approach of Lyne et al. (2015). Therefore, an immediate question arising from using an approximation of the normalising constant, as detailed above (23), would be to understand the convergence properties of the resulting noisy MCMC algorithm although it is beyond the scope of this paper. Of course, the exact sampling algorithm developed in this paper avoids such approximation.
Instead, we carry out a small study to explore empirically estimated posterior means and standard deviations arising from the noisy MCMC algorithm derived by replacing the true likelihood with the approximate likelihood with truncated normalising constant. To do this, we again analyse the Inventory dataset experimenting with two choices for the number of terms m in the truncated normalising constant (23). In particular, we set m = 3,300 as this lead to a similar computation time for one transition of the exact MCMC algorithm. Table 3 shows that the results of this experiment are in good agreement with the results of the exact MCMC algorithm detailed in Table 1. We also note that the truncation of the normalising constant to m = 100 terms yielded identical results to that of m = 3,300, but resulted in an 86% decrease in the time required by our exact sampler. However these results disguise the fact that m = 100 or even m = 3,300 terms are not sufficient to give an accurate likelihood approximation for all regions of the parameter space. To investigate, we began the algorithm at the starting point (μ = 500, ν = 0.0001) and found that for both cases m = 100 and m = 3,300 the resulting noisy MCMC algorithms failed to converge whereas our algorithm converged to the posterior estimates as in Table 1. The failure to converge is due to an insufficient number of terms in the truncation, meaning the truncation is only practical in some but not all areas of the parameter space. Of course, this has immediate implications when one implements this truncated normalised constant within an MCMC algorithm, as above, since the algorithm may visit areas of the parameter space for which the truncation is inaccurate. We also emphasise that one would not know in advance the  Table 3: Truncated normalising constant estimates: Posterior estimates for using a truncated normalising constant in the likelihood with m = 3,300 terms. appropriate value of m to use. This is illustrated in Figure 5 where we explore the accuracy of (23) for different values of μ and ν.

Takeover bids COM-Poisson regression model
We consider the dataset from Jaggia and Thosar (1993) on the number of bids received by U.S. firms that were targets of bid offers during the period 1978-85 and taken over within 1 year of the initial offer. The response variable is the count of takeover bids excluding the initial bid (NUMBIDS) filed for the particular company. The covariates available include variables which capture the firm-specific characteristics and defensive action taken by the firm in response to their initial takeover bid and are presented in Table 4. This data has previously been analysed using Poisson regression by Jaggia and Thosar (1993) and series expansions by Cameron and Johansson (1997). We reconsider this analysis by comparing two Poisson regression models to three COM-Poisson regression models. The five models considered in our analysis are listed in Table 5. We performed a prior variable selection by fitting a Poisson GLM by maximum likelihood to the data to establish which predictors to consider in the Bayesian GLM analysis. The first model and simplest considers NUMBIDS in a Poisson GLM where the linear predictor of log (μ i ) contains an intercept and two predictors (BIDPREM and WHTKNGHT).  Priors for β and ρ were set as N (0, 5 2 ). The exchange algorithm was run for 100,000 observations with a burnin of 10,000 draws discarded. Each coefficient was updated using a single-site update with a random walk proposal. The proposal variance of the random walk was tuned to achieve an acceptance rate of 44% for each site. The posterior results for all models are presented in Table 6. The BIC was calculated using (19) with r set to 5000 to give an accurate estimate of the BIC. Model 5 was selected as the best model with the lowest BIC of 386.40. All 3 COM-Poisson regression models are selected as better than the Poisson models, Model 1 and Model 2, suggesting that including covariates in the dispersion link function is worthwhile.

Model
Response (y i ) Linear Predictor(s)

Comparing sampler efficiency
Following Section 3.2, we compare (in Table 7) the efficiency of Algorithm 2 to the rejection sampler of Chanialidis et al. (2018) in the context of the takeover bids dataset.
To do so, we examine the number of MCMC draws per second resulting from Algorithm 2 and from the rejection sampler of Chanialidis et al. (2018) for each COM-Poisson models (Models 3, 4 and 5) outlined in Table 5. Our sampler provides a threefold increase in the number of MCMC draws per second in each of the COM-Poisson models agreeing with the experiment presented in Figure 3.
Model 3 Model 4 Model 5 MCMC draws per sec. (Algorithm 2) 9,973 12,534 9,297 MCMC draws per sec. (Chanialidis et al. (2018)) 2,866 3,585 2,828 Relative efficiency 3.480 3.496 3.287 Table 7: Algorithm efficiency comparison: This table shows the number of MCMC draws per second for our rejection sampler (Algorithm 2) compared with the rejection sampler of Chanialidis et al. (2018). Our algorithm is at least 3 times faster for all 3 models.

Pseudo-marginal MCMC results
We choose Model 5 in Table 5 to implement the GIMH pseudo-marginal MCMC algorithm using the unbiased likelihood estimator (18) in the acceptance ratio. The unbiased likelihood estimator was calculated for 5 scenarios with r = 1, 5, 10, 50, 100. The posterior density estimates for each of the pseudo-marginal MCMC runs are shown in Table 8 along with the CPU time and multivariate effective sample size (mESS) (Vats et al., 2019). The mESS is a generalisation of the effective sample size (ESS), which captures the cross-autocorrelation between parameters in the MCMC output. mESS = n mcmc |Λ| |Σ| 1/p , where n mcmc is the number of MCMC draws, Λ is the sample autocovariance matrix of the MCMC output and Σ is the true covariance of the posterior which can be estimated by a batch-means method described in Vats et al. (2019). Figure 6 compares the pseudo-marginal (r = 100) posterior MCMC chain, density and autocorrelation with the exchange algorithm. There is close agreement between the two methods, however the CPU time for the pseudo-marginal algorithm is dramatically increased.

Discussion
This paper provides a new rejection sampler to sample from the COM-Poisson distribution. This rejection sampler allows for faster parameter inference to be performed in COM-Poisson regression models compared with the sampler of Chanialidis et al. (2018). Our rejection sampling algorithm shows that one must consider both the rejection rate of the enveloping distribution and the sampling time from the enveloping distribution in order to construct an efficient rejection sampler. We have also shown how the number of rejected proposals within rejection sampling can be used to construct an unbiased intractable likelihood estimator. This estimator can be used to perform model selection using BIC but future work could include this estimator within other Bayesian model choice strategies. Model selection for doubly-intractable problems such as COM-Poisson regression is a difficult problem and our approach offers one solution to perform model selection. The unbiased likelihood estimator was shown to work well when used within a pseudo-marginal MCMC algorithm (Andrieu and Roberts, 2009) as it provided an unbiased estimate of the acceptance ratio. This unbiased likelihood estimator could be constructed for other intractable models where a rejection sampling algorithm exists to sample from the likelihood, presenting opportunities for future research.  Table 8.