Use in practice of importance sampling for repeated MCMC for Poisson models

: The Importance Sampling method is used as an alternative approach to MCMC in repeated Bayesian estimations. In the particular context of numerous data sets, MCMC algorithms have to be called on several times which may become computationally expensive. Since Importance Sampling requires a sample from a posterior distribution, our idea is to use MCMC to generate only a certain number of Markov chains and use them later in the subsequent IS estimations. For each Importance Sampling procedure, the suitable chain is selected by one of three criteria we present here. The ﬁrst and second criteria are based on the L 1 norm of the diﬀerence between two posterior distributions and their Kullback-Leibler divergence respectively. The third criterion results from minimizing the variance of IS estimate. A supplementary automatic selection procedure is also proposed to choose those posterior for which Markov chains will be generated and to avoid arbitrary choice of importance

context of numerous data sets, MCMC algorithms have to be called on several times which may become computationally expensive.Since Importance Sampling requires a sample from a posterior distribution, our idea is to use MCMC to generate only a certain number of Markov chains and use them later in the subsequent IS estimations.For each Importance Sampling procedure, the suitable chain is selected by one of three criteria we present here.The first and second criteria are based on the L 1 norm of the difference between two posterior distributions and their Kullback-Leibler divergence respectively.The third criterion results from minimizing the variance of IS estimate.A supplementary automatic selection procedure is also proposed to choose those posterior for which Markov chains will be generated and to avoid arbitrary choice of importance functions.

Introduction
Bayesian approach allows the introduction of prior knowledge of the parameter via prior probability distribution.The Bayesian idea consists of combining two sources of information on the parameter, one from the data through a likelihood function, and the other implied by the prior.The result is the posterior distribution of the parameter which is a conditional distribution, given the data.In complex models, an analytical solution of the posterior distribution and its descriptive statistics are generally not available.In this case, approximation methods are used which are based on stochastic MCMC algorithms such as the Hastings-Metropolis algorithm (Hastings, 1970) or Gibbs sampler (Geman and Geman, 1984).These algorithms generate a Markov chain whose stationary distribution is the posterior distribution.The ergodic theorem guarantees that empirical averages provide good approximations but on condition that the algorithm achieves convergence, which often requires a huge number of iterations.In cases where new models are studied, performances of estimates can be evaluated in a simulation study.For example, traditionally, in order to assess qualities such as bias or quadratic errors of estimates, different data sets are simulated for different parameter settings, then relevant priors are set on these parameters and posterior estimates computed for all data sets using an MCMC approach.Unfortunately, the time required to perform this computation is usually prohibitive.An alternative solution would be to apply the Importance Sampling (IS) method, by which we can estimate one expected probability distribution using a sample from a different distribution which is easier to simulate.The simultaneous use of IS and MCMC was previously proposed by Geyer and Thompson (1992) in likelihood computation, by Gelfand, Dey and Chang (1992) in crossvalidation and model determination, and more recently in Population Monte Carlo approach (Cappé et al., 2004;Douc et al., 2007b,a;Cappé et al., 2007).A global use of IS in repeated MCMC is described in McVinish et al. (2008) where IS is used among other techniques in simulation studies and asymptotic efficacy is discussed.
Importance Sampling has to be done under conditions where both densities are close and the support of the sampling distribution covers the support of the initial one.When simulation studies are performed, the posterior distribution is conditional on different data sets which have been simulated under the same model.Theoretically, these posteriors should resemble one another as should the data sets.A first idea is to run an MCMC algorithm on the first data set and then apply IS to obtain estimates for the other distributions using the first posterior distribution as the importance function.This approach can be compared with an idea of predictive distribution estimations of Gelfand, Dey and Chang (1992) where a single choice of importance sampling distribution is taken into account.Nevertheless, in a more general case, if the posterior distribution is conditional on one data set this could lead to poor approximations (when used as an importance function in the IS method) for the other data sets.This is why the choice of sampling distribution may depend on the choice of data set.We therefore propose three criteria for choosing the posterior sampling distribution.
The objective of this paper is to assess and improve the efficacy of IS in relation to the MCMC method as an easier and faster way of Bayesian estimation.The method is illustrated on Poisson models and the data sets used in this study are simulated for different values of the parameters.The parameters were estimated for each data set by both MCMC and the IS method, and the results were compared through the mean square errors.
In section two we present the concept of the IS method and define a context in which we use it.Then we propose three criteria for choosing the best sampling distribution for the IS approximation among a set of preselected posterior distributions.In the third section we present results of discussed estimation methods applied in different Poisson models.A selection procedure is presented in the fourth section to obtain automatically the set of preselected posterior distributions.Conclusions are reported in the final section.

Methods
Suppose data X are described by a probability model (π(x|θ), θ ∈ Θ).Prior distribution on the parameter θ (which can be multidimensional) is denoted by π(θ) and the posterior distribution by π(θ|x).We are equally interested in some descriptive statistics of this posterior distribution such as posterior mean, posterior standard deviation, posterior quantiles (credibility interval), etc.Unfortunately, the closed-form solution for the posterior is very often not available and nor are its descriptive statistics.In the Bayesian context the standard solution for dealing with this problem is provided by MCMC methods (see Gilks, Richardson and Spiegelhalter, 1996).For a given data set X and a given prior distribution on θ, a MCMC algorithm generates a Markov chain θ 1 , . . ., θ N whose stationary distribution is π(θ|x).The ergodic theorem ensures that the expected value E [θ|X] [g(θ)]with respect to the posterior distribution of θ of any integrable function g can be approximated almost surely by 1/N N i=1 g(θ i ).According to the choice of the function g we can approximate different descriptive statistics for the posterior distribution.
In the case of many repeated Bayesian estimations (e.g. while testing a new model's properties by simulating many data sets and estimating the model's parameters afterwards), MCMC has to be used for each sample, which is often expensive and time-consuming.When carrying out an empirical study of estimates, data are generally simulated from the model π(x|θ) under different parameterization schemes.Parameters (or functions of the parameters) are then estimated for each simulated data set.Since we are interested in Bayesian estimation and the posterior does not have an analytical solution, we can approximate it via MCMC.Consider L parameterization schemes and for each K simulated samples, then the MCMC algorithm must be run L×K times, thus extending the time taken for the estimations.In order to speed up the computation of posterior expectations, we propose two strategies applying the Importance Sampling method.Globally, these methods require that MCMC algorithm is used on a subset of samples.As IS theoretical results are based on the assumption that MCMC samples are from the posterior distribution, we assume that the convergence of algorithms is carefully checked.In section 3, different diagnostics are proposed for our applications.

Importance Sampling: fixed strategy
Importance Sampling (IS) is a well-known estimation method.We are particularly interested in using it to approximate expected values under posterior distributions in order to reduce the number of times MCMC is used.We present the concept of IS on the example of E [θ|X (k) ] [g(θ)] estimation using two data sets X (k) and X (m) drawn from the same model π(x|θ) (with the same model parameterization).The expected value E [θ|X (k) ] [g(θ)] is given by the integral formula and the Importance Sampling method relies on the following transformation under the integral sign with π(θ|X (m) ) = 0 whenever π(θ|X (k) ) = 0.The posterior distribution π(θ|X (m) ) is called the importance function in IS procedure (right term of 1).Then, if we already have an MCMC sample θ from the posterior distribution π(θ|X (m) ) of length N, the approximation of the expected value by the ergodic theorem (Theorem 3 Tierney, 1994) is: 1 If there is no closed-form solution for the posterior distribution, the approximation (2) can be replaced by another one which is easily calculable with normalized weights: where wi (k, m) = and The IS estimator with normalized weights given in ( 3) is also consistent for ] as a consequence of the strong law of large numbers.Geweke (1989) discusses the IS estimator with normalized weights for a case of iid samples, which can be extended to the case of Markovian samples as a result of the Markov chains theory (see, for example Tierney, 1994).Moreover, Geweke (1989) gives conditions for central limit theorem to hold for IS estimator with normalized weights which can also be extended for Markovian samples.Indeed, in discussion with Tierney (1994), Doss (1994) proves that for stationary and uniformly ergodic chain for which conditions E π(θ|X (m) ) [(g(θ)w(θ)) 2 ] < ∞ and E π(θ|X (m) ) [(w(θ)) 2 ] < ∞ are satisfied, an IS estimator with normalized weights has a normal asymptotic distribution.As discussed in Geweke (1989), the limiting distribution assessment may give us some indications of used in further calculations needs to be ergodic from the posterior distribution π(θ|X (m) ) being its equilibrium distribution.Consequently we can easily approximate E [θ|X (k) ] [g(θ)] for all k = 1, . . ., m − 1, m + 1, . . ., K, using systematically the MCMC sample from π(θ|X (m) ).We call this first strategy "fixed strategy" because m is fixed and unique over all IS estimations of E [θ|X (k) ] [g(θ)] for all k = m.In practice this means that it is enough to run MCMC only once to obtain K estimations of E [θ|X (k) ] [g(θ)] for k = 1 . . ., K. Without loss of generality, m is fixed to 1.
Even if the data sets are simulated under the same model, considerable variability between samples could appear and thus posteriors may be remote.For this reason we propose a second strategy to improve the IS method by choosing a different importance function for each estimation.

Importance Sampling: modulated strategy
The basic idea is to use an MCMC algorithm on M preselected simulated data set (M < K) and to apply the IS procedure to remaining data where, for each data set, the importance function is one of the M preselected posterior distributions approximated by MCMC.We denote this second strategy by "modulated strategy."M has to be relatively small with regard to K in order to ensure a gain in computer time.In our examples, M is equal to 10 and K is equal to 100.The difficulty of this strategy is to establish relevant criteria for choosing the more adequate preselected posterior distributions for the IS procedure.We propose and compare three criteria, which are detailed further.More precisely, this second strategy can be formulated as follows.
Suppose that M simulated data (X (m) , m = 1, . . ., M ) have been preselected from all data sets and M Markov Chains ((θ N ), m = 1, . . ., M ) have been produced by MCMC under (π(θ|X (m) ), m = 1, . . ., M ) respectively.We now want to approximate E [θ|X (k) ] [g(θ)] for k = M + 1, . . ., K via IS.In the modulated strategy, IS estimations are based on the choice of m k ∈ {1, . . ., M }, so that π(θ|X (m k ) ) becomes the importance function which may change from one estimation to another within M possible choices.Following the same reasoning as before, the approximation of (4) but depends on the choice of m k .
For M > 1, we propose three criteria for choosing m k ∈ {1, . . ., M } for each k = M + 1, . . ., K. The choice of importance function in the IS procedure is discussed in the literature with different solutions.For instance, solutions are proposed as an adaptation procedure of Cappé et al. (2007) where this function is based on minimization of deviance criterion or as in Douc et al. (2007b) via minimization of the asymptotic variance of IS.Let the IS estimation be as in equation ( 4).We expect that the IS method will provide a good estimation if π(θ|X (m k ) ) is "close" to the target distribution π(θ|X (k) ).It seems natural then to choose m k for which the norm L 1 of the difference between these two posterior distributions ||π(θ || is the smallest.The first criterion corresponds to the smallest approximation of this norm and m k is chosen, satisfying the following expression min or, when the closed-form of posteriors is not available A major concern in the context of our study is verifying the support condition to apply the importance sampling estimator as discussed in section 2.4.
To handle this difficulty we may check a tail behavior of both target and sampling distributions, since we want the sampling distribution to be flatter.To control tail behaviors many authors suggest basing the choice of IS density on a minimization of a Kullback-Leibler divergence (see, for example, Asmussen, Kroese and Rubinstein (2005); Chen and Shao (1997); Rubinstein and Kroese (2004) and Gustafson and Wasserman (1995)).Following this concept, the second criterion may choose m k corresponding to the IS distribution for which a Kullback-Leibler divergence KL(π(θ|X or, when the closed-form of posteriors is not available The third criterion is based on the variance of g(θ) π(θ|X (k) ) π(θ|X (m k ) ) .We would like to choose m k ∈ {1, . . ., M } satisfying the following inequality: corresponding to the variance of the IS estimate on the left and the variance of the MCMC estimate on the right. As , it is then equivalent to: As the choice of π(θ|X (m k ) ) is limited to the M preselected data, it is not certain that the above inequalities will be satisfied for any of these densities.
] is finite, we can find m k for which the variance of the IS estimate is smallest.For this reason, for each k = M + 1, . . ., K, the second criterion selects m k ∈ {1, . . ., M } such that the corresponding pos- . It can be shown (see Robert, 2007, section 6.2.2) that to minimize this variance, π(θ|X π(θ|X (m k ) ) the most stable.Thus, the third criterion chooses m k which satisfies min where u(θ when the posterior distributions cannot be calculated.Note that this third criterion depends on function g and thus must be calculated for each function g.This is not the case for two other previous criteria.

Comparison of different strategies
In a simulation study to judge the performance of one estimation method against another, we calculate the mean square error for each one, which by definition is the mean square difference between estimations and true expected value.However, when there is no analytical expression of E [θ|X (k) ] [g(θ)], comparison with the true posterior mean is not possible.We therefore propose to compare the IS estimate directly with the MCMC estimate, as MCMC is the traditional procedure to use and can thus be considered as reference.Let us denote by ] obtained using the traditional MCMC procedure and by IS (k)/(m) the estimate obtained using IS, as in (3).We thus define the mean square errors as follows: for the "fixed strategy" (fs), and for the "modulated strategy" (ms).In this second case, the mean square errors will be evaluated for the three criteria, allowing us to assess the effect of the criteria on the IS estimates.In fact, this mean square error can be considered as a distance between the IS and MCMC methods.
As each procedure (MCMC or IS) corresponds to an approximation, we also calculate mean square errors of MCMC and IS procedures for the values fixed in the simulation scheme.
The choice of M preselected simulated data for the three criteria could appear arbitrary and not necessarily optimal.We therefore propose a procedure allowing "automatic" selection.This point will be described in section 4. Convergence of MCMC algorithms was checked systematically using different tools.

Importance Sampling performances
Generally speaking, the use of IS is justified if the support of the target posterior distribution is included in the support of the importance function.In our study, the target density is π(θ|X (k) ) and the importance function is the posterior density π(θ|X (m) ) for fixed strategy or π(θ|X (m k ) ) for modulated strategy.This problem of support is particularly delicate and crucial when the posterior distribution depends on parameter values.A simple example of this case is when data are assumed to be uniform on an interval determined by parameters which implies that parameter space of the posterior depends entirely on data.GLM models do not correspond to this situation.Apart from these cases of links between parameter space and data, the difficulty appears when, in any part of the support, the sampling distribution π(θ|X (m k ) ) tends to zero faster than the target distribution π(θ|X (k) ).Then, the weights involved in the IS procedure are very unstable with possible huge values.The accuracy of estimates obtained by IS depends a great deal on weight behavior.Hence, the support assumption may be ensured in a way by the control of the tail behavior of weights.Different tools exist in the literature to detect weight instability.For instance, Geweke (1989) proposed two diagnostics for computational accuracy: the first, based on order statistics of weights, is able to indicate thin tails in IS density relative to the posterior, and the second, relative numerical efficiency (RNE), is the ratio between the number of iterations used in IS and the number of iterations used in traditional MCMC to give the same numerical standard error.McVinish et al. (2008) proposed a statistic ∆(X (k) , X (m k ) ) to control the variance of weights when posterior distributions (π(θ|X (k) ) and π(θ|X (m k ) )) are approximated by Gaussian distributions.The greater the values of ∆(X (k) , X (m k ) ), the bigger the variability of weights.Under this gaussian approximation, we can show that the Kullback-Leibler divergence is proportional to this statistic ∆(X (k) , X (m k ) ).More precisely, with same notations as McVinish et al. (2008), π(θ|X (k) ) and π(θ|X (m k ) ) are approximated by Gaussian centered at θk and θm k respectively with asymptotic covariances matrices J k and ), I 0 being a fixed positive definite matrix, the authors show that the asymptotic variance of weights is: where r is the dimension of parameter space.If we assume that J k = J m k (sample sizes of X (k) and X (m k ) are implicitly assumed to be equal) then: which is exactly the statistic controling the variability of weights in McVinish et al. (2008).If X (k) and X (m k ) are mutually independent and if θk and θm k are maximum likelihood estimators then ∆(X (k) , X (m) ) d → 2χ 2 r .This remark goes in favor of the choice of Kullback-Leibler divergence as a criterion for modulated strategy because it minimizes the statistic ∆(X (k) , X (m k ) ).A critical situation for modulated strategies would be when these computational accuracy diagnostics give poor indications for all M preselected data sets.Note that this situation occurs very rarely especially because, as mentioned by Asmussen, Kroese and Rubinstein (2005), the importance function is chosen from among the same parametric family as the target density, as happens in our case.If this critical situation exists for a specific π(θ|X (k) ), one solution is to add the data X (k) in the preselected data set (containing then M + 1 elements); this can be done without any difficulty.

Applications
In this section we use both the MCMC (Gibbs sampling in our case) and IS methods to estimate parameters of three Poisson models.The first is a Poisson model with one parameter (the mean), the second is a Poisson regression on one covariate with two parameters (intercept and covariate association), and the third is a Poisson regression on one covariate with extra Poisson variability introduced by a Gaussian residual error term with three parameters (intercept, covariate association and residual variance).The first model can be seen as a toy example with explicit posterior distributions; the second corresponds to a widely used GLM model, and the third introduces over-dispersion which is essential, for example, in medical applications since association estimates would be biased if extra-Poisson variability was not modelled (see Breslow (1984) for motivations).For each model K = 101 data sets are simulated for different values of the parameters.All data sets contain n = 20 observations.Vague priors are assigned to the parameters and the posterior values are estimated via MCMC and IS as discussed above.Note that it is essential that MCMC convergence is achieved, therefore several (and not only one) diagnostics of convergence have to be checked as suggested by Brooks and Roberts (1998) and Mengersen, Robert and Guihenneuc-Jouyaux (1999).Many diagnostics are available in the BRuGS package CODA, namely convergence diagnostics of Geweke, Gelman and Rubin, Raftery-Lewis, Heidelberger and Welch.In our examples, we use MC error (MC error was less then 5% of the posterior standard deviation) and Gelman and Rubin diagnostics as well as graphical tools (history, autocorrelation).For all estimating values we ran an MCMC algorithm for 50, 000 iterations with a burn-in of 5, 000.
Firstly, we study performance of the fixed strategy with m = 1 over all IS estimations.Then we generalize the concept of the fixed strategy by evaluating mean square errors for all possible values of m (m = 1, . . ., 101).This means that the fixed strategy is repeated 101 times, and each time only one posterior distribution is taken as an importance function in all estimations.However, any time the fixed strategy is repeated, this posterior distribution, which is used as an importance function, depends on another data set.This approach allows us to assess the best and the worst performance of this strategy in terms of mean square error.We also draw box-plots to summarize all these mean square errors.
For the second strategy, we preselect the first 10 simulated data (M = 10) and use them next in criteria.For each criterion we calculate corresponding mean square errors, which are then reported on the box-plots mentioned above.
Finally, we compare the results with those obtained when: (1) the number of observations increases to n = 1, 000 and (2) the number of coefficients increases by introducing 10 covariates into the regression.All the calculations were done using R software environment for statistical computing and graphics (R Development Core Team, 2008).The BRugs package (Thomas et al., 2006) was used in MCMC simulations.Note that for storage reasons, a subset of MCMC iterations (thin equal to 20) is used for Model 3 with n = 1, 000 for all procedures.

Simple Poisson model
As a tool example we consider that the data are described by a simple Poisson model with one parameter λ, that is X 1 , . . ., X n are iid and X i |λ ∼ P(λ).We use λ ∼ G(α, β) as a prior distribution and then the true posterior distribution of the parameter given the whole data set To simulate data sets of size n = 20 each, we use two values for the model parameter λ = 1 and λ = 20.In both cases the parameters of the prior are α = 0.01 and β = 0.01.
The mean square errors for the fixed strategy and the modulated strategy for g(λ) = λ are presented in figure 1 with λ = 1 on the left and λ = 20 on the right.Black, white and grey diamonds correspond to the first, second and the third criteria respectively for the modulated strategy.The box-plots give results for the fixed strategy when all possibilities of fixed m are considered.Box-plots without ouliers (extreme MSEs) are presented in the top right corner of each graphic.We observe that mean square errors are greater for λ = 20 as data variability is greater.However, in both cases of lambda, modulated strategies (first to third criteria) seem to reduce the mean square errors compared with the M SE f s for fixed strategy.As it is impossible to choose with certainty one good sample distribution for the fixed strategy, the results show that thanks to the criteria, we can avoid those m (fixed strategy) for which associated MSE is greatest.
The advantage of this simple example is that the true posterior mean of λ is known, equal to ( k) .For each approximation procedure (MCMC and IS), table 1 presents MSEs with regard to the true posterior mean for λ = 1 on the left and λ = 20 on the right.
Generally these MSEs are small for all strategies studied (notice that values reported in table 1 are multiplied by 10 6 ).For λ = 1 MSEs calculated for MCMC are smaller than for other strategies, however for λ = 20 MSEs calculated for the first and the third criteria are smaller than those calculated for MCMC.Three modulated strategies have MSEs sharply smaller than the biggest MSE (the worst case) of the fixed strategy.An advantage of modulated strategies is that the estimations are on average closer to the true posterior means than are estimations calculated with fixed strategy.For the second criterion based on Kullback-Leibler divergence, numerical problems can appear due to the log of weigths.For λ = 20 and second criterion, MSE is equal to 124.95 10 −6 due to two extreme values.If these values are omitted (and then the mean is based on 89 values instead on 91 values), MSE is equal to 53.3 10 −6 , which is comparable to the other MSEs obtained from modulated strategies.The mean square errors calculated with respect to the true values set in simulations (results not shown) show similar performances for the modulated strategies and MCMC and sometimes, as in table 1, slightly better results are obtained from modulated strategies than from the MCMC procedure.

Poisson regression model
The second model is a Poisson regression with a covariate Z and two parameters a and b, that is X i |λ ∼ P(λ i ) where log(λ i ) = a + bZ i .For both coefficients a and b we use N (0, 10 5 ) as vague priors, and data sets are simulated when a = 0 and b = 0.5.For this model, no analytical solution of posterior parameter distribution is available.Table 2 shows the mean square errors for the best and for the worst choice of the sampling distribution with fixed strategy, and also when the sampling distribution changes according to the three criteria.These results are illustrated by the box plots of all mean square errors (figure 2).with 3 rd criterion 5 0.026 0.001 0.594 0.001 1 Mean square errors on (K-M) samples (×10 3 ) 2 Highest 97.5% of the terms involved in mean square error expression (×10 3 ) 3 IS with fixed strategy associated to the best MSE (×10 3 ) 4 IS with fixed strategy associated to the worst MSE (×10 3 ) 5 IS with modulated strategy (×10 3 ) Again, both strategies provide better estimations in comparison with most of the estimations for which fixed choice of sampling distribution was used.Indeed, we can ascertain from figure 2 that the MSEs for criteria are smaller than the quasi-totality of the MSEs for the fixed choice.

Poisson regression model with extravariability
The third model gives an example of extra-variability.Poisson regression on one covariate Z with extra-variability is supposed.The model is then the following: For both coefficients a and b we use N (0, 10 5 ) as vague priors, and inverse gamma distribution IG(0.01,0.01) (the second coefficient being the rate) as a prior of residual variance σ 2 .In the simulations we set a = 0 and b = 0.5 and we consider three parameter settings of σ 2 : 1/8, 1/4, 1/2.There is no closed-form expression for the posterior parameter distribution.For each strategy and each parameter setting, mean square errors (M SE f s and M SE ms ) with regard to MCMC posterior means are presented in table 3 as well as the highest 97.5% of the terms involved in mean square error expression.
Overall, the modulated strategy always gives better results than the worst fixed strategy for the three parameters, but sometimes the best fixed strategy is better.The impact of criterion choice for the modulated strategy is minor, with results being almost the same.Figure 3 presents box-plots for the fixed strategy when all possibilities of fixed m are considered.As previously, black, white and grey diamonds correspond to the first, second and third criteria respectively for the modulated strategy.From left to right, the results are given for the three values of σ 2 (1/8, 1/4, 1/2) respectively.These box-plots clearly confirm that the modulated strategy avoids the worst cases of fixed strategy and show that in most cases, fixed strategy leads to greater MSEs even if it is less clear that modulated strategy performances are better for the parameter σ 2 .
As the MCMC procedure corresponds to an approximation, mean square errors of MCMC and of the two IS strategies are assessed with respect to the "true" parameter values (results not shown).Modulated strategies lead again to good results, sometimes better than those from the MCMC procedure.For instance, in the case of σ 2 = 1/2, MSEs of coefficient b for three criteria of modulated strategy are equal to 67×10 −3 , 64×10 −3 and 58×10 −3 respectively, while the MSE of b from MCMC is 83 × 10 −3 .
In all three models, there are cases where fixed strategy corresponds to smaller M SE f s than M SE ms and also where it is smaller than the MSE of MCMC.Nevertheless, while comparing posterior estimations with the values set in simulations, modulated strategy again shows better results than fixed strategy except in a small number of cases and avoids the worst case of the fixed strategy.

Sensitivity analysis
As an extension, three other cases are considered, in each of which we increase the number n of observations per data set to n = 1, 000.The first case (case 1) is again the Poisson regression model with extra-variability as previously described in section 3.3.The same parameter values are chosen (a = 0, b = 0.5, σ 2 = 1/2) and the same priors are taken.Two other cases of Poisson regression models are studied where the number of covariates is now 10 which are all continuous (normally distributed from N (0, 1)) (case 2) or which are continuous or binary (Z ij ∼ N (0, 1) for j = 1, . . ., 5 and Z ij ∼ Bernoulli(0.25)for j = 6, . . ., 10) (case 3).For these last two cases, the linear predictor becomes We use N (0, 10 5 ) as vague priors for coefficients a and {b j , j = 1, . . ., 10}, and inverse gamma distribution IG(0.01,0.01) as prior of residual variance σ 2 of ǫ i .In simulations we set a = 0, b j = 0.05 for j = 1, . . ., 10 and σ 2 = 1/2.
For each case, mean square errors are calculated with regard to MCMC posterior means.For these three cases, similar performances are obtained for modulated strategies to those in section 3.3.The increase in the number of observations n leads to smaller MSEs, as expected.Table 4 gives these MSEs for case 1 and has to be compared with the last line of table 3 corresponding to a similar Table 4 Mean square errors in Model 3 with a = 0, b = 0.5, σ 2 = 1/2 and n = 1000 for g(θ) = a (left), g(θ) = b (middle) and g(θ) = σ 2 (left) calculated with regard to MCMC posterior means The best fixed strategy  ×103 ) 2 Highest 97.5% of the terms involved in mean square error expression (×10 3 ) 3 IS with fixed strategy associated to the best MSE (×10 3 ) 4 IS with fixed strategy associated to the worst MSE (×10 3 ) 5 IS with modulated strategy (×10 3 ) case but with n = 20.MSEs are approximately divided by 3 with n = 1, 000 but the interclassification is preserved.An increase in the number of covariates gives similar results.Concerning case 2, mean of 10 MSEs corresponding to 10 covariate associations is calculated for each strategy.The three modulated strategies give very close results (2.4 × 10 −3 , 2.9 × 10 −3 and 2.3 × 10 −3 for criteria 1, 2 and 3 respectively), MSE means for the best and the worst fixed strategy being 1.2 × 10 −3 and 11.7 × 10 −3 respectively.As before, the result for the worst fixed strategy is considerably less good than for modulated strategies but slightly better for the best fixed strategy.Concerning case 3, results for associations with normal covariates are separated from those with Bernoulli covariates.For the 5 normal covariates, results are completely similar to the previous case.For the 5 Bernoulli covariates, MSE means for the three modulated strategies are 13.6×10 −3 , 15.0×10 −3 and 12.5×10 −3 for strategies 1, 2, and 3 and 6.1×10 −3 , 66.4 × 10 −3 for the best and worst fixed strategy respectively.Less good MSEs are approximately multiplied by 6 but conclusions are similar.
An increase in n and in the number of covariates clearly increases computation times.Improving procedures in terms of computation times is then a major challenge, particularly in high dimension.Comparisons of these times for the different procedures are presented and discussed in the last section (section 5).

Selection procedure
In this section, we propose a procedure to select M sampling distributions useful for the modulated strategies.The basic idea is to find M posterior distributions which are, in a certain sense, representative of the K − M remaining posterior distributions.As in our examples vague priors are chosen, this is almost equivalent to considering sampling distributions instead of posterior distributions.First, we define this selection procedure and then illustrate it on Model 3 defined previously.

Selection method
The basic idea of this automatic selection method is to build M clusters from K sampling distributions and then to select one distribution in each cluster to become the M preselected simulated data set used in section 2.2.The cluster constructions are based on a summary statistic characterizing a sampling distribution and on a distance between these statistics.Different choices of statistic and distance are possible, but for simplicity reasons, we choose the empirical mean as summary statistic and the Euclidean distance.The most suitable methods here are "k-medoids" methods which partition all elements into M clusters returning its central element for each cluster, the medoid.Since we want to select M sampling distributions representative of the K sampling distributions, the advantage of such a method is that a "medoid" of each cluster is obtained directly.Several clustering methods exist and some are implemented in R packages.To illustrate this procedure, we choose the "Partitioning Around Medoids" (PAM) method introduced by Kaufman and Rousseeuw (1990) and well-documented in Nakache and Confais (2005).The R procedure is "PAM" in the Cluster package (Maechler et al., 2005).Note that other "k-medoid" methods exist, such as Clustering LARge Applications (Kaufman and Rousseeuw, 1990), Clustering LARge Applications based on RANdomized Search (Ng and Han, 1994) and Fast Intelligent Subspace Clustering Algorithm using Dimension Voting (Woo et al., 2004).

Illustration
The selection procedure used in the third model was as defined in section 3.3 with the parameter settings a = 0, b = 0.5 and σ 2 = 1/2.The number of simulated data sets is K = 101 and of preselected data sets M = 10 as before.Figure 4 represents estimated marginal posterior distributions for each parameter approximated by classical MCMC, where bold densities correspond to the 10 data sets preselected by the automatic procedure PAM.This figure shows clearly that the 10 posterior distributions of selected data sets represent all the posterior existing distributions well.
Table 5 presents the mean square errors when IS with modulated strategies is used with the 10 simulated data sets selected automatically using the above procedure.The reader can check that these results are comparable to those given in table 3.These results are again always better than those obtained with the worst fixed choice of table 3. The selection procedure seems to give results which are as good as previous ones and has the great advantage of being automatic and therefore avoiding the arbitrary choice of preselected sampling distributions.    4 3 IS with the different π(θ|X (m) ) selected by the 1 st or the 2 nd criterion (×10 3 )

Conclusions
In this paper, we compare a classical MCMC algorithm with MCMC combined with an IS approach in repeated posterior estimations conditional on different data sets.The aim is to run MCMC on only a small number of data sets and then use generated chains in IS procedure in the following estimations conditional on the remaining data sets.The sampling distribution, namely a chain, can be the same in each IS estimation ("fixed strategy") or differ one from another according to one of the three proposed criteria ("modulated strategy").We compare the results of the approaches using IS with those based on MCMC only, using mean square errors.When the true posterior means are available we compare estimations with them.Comparison becomes impossible since posterior expectation values are unknown.To overcome this difficulty we try to compare the estimation strategies involving IS directly with MCMC posterior mean estimations, considering MCMC estimation as a benchmark.The disadvantage of this comparison, however, is that both IS and MCMC are only approximate.In the studied examples, we chose vague priors on parameters.Hence, we also calculated mean square error with regard to the values set on parameters in the simulations.
The methods discussed were applied to the Poisson models.The results of IS estimation were quite satisfying in the case of the fixed strategy and the modulated strategy.The first model was the simplest and its true posterior characteristics were available.In this example we saw that IS estimations were very close to them and also to the MCMC estimations.For the other two models, analytical estimates were not available, and therefore the IS results were compared with the MCMC estimations showing concordance between the two methods.For Model 3, it emerges that a residual variance is more difficult to estimate and the MCMC algorithm does not handle it well either, returning estimates with large credibility intervals.In this case, larger MSEs are not surprising.For the modulated strategy, neither criterion has managed to produce mean square errors smaller than the smallest MSEs obtained for the fixed strategy, but nevertheless, they are always smaller than the largest ones and in most cases smaller than the fixed strategy MSEs.As an extension of this study, we have tested IS strategies on the same Poisson regression, but increasing the number of observations per data set to n = 1, 000.To increase the number of parameters, we have also added more covariates.They were chosen arbitrarily to be normal or normal and Bernoulli.The main conclusions remain unchanged.The modulated strategy allows us to avoid the worst case of the fixed strategy for both types of mean square error.When comparing estimations with the values set in simulations, IS may outperform MCMC, as in the best case of the fixed strategy.Unfortunately it is impossible to indicate conditionally the data set in which a posterior distribution is the best importance function in the fixed choice strategy to give the smallest corresponding M SE f s .
In terms of computation times, modulated strategies and fixed strategies always run faster than the MCMC procedure for complex models as Model 3. The gain depends on the number of observations n and the number of parameters.Generally speaking, the higher the dimension of the parameter space and/or the number of observations, the greater the time difference between the methods.All comparisons were carried out on the same computer and on the same number of iterations in the MCMC procedure.Table 6 gives the ratio of computation times between the IS and MCMC methods for Models 2 and 3 with n = 1, 000 and ten covariates.For the Poisson model with extra-variability (Model 3), the ratio between fixed strategy and MCMC computation times is on average 0.036, the ratio between modulated strategy and MCMC is 0.278, 0.266 and 0.652 for criteria 1, 2 and 3 respectively.The benefit in terms of computation times between classic MCMC and MCMC together with IS is thus considerable.For Poisson model without extra-variability (Model 2), the ratio between fixed strategy and MCMC computation times is on average 0.057 showing again a clear benefit of IS procedure.But, the gain between MCMC and IS with modulated strategies is less important.Even if ratio is smaller than 1 for criteria 1 and 2, note that classic MCMC is faster than IS with criterion 3 because this last criterion needs to be assessed for each function g as discussed in section 2.2.For simple models, MCMC runs very fast and so, the use of alternative approaches as IS seems less crucial than for models requiring expensive computation times as Model 3.
To conclude, modulated strategies, especially when associated with the first or second criterion, show the best compromise between estimate performances and computation times.Indeed, the estimate performances are nearly equivalent for modulated strategies and MCMC, but with computation times that are significantly smaller.Moreover, the comparison with fixed strategy revealed better performances for the modulated strategy.Interesting extensions of this work could be to study IS estimates for generalized linear mixed models (GLMM) which are widely used in practice.These models appear typically to require computation time improvement because they correspond to a high dimensional case.From a methodological point of view, IS based on a mixture of preselected posterior distributions versus only a single distribution is an important perspective.It offers the advantage that it probably proposes an importance function with greater support range, and hence more stable weights.
The featured methods are illustrated in simulation studies on three types of Poisson model: simple Poisson model, Poisson regression model and Poisson regression model with extra Poisson variability.Different parameter settings are considered.

Fig 1 .
Fig 1. Box-plot of the M SE f s of all possible m in fixed strategy together with the M SEms of three criteria in Model 1 with g(λ) = λ for λ = 1 (left) and λ = 20 (right); black diamond for the first criterion, white diamond for the second criterion and grey diamond for the third.

Fig 2 .
Fig 2. Box-plot of the MSEs of all possible fixed choices together with the MSEcs of both criteria in Model 2 with θ = (a = 0, b = 0.5) for g(θ) = a (top left) and g(θ) = a 2 (top right), g(θ) = b (bottom left) and g(θ) = b 2 (bottom right); black diamond for the first criterion, white diamond for the second criterion and grey diamond for the third.

Fig 3 .
Fig 3.Box-plot of the M SE f s of all possible fixed choices together with the M SEms of both criteria in Model 3 with a = 0, b = 0.5 and σ 2 = 1/8 (1 st line), σ 2 = 1/4 (2 nd line) or σ 2 = 1/2 (3 rd line), for g(θ) = a (left) and g(θ) = b (middle), g(θ) = σ 2 (right); black diamond for the first criterion, white diamond for the second criterion and grey diamond for the third criterion in modulated strategy.

Table 1
Mean square errors with regard to the true posterior mean in Model 1 for g(λ) = λ with

Table 2
Mean square errors with regard to the MCMC posterior means in Model 2 with a = 0, b = 0.5 and n = 20 for g(θ) = a (left)and g(θ) = b (left)

Table 5
Mean square errors in Model 3 with the Automatic Selection Procedure and parameter settings θ See the legend of Table

Table 6
Ratio of computation times between the IS strategies and MCMC methods with n = 1, 000 and 10 covariates