Likelihood-free inference by ratio estimation

We consider the problem of parametric statistical inference when likelihood computations are prohibitively expensive but sampling from the model is possible. Several so-called likelihood-free methods have been developed to perform inference in the absence of a likelihood function. The popular synthetic likelihood approach infers the parameters by modelling summary statistics of the data by a Gaussian probability distribution. In another popular approach called approximate Bayesian computation, the inference is performed by identifying parameter values for which the summary statistics of the simulated data are close to those of the observed data. Synthetic likelihood is easier to use as no measure of `closeness' is required but the Gaussianity assumption is often limiting. Moreover, both approaches require judiciously chosen summary statistics. We here present an alternative inference approach that is as easy to use as synthetic likelihood but not as restricted in its assumptions, and that, in a natural way, enables automatic selection of relevant summary statistic from a large set of candidates. The basic idea is to frame the problem of estimating the posterior as a problem of estimating the ratio between the data generating distribution and the marginal distribution. This problem can be solved by logistic regression, and including regularising penalty terms enables automatic selection of the summary statistics relevant to the inference task. We illustrate the general theory on canonical examples and employ it to perform inference for challenging stochastic nonlinear dynamical systems and high-dimensional summary statistics.


Introduction
We consider the problem of inferring the posterior probability density function (pdf) of some model parameters θ ∈ R d given observed data x 0 ∈ X when computation of the likelihood function is too costly but data can be sampled from the model. In particular, we assume that the model specifies the data generating pdf p(x|θ) not explicitly, e.g. in closed form, but only implicitly in terms of a stochastic simulator that generates samples x from the model p(x|θ) for any value of the parameter θ. The simulator can be arbitrarily complex so that we do not impose any particular conditions on the data space X . Such simulator-based (generative) models are used in a wide range of scientific disciplines to simulate different aspects of nature on the computer, ranging from sub-atomic particles (Martinez et al., 2016) to human societies (Turchin et al., 2013), or universes (Schaye et al., 2015).
Denoting the prior pdf of the parameters by p(θ), the posterior pdf p(θ|x 0 ) can be obtained from Bayes' formula, for x = x 0 . Exact computation of the posterior pdf is, however, impossible if the likelihood function L(θ) ∝ p(x 0 |θ) is too costly to compute. Several approximate inference methods have appeared for simulator-based models. They are collectively known as likelihood-free inference methods, and include approximate Bayesian computation (Tavaré et al., 1997;Pritchard et al., 1999;Beaumont et al., 2002) and the synthetic likelihood approach (Wood, 2010). For a comprehensive introduction to the field, we refer the reader to the review papers by Lintusaari et al. (2017); Marin et al. (2012); Hartig et al. (2011);Beaumont (2010).
Approximate Bayesian computation (ABC) relies on finding parameter values for which the simulator produces data that are similar to the observed data. Similarity is typically assessed by reducing the simulated and observed data to summary statistics and comparing their distance. While the summary statistics are classically determined by expert knowledge about the problem at hand, there have been recent pursuits in choosing them in an automated manner (Aeschbacher et al., 2012;Fearnhead and Prangle, 2012;Blum et al., 2013;Gutmann et al., 2014Gutmann et al., , 2017. While ABC can be considered to implicitly construct a nonparametric approximation of p(x|θ) (e.g. Hartig et al., 2011;Gutmann and Corander, 2016), synthetic likelihood assumes that the summary statistics for a given parameter value follow a Gaussian distribution (Wood, 2010). The synthetic likelihood approach is applicable to a diverse set of problems (Price et al., 2016), but the Gaussianity assumption may not always hold and the original method does not include a mechanism for choosing summary statistics automatically.
In this paper, we propose a framework and practical method to directly approximate the posterior distribution in the absence of a tractable likelihood function. The proposed approach includes automatic selection of summary statistics in a natural way.
The basic idea is to frame the original problem of estimating the posterior as a problem of estimating the ratio r(x, θ) between the data generating pdf p(x|θ) and the marginal distribution p(x), By definition of the posterior distribution, an estimater(x, θ) for the ratio implies an estimatep(θ|x 0 ) for the posterior,p (θ|x 0 ) = p(θ)r(x 0 , θ).
In addition, the estimated ratio also yields an estimateL(θ) of the likelihood function, as the denominator p(x) in the ratio does not depend on θ. We can thus perform likelihoodfree inference by ratio estimation, and we call this framework in short "LFIRE". If approximating the likelihood function is the goal, also other distributions than the marginal can be chosen in the denominator. While we do not further address the question of what distributions can be chosen, for reasons of stability, however, at first glance it seems reasonable to prefer distributions that have heavier tails than p(x|θ) in the numerator.
There are several methods in the literature available for the estimation of density ratios (Sugiyama et al., 2012), of which estimation through logistic regression is widely used and has some favourable asymptotic properties (Qin, 1998;Cheng and Chu, 2004;Bickel et al., 2007). Logistic regression is very closely related to probabilistic classification and we use it in the paper to estimate the ratio r(x, θ).
Logistic regression and probabilistic classification have been employed before to address computational problems in statistics. Gutmann and Hyvärinen (2012) used it to estimate unnormalised models, Pham et al. (2014); Cranmer et al. (2015) used it to estimate likelihood ratios, and Goodfellow et al. (2014) employed it for training neural networks to generate samples following the same distribution as given reference data. More general methods for ratio estimation are now considered for training such neural networks (see e.g. the review by Mohamed and Lakshminarayanan, 2016), and they were used before to estimate unnormalised models (Pihlaja et al., 2010;Gutmann and Hirayama, 2011). Classification in general has been shown to yield a natural distance function in terms of the classifiability between simulated and observed data, which can be used for ABC (Gutmann et al., 2014. While this earlier approach is very general, the classification problem can be difficult to set up when the observed data consist of very few data points only. The method proposed in this paper avoids this difficulty. The details on how to generally estimate the ratio r(x, θ) and hence the posterior by logistic regression are presented in Section 2. In Section 3, we model the ratio as a linear superposition of summary statistics and show that this assumption yields an exponential family approximation of the intractable model pdf. As Gaussian distributions are part of the exponential family, our approach thus includes the synthetic likelihood approach as a special case. We then show in Section 4 that including a penalty term in the logistic regression enables automatic selection of relevant summary statistics. In Section 5, we validate the resulting method on toy examples, and in Section 6, we apply it to challenging inference problems in ecology and weather forecasting. All simulation studies include a comparison with the synthetic likelihood approach. The new method yielded consistently more accurate inference results.

Posterior estimation by logistic regression
We here show that the ratio r(x, θ) in Equation (2) can be estimated by logistic regression, which yields estimates for the posterior and the likelihood function together with Equations (3) and (4). Figure 1 provides an overview.
By assumption we can generate data from the pdf p(x|θ) in the numerator of the ratio be such a set with n θ independent samples generated with a fixed value of θ. Additionally we can also generate data from the marginal pdf p(x) in the denominator of the ratio; let X m = {x m i } nm i=1 be such a set with n m independent samples. As the marginal p(x) is obtained by integrating out θ, see Equation (1), the samples can be obtained by first sampling from the joint distribution of (x, θ) and then ignoring the sampled parameters, We now formulate a classification problem where we aim to determine whether some data x were sampled from p(x|θ) or from p(x). This classification problem can be solved via (nonlinear) logistic regression (e.g. Hastie et al., 2001), where the probability for x to belong to X θ , for instance, is parametrised by some nonlinear function h(x), with ν = n m /n θ compensating for unequal class sizes. A larger value of h at x indicates a larger probability for x to originate from X θ . A suitable function h is typically found by minimising the loss function J on the training data X θ and X m , The dependency of the loss function on θ is due to the dependency of the training data X θ on θ.
We prove in Appendix A that for large n m and n θ , the minimising function h * is given by the log-ratio between p(x|θ) and p(x), that is For finite sample sizes n m and n θ , the minimising functionĥ, thus provides an estimater(x, θ) of the ratio r(x, θ), and Equations (3) and (4) yield the corresponding estimates for the posterior and likelihood function, respectively, In case samples from the posterior are needed, we can use standard sampling schemes withp(θ|x 0 ) as the target pdf (Andrieu and Roberts, 2009). The estimates can also be used together with Bayesian optimisation for fast likelihood-free inference (Gutmann and Corander, 2016).
When estimating the posterior or likelihood function by logistic regression as outlined above, the sample sizes n m and n θ are entirely under our control. Their values reflect the trade-off between computational and statistical efficiency. We note that both X θ and X m can be constructed in a perfectly parallel manner. Moreover, while X θ needs to be re-constructed for each value of θ, X m is independent of θ and needs to be generated only once.
Different models can be used for probabilistic classification; equivalently, different assumptions can be made on the family of functions to which the log-ratio h belongs. While non-parametric families or deep architectures can be used, we next consider a simple parametric family that is spanned by a set of summary statistics.

Exponential family approximation
We here restrict the search in Equation (9) to functions h that are members of the family spanned by b summary statistics ψ i (x), each mapping data x ∈ X to R, with β i ∈ R, β = (β 1 , . . . , β b ), and ψ(x) = (ψ 1 (x), . . . , ψ b (x)). This corresponds to performing logistic regression with a linear basis expansion (Hastie et al., 2001). The observed data x 0 may be used in the definition of the summary statistics, as for example with the Ricker model in Section 6, and thus influence the logistic regression part of the likelihood-free inference pipeline in Figure 1 (not shown in the figure).
When we assume that h(x) takes the functional form in Equation (12), estimation of the ratio r(x, θ) boils down to the estimation of the coefficients β i . This is done by minimising J(β, θ) = J (β ψ, θ) with respect to β, The terms ψ θ i = ψ(x θ i ) and ψ m i = ψ(x m i ) denote the summary statistics of the simulated data sets x θ i ∈ X θ and x m i ∈ X m , respectively. The estimated coefficientsβ depend on θ because the training data x θ i ∈ X θ depend on θ. With the model assumption in Equation (12), the estimate for the ratio in Equation (10) thus becomeŝ and the estimates for the posterior and likelihood function in Equation (11) arê respectively.
As r(x, θ) is the ratio between p(x|θ) and p(x), we can consider the estimater(x, θ) in Equation (15) to provide an implicit estimatep(x|θ) of the intractable model pdf p(x|θ), The estimate is implicit because we have not explicitly estimated the marginal pdf p(x).
Importantly, the equation shows thatp(x|θ) belongs to the exponential family with ψ(x) being the sufficient statistics for the family, andβ(θ) the vector of natural parameters.
In previous work, Wood (2010) in his synthetic likelihood approach, as well as Leuenberger and Wegmann (2010), approximated the model pdf by a member from the Gaussian family. As the Gaussian family belongs to the exponential family, the approximation in Equation (17) includes this previous work as a special case. Specifically, a synthetic likelihood approximation with summary statistics φ corresponds to an exponential family approximation where the summary statistics ψ are the individual φ k , all pairwise combinations φ k φ k , k ≥ k , and a constant. While in the synthetic likelihood approach, the weights of the summary statistics are determined by the mean and covariance matrix of φ, in our approach, they are determined by the solution of the optimisation problem in (14). Hence, even if equivalent summary statistics are used, the two approaches can yield different approximations if the summary statistics are actually not Gaussian. Computationally, both methods require generating the data set X θ , which most often will dominate the computational cost. The proposed method has the additional cost of constructing the set X m once and the cost of performing logistic regression for each θ. Synthetic likelihood, on the other hand, requires inversion of the covariance matrix of the summary statistics for each θ. We thus consider the computational cost of both methods to be roughly equal.
Later in the paper, we compare the posteriors estimated by the two approaches. We will see that for equivalent summary statistics, relaxing the Gaussianity assumption typically leads to better inference results.

Data-driven selection of summary statistics
The estimated coefficientsβ(θ) are weights that determine to which extent a summary statistic ψ i (x) contributes to the approximation of the posterior. As the number of simulated data sets n m and n θ increases, the error in the estimatesβ(θ) decreases and the importance of each summary statistic can be determined more accurately. Increasing the number of simulated data sets, however, increases the computational cost too. As an alternative to increasing the number of simulated data sets, we here use an additional penalty term in the logistic regression to determine the importance of each summary statistic. This approach enables us to work with a large list of candidate summary statistics and automatically select the relevant ones in a data-driven manner. This makes the posterior inference more robust and less dependent on subjective user input.
While many choices are possible, we use the L 1 norm of the coefficients as penalty term, like in lasso regression (Tibshirani, 1994). The coefficients β in the basis expansion in Equation (12) are thus determined as the solution of a L 1 -regularised logistic regression The value of λ determines the degree of the regularisation. Sufficiently large values cause some of the coefficients to be exactly zero. Different schemes to choose λ have been proposed that aim at minimising the prediction risk (Zou et al., 2007;Wang and Leng, 2007;Tibshirani and Taylor, 2012;Dutta et al., 2012). Following common practice and recommendations (Tibshirani, 1994;Hastie et al., 2001;Greenshtein and Ritov, 2004;Efron et al., 2004;Zou et al., 2007;van de Geer, 2008;Friedman et al., 2010;Tibshirani, 2011;van de Geer and Lederer, 2013)), we here choose λ by minimising the prediction risk R(λ), The minimising value λ min determines the coefficientβ(θ), which is used in the estimate of the density ratio in Equation (15), and thus the posterior and likelihood in Equation (17). Algorithm 1 presents pseudo-code that summarises the procedure for joint summary statistics selection and posterior estimation. Algorithm 1 is a special case of the scheme described in Figure 1 when h(x) is a linear combination of the summary statistics ψ(x) as described in Equation (12).
d. Compute the value of the estimated posterior pdfp(θ|x 0 ) according to Equation (16).
For the results in this paper, we always used n θ = nm. To implement steps b and c we used the R package 'glmnet' (Friedman et al., 2010).

Validation on toy inference problems
We here validate and illustrate the presented theory on a set of toy inference problems.

Gaussian distribution
We illustrate the proposed inference method on the simple example of estimating the posterior pdf of the mean of a Gaussian distribution. The observed data x 0 is a single observation that was sampled from a uni-variate Gaussian with mean µ o = 2.3 and standard deviation σ o = 3. Assuming a uniform prior U(−20, 20) on the unknown mean µ, the log posterior −20, 20), and zero otherwise. The model is thus within the family of models specified in Equation (16). Coefficient α 0 (µ) equals where Φ is the cumulative distribution function of the standard normal distribution, and the coefficients α 1 (µ) and α 2 (µ) are For Algorithm 1, we used ten-dimensional summary statistics ψ(x) = (1, x 2 , . . . , x b−1 ), so that b = 10, and fixed n m = n θ = 1000. As an illustration of step c of Algorithm 1, we show the prediction error R(λ) in Figure  In Figure 2b, we plotα(µ) and α 0 (µ), α 1 (µ), α 2 (µ) from Equation (21) for µ ∈ [−5, 5].
We notice that the estimated coefficients α k are exactly zero for k > 2 while for k ≤ 2, they match the true coefficients up to random fluctuations. This shows that our inference procedure can select the summary statistics that are relevant for the estimation of the posterior distribution from a larger set of candidates.
In Figure 2c, we compare the estimated posterior pdf (yellow) with the true posterior pdf (blue). We can see that the estimate matches the true posterior up to random fluctuations.
The figure further depicts the posterior obtained by the synthetic likelihood approach of Wood (2010) (red) where the summary statistics φ(x) are equal to x. Here, working with Gaussian data, the performance of the proposed inference scheme based on penalised logistic regression and the performance of the existing synthetic likelihood approach are practically equivalent.
The true posterior distribution of θ = (θ 1 , θ 2 ) can be computed numerically (e.g. Gutmann For estimating the posterior distribution with Algorithm 1, we used summary statistics ψ that measure the (nonlinear) temporal correlation between the time-points, namely the auto-correlations with lag one up to five, all pairwise combinations of them, and additionally a constant. For checking the robustness of the approach, we also considered the case where almost 50% of the summary statistics are noise by augmenting the above set of summary statistics by 15 white-noise random variables. For synthetic likelihood, we used the autocorrelations as the summary statistics without any additional noise variables, as synthetic likelihood approach is typically not adapted to selecting among relevant and irrelevant summary statistics. As explained in Section 4, synthetic likelihood always uses the pairwise combinations of the summary statistics due to its underlying Gaussianity assumption.   (1) In order to assess the performance more systematically, we next performed posterior inference for 100 observed time-series that were each generated from Equation (24) with Table 1 shows the average value of the symmetrised Kullback-Leibler divergence for n θ = n m ∈ {100, 500, 1000}. The average divergence decreases as the number of simulated data sets increases for our method, in contrast to the synthetic likelihood approach. We can attribute the better performance of our method to its ability to better handle non-Gaussian summary statistics and its ability to select the summary statistics that are relevant.
We further compared the performance of the proposed method and synthetic likelihood case-by-case for the 100 different observed data sets. For that purpose, we computed the difference ∆ sKL between sKL(p(θ|x 0 )||p(θ|x 0 )) whenp(θ|x 0 ) is estimated by the proposed method and by synthetic likelihood. A value of ∆ sKL < 0 indicates a better performance of the proposed method while a value ∆ sKL > 0 indicates that synthetic likelihood is performing better. As ∆ sKL depends on x 0 , it is a random variable and we can compute its empirical distribution on the 100 different inference problems corresponding to different  observed data sets. Figure 4 shows the distribution of ∆ sKL when the white noise variables are absent (blue) and present (red) for the proposed method. The area under the curve on the negative-side of the x-axis is 82% (white-noise absent) and 83% (white-noise present), which indicates a superior performance of the proposed method over synthetic likelihood and robustness to the perturbing irrelevant summary statistics.

Bayesian inference for nonlinear dynamical systems
We here apply Algorithm 1 to two realistic models with intractable likelihood functions and compare the inference results with the results for the synthetic likelihood approach by Wood (2010). The first one is the ecological model of Ricker (1954) that was also previously used by Wood (2010). The second one is the widely used weather prediction model of Lorenz (1995) with a stochastic reparametrisation (Wilks, 2005), which we simply call "Lorenz model". Both are time series models, and the inference is difficult due to unobserved variables and their strongly nonlinear dynamics.

Models
Ricker model. This is a model from ecology that describes the size of some animal population over time. The observed population size at time t, y (t) , is assumed to be a stochastic observation of the actual but unobservable population size N (t) . Conditional on N (t) , the observable y (t) is assumed Poisson distributed, where φ is a scaling parameter. The dynamics of the unobservable population size N (t) is described by a stochastic version of the Ricker map (Ricker, 1954), where T = 50, e (t) are independent standard normal random variables, log r is related to the log population growth rate, and σ is the standard deviation of the innovations. The model has in total three parameters θ = (log r, σ, φ). The observed data x 0 are the time-series (y (t) , t = 1, . . . , T ), generated using θ 0 = (log r 0 , σ 0 , φ 0 ) = (3.8, 0.3, 10). We have assumed uniform prior for all parameters: U(3, 5) for log r, U(0, 0.6) for σ, and U(5, 15) for φ.
For our method, we use the set of 13 summary statistics φ suggested by Wood (2010) as well as all their pairwise combinations and a constant in order to make the comparison with synthetic likelihood fair -as pointed in Section 4, synthetic likelihood implicitly uses the pairwise combinations of the summary statistics due to its underlying Gaussianity assumption. The set of 13 summary statistics φ are: the mean observationȳ, the number of zero observations, auto-covariances with lag one up to five, the coefficients of a cubic regression of the ordered differences y (t) −y (t−1) on those of the observed data, and the least squares estimates of the coefficients for the model (y (t+1) ) 0.3 = b 1 (y (t) ) 0.3 + b 2 (y (t) ) 0.6 + (t) , see (Wood, 2010) for details.
Lorenz model. This model is a modification of the original weather prediction model of Lorenz (1995) when fast weather variables are unobserved (Wilks, 2005). The model assumes that weather stations measure a high-dimensional time-series of slow weather variables (y (t) k , k = 1, . . . , 40), which follow a coupled stochastic differential equation (SDE), called the forecast model (Wilks, 2005), where η (t) k is stochastic and represents the uncertainty due to the forcing of the unobserved fast weather variables. The function g(y hours, and solved the SDEs by using a 4th order Runge-Kutta solver at these time-points (Carnahan et al., 1969, Section 6.5). In the discretised SDEs, following Wilks (2005), the stochastic forcing term is updated for an interval of ∆t as where the e (t) are independent standard normal random variables and η (0) = 1 − φ 2 e (0) .
The inference problem that we solve here is the estimation of the posterior distribution of the parameters θ = (θ 1 , θ 2 ), called closure parameters in weather modelling, from the 40 slow weather variables y (t) k , recorded over twenty days. We simulated such observed data x 0 from the model by solving the SDEs numerically as described above with θ 0 = (θ o 1 , θ o 2 ) = (2.0, 0.1) over a period of twenty days. The uniform priors assumed for the parameters were U(0.5, 3.5) for θ 1 and U(0, 0.3) for θ 2 .  k+1 for time lag one. These values were computed and averaged over all k due to the symmetry in the model. We used the six summary statistics for synthetic likelihood, and, to make the comparison fair, for the proposed method, we also used their pairwise combinations as well as a constant like in the previous sections.

Results
We used an importance sampling scheme (Ripley, 1987, IS) by sampling 8000 samples from the prior distribution and computed their weights using Algorithm 1, which is equivalent to one generation of the SMC algorithm (Cappé et al., 2004;Del Moral et al., 2006, SMC). As suggested by Wood (2010), for the synthetic likelihood approach we used a robust variancecovariance matrix estimation scheme for a better estimation of the likelihood function.
The squaring and division should be understood as element-wise operations. As the relative error depends on the observed data x 0 , we computed the error for 100 different observed datasets. We performed a point-wise comparison between the proposed method and synthetic likelihood by computing the difference ∆ rel−error between the relative errors for all elements in the parameter vector θ, A value of ∆ rel−error < 0 means that the relative error for the proposed method is smaller than the relative error for the synthetic likelihood approach. A value of ∆ rel−error > 0, on which indicates that the proposed method is performing better in both applications. As the proposed and the synthetic likelihood method use exactly the same summary statistics, we do not expect large improvement in the performance. Nevertheless, it can be seen that the proposed method is able to reduce the error for most parameters, including the difficult σ parameter of the Ricker model. We next analysed the impact of the improved inference on weather prediction, which is the main area of application of the Lorenz model. Having observed weather variables for t ∈ [0, 4], or 20 days, we would like to predict the weather of the next days. We here consider prediction over a horizon of ten days, which corresponds to t ∈ [4, 6].
Given x 0 , we first estimated the posterior mean of the parameters using the proposed More area under the curve on the negative side of the x-axis indicates a better performance of the proposed method. We used Algorithm 1 and synthetic likelihood with n θ = n m = 100 to estimate the posterior pdf. The densities in (a-b) were estimated using a Gaussian kernel density estimator with bandwidth 0.025 and 0.037, respectively.  4,6] corresponding to 1 to 10 days in the future. We used Algorithm 1 and synthetic likelihood with n θ = n m = 100 to estimate the posterior pdf.
As the median is always positive, the proposed method obtains, on average, a smaller prediction error than synthetic likelihood. and the synthetic likelihood approach. Taking the final values of the observed data (y (4) k , k = 1, . . . , 40) as initial values, we then simulated the future weather development using the SDE in Equation (28) for both the true parameter value θ 0 , as well as for the two competing sets of estimates. Let us denote the 40-dimensional time series corresponding to θ 0 ,Ê(θ|x 0 ) andÊ SL (θ|x 0 ) at time t by y (t) ,ŷ (t) , andŷ (t) SL , respectively. We then compared the proposed and the synthetic likelihood method by comparing their prediction error. Denoting the Euclidean norm of a vector by ||.||, we computed which measures the relative decrease in the prediction error achieved by the proposed method over synthetic likelihood. As the estimates depend on the observed data x 0 , ζ (t) (x 0 ) depends on x 0 . We assessed its distribution by computing its values for 100 different x 0 .
In Figure 9, we plot the median, the 1/4 and the 3/4 quantile of ζ (t) (x 0 ) for t ∈ [4,6] corresponding to one to ten days in the future. We achieve on average a clear improvement in prediction performance for the first days; for longer-term forecasts, the improvement becomes smaller, which is due the inherent difficulty to make long-term predictions for chaotic time series.

Discussion
In the paper, we considered the problem of estimating the posterior density when the likelihood function is intractable but generating data from the model is possible. We framed the posterior density estimation problem as a density ratio estimation problem.
The latter problem can be solved by (nonlinear) logistic regression and is thus related to classification. This approach for posterior estimation with generative models mirrors the approach of Gutmann and Hyvärinen (2012) for the estimation of unnormalised models.
The main difference is that here, we classify between two simulated data sets while Gutmann and Hyvärinen (2012) classified between the observed data and simulated reference data.
This difference reflects the fact that generating samples is relatively easy for generative models while typically difficult for unnormalised models.
For unnormalised models, Gutmann and Hyvärinen (2012) found that increasing the size of the reference data set increases the accuracy of the estimates. We here worked with two equally sized data sets. But the data set with samples from the marginal can be made larger without much overhead as it is only generated once, which will likely lead to more accurate posterior estimates.
In our previous work (Gutmann et al., 2014, we showed how to perform likelihoodfree inference via classification. The difference to the current work is that we there classified between observed and simulated data, and not between two simulated data sets as done here. The key advantage of working with two simulated data sets is that it supports posterior inference given a single observed datum only, as we are guaranteed to have enough data to train the classifier. Goodfellow et al. (2014) classified between observed and simulated data as well, in order to generate random samples via a neural network. Our approach thus distinguishes itself not only in the different goals but also in the fact that we work with two simulated data sets.
Working with two simulated data sets has the additional advantage that most computations can be performed offline before the observed data are seen, and that computations can be recycled for newly observed data sets, which enables "crowd-sourcing" of computations. This kind of (shared) pre-computations can be particularly advantageous when the posterior needs to be estimated as part of a decision making process that is subject to time constraints.
Our method requires that several samples from the model are generated for the estimation of the posterior at any parameter value, like for synthetic likelihood (Wood, 2010).
While the sampling can be performed perfectly in parallel, it constitutes the main computational cost. There are several ways to reduce it: First, the proposed approach is compatible with likelihood-free inference that uses Bayesian optimisation to intelligently decide where to evaluate the posterior (Gutmann and Corander, 2016), thus reducing unnecessary computations. Second, we can learn the relation between the parameters and the weights in the logistic regression model from the outcomes of the regressions performed for previous parameter values. An initial estimate of the posterior can thereby be obtained without any new sampling from the model, and additional computations may only be spent on fine-tuning that estimate. Third, for prior distributions much broader than the posterior, performing logistic regression with samples from the marginal distribution is not very efficient. Iteratively constructing a proposal distribution that is closer to the posterior, e.g.
by re-using data sets and parameter values from previous simulations, will likely lead to computational gains. The estimated posterior will not be properly normalised but this is typically not a problem when used as the target distribution for posterior sampling by Monte Carlo algorithms.
Density ratio estimation has previously been used by Izbicki et al. (2014) andCranmer et al. (2015) to approximate likelihood functions and to estimate likelihood-ratios in the framework of hypothesis testing, respectively. In other recent work, Pham et al. (2014) used classification to estimate the ratio of the likelihoods of two parameters appearing in the acceptance probability of the Metropolis-Hastings sampling scheme. If we used the approximate posterior distribution in Equation (11) to estimate the acceptance probability, we would end up with a similar density ratio as Pham et al. (2014). A key difference is that our approach results in estimates of the posterior and not in a single accepted, or rejected, parameter value, which is arguably less informative. Additionally, the MCMC approach is computationally often not very efficient for likelihood-free inference, since long simulations from the model may nonetheless result in rejected parameter values. Importantly, our work goes beyond using density ratio estimation for posterior estimation in that we show how it can be used to automatically combine and select relevant summary statistics from a large pool of candidates.
The existing work related to summary statistics selection in likelihood-free inference (Aeschbacher et al., 2012;Fearnhead and Prangle, 2012;Blum et al., 2013;Gutmann et al., 2014Gutmann et al., , 2017Marin et al., 2016) are generally concentrated on ABC. Our approach is more closely related to the synthetic likelihood approach by Wood (2010), including it as a special case. While the cited methods for summary statistics selection in ABC might be adaptable for use with synthetic likelihood, the summary statistics generally have to be transformed before use, in order to match the Gaussianity assumption of synthetic likelihood. This is in contrast to our approach that automatically adapts to non-Gaussianity of the summary statistics.
Our method is simple to use and relies on standard statistics or machine learning libraries. Ease of use is shared with existing likelihood-free inference methods that are based on a Gaussianity assumption of the summary statistics or the data generating process (Wood, 2010;Fan et al., 2013;Price et al., 2016). But even the simple approach of using a linear basis expansion in the logistic regression yields a richer statistical model than the Gaussian family, in the form of the exponential family. The advantage of working with the exponential family has been demonstrated on a number of applications, including an auto-regressive time series model (Section 5.2) and stochastic nonlinear dynamical systems (Section 6). Moreover, we are not restricted to (penalised) linear logistic regression. Our approach can be used with more general regression models too, as portrayed in Figure 1. Gutmann and Hirayama (2011) used the Bregman divergence to estimate ratios of probability distributions thereby introducing a large family of estimators for unnormalised models, containing the method by Gutmann and Hyvärinen (2012) as a special case. Drawing on the discussed connection between (Gutmann and Hyvärinen, 2012) and the paper at hand suggests that an equally large family of inference methods can be obtained for the estimation of generative models, containing our method based on logistic regression as a special case.
In conclusion, our paper opens up a direction to new likelihood-free inference methods based on logistic regression or other density ratio estimation schemes that can be used whenever the likelihood function is not available but sampling from the model is possible.

Acknowledgements
The work was partially done when RD and MUG were at the Department of Computer Author contributions Idea and theory: MUG; implementation and simulations: RD; writing of the paper: MUG and RD; contributed to writing: JC and SK.

A Proof of Equation (8)
We here prove that log r(x, θ) = log (p(x|θ)/p(x)) minimises J (h, θ) in Equation (7) in the limit of large n θ and n m . We first simplify the notation and denote x θ by x, its pdf p(x|θ) by p x , n θ by n, x m by y, its pdf p(x) by p y , and n m by m. Moreover, as θ is considered fixed for this step, we drop the dependency of J on θ. Equation (7) thus reads We will consider the limit where n and m are large, with fixed ratio ν = m/n.
The function h * that minimisesJ (h) also minimises J (h) in the limit of large n and m.