On a Class of Objective Priors from Scoring Rules

. Objective prior distributions represent an important tool that allows one to have the advantages of using a Bayesian framework even when information about the parameters of a model is not available. The usual objective approaches work oﬀ the chosen statistical model and in the majority of cases the resulting prior is improper, which can pose limitations to a practical implementation, even when the complexity of the model is moderate. In this paper we propose to take a novel look at the construction of objective prior distributions, where the connection with a chosen sampling distribution model is removed. We explore the notion of deﬁning objective prior distributions which allow one to have some degree of ﬂexibility, in particular in exhibiting some desirable features, such as being proper, or log-concave, convex etc. The basic tool we use are proper scoring rules and the main result is a class of objective prior distributions that can be employed in scenarios where the usual model based priors fail, such as mixture models and model selection via Bayes factors. In addition, we show that the proposed class of priors is the result of minimising the information it contains, providing solid interpretation to the method.


Introduction
With the ever increasing popularity of Bayesian methods, attributable largely to the advent of Markov chain Monte Carlo methods and other sampling techniques, the need for default, otherwise known as objective or noninformative, priors is also in demand. Model based objective priors, such as the reference prior (Berger et al., 2009) and Jeffreys prior (Jeffreys, 1961), are commonly used when available. However, as models become larger and more complex, so it is that such priors are becoming more difficult to obtain, if not altogether unavailable. Indeed, it is our contention that model based objective priors have now reached their natural ceiling with little progress or advances in recent years.
Our observation is that limits to the progress in the research on objective priors is connected to their improperness. In fact, with very few exceptions, objective priors are improper. Although this may not represent a problem, as long as the posterior is 2

On a Class of Objective Priors from Scoring Rules
proper, it causes severe limitations to the use of objective prior distributions. Indeed, the improperness of objective priors is the main motivation which brought us to investigate a novel approach to derive objective priors. A thorough discussion of the problems that improper priors cause can be found in Kass and Wasserman (1996), where they illustrate issues such as incoherence, strong inconsistencies and nonconglomerability, dominating effect of the prior, inadmissibility, marginalisation paradoxes and improper posteriors. Although all points are undoubtedly important, it is probably the last issue that requires careful consideration. The main concern is that, as of today, general results that allow one to assess if a given improper prior yields a proper posterior are yet to be found. Research has progressed on a case by case basis; for example, see Ibrahim and Laud (1991) for the use of Jeffreys prior in generalised linear models, extended to overdispersed models of the same kind by Dey et al. (1993), Natarajan and McCulloch (1995), Berger and Strawderman (1993), and Yand and Chen (1995). More recently, Rubio and Steel (2018) describe general conditions to use improper priors for linear mixed models with longitudinal and survival data. As one would expect, the task of assessing posterior properness becomes more onerous the more complex the model is. But, even for simple models, the risk is high; as, for example, the discussion in Vallejos and Steel (2013) about the use of the Jeffreys rule prior for the Student-t regression model, derived in Fonseca et al. (2008), shows.
In this paper, we investigate constructing objective prior distributions that are not model dependent and based on the sole knowledge of the parameter space, say, Θ. As such, the connection between the prior distribution, p(θ), and the likelihood function, f (·|θ), is limited to the common parameter space only. Therefore, the prior p(θ) using only Θ loses the connection with the subjective component f (·|θ) and could be argued as a consequence to be more objective. Conversely, model based priors, such as the Jeffreys prior (Jeffreys, 1946(Jeffreys, , 1961 or the reference prior (Berger et al., 2009), necessarily include the subjective choice of the model. In fact, models are by and large misspecified and, consequently, model based priors are propagating this misspecification. So, while a model based prior reinforces the connection between the misspecified model and the prior itself, a prior that depends on the parameter space loses only the connection.
The method we propose to derive a novel class of objective priors is based on defining a scoring rule as a combination of the log-score and of the Hyvärinen score (Parry et al., 2012). We then seek a prior such that the above score, say S(θ, p), is constant, that is We show in Section 3 that the density p satisfying (1) identifies a class of objective priors. Furthermore, we show in Section 4 that the density p(θ) solving (1) minimises the information in the prior as measured by a combination of the Shannon information and the Fisher information. As a result of these equivalent approaches, we show that the objective prior p(θ) is obtained by solving the following differential equation: F. Leisen,C. Villa,and S. G. Walker 3 Our prior is the class of solutions to the differential equation (2). The class allows us to include constraints such as properness, convexity and being decreasing, among others.
The development of objective priors is thoroughly reviewed in Consonni et al. (2018). The idea is that in a scenario where prior elicitation is not feasible, or not desirable, a prior distribution can be formed through structural or formal rules (Kass and Wasserman, 1996). The most popular objective prior is Jeffreys prior (Jeffreys, 1946(Jeffreys, , 1961, who proposed a prior distribution for continuous parameter spaces which is invariant for one-to-one transformations of the parameter space. Although in scenarios where there is only one parameter of interest, Jeffreys prior yields sensible posterior distributions. However, in cases where the parameter space has a dimension of two or more, the prior is known to yield posteriors with poor performance (sometimes giving paradoxical results, such as the marginalisation paradox, see Stone and Dawid (1972)). In such cases the priors are taken to be independent. Although other more general invariance priors have been proposed, such as in Dawid (1983), Hartigan (1964) and Jaynes (1968), the reference prior of Bernardo (Berger et al., 2009) represents an alternative to Jeffreys prior. A limitation of reference priors is sensitivity to the order of importance of parameters; this issue and possible solutions have been discussed in Berger et al. (2015). Other objective priors proposed include that of Box and Tiao (1973), based on data-translated likelihoods, and maximum entropy priors, see, for example, Jaynes (1957Jaynes ( , 1968. As discussed in Kass (1990), these priors turn out to be very restrictive. Another important class of objective priors are the probability matching priors, first proposed in Welch and Peers (1963). The aim is to obtain a prior distribution under which the posterior probabilities of certain regions coincide with their coverage probabilities, either exactly or approximately. Recent developments of this method can be found in Sweeting et al. (2006) and Sweeting (2008). A different method, based on information theoretical concepts, has been proposed by Zellner and Min (1993), giving the so called maximal data information prior. Possibly, the most recent development in defining prior distributions, although not strictly in an objective sense, is discussed in Simpson et al. (2017). The idea is to identify the parts in a complex model that require subjective input, while the remaining parts can be associated with non-informative priors. It is important to point out that objective priors derived from scoring rules have been proposed in the recent work of Giummolè et al. (2018). A final consideration is reserved for discrete parameter spaces, whose systematic discussion can be seen to be generated by the paper of Rissanen (1983). The lack of general methods, due to the challenges that discreteness imposes, has been filled by Berger et al. (2012) first, and by Villa and Walker (2015) later.
It is known that for a mixture model Jeffreys prior can only be found under specific conditions, as studied in Grazian and Robert (2018), for reasons identified, in part, by Titterington et al. (1985). Also, standard objective priors are problematic in model comparison since Bayes factors depend on the arbitrary normalizing constants of improper prior. Though, special solutions have been proposed, for example, by O'Hagan (1995) and Berger and Pericchi (1996). Further discussion of these contexts is contained in Sections 6.1 and 6.2, where solutions provided by our approach are explored. There are other examples where the use of improper priors is challenging. For example, Kass and Wasserman (1996) discuss some issues related to their use in hierarchical modelling. Also, paradoxical results may appear, such as the marginalisation paradox (Stone and Dawid, 1972) or the Stein's paradox (Bernardo and Smith, 1994). For more examples, see Stone (1976) and Syversveen (1998).

On a Class of Objective Priors from Scoring Rules
The paper is structured as follows. In Section 2 we give an outline of the idea and discuss a simple introductory example. In Section 3 we introduce the foundations of the proposed prior on the basis of scoring rules and their properties. An interesting aspect of the prior based on scoring rules is its interpretation in terms of the information content carried by the prior itself. This aspect is explored in Section 4. In Section 5 we present the objective priors concentrating on Θ = (0, 1), (0, ∞), (−∞, +∞) and (−M, M ), for a finite M , as well as on a multidimensional space. The implementation of the prior for some specific applications is presented in Section 6 where, in addition, we explore some properties of the proposed prior. Finally, Section 7 is dedicated to some concluding remarks.

Outline of the idea
Here we illustrate two key motivating examples, in particular in relation to the use of improper priors, and give an intuitive description of the core ideas of this work.
The key to our idea is to consider a loss function l(θ, p(θ)) which penalizes for each θ ∈ Θ a choice of a prior density p(θ). The objective criterion is then based on the idea of finding the class of p which makes l(θ, p(θ)) constant. For obvious reasons, the loss function should have the following property for all q's representing a density for the θ. In other words, if a "true" density for θ exists, the expected loss should be minimised when such a density is chosen. The condition in (3) identifies a particular class of loss functions, known as proper scoring rules. One way of interpreting (proper) scoring rules is as loss functions that measure the quality of a quoted density p for an uncertain quantity θ; see, for example, Parry et al. (2012). We indicate a proper scoring rule by S(θ, p), and we ask it to be constant for all θ ∈ Θ. So we set S(θ, p) = constant ∀ θ ∈ Θ, and the densities satisfying the above equality identify a class of objective priors. We set the constant to 1 and show later that this choice is without loss of generality. The criterion defining this class of priors is clearly objective, for if the scoring rule were not constant, some parts of the space Θ would be given preference above others.
As discussed in Parry et al. (2012) a scoring rule is defined as local if it depends on the density function p(·) only through its value at θ, that is p(θ). Holding the above definition, we have that any proper local scoring rule is equivalent to the log score, − log p(θ), also known as the self information loss function. However, the above scoring rule is not suitable to us, for if we set − log p(θ) = constant, we only achieve p(θ) ∝ 1, and the result is not interesting. Hence, we extend the score function to include the first and second derivatives of p(θ); a natural extension, in our opinion.
Thus, we consider additionally the Hyvärinen scoring rule (Hyvärinen, 2005) which makes use of the first two derivatives of p, written as p and p . We then have a scoring F. Leisen,C. Villa,and S. G. Walker 5 rule S(θ,p) which has two components; the log score and the Hyvärinen score. Finding solutions to S(θ, p) = 1 will now involve solving a second order differential equation and we obtain the class of prior through the two constants connected with the two derivatives. Multivariate versions of the scoring rule criterion we are proposing, and corresponding solutions, are discussed in Section 3. Parry et al. (2012) describe a larger class of score based on the first two derivatives; see (39) in their paper. In the second order case these are based on an Euler-Lagrange equation which itself is based on a measure of information of the form [p (θ)] k / [p(θ)] k−1 dθ for k = 2, 3, . . .. The only widely recognised information occurs with k = 2; the Fisher information and the corresponding score coincides with the Hyvärinen score. Hence, we use this. A connection between our scoring rule criterion and information is made through an alternative formulation through variational methods, which is discussed in Section 4.

Priors from scoring rules
Let us consider a quantity of interest, θ, which can take values in the space Θ ⊂ R k . The fundamental argument behind objective prior distributions is that they should represent a state of actual or alleged prior ignorance about the true value of θ. Several criteria have been proposed to select such a prior, all of which assume that a probabilistic model generating the data (given θ) has been chosen. What we propose is to avoid this choice and derive a prior depending on Θ only. The idea is to measure the quality of the prior p with a proper scoring function, say S(θ, p), and require it to be constant, as discussed in the Introduction.

Definition 1.
A density p with respect to the Lebesgue measure on Θ, is objective (in accordance with commonly accepted meaning of the expression) if S(θ, p) = constant for all θ ∈ Θ, where S is a proper scoring rule.
The constant here is unimportant and we can take it as 0. Any constant workseffectively it becomes 0 once the normalizing constant for p has been established.
Before proceeding we provide a brief discussion on scoring rules. Scoring rules are proper if Θ S(θ, p) q(θ) dθ is minimized at p = q and local if it depends on p only through the value p(θ). The unique proper local scoring rule is the log score, defined as Parry et al. (2012) extend the local property to m-local, in that now S(θ, p) depends also on the l-derivative p (l) (θ), for 0 ≤ l ≤ m. In particular, for m = 2, there is the Hyvärinen scoring rule, (Hyvärinen, 2005), given by In Section 4 we will illustrate the connection of the Hyvärinen scoring rule to Fisher information. Our choice of scoring rule following our reasoning in Section 2, and with 6

On a Class of Objective Priors from Scoring Rules
the weighting factor, is That this is a proper scoring rule is derived from the fact that it is the sum of two proper scoring rules. It is also clearly 2-local. Previously, priors have been sought based solely on log p; for example, the reference prior, and the math becomes unnatural as a consequence. On the other hand, including higher derivatives yields well defined solutions to optimization procedures. That we set this score to 1 for all θ is done without loss of generality, as we shall see later on. That we understand this to be an objective procedure is evident from the fact that no part of Θ is being given preference; the loss at θ for our choice of p(θ) is the same for all θ. For, if S(θ, p) did depend on θ then we argue that this could only be driven by information; i.e. parts of Θ space are preferential to others.
Predominantly, throughout the paper, we will be using the choice of w = 1. After all the value of w is a calibration issue between the two scores; i.e. to put them on a comparable scale. The reason for w = 1 is that for the benchmark standard normal density function, i.e. p(θ) ∝ e − 1 2 θ 2 , the difference between the scores S L and S H is a constant (i.e. does not depend on θ), and so one does not end up dominating the other, only for w = 1.
Hence, we see that the objective prior, p(θ) ∝ exp{−u(θ)}, is obtained by solving the following differential equation: To derive the solution, we have the following result.
Proof. Solving the differential equation (4) is equivalent to solving the following differential equation; having defined v j = ∂u/∂θ j . It is now seen the solution is given by (5). This follows and note that c j e u − 2w/k = v 2 j + 2wu/k. Therefore (6) holds.

7
The missing pieces in (5) are c = (c j ) and say u(0), the constants of integration. Note that, the initial value u(0) (together with the constant c) is required to ensure the existence and uniqueness of the solution. We will see how to complete these when we look at illustrations in Section 4. In general, as the solution depends on the above arbitrary constants, our method provides a class of solutions, where some are proper and some are improper and, more general, where the priors will have some assigned properties via specification of (c, u(0)).
We also note here that we do not need the normalizing constant for p and neither do we need to find an explicit solution for u, and p, beyond (5). The reason for this is that we can find an accurate solution via numerical methods; i.e. if we have u(θ) at a particular θ value, then we can evaluate for small ε, and the ∂u/∂θ and ∂ 2 u/∂θ 2 are available explicitly, combined with the ease of obtaining higher derivatives if needed. From here we can evaluate p(θ).
To ease the reader into the proposed prior, we illustrate the following simple example, where the parameter space is Θ = (−M, +M ). Here we discuss an explicit solution to the equation u (θ) = ± ce u(θ) − 2(1 + u(θ)), which is (5) where we have set w = 1 and, as the parameter space is unidimensional, k = 1. If we set c = 0 then for a solution to exist we must have 1 + u to be negative. Consider a prior on Θ = (−M, +M ) for some finite M . With c = 0 we have the solution for u in the form 1 + u(θ) = − 1 2 (θ − μ) 2 for some μ. Hence, p(θ) ∝ exp 1 2 (θ − μ) 2 , which will provide a proper density on Θ. A more general solution p(θ) ∝ exp 1 2 w(θ − μ) 2 arises when we take the more general form of score function; i.e. S(θ, p) = w S L (θ, p) + S H (θ, p), providing interpretation for the score weighting parameter in this case. Plots of such p(θ) depending on w are presented in Figure 1, with M = 2. 8

Variational problems and solutions
Here we provide an alternative derivation of (4) using information theory, specifically entropy information and Fisher information. We show that the p solving (4) can also be regarded as a density carrying minimal local information. This material then is to provide support for the solution to (4) being an objective prior.
The entropy information (negative entropy) of a density function p is given by I E (p) = p(θ) log p(θ) dθ, which is related to Shannon's entropy and is equal to negative the expected self-information loss. In addition to I E (p), we consider a measure of the information in the density p known as Fisher information, given by See, for example, Bobkov et al (2014).
Now consider I(p) = I E (p) + 1 2 I F (p) and the aim is to find the p which minimizes I(p). Recalling variational methods (Rustagi, 1976), if we wish to minimise b a L(θ, p, p ) dθ, a necessary condition for a local extremum of the integral of the La- Minimising b a L(θ, p, p ) dθ reduces to the classical calculus of variation problem where we want to extremize the integral of the function The solution to the extremal problem, if it exists, is obtained from the Euler-Lagrange equation, given by (7). According to page 44 of Rustagi (1976), if L(p, p ) is strictly convex on (0, ∞) × (−∞, +∞), and p satisfies the Euler equation, then p is a minimum of Theorem 4.1. A minimum satisfying the Euler-Lagrange equations is given by the p solving the differential equation p = ±p c/(e p) + 2 log p, for some suitable c. Leisen,C. Villa,and S. G. Walker 9 where κ = p /p. This is easily seen to be a positive definite matrix; the eigenvalues are given by 1 p

Proof. Calculations give
which are positive.
Then (7), after some elementary algebra and differentiation, leads to the differential equation (2), which we report here for convenience, which is the same as (4).
This differential equation has the solution derived in the previous section. It is interesting that the Euler-Lagrange equations are solved by precisely the same p solving (4).
It might be thought that we would need the constraint p(θ) dθ = 1, that is to consider L(θ, p, p ) + λ(p), i.e. to include a Langrange multiplier. However, any ensuing differences are covered by our note in Section 5.

Illustrations
Before proceeding with some illustrations and applications, it is important that we discuss two aspects of the proposed class of priors within the boundaries of Objective Bayes: i.e. uniqueness and invariance.
It is important to discuss uniqueness and flexibility associated with objective priors. It is widely acknowledged that a prior representing total ignorance is elusive, and it might not even be possible to obtain in principle, see Bernardo and Smith (1994). As a consequence, any prior distribution, objective or not, must out of necessity provide some knowledge about something, and this "something" is not necessarily unique. For example, given a particular problem, the corresponding objective prior over a given parameter space could be proper or improper; differentiable everywhere or not; convex; log-concave; etc. In other words, a prior can be objective and exhibit desirable features of choice without impinging on subjective components relating to information.
So while we will be introducing a Bayesian objective prior criterion, it does not lead to a unique prior, rather to a class of priors, where some desirable features may or may not be included. We believe that this level of flexibility is a point of strength of the proposed approach, making it adaptable to different scenarios, including those where model based priors do not work.
Another fundamental point of discussion about prior distributions and, in particular, objective prior distributions, is invariance. Indeed, Jeffreys' rule to derive a prior distribution for the parameters of a given model is based on an invariance requirement, in particular on invariance under one-to-one reparameterisations. Also, other common objective priors, such as reference priors, have been shown to be invariant and the same apply, for example, to the priors in Simpson et al. (2017).
Here we discuss invariance from two opposite perspectives: that it is not important, and that it is important. Before discussing this apparent contradiction, we need to point out that we define the objective prior by setting the scoring rule equal to a constant, that is S(θ, p(θ)) = constant, is invariant under location transformations.
The question is whether lack of invariance has any practical implications given that only one parameterisation will be used. Current model based objective procedures are bound to throw away some coherence properties to achieve invariance, see Kass and Wasserman (1996). However, our point is that there is no practical consequence of any relevance arising from the lack of invariance, given that, as mentioned above, a single parameterisation will be used. For example, in the case the chosen model is the normal density, one either considers the precision parameter or the variance parameter, not both. And whichever parameterisation is used, our claim is that the corresponding objective prior is adequate for the purpose to which it has been assigned.
The above points of discussion are concerned with the perspective that invariance is not important. To consider the opposite point of view let us assume that there is a canonical parameterisation for the model f (·|θ). Certainly, for most models the set of parameters for which priors would be assigned is obvious. For example, the exponential family has with θ = (θ 1 , . . . , θ p ) being the canonical parameterisation. We can then define the canonical objective prior for statistical model f (·|θ), θ ∈ Θ, as p Θ (θ) = p j=1 p Θj (θ j ), where Θ = ⊗ p j=1 Θ j . Then, any transformed prior can be obtained in the usual way involving variable transformations; that is p(φ) = |J| p Θ (θ(φ)), where J is the Jacobian matrix for the transformation.
To illustrate the proposed method we consider three common parameter spaces in the unidimensional case and one in the multidimensional case. In particular, we consider the space for a parameter representing a probability, that is Θ = (0, 1), the space Θ = (0, ∞), usually representing the support of scale parameters, and the support for (location) parameters Θ = (−∞, +∞). The space Θ = (−M, M ), for some finite M , has been illustrated in Section 3. As an illustration for the multidimensional case, we consider the bidimensional parameter space (0, ∞) 2 .

One dimensional parameter space
The aim here is to solve (7) for particular motivated choices of (c, u(0)), equivalently, (c, p(0)) or (p (0), p(0)). In fact, the solutions to the Euler-Lagrange equations are many, and the choice of the two constants (c, u(0)) will then determine a unique solution. 2 ) = 1.14; plot of the normalised prior density obtained by setting u( 1 2 ) = 1.14 (c). Now there is the flat solution for all Θ in (4) given by p(θ) ∝ 1. This is achieved by setting c = 2 and u(0) = 0. However, in each of the settings of Θ considered we can find alternate priors with particular features. So, e.g. for Θ = (0, 1) we ask that p(0) = p(1) = 0 and for Θ = (0, ∞) we ask that p is convex and decreasing.
Case Θ = (0, 1) Here we consider the u function, recall p ∝ e −u , and so for p(0) = p(1) = 0 we require u(0) = u(1) = ∞. For additional symmetry, we can take u 1 2 > 0 and taking c = 2 as the extremal value, we have Note there is a discontinuity in the derivative of u at θ = 1 2 . As u( 1 2 ) increases it is that u(0) and u(1) got to ∞. A plot of u is given in Figure 2a for u( 1 2 ) = 1.1 and for u( 1 2 ) = 1.14 in Figure 2b. For these figures we used a grid of 1000 either side of θ = 1 2 to obtain the numerical solutions. In the latter case, the corresponding density for p is presented in the right plot of Figure 2c.
It is also possible to obtain a prior that mimics Jeffreys'; that is, a distribution that has spikes at θ = 0 and θ = 1 with the lowest value at θ = 1 2 . This is done by simply inverting u: i.e. set u = −u in the above prior.
Case Θ = (0, ∞) For a prior defined on the space (0, ∞) we require a specific shape property for p (convex and decreasing) and then take extremal values for the c and u(0). This property is common to most objective priors on (0, ∞). Thus, since p < 0 we require u > 0 and so u = ce u − 2(1 + u), and for u to exist for all u we must have c ≥ 2. Thus, as an extremal value, we take c = 2.
In the next result we show that u is bounded away from 0, and this will have important consequences for the properness of p.

Lemma 1. It is that u is bounded away from 0.
Proof. To show this we need to show that e u −1−u is bounded away from 0 for u ≥ u(0). This follows trivially since e u − 1 − u ≥ 1 2 u 2 ≥ 1 2 u(0) 2 .
The result of Lemma 1 has also the implication that p is a proper density function. To show this, we require Gronwall's inequality (Gronwall, 1919). This inequality states that, if f and g are real valued functions on Θ = (0, ∞), g is differentiable on int(Θ), and

and hence p is proper.
Proof. Since p = −u p and we have u ≥ for some > 0, it is that p ≤ − p. From Gronwall's lemma, with f (t) = − and g = p, we have that and hence the proof is complete.
To have a graphical image, in Figure 3a we plot the prior using the approximation available via a numerical solution to the differential equation for p. Note that this is the unnormalised p.

Case Θ = (−∞, +∞)
A solution here is a symmetric version of the case Θ = (0, +∞), which will represent a proper prior. On the other hand, if we ask that p is smooth at the origin, i.e. p (0) = 0, then we need u (0) = 0 and hence we must take (c, u(0)) to satisfy c e u(0) = 2 + 2u(0). If now we take c = 2, then u(θ) is a constant, resulting in a flat (improper) prior for p.
For a proper prior here one could take u(0) to be small, say u(0) = 0.01 and then to take c = 2{1 + u(0)}/ exp{u(0)}. We computed numerically the right side; i.e. the (0, ∞) side, for p, which is shown in Figure 3b.
To make the value of u(0) more diverted, we could equally set a motivated choice for p(0) = 1/(σ √ 2π), corresponding to a normal density with zero mean and variance σ 2 .

Applications
The proposed class of priors is illustrated through a simulation study and the discussion of some practical implementation. Besides two initial simple examples, this section is dedicated to show the behaviour of the prior for cases where improper priors cannot be used (i.e. mixture models and model comparison via Bayes factors), while simula-14

On a Class of Objective Priors from Scoring Rules
tion studies and a real data example are illustrated, respectively, in Appendix B and Appendix C in the Supplementary Material.
Although we do not have an explicit form for p(θ), we can use (5) to calculate it numerically quite easily. In particular, if we know p(θ) then we calculate p(θ +δθ) for small δθ, hence setting up the possibility of a posterior estimation process via Metropolis-Hastings sampling. The algorithm employed is detailed in Appendix A of the Supplementary Material.
To be specific, suppose we are currently at θ and the proposal value is θ . The acceptance probability is where l(θ) is the likelihood function, and q(θ |θ) is the proposal density. The evaluation of p(θ )/p(θ) in (8) does not represent any particular challenge. In fact, we have where u is the solution of the differential equation Equation (9) allows us to evaluate u(θ ) − u(θ) numerically, via where the derivatives are u (θ) = ce u(θ) − 2(1 + u(θ)), u (θ) = 1 2 ce u(θ) − 1, and u (θ) = 1 2 ce u(θ) ce u(θ) − 2(1 + u(θ)), and so on. Depending on how far θ is from θ we can either use the direct approximation just given or otherwise get from θ to θ using smaller step sizes.
The frequentist performances are illustrated in a thorough simulation study, which is presented in Appendix B of the Supplementary Material. There, we also show the complete analysis for two single i.i.d. samples, that is where we obtain data from a Poisson distribution and from a normal density with unknown mean and known variance.

Mixture models
In this section we discuss the application of the proposed method to a scenario where objective priors have been notoriously challenging, namely mixture models. Due to their flexibility, mixtures of probability distributions allow models suitable for complex data by building on simple components. As an example, consider a mixture of normal densities, where k is a positive integer, including ∞, and the (ω j , μ j , σ j ) are the collection of parameters. Even under the scenario when k is known, the reference prior for model (10) has yet to be derived, and Jeffreys prior can only be obtained under specific conditions; see Grazian and Robert (2018). Furthermore, this type of model is subject to other issues related to non-identifiability and unbounded likelihoods, among others. The issues mainly arise from the fact that improper priors may not be appropriate as we might not observe outcomes from every component of the mixture (Titterington et al., 1985). For example, Grazian and Robert (2018) show that Jeffreys prior is suitable for mixtures of normal densities only in certain circumstances; that is, when the unknown parameters are the weights. If the unknown parameters are the means or the variances, then using Jeffreys prior may lead to improper posteriors. In particular, if the unknown parameters are the means only, proper posteriors exist only when the number of mixture components is at most two; while, if the unknown parameters are the variance, or the mean and the variances, then Jeffreys prior is not suitable for inference. The above issues can be generalised to apply to any type of mixture model.
Given that the objective prior we propose is proper, it allows to make inference on the parameters of a mixture density as the yielded posteriors will be proper. As an illustration, we consider a mixture of three normal densities, where the weights and the parameters of the components are unknown. In particular, we sample from the following model with weights ω 1 = 0.25, ω 2 = 0.35 and ω 3 = 0.40, means μ 1 = −3.5, μ 2 = 0 and μ 3 = 2.5, and variances σ 2 1 = 0.5, σ 2 2 = 0.1 and σ 2 3 = 1.2. Note that we have chosen mixture components that are reasonably distant, so not to be forced to impose any constraint to overcome identifiability, as the focus of the paper is not in this sense. However, the implementation of constraints in that sense is straightforward. For the parameters we have chosen prior independence, where each prior is the prior on the space (0, 1) for the weights, on the space (−∞, ∞) for the means and on the space (0, ∞) for the variances, in agreement with Section 5. The prior on (−∞, ∞) is the symmetrised version from Θ = (0, ∞). To ensure properness of the priors for the means and variances, we have set c = 2 and u(0) = 1.31 (as discussed in Section 5).
We have performed the analysis on two data sets of size n = 100 and n = 250. The whole details, including histograms of the sample data, description of the algorithm implemented, as well as convergence diagnostics are reported in Appendix D of the Supplementary Material.

Model comparison
Another simple case where objective priors are problematic is in model comparison (or selection) via Bayes factors. So, if we wish to compare model M 1 = {f 1 (x|θ θ θ 1 ), p 1 (θ θ θ 1 )} to model M 2 = {f 2 (x|θ θ θ 2 ), p 2 (θ θ θ 2 )}, where both θ θ θ 1 and θ θ θ 2 are vector of parameters with some elements not in common, then the Bayes factor is, in general, meaningful if the priors assigned to non-common parameters are proper.
If not, then the arbitrary multiplicative constant up to which they are defined do not cancel and the Bayes factor depends on an arbitrary constant. Solutions to the issue have been proposed, see, for example, O'Hagan (1995) and Berger and Pericchi (1996), however, the resulting procedures are still quite tedious to implement and are limited to simple models. By and large the above issue stays; however, Berger et al. (1998) give an exception of the issue. Furthermore, in Dawid and Musio (2015) and Dawid et al. (2017) the authors propose to make use of homogeneous scoring rules that circumvent the problem of using improper priors on the parameters.
An example on the application of the proposed prior in model selection is discussed in Appendix E of the Supplementary Material. There, a Poisson distribution and a geometric distribution are compared, where the priors are, respectively, of a parameter space (0, ∞) and (0, 1). As properness is necessary in this context, we have set c = 2 and u(0) = 1.31.

Nested models
When models under comparison are nested, there are particular considerations which are needed to be taken into account; see, for example, Consonni et al. (2013). The point is that a diffuse type prior for the larger model will end up lacking focus so that the mass assigned to the smaller model is too much. However, our argument is that if two nested models are under comparison, it is essential, at least from a coherent point of view, to center the larger prior on the fixed part of the smaller one. Let us elaborate.
Suppose f (y|θ) for θ ∈ Θ 1 is the larger model and the smaller one is given by θ ∈ Θ 0 where Θ 0 ⊂ Θ 1 . Typically Θ 1 will be a higher dimension to Θ 0 and to get the latter from the former one fixes a particular value in the higher dimension. To make this concrete, let us consider Example 2.1 from Consonni et al. (2013), where M 0 : f (y|θ 0 ) is binomial(n, θ 0 ), with θ 0 = 1/4 fixed, and M 1 : f (y|θ) is binomial(n, θ), for which a prior for θ, p(θ), is required. Given the nature of the comparison it is our argument that p(θ) must be centered on θ 0 .
We can adapt quite easily the prior obtained in Section 4, the Θ = (0, 1), to be centered on 1/4 rather than 1 2 . Without repeating the mathematics, we can take u(1/4) = w and c = 2 and then For the illustration of the prior p(θ), obtained numerically from u, in Figure 5 we took w = 1.5.
The proposed prior, centered at θ 0 = 0.25, is compared with the intrinsic prior in Consonni et al. (2013), that is where b = 1 and t = 8. The intrinsic prior defined above is centered at wθ 0 + (1 − w) 1 2 , where w = t/(2b + t), and has behaviour similar to the one in Figure 5, giving the Intrinsic Bayes Factor in favour of M 1 where n = 12. A relatively small sample size allows to better capture differences in the performance of the two priors. Figure 6 shows the posterior probability for M 1 , i.e. P (M 1 |y) = (1 + 1/BF I 10 ) −1 . The priors yield model probabilities that are similar; in fact in both cases the lowest point is at θ = θ 0 and, the more θ moves away from θ 0 the higher the posterior probability for M 1 .

Further properties of the prior
It is important to illustrate how the choices of the constraints u(0) and c impact the prior; in particular, in terms of mean, variance and tail behaviour.
As a general result, if we need to center the prior at a particular θ 0 , we can simply set See Section 6.2.1. In some cases, for example, when nested models are compared, one wishes to have a prior distribution with tails that are as heaviest as it is possible; i.e. suitable for alternative priors. This is achieved by aiming to have a prior for which u is as small as possible, thus we can set c to the smallest value for which we can solve the equation, which is c = 2, if u(0) < 0, and c = 2(1 + u(0))e −u(0) , if u(0) > 0.

On a Class of Objective Priors from Scoring Rules
Figure 6: Small sample evidence for the Binomial example. The graph shows the posterior probability for model f (y|θ) = Bin(n = 12, θ) using the proposed prior (squares) and the intrinsic prior (circles). As an illustration, let us consider the case Θ = (−∞, ∞), and compare the tail of the prior distribution to the tails of a normal density and a Student-t with 1 degree of freedom. For simplicity, we consider only the positive half of the real line. Figure 7 shows the comparison of the prior based on scoring rules for c = 2 and for u(0) = (0.1, 1.31, 2). When compare to both the standard normal and the Student-t with 1 degree of freedom, the proposed prior appears to have lighter tails for relatively large values of u(0). In particular, the higher the value the quicker the prior drops towards 0. On the other hand, should we select a small value of the initial condition u(0), then the proposed prior has a more gentle decrease to zero.
line with the tail behaviour discussed above, where the larger the value chosen for u(0), the faster the prior drops to 0.
For the case Θ = (−∞, ∞) we have chosen to center the prior ate zero; although other values are easily attainable yielding similar results. Again, the prior variance (Table 3) decreases as we set the initial conditions to increasingly large values.
A final application of the proposed prior is for a Poisson regression model. The example is presented in Appendix C of the Supplementary Material, where we have worked with simulated data as well as real data. One objective of the simulation study is to show the robustness of the prior when nuisance covariates are added in the regression model; in fact, it can be seen that there is no noticeable impact on the size of the posterior credible intervals.

Discussion
In this paper we have introduced a new class of objective priors derived from scoring rules. A remarkable aspect is that we have been able to show that the same result can be achieved via the rigour of calculus of variations, by finding objective priors which solve the Euler-Lagrange equation for finding extremum to integrals of the type L(θ, p, p ) dθ.
If we can establish suitable choices of L(θ, p, p ) which can be motivated and satisfy conditions for the existence of extremum, then new classes of objective prior can be sought. The case we have considered, which we can consider as a first step, is to use a combination of two well known measures of information in a prior density function; i.e. L(θ, p, p ) = 1 2 p (θ) 2 p(θ) + p(θ) log p(θ).
The objective priors here defined have two desirable properties. The first is that they are somewhat detached from the choice of the sampling distribution and are dependent on the parameter space only. In other words, the information required to derive the prior is limited to the range of values that the quantity of interest can take.
The second property is that the prior can be proper. Besides the advantage of not having to check properness of the posterior, it allows to exploit the prior in scenarios where improper objective priors have been challenging. For example, as illustrated in Section 6.1, the proposed prior is used to estimate the means of a mixture of normal densities with three components. Another potential application, discussed in Section 6.2, is in model selection. In particular, the objective prior may be used to represent minimal information on the parameters that are not common to two models. In fact, the Bayes factor used to compare two models is, in general, sensitive to the proportionality constant of improper priors. While for common parameters the constant will cancel out, this is not the case if the parameter is either at the numerator or at the denominator of the ratio only. Hence, the necessity of having a proper prior assigned to this kind of parameters.
The simulation study, aimed to compare the frequentist performances of the proposed prior with the ones of the Jeffreys prior, has shown no appreciable differences, with the exception of a slightly larger Mean Squared Error (MSE) for the proposed prior; the last result is expected as it is a consequence of the smaller information used to define the proposed prior in comparison with any model based objective prior.
Future and ongoing work involves using the Fisher information alone; i.e. to minimize (p ) 2 /pdp subject to p having certain constraints; for example, a zero mean or a specified variance or being log-concave. The mathematical results here would be able to provide explicit solutions including a class of non-local prior distributions (Johnson and Rossell, 2010).

Supplementary Material
On a Class of Objective Priors from Scoring Rules. Supplementary Material (DOI: 10.1214/19-BA1187SUPP; .pdf). Supplement to "On a Class of Objective Priors from Scoring Rules". The supplementary material contains the Appendixes A, B, C and D of the paper.