An Explanatory Rationale for Priors Sharpened Into Occam’s Razors

. In Bayesian statistics, if the distribution of the data is unknown, then each plausible distribution of the data is indexed by a parameter value, and the prior distribution of the parameter is speciﬁed. To the extent that more compli-cated data distributions tend to require more coincidences for their construction than simpler data distributions, default prior distributions should be transformed to assign additional prior probability or probability density to the parameter values that refer to simpler data distributions. The proposed transformation of the prior distribution relies on the entropy of each data distribution as the relevant measure of complexity. The transformation is derived from a few ﬁrst principles and extended to stochastic processes.


Introduction
The typical Bayesian data analysis involves specifying one or more default prior distributions, often called "objective priors" (Ghosh et al., 2006;Press, 2009). They are objective in the sense that they are automatically determined by the application of some algorithm as opposed to representing the beliefs of one or more people. The simplest case is the uniform prior distribution on a finite set of parameter values. In hypothesis testing, the assignment of equal prior probability to the null hypothesis and the alternative hypothesis is the most common default. In Bayesian model selection and Bayesian model averaging, the most common default is to assign each model equal prior probability. When the parameter value is continuous, more sophisticated procedures replace the assignment of equal probabilities (Kass and Wasserman, 1996).
The following toy models explain why default prior distributions may need to be modified to reflect the simplicity or complexity of each data distribution specified by a parameter value. Example 1. The observable outcomes from a black box are independent and identically distributed (IID) integers between 1 and 20. Before observations are made, it is known that n outcomes x = (x 1 , x 2 , . . . , x n ) will be generated by rolls of a fair die with a number on each face from 1 up to the number of sides of the die. The die is shaped like one of the five Platonic solids, which implies that the die has 4, 6, 8, 12, or 20 sides.
The die was constructed inside the box by an unknown mechanism that constructs shapes at random until it happens upon one that closely resembles a Platonic solid. To make Bayesian inferences about θ, the number of sides of the die, we need the posterior probability that it has θ sides: where P (θ) is the prior probability that it has θ sides, and f θ (x i ) is the probability that x i would be observed if it has θ sides. From the given information, each data distribution f θ is a uniform distribution on {1, . . . , θ}, so that f θ (i) = 1/θ for i = 1, . . . , θ but f θ (i) = 0 for i = θ + 1, . . . , 20. While it might be tempting to assign the uniform prior distribution such that P (θ) = 1/5 for θ = 4, 6, 8, 12, 20, that would not account for how many more coincidences it would take for the mechanism to generate a die with a higher number of sides than a lower number of sides. Incorporating that information means assigning more probability to simpler dice and less probability to more complex dice: P (4) > P (6) > P (8) > P (12) > P (20) .
Which prior distribution satisfying that constraint should be used?
Such coincidences, occurrences of multiple improbable events without strong dependence on each other, are tacit in many other complex distributions (cf. White, 2005). A less Platonic example emphasizes the need to consider the simplicity of data distributions when assigning a prior.
Example 2. Inside a black box, an unknown mechanism randomly constructed one or more balls of different colors and placed them in an urn. The number of balls in the urn is the number that could be constructed within a short time window. If none were constructed within that time period, the process started over and continued until at least one ball was placed into the urn. The mechanism had access to a million colors. From the urn, n balls will be drawn independently, with equal probability, and with replacement. The observer wants to make inferences about φ, the set of the colors of the |φ| balls in the urn. That set differs from the configuration θ, the |φ|-tuple of the colors of the balls in the order in which they were placed into the urn, in that φ is an unordered set and θ is an ordered set or vector of the same number of colors. Since θ is a permutation of the members of φ, the posterior probability that the set of colors of the balls in the urn is φ is the sum of each posterior probability that the configuration is θ over all permutations of the members of φ. The latter posterior probability is given by (1) with P (θ) as the configuration is θ and with f θ (x i ) as the probability that the ith ball drawn from the urn will be of color x i , conditional on θ as the configuration. It may have been reasonable to assign a uniform prior distribution over Θ, the configuration space, were it not for the information about how the urns were populated, information indicating the higher number of coincidences needed to populate an urn with more balls as opposed to fewer. That information, without being enough to determine the prior distribution, requires configurations with fewer balls to have higher prior probabilities than those with more balls: P (θ 1 ) > P (θ 2 ) for all θ 1 , θ 2 ∈ Θ such that |θ 1 | < |θ 2 |, where |θ| is the dimension of θ. Subject to that constraint, what should the prior be?
Occam's razor is the principle that simpler explanations are more credible than more complex explanations in the absence of evidence favoring more complex explanations. In a Bayesian framework with the number of free parameters in a model as the measure of complexity, that greater credibility may show up as a higher posterior probability (cf. Rosenkrantz, 1976) or, via the simplicity postulate (Jeffreys, 1948, pp. 100-101, 113, 222), as a higher prior probability (Jefferys and Berger, 1992). Multiple methodology researchers reached similar conclusions for other forms of complexity. Among others, Poston (2014) argues that complexity should constrain the prior distribution, with simpler explanations being at least as probable as more complex explanations. Explanations requiring more coincidences, while not impossible, tend to be less probable than those requiring fewer coincidences (Myrvold, 2017;Blanchard, 2018).
A special case of that type of constraint on prior distributions is seen in Examples 1 and 2. In both examples, each data distribution f θ is uniform on some sample space X θ of a number of possible outcomes equal to |X θ | = θ in Example 1 and |X θ | = |θ| in Example 2. Also in both examples, the prior probability P (θ) decreases as a function of |X θ | since it reflects the number of coincidences that f θ would require. Although |X θ | increases with the complexity of a uniform distribution, another measure of complexity is needed for other data distributions.
Entropy is a measure of complexity that generalizes the reasoning of Examples 1 and 2, for the entropy of a uniform distribution f θ is log |X θ |. The conditions of Section 2 result in the constraint that parameter values corresponding to data distributions with lower entropy have higher prior probabilities than those of higher entropy. Although those conditions are not universally applicable, they provide the foundation for the more general methods of later sections.
Merely arranging parameter values in order of prior probability is not enough for Bayesian data analysis, as an ordering in itself does not determine a prior distribution. Starting with the ordering, Section 3 derives a method for transforming a preliminary prior distribution such as the uniform distributions of Examples 1 and 2 into a prior distribution informed by Occam's razor. The derivation is based on desirable properties of such a transformation.
That prior distribution, however, is only determined up to a parameter that controls the extent to which it differs from the preliminary prior. In applications requiring flexibility, the ability to set the parameter on a case-by-case basis may be desirable. In other applications, using a default value would save resources or reduce concerns about a potential conflict of interest. Section 4 derives such a value from an idealized model of constructing a data distribution, with more complex distributions being less probable because they require more coincidences to construct.
Because Shannon's entropy has a number of complexity-suitable properties that uniquely characterize it (e.g., Rényi, 1965), it is the measure of complexity emphasized in this paper. As an excursus, Section 5 explores alternative definitions of entropy as complexity. It notes that since all Rényi entropies are additive and have the property that the entropy of a uniform distribution f θ is log |X θ |, any of them may replace the Shannon entropy.
Since most statistics applications involve probability distributions with infinite domains, the framework is generalized from finite parameter spaces to infinite parameter spaces in Section 6 and from finite sample spaces to infinite sample spaces in Section 7. The latter section applies the framework to null hypothesis significance testing.
Since the entropy of a whole sample is greater than that of a single observation, the relation to the sample size is specified in Section 8, which extends the framework to stochastic processes. In the usual case of a sample of n IID observations from a distribution conditional on θ, the sharpened prior probability is proportional to the unsharpened prior probability and inversely proportional to the exponential of the entropy or differential entropy of that distribution. Examples include both the IID processes of Examples 1-2 and some binomial and normal models that commonly occur in practice.
Section 9 closes with a discussion of which priors require simplicity adjustments.

Prior probabilities constrained by the simplicity of data distributions
The preliminary concepts of this section provide a foundation for generating prior distributions that satisfy a generalization of the simplicity conditions suggested by Examples 1 and 2. The subsequent sections build on this foundation.
The entropy of a probability mass function (PMF) g on finite set X of possible observations is understood such that 0 ln 0 = 0. All data distributions are on the same sample space X . That means the data distributions of Example 1, while uniform if restricted, are not uniform on X = {1, . . . , 20}, with the exception of f 20 , the distribution of outcomes of the 20-sided die. For reasons explained below, we also need a continuum between different uniform distributions on a finite sample space. For example, on the sample space X = {1, 2, 3, 4, 5, 6}, the PMF given by is uniform on {1, 2} but with x = 3 having a probability mass between that of each x in the main supported set {1, 2} and that of each x in the non-supported set {4, 5, 6}: To streamline development involving distributions like g, we need a term for them.

Definition 1.
A PMF g on X is called partially uniform if it meets these conditions: 1. It is uniform on X (g), a non-empty subset of X . That is, there is a g max > 0 such that g (x) = g max for all x ∈ X (g).
2. There is no more than one y ∈ X such that 0 < g (y) < g max .
3. It otherwise has a value of 0. That is, g (x) = 0 for all x = y in X but not in X (g) if y exists; otherwise, g (x) = 0 for all x ∈ X \X (g).
Condition 3 requires g to have a value of 0 outside its support. Loosely speaking, conditions 1-2 require g to be as uniform as possible on its support.
More precisely, condition 2 requires that, outside of X (g), which is the uniform portion of g's domain, g is only allowed to have at most one component deviating from the uniformity specified by condition 1. It allows one component ("y") to have lower probability than the components on X (g), as seen in (4), in which y = 3. The reason to allow that departure from uniformity is to create a continuum of PMFs between the uniform distribution on the whole support and the uniform distribution on the support without y. That continuum is needed in the next paragraph to generalize the cardinality of a sample space's support from a counting number to a positive real number. Without the continuum and its real-valued generalization of cardinality, the measure of complexity to be derived would be unnecessarily burdened with discontinuities to keep track of. In short, a little inelegance in Definition 1 will pay off in the elegance of the result.
A Bayesian model is a pair (θ → f θ , P ), where θ → f θ , abbreviated as f • , is a function on Θ such that, for every θ ∈ Θ, f θ is a (data) PMF on X and P is a (prior) PMF P on Θ. Let M (X , Θ) denote the set of all Bayesian models with X as the sample space and Θ as the parameter space.
What it means for a prior distribution to be constrained by the simplicity of the sampling distributions uses entropy as a generally applicable measure of complexity and intricacy as a measure of complexity that only applies to partially uniform distributions.

Definition 2.
Let M S denote a subset of M (X , Θ). M S is called a set of Bayesian models with simplicity-constrained PMFs if these conditions are satisfied: 1. There exists a function p such that, for every (f • , P ) ∈ M S , The function p is called the prior generator for M S , and the function S f• is called the entropy spectrum of f • .

For every Bayesian model
and such that f θ is partially uniform for every θ ∈ Θ, While (5) says the prior distribution is a function of the entropy spectrum, (6) says parameter values labeling less intricate partially uniform distributions have higher prior probabilities. The rationale is that, in the absence of other information, uniform distributions on larger domains tend to require more coincidences and thus to be less probable than those on smaller domains, as seen in Examples 1 and 2.
The result is that simpler data PMFs tend to have higher prior probabilities.

Lemma 1. If M S is a set of Bayesian models with simplicity-constrained PMFs, then the prior generator for
for all θ 1 , θ 2 ∈ Θ.
Proof. If a PMF g is partially uniform, then (6) and (8) imply that (7) holds for all θ 1 , θ 2 ∈ Θ. (6) and (8) would be satisfied if I (f θ ) = exp (S (f θ )) for every θ ∈ Θ. In fact, since such a Bayesian model is in M S by part 2 of Definition 2 for every real value of f θ(y) (y) strictly between 0 and 1, the functions I and S must be isomorphic on the domain of I by its monotonicity and continuity properties. Thus, (5) constrains the prior generator p such that its PMF assignments satisfy (7) holds for all θ 1 , θ 2 ∈ Θ.

Now consider instead a Bayesian model (f
Since every possible entropy spectrum S f• is achieved by some Bayesian model (f • , P ) such that f θ is partially uniform for every θ ∈ Θ, and since every such model is in M S (Definition 2, part 2), it follows that p is completely determined by I in such a way that (7) holds for all θ 1 , θ 2 ∈ Θ. That same p is the prior generator not only for those Bayesian models but for all Bayesian models in M S by part 1 of Definition 2. It follows that (7) holds just as generally.

Adjusting prior probabilities for the simplicity of data distributions
The method of this section transforms a prior distribution that does not account for the simplicity of the data distributions into a prior that does. That prior satisfies the constraints of Section 2 in the special case that the pre-transformation prior is uniform. More generally, any prior on a finite parameter space may serve as the pretransformation prior, regardless of how far it deviates from uniformity. That is how this section paves the way for the extensions to priors for more general parameter spaces in Section 6, including commonly used objective priors such as those of Section 8's Examples 6 and 7.
Definition 3. Let • denote a function that transforms a PMF to another PMF on the same parameter space and that necessarily satisfies the following conditions for a prior generator p. That function is called a sharpener. Given any PMF P , its sharpened counterpart is P , that is, • evaluated at P . Conditions: 1. Simplicity constraint. The sharpened counterpart P of a uniform PMF P is a simplicity-constrained PMF generated by p.
2. Coherence. The sharpened counterpart P (•|X = x) of a posterior PMF P (•|x) based on a prior PMF P is P (•|X = x), the posterior distribution based on P , the sharpened counterpart of P , where x is an observed sample.
3. Independence preservation. Consider the finite parameter sets Θ and Φ. Suppose that, for all θ ∈ Θ and φ ∈ Φ, X ∼ f θ and Y ∼ g φ are independent random variables of joint PMF h θ,φ with values in X and Y, where f θ and g φ are PMFs on X and Y, respectively. If P is the sharpened counterpart of a PMF (θ, φ) → P (θ, φ) = P 1 (θ) P 2 (φ) that is the joint PMF of independent parameters θ and φ that have prior PMFs P 1 and P 2 and their sharpened counterparts P 1 and P 2 , respectively, then P (θ, φ) = P 1 (θ) P 2 (φ) for all θ ∈ Θ and φ ∈ Φ.
The simplicity constraint (condition 1) builds Definition 3 on the foundation laid in Section 2. The coherence condition (condition 2) means considering simplicity commutes with conditioning on the observed data so that it does not matter which happens first. Independence preservation (condition 3) means that if two quantities have nothing to do with each other, then that should be reflected in their priors adjusted for simplicity.
Proof. Let f θ , g φ , h θ,φ , P 1 , P 2 , and P denote PMFs that satisfy the independence assumptions of Condition 3, and assume P 1 and P 2 are uniform. Since P , P 1 , and P 2 are uniform, the simplicity constraint (Condition 1) requires that P , P 1 , and P 2 are simplicity-constrained PMFs generated by the same prior generator p. By (5), According to Lemma 7, they satisfy 7 for the same prior generator p. Thus, there is a strictly monotonic decreasing function q According to the independence preservation condition, P (θ, φ) = P 1 (θ) P 2 (φ) for all θ ∈ Θ and φ ∈ Φ. Thus, there is a real number c such that for independent random variables, implies that there are real numbers a and b such that ln q (•) = a × • + b, where a < 0 since q is strictly monotonic decreasing, and b may differ between f θ , g φ , and h θ,φ .
It follows that, even when the independence assumptions are not satisfied, ln P (θ|X = x) = a S (f θ ) up to a constant term for every uniform posterior PMF P (•|X = x) according to the simplicity constraint. Letting κ = − |a|, As a posterior distribution, where the first proportionality results from another application of Bayes's theorem. Thus, which ensures that the more generally applicable sharpener • has the form of (9).
A prior would require sharpening whenever it neglects relevant information about the simplicity of the data distributions.
The value of κ is called the sharpness of the sharpener, which may now be written as • κ to distinguish it from other sharpeners. Each sharpener corresponds to a different sharpened prior distribution, as will be seen in Figure 1 of Example 3. The application to real data may suggest a way to specify the value of κ in some cases. In other cases, a default value is needed.

How much should priors be adjusted for simplicity by default?
The method of Section 3 cannot be applied without somehow specifying a value of κ, the degree to which priors are adjusted for the simplicity of the data distributions. This section argues for a default setting of κ = 1.
Definition 4. Let • κ denote a sharpener of sharpness κ > 0 with the following constraint. For any Bayesian model (f • , P ) such that f θ is partially uniform, ∀x ∈ X f θ (x) ∈ {0, 1/ |X (f θ )|} for every θ ∈ Θ, and there is an x ∈ X such that x ∈ X (f θ ) for every θ ∈ Θ and such that the sharpened counterpart P κ of P is the conditional PMF given by for all θ ∈ Θ. Then, given any Bayesian model (f • , P ), the sharpened counterpart P κ of P is ideal, whether or not P meets the conditions for P .
Thus, the universal ideal κ may be found by conditioning on successfully generating the correct realization, with the unsharpened PMF as the distribution of opportunities to attempt a correct realization. In Example 1, that means successfully constructing a die of a certain number of sides, whereas in Example 2, it means successfully constructing an urn with the specified configuration of colors.
Theorem 2. For any Bayesian model (f • , P ), the ideal sharpened counterpart P κ of P satisfies for all θ ∈ Θ.
Proof. By (10) and the conditions on f • , That only agrees with (9) if κ = 1. Thus, P κ = P 1 for any Bayesian model (f • , P ), and the right-hand-side of (11) results from the relevant special case of (9).
Similar results may be derived from fewer assumptions using the concept of Rényi entropy (Section 5).
An alternative derivation of (11) appears in Bickel (2016). Instead of conditioning on the event that a data distribution is constructed correctly, it conditions on the event that a randomly typed computer program yields output representing the data distribution.

Excursus: Rényi entropy as a measure of complexity
For any α > 0, the α-Rényi entropy of a probability mass function (PMF) g on finite set X of possible observations is if α = 1 and is the Shannon entropy given by (2) if α = 1. Thus, if g is uniform, then for all α = 1. With that property and additivity under independence, substituting S α for S 1 throughout the paper would yield analogous results for any other Rényi entropy.
In Section 4, the Shannon entropy was derived as a component of the ideal sharpened prior given assumptions including one about the coincidences involved in constructing a data distribution. A Rényi entropy and a limiting case of Rényi entropy can be derived from fewer assumptions, as follows.
S 2 (g), the quadratic entropy, also called the "collision entropy" (Teixeira et al., 2012), is related to the prior distribution obtained if the difficulty of constructing a data distribution is modeled in terms of the coincidence that two independent realizations collide with each other.
Definition 5. Given any Bayesian model (f • , P ), the collision prior PMF corresponding to P is P coll (θ) = Prob ϑ∼P ,X,X ∼f ϑ (ϑ = θ|X = X ) , as a function of θ, where X and X are IID.

Theorem 3. The collision prior PMF corresponding to P satisfies
Proof. By Bayes's theorem with X as data, where the substitution of e − S 2 (f θ ) for x∈X f θ (x) f θ (x) is sanctioned by (12) with α = 2.
A limiting case of Rényi entropy that is important in cryptography is the min-entropy (Teixeira et al., 2012), .
It is related to the prior distribution obtained if the difficulty of constructing a data distribution is modeled in terms of the coincidence of a successful prediction using the best predicted observation given the data distribution.

Theorem 4. The prediction prior PMF corresponding to P satisfies
for all θ ∈ Θ.
Proof. By Bayes's theorem with x pred θ as data, where the substitution of e − S ∞(fθ ) for f θ (x pred θ ) is permitted by (13).

Adjusting prior densities for the simplicity of data distributions
The extension of sharpened prior PMFs to general sharpened priors, while requiring more notation, is straightforward. It says sharpened prior probability distributions and sharpened probability density function (PDFs) project to sharpened prior PMFs on all finite partitions the parameter space, which need not be finite.
Definition 7. Given any κ > 0 and a probability measure Π on a measure space (Θ, F), the probability measure Π κ on (Θ, F) is the κ-sharpened counterpart of Π if for every finite partition F ⊂ F of Θ, where each P Π,F is the PMF on F such that For every measure ν that dominates Π, the PDF dΠ κ /dν is the κ-sharpened counterpart of the PDF dΠ/dν.
The differential element "dν (θ )" in the next result may be read as "dθ " in the usual case that the dominating measure is uniform on the real line.

Corollary 1. The κ-sharpened counterpart of any continuous PDF π on Θ satisfies
where ν is the measure that defines π as dΠ/dν for some probability measure Π on (Θ, F) that is dominated by ν, almost surely with respect to Π.
Proof. By the definition of the κ-sharpened counterpart of any PDF π and by the definition of a PDF, for all N ∈ F, where Π κ is the κ-sharpened counterpart of Π. Thus, from Theorem 1 and (14)- (15), for all N ∈ F . According to the mean value theorem, the continuity of both π and θ → S (f θ ) requires that there is a θ ∈ N such that π κ (θ) ∝ π (θ) e −κ S (f N ) . That can only be the case for every arbitrarily small N if π κ (θ) ∝ π (θ) e −κ S (f θ ) holds up to a Π-null set. Since π κ , as a PDF, integrates to 1, it follows that (16) holds up to a Π-null set.
Equation (16) reveals that sharpened prior distributions are not alternatives to elicited priors, objective Bayes priors such as reference priors, or other priors in the literature, for it requires as input π, the unsharpened prior density. On the contrary, any prior density might serve as π, as will be illustrated for commonly used default priors in Examples 6 and 7.

Adjusting priors for the simplicity of data PDFs
Since infinite sample spaces are valuable only as approximations of finite sample spaces (e.g., Evans, 2015, Appendix A), previous sections used finite sample spaces to determine how to adjust prior distributions of parameters to reflect the simplicity of each data distribution indexed by a parameter value. The results are now extended to infinite sample spaces, as anticipated in Bickel (2016), following Cover and Thomas (2006, §8.3).
The technical tools to accomplish that are relative entropy and the convergence of sample spaces of increasing cardinality. The relative entropy function D has values equal to D (μ || ν) = dμ log (dμ/dν), the entropy of a probability measure μ relative to a measure ν that dominates μ (e.g., Maas, 2017). Complementary statistical applications of relative entropy include the prevention of overfitting models (Fúquene et al., 2016;Gelman et al., 2017), the idealization of Cromwell's rule for revising priors (Bickel, 2018), and the automatic construction of unsharpened priors (Section 8, Example 7).
"The convergence of sample spaces" refers to the weak convergence of the data distributions and other measures defined on them. Lemmas 2-3 both involve a sequence of finite sample spaces that approach X (1) , a countably infinite sample space. The (1) in the superscript indicates that 1 is the lowest possible distance between members of X (1) . Lemma 3 additionally uses a sequence of countably infinite sample spaces that approach X (0) , an uncountably infinite sample space. The (0) in the superscript indicates that the members of X (0) may be arbitrarily close to each other.
Definition 8. Consider a κ > 0 and probability measures Π and Π κ on a measure space (Θ, F). If the probability measure Π κ (1),m on (Θ, F) is the κ-sharpened counterpart of Π with X (1),m as the sample space for all m = 1, 2, . . . , if Π κ (1),m weak − −− → Π κ as m → ∞, and if the convergence conditions of Lemma 2 hold for all finite-sample PMFs corresponding to each θ ∈ Θ, then Π κ is the κ-sharpened counterpart of Π with X (1) as the sample space and, for all θ ∈ Θ, with μ (1) θ as the data probability measure on the power set of X (1) and f (1) as the data PDF. Similarly, if the probability measure Π κ Δ,m on (Θ, F) is the κ-sharpened counterpart of Π with X Δ,m as the sample space for all m = 1, 2, . . . and Δ > 0, if Π κ Δ,m weak − −− → Π κ as m → ∞ followed by Δ → 0, and if the convergence conditions of Lemma 3 hold for all finite-sample PMFs corresponding to each θ ∈ Θ, then Π κ is the κ-sharpened counterpart of Π with X (0) as the sample space and, for all θ ∈ Θ, with μ θ as the data probability measure on the measurable subsets of X (0) and f (0) θ = dμ θ /dλ as the data PDF.
Since (16) holds for each sharpened prior distribution based on a finite sample space, its equivalent holds for each limiting sharpened prior distribution based on an infinite sample space, where "equivalent" means writing f in place of f θ when the infinite sample space is X (1) or X (0) , respectively.
Theorem 5. Consider a κ > 0 and probability measures Π and Π κ on a measure space (Θ, F). Let Π κ denote the κ-sharpened counterpart of Π with X (1) or X (0) as the sample space, and π κ = dΠ κ /dν for a measure ν that dominates Π. Almost surely with respect to Π, if X (1) is the sample space, then where S is the entropy function defined by (2), but if X (0) is the sample space, then where S is the differential entropy function defined by Proof. First, suppose Π κ is the κ-sharpened counterpart of Π with X (1) as the sample space. By Lemma 2, for all θ ∈ Θ, where μ as m → ∞ for all θ ∈ Θ. Therefore, (20) implies (17) Π-almost surely.
The next example presents a new Bayesian method of calibrating p values.
Example 3. To address the problem of interpreting p values, Held and Ott (2016) presented various lower bounds on the Bayes factor in favor of the null hypothesis. For example, suppose that under the alternative hypothesis z (X) ∼ N 0, σ 2 , where σ is the alternative hypothesis's standard deviation of z (X), the p one (X)-quantile of the standard normal distribution, given p one (X), a single-sided p value or p one (X) = 1 − p (X) /2, from p (X), a two-sided p value. Then one of the lower bounds is based on the observed Bayes factor With P (0) as the prior probability of the null hypothesis and P (1) = 1 − P (0) as the prior probability of the alternative hypothesis, the null hypothesis's posterior probability is by Bayes's theorem. P (0) may be interpreted as the probability that the null hypothesis component of a mixture model rather than the alternative hypothesis component generated the observation that z (X) = z (x).
If the complexity of the distribution of z (X) conditional on the alternative hypothesis is the differential entropy and was not considered when the value of σ was elicited, then P is an unsharpened prior PMF on {0, 1}. Since S N 0, σ 2 , the differential entropy of a normal distribution of standard deviation σ, is ln σ plus a constant (Michalowicz et al., 2013, p. 127), sharpening P according to (18) leads to as the sharpened prior and posterior probabilities of the null hypothesis. The sharpened prior is plotted in Figure 1.
Equation (24) suggests viewing σ κ B (x; σ) as the κ-sharpened Bayes factor, applicable regardless of the value of P (0). Under the ideal value of κ derived in Section 4, that simplicity adjustment, when coupled with an argument of Benjamin et al. (2017), leads to 0.001 or 0.01 rather than 0.005 or 0.05 as the default p-value threshold of statistical significance (Bickel, 2019c).
Alternatively, there is a true Bayes factor that is either the unsharpened Bayes factor B (x; σ) or the κ-sharpened Bayes factor σ κ B (x; σ) for a κ known only to lie within some interval. Let κ ≥ 0 and κ ≥ κ denote the lowest and highest possible values of the true κ; hence, κ = 0 if the true Bayes factor is possibly unsharpened and κ = 0 if the true Bayes factor is necessarily unsharpened. Since κ ≤ κ ≤ κ, the true Bayes factor is in σ κ B (x; σ) , σ κ B (x; σ) . Suppose a decision maker (DM) must decide whether or not to reject the null hypothesis under some loss for rejecting a true null hypothesis and some potentially different loss for failing to reject a false null hypothesis. The action the DM takes minimizes the expected loss with respect to the posterior distribution corresponding to the Bayes factor that a scientist reports. When deciding which Bayes factor in σ κ B (x; σ) , σ κ B (x; σ) to report, the scientist incurs Figure 1: The sharpened prior probability P κ (0) that the null hypothesis is true, as a function of the unsharpened prior probability P (0), according to (23) with σ = 2. The degrees of sharpness are κ = 4, κ = 2, κ = 1, κ = 1/2, and κ = 0 from the darkest curve to the lightest curve. regret according to a caution parameter C ∈ [0, 1] whenever the DM takes an action different than the action that minimizes the DM's loss function with respect to the posterior distribution corresponding to the unknown true Bayes factor. Accordingly, the scientist reports the minimax Bayes factor B (x), the Bayes factor that minimizes the scientist's expected regret while an opponent chooses the true Bayes factor to maximize that expected regret. If the scientist does not know the DM's loss function but can assume certain invariance properties, then the minimax Bayes factor is a C -weighted geometric mean of the two most extreme Bayes factors (Bickel, 2019b, Proposition 1), in this case which is the κ-sharpened Bayes factor, where κ = (1 − C ) κ + C κ, the weighted arithmetic mean called the minimax sharpness. That has some interesting consequences: 1. At least some sharpening is optimal unless the true Bayes factor is known or the regret is at the least cautious extreme. More precisely, κ > 0 unless κ = 0 or both C = 0 and κ = 0.
3. In the case of intermediate caution (C = 1/2), the minimax sharpness is the unweighted arithmetic mean of the lowest possible sharpness and the highest possible sharpness: κ = (κ + κ) /2. 4. For any degree of caution, if the unsharpened Bayes factor could be true (κ = 0), then the minimax sharpness is simply κ = C κ.
5. Putting the conditions of consequences 3 and 4 together with taking Section 4's ideal sharpness as the maximum (κ = 1) yields κ = 1/2 as a low-sharpness default.

Adjusting priors for the simplicity of stochastic processes
In a typical Bayesian analysis, the data constitute a sample of n observations that are conditionally independent given each value of θ, the parameter. The data can be viewed in terms of making n observations of an IID stochastic process labeled by an unknown value of θ. More generally, the data consist of a time series of n observations of a stationary stochastic process labeled by an unknown value of θ. In either case, Bayesian coherence does not allow either the unsharpened prior distribution of θ or the sharpened prior distribution of θ to depend on n. Under that restriction, this section applies sharpened prior distributions to stochastic processes in order to facilitate Bayesian data analysis.
That definition ensures that the expression for sharpened priors over stochastic processes has the same form as those over other observables.

Finite-Θ examples
In both of the following examples from Section 1, the sample space X ( * ) = X (1) is finite, as is the parameter space Θ. In (26), both f (1) θ and π are PMFs, and ν is the counting measure.
according to (26). Thus, the ideal sharpened prior probability (κ = 1) of a die is inversely proportional to how many sides it has.
Example 5. Example 2, continued. With the uniform distribution on the configuration space Θ as the preliminary prior, reasoning analogous to that of Example 4 yields the analog of (27), with θ as the configuration of colors. Substituting κ = 1 shows that the ideal sharpened prior probability of a configuration is inversely proportional to how many colors it has.
The prior distributions resulting in the κ = 1 case of both examples are reciprocal distributions, which are important in studies of Benford's law (Hill, 1995;Pietronero et al., 2001;Kossovsky, 2014, p. 238). While those finite-domain examples motivate the theory, the remaining examples illustrate the sharpening of priors for statistical data analysis with infinite domains.
Example 7. Maximizing the entropy of an asymptotic posterior, relative to a prior, leads to the Berger-Bernardo (e.g., Berger and Bernardo, 1989) reference priors (Kass and Wasserman, 1996); see Berger et al. (2009). In the case of a normal data distribution with unknown mean μ and unknown standard deviation σ, the reference prior density π (μ, σ) is proportional to 1/σ (Ghosh et al., 2006, §5.1.0). As noted in Example 3, the differential entropy S (f (0) μ,σ ) of a normal distribution f (0) μ,σ = N μ, σ 2 is ln σ up to a constant term. Then (26) prescribes the sharpened counterpart of the reference prior density: which, in the case of Section 4's κ = 1, is the left-invariant measure (Bickel, 2016). Since the reference density in this case is a probability matching prior, (28) may also be used to adjust p values and confidence intervals for simplicity (Bickel, 2019a).
9 Discussion: Which priors should be adjusted for simplicity?
The idea of Examples 1-2 and Section 4 that priors are adjusted according to the coincidences involved in constructing the system studied has implications for whether and how to adjust priors for the simplicity of data distributions. First, simplicity may be warranted not only for default priors such as those in Examples 4-7 but also for other priors that do not account for the coincidences involved in the construction of the system studied (e.g., Example 3). Another implication is that prior distributions that represent known physical variability do not require adjustments for simplicity (cf. Bickel, 2019c), for their probabilities are limiting relative frequencies that do not depend on the construction of systems.
A third implication is that each f θ used to adjust a prior for simplicity must reflect the variability intrinsic to the system studied as opposed to technical variability or measurement error. Otherwise, the sharpened prior, like some default priors, would depend on the details of the experiment or observational study, in violation of the likelihood principle. That charge of violating the likelihood principle is often made against default priors that depend on the sampling model (e.g., Ghosh et al., 2006, §5.2;Kadane, 2011, §12.8).