Controlling the degree of caution in statistical inference with the Bayesian and frequentist approaches as opposite extremes

In statistical practice, whether a Bayesian or frequentist approach is used in inference depends not only on the availability of prior information but also on the attitude taken toward partial prior information, with frequentists tending to be more cautious than Bayesians. The proposed framework defines that attitude in terms of a specified amount of caution, thereby enabling data analysis at the level of caution desired and on the basis of any prior information. The caution parameter represents the attitude toward partial prior information in much the same way as a loss function represents the attitude toward risk. When there is very little prior information and nonzero caution, the resulting inferences correspond to those of the candidate confidence intervals and p-values that are most similar to the credible intervals and hypothesis probabilities of the specified Bayesian posterior. On the other hand, in the presence of a known physical distribution of the parameter, inferences are based only on the corresponding physical posterior. In those extremes of either negligible prior information or complete prior information, inferences do not depend on the degree of caution. Partial prior information between those two extremes leads to intermediate inferences that are more frequentistic to the extent that the caution is high and more Bayesian to the extent that the caution is low.


Introduction
The controversy between Bayesianism and frequentism may be irresolvable to the extent that it reflects honest differences in personal attitudes of statisticians rather than differences in their knowledge or rationality.As Efron (2005) pointed out, The Bayesian-frequentist debate reflects two different attitudes about the process of doing science, both quite legitimate.Bayesian statistics is well suited to individual researchers, or a research group, trying to use all of the information at its disposal to make the quickest possible progress.In pursuing progress, Bayesians tend to be aggressive and optimistic with their modeling assumptions.Frequentist statisticians are more cautious and defensive.One definition says that a frequentist is a Bayesian trying to do well, or at least not too badly, against any possible prior distribution.
The frequentist aims for universally acceptable conclusions, ones that will stand up to adversarial scrutiny.
On one hand, methodology reflecting extreme caution in the form of the minimax-like attitude attributed to frequentists and, on the other hand, methodology reflecting the extreme reliance on modeling assumptions attributed to Bayesians both play useful roles in statistical inference.Building on that premise, the idea motivating this paper is that methodology for moderate amounts of caution also has a place in practical data analysis.The extent of such caution will be formally defined in order to facilitate making statistical inferences at the level of caution appropriate to the situation.
The formal definition will build on previous work to formalize caution in the face of uncertainty.Attitudes toward uncertainty have long been mathematically modeled in the economics literature.Ellsberg (1961) identified two distinct types of uncertainty: risk is the variability in an unknown quantity that threatens assets, whereas ambiguity is ignorance about the extent of such variability.The same agent may be much more cautious toward risk than toward ambiguity or vice versa.A utility or loss function can model an agent's attitude toward risk but not its attitude toward ambiguity.Because frequentist actions can differ from Bayesian actions given the same loss function, the ambiguity attitude is much more pertinent than the risk attitude to the concept of caution needed to represent and balance the two basic approaches to statistical inference.Ellsberg (1961) distinguished "pessimism" from "conservatism": the former is an excessive belief that worst-case scenarios will materialize, whereas the latter only involves cautiously acting as if they will.In other words, the attitude of "hoping for Red drawn Black drawn Yellow drawn action I $100 $0 $0 action II $0 $100 $0 Table 1: Utility function for actions I-II and the three possible states of nature.
Red drawn Black drawn Yellow drawn action III $100 $0 $100 action IV $0 $100 $100 Table 2: Utility function for actions III-IV and the three possible states of nature.
the best, preparing for the worst" is consistent with conservatism but not pessimism.
While that attitude does motivate much of frequentist statistics, "conservatism" already has technical meanings in the statistics literature, e.g., conservative confidence intervals have higher-than-nominal coverage rates.For that reason, the term "caution" will be used when assigning an operational definition to the degree of conservatism toward ambiguity in the sense of Ellsberg (1961).
In an example from Ellsberg (1961), a ball is randomly drawn from an urn of 90 balls, each of one of three possible colors: red, black, and yellow.Nothing is known about the distributions of the balls in the urn except that exactly 30 are red.Thus, there is ambiguity in the distribution of black and yellow balls.The agent would gain a reward of $0 or $100 based on its taking action I or action II according to utility function displayed as Table 1 in setting 1.In setting 2, the agent would instead gain $0 or $100 based on its taking action III or action IV according to the utility function displayed as Table 2. Agents cautious toward ambiguity would choose action I over action II in setting 1 but would take action IV over action III in setting 2, against subjective Bayesian concepts of coherence but without requiring the extreme caution of a minimax strategy (Ellsberg, 1961).
In the absence of ambiguity, the axiomatic system of von Neumann and Morgenstern (1953, §3.6) and later generalizations prescribe choosing the action that maximizes expected utility.By forcefully applying such a system to conditional expectations given observed data, Savage (1954) revitalized Bayesian statistics.The action that maximizes expected utility with respect to a Bayesian posterior is called the posterior Bayes action.Ambiguity about the posterior is usually modeled in terms of a set Ṗ of multiple posteriors in place of a single posterior.A multiplicity of posteriors may arise from insufficient elicitation of subjective prior opinions (e.g., Berger et al., 2000), from a spread in a gamble's buying and selling prices (e.g., Walley, 1991), or, more objectively, from ignorance as to which prior distribution in a set describes the physical variability of a parameter.The last source accords best with the notion of ambiguity as used in Ellsberg (1961), Jaffray (1989b), Jaffray (1989a), and Gajdos et al. (2004).
In the Bayesian statistics literature, the most studied decision-theoretic approach for sets of priors is the (marginal) Γ-minimax strategy (e.g., Berger et al., 2000), which formulates the problem in terms of minimax risk in the frequentist sense of Wald (1961).The closely related conditional Γ-minimax strategy (e.g., Betrò and Ruggeri, 1992) takes the action that minimizes the expected loss maximized over all of the posterior distributions in a set Ṗ, each member of which corresponds to a prior distribution in a set traditionally denoted by Γ.That statistical strategy is a special case of the maxmin expected utility strategy (Hurwicz, 1951b;Gilboa and Schmeidler, 1989), which takes the action that maximizes the expected utility minimized over a set of distributions.Both "robust Bayesian" strategies are reviewed in Vidakovic (2000).
The following equation extends the conditional Γ-minimax strategy to the problem of conducting statistical inference at a specified degree of caution κ and with respect to a Bayesian posterior Ṗ ∈ Ṗ that is not generally the true physical distribution of the parameter θ.For any κ ∈ [0, 1], the κ-conditional-Γ (κCG) action is defined as ȧκ = arg inf a∈A κ sup with the conventions that κ × ∞ = 0 if κ = 0 and (1 − κ) × ∞ = 0 if κ = 1.The κCG action reduces to the conditional Γ-minimax action under complete caution (κ = 1) and to the posterior Bayes action in the complete absence of caution (κ = 0).For discrete θ, this κ is isomorphic to quantities used by Ellsberg (1961), Gajdos et al. (2004), andTapking (2004) and is similar in spirit to the quantity in Hurwicz (1951a) and Jaffray (1989b) that Augustin (2002) calls "caution."Gajdos et al. (2004) stressed the equivalent of the rearrangement of equation ( 1) as The κCG strategy has two drawbacks that will prevent its use in many applications.First, under standard loss functions, the conditional Γ-minimax (1CG) strategy requires either that Ṗ impose strict bounds on the parameter space (Abdollah Bayati and Parsian, 2011) or that A be severely restricted (Betrò and Ruggeri, 1992), and the κCG strategy with 0 < κ < 1 has the same limitation.Second, since the 1CG strategy is not necessarily a frequentist procedure, the κCG framework does not fulfill the above goal of formulating procedures that reduce to frequentist procedures given complete caution.
Following the preliminary notation and definitions of Section 2, an informationtheoretic framework will be introduced in Section 3 to overcome the identified limitations of the κCG framework.Simple examples demonstrating the wide applicability of the information-theoretic framework will appear in Section 4. Further generality will be carried out in Section 5 by exchanging roles of frequentist and Bayesian procedures and by noting the particular applications that call for each role exchange.
Section 6 closes the paper with a brief discussion.
2 Bayesian and frequentist posterior distributions

Preliminary concepts
The observed data vector x ∈ X is modeled as a realization of a random variable X of probability space (X , X, P θ * ,λ * ), which for some parameter set Θ * × Λ * is indexed by an interest parameter θ * ∈ Θ * and potentially also by a nuisance parameter λ * ∈ Λ * .
Inferences will be made about the focus parameter θ = θ (θ * ), a subparameter of the interest parameter, in a set Θ.In the simplest case, θ = θ * and Θ = Θ * , but there are many other possibilities.For example, when testing the null hypothesis that θ * = 0 against the alternative hypothesis that θ * = 0 for Θ * = R, it is convenient to define the focus parameter by θ = 0 if θ * = 0 and θ = 1 if θ * = 0, in which case Let H denote a σ-field that allows any physically meaningful hypothesis about θ to be expressed as "θ is in Θ † ," where Θ † ∈ H.

Bayesian posteriors
In the Bayesian setting, the above sampling model is understood as conditional on the parameter values with respect to some prior distribution, as follows.Every member P prior * of some set P prior * is a distribution such that there is a random triple Ẋ, θ * , λ * ∼ P prior * and such that P θ * ,λ * = P prior * Let P denote the set of all probability distributions on (Θ, H).Before observing data, knowledge about the focus parameter is represented by P prior , the set of all distributions of Ẋ, θ θ * for all P prior * ∈ P prior * : The marginal distribution of each θ θ * is in P and is called a plausible prior distribution since it is consistent with pre-data knowledge.
The Bayesian approach yields inferences about the focus parameter on the basis of a single distribution, Ṗ prior ∈ P prior .If Ẋ, θ ∼ Ṗ prior , then the working prior distribution is the marginal distribution of θ.It follows that the working prior is one of the plausible priors.
The working Bayesian posterior Ṗ and the knowledge base Ṗ (Topsøe, 2004) are defined such that Ṗ is simply the Bayesian posterior distribution corresponding to the working prior, and Ṗ is likewise the set of Bayesian posteriors in P that correspond to plausible prior distributions.To prevent confusion with Ṗ , members of Ṗ will be referred to as plausible posteriors since they are the parameter distributions consistent with the mathematical representation either of a physical system or of a belief system (cf.Topsøe, 1979Topsøe, , 2004)).Thus, the posterior that would be used in purely Bayesian inference is one of the plausible posteriors Ṗ ∈ Ṗ .

Confidence posteriors
The sampling model of Section 2.1 admits not only system constraints and Bayesian inference ( §2.2) but also frequentist inference in the form of confidence intervals and for all x ∈ X and Θ † ∈ H * , where θ * is a random variable of distribution P * .ple null hypotheses (Efron, 1993).Various devices extend confidence posteriors to cases in which their posterior probabilities only approximately match confidence levels (Schweder and Hjort, 2002;Singh et al., 2005;Polansky, 2007;Bickel, 2011b).
The identity between confidence posterior probabilities and levels of confidence (4) clears up the misunderstanding that confidence levels and p-values cannot be interpreted as epistemological probabilities of hypotheses given the observed data.
In fact, since P * is a Kolmogorov probability measure on parameter space, decisions made using various loss functions by the confidence posterior action for each loss function L are coherent with each other in the senses usually associated with Bayesian inference, whether or not P * can be derived from some prior via Bayes's theorem (Bickel, 2011c,b).
Let P * denote the set of confidence posteriors on (Θ * , H * ) that are under consideration.For example, P * could be the set of a single confidence posterior, the set of all distributions on (Θ * , H * ) that satisfy equation ( 4), or, as in Bickel (2011b), the set of two approximate confidence posteriors or the convex set of all mixtures of the two.
The set P will represent the set of distributions of θ θ * for all P * ∈ P * : Thus, for any P * ∈ P * , there is a random parameter θ = θ θ * of distribution 3 Framework of moderate inference

Moderate posteriors
Let P and Q denote probability distributions on (Θ, H).The information divergence of P with respect to Q is defined as if Q is absolutely continuous with respect to P and I (P ||Q) = ∞ if not, where 0 log (0) = 0 and 0 log (0/0) = 0.I (P ||Q) goes by many names in literature, including "Kullback-Leibler information" and "cross entropy."Viewing I (P ||Q) as information leads to the concept of how much information for statistical inference would be gained by replacing a confidence posterior P ′′ ∈ P with another posterior Q ∈ P if the plausible posterior P ′ ∈ Ṗ were the physical distribution of the parameter θ. Specifically, as a special case of "information gain" (Pfaffelhuber, 1977), is called the inferential gain of Q relative to P ′′ given P ′ (Bickel, 2011a).(The notation is borrowed from Topsøe (2007).) In analogy with equation ( 2), the caution κ ∈ [0, 1] is then the extent to which a "worst-case" plausible posterior P ′ ∈ Ṗ is used for inference as opposed to the working Bayesian posterior Ṗ in this definition of the κ-inferential gain of Q relative to P ′′ given P ′ and Ṗ : The posterior distribution that has the highest κ-inferential gain in the following sense will be used for making inferences and decisions.The moderate posterior distribution with caution κ relative to P given Ṗ and Ṗ is denoted by P κ and defined by inf Less technically, P κ is the posterior distribution that maximizes the worst-case inferential gain relative to the confidence posterior P ′′ , which is in turn chosen to minimize the maximum worst-case gain.In the case that equation ( 7) does not have a unique solution, the moderate posterior is defined to be as close as possible to the working Bayesian posterior: where the set P κ of candidate moderate posteriors is defined as the set of all distributions in P such that every member of P κ solves equation ( 7).By letting for any P ′′ ∈ P, that set may be written as P κ = P ∈ P : inf The moderate posterior action with caution κ is which defines making decisions on the basis of the moderate posterior as taking actions that minimize its expected loss.For example, if P is the only confidence posterior under consideration, then P = P and which recalls equation (2).Since Ṗ ∈ Ṗκ ⊆ Ṗ, Ṗ0 = Ṗ , and Ṗ1 = Ṗ, the effect of κ < 1 as opposed to κ = 1 is to replace the knowledge base Ṗ with a subset Ṗκ containing the working Bayesian posterior Ṗ (cf.Gajdos et al., 2004).
The two extreme cases of caution reduce decision making to previous frameworks.
A complete lack of caution (κ = 0) leads to the sole use of the working Bayesian posterior for the minimization of posterior expected loss: P 0 = Ṗ .On the other hand, complete caution (κ = 1) leads to ignoring the working Bayesian posterior and, in the case of a single confidence posterior, to the framework of Bickel (2011a), in which P 1 is called the blended posterior.
Remark 1. Unless κ = 0, the condition that P ∩ Ṗκ be nonempty holds whenever the plausible posteriors are sufficiently unrestricted.The most important such setting for applications is a complete lack of constraints Ṗ = P , in which case and, if P is convex and unbounded, then P ∩ Ṗκ = P ∩ P = P for any κ ∈ (0, 1].

Examples
The first two examples involve the continuous, scalar parameters typical of point and interval estimation (Θ = R).For simplicity, each uses only a single confidence posterior P = P .
Example 2. X ∼ N (θ, 1) with no information about θ except that θ ∈ R = Θ, that X = x is observed, and that Ṗ is the working Bayesian posterior distribution of θ.
It follows that Ṗ is the set of all distributions on the Borel space (R, B (R)).Again under quadratic loss, by equation ( 1), the κCG estimate is which is the posterior mean θd Ṗ (θ) if κ = 0 but which has no unique value for any other value of κ since sup P ′ ∈ Ṗ θ − θ 2 dP ′ (θ) = ∞ for any θ.By contrast, equation ( 11) specifies the unique moderate-posterior estimate given P = N (x, 1): where, provided that κ > 0, P κ = P according to Corollary 2 since P ∈ P = Ṗκ , leading to the frequentist posterior mean.
The last example involves a discrete focus parameter, as is typical of hypothesis testing and model selection applications.is the corresponding working Bayesian posterior probability that the null hypothesis is true.Let p (1) and p (2) denote observed p-values of the one-sided test of θ * = 0 versus θ * > 0 and thus of the two-sided test of θ * * = 0 versus θ * * = 0.In this example, p (1) (x) ≤ p (2) (x), perhaps because p (2) (x) is based on a test that makes weaker parametric assumptions than that of p (1) (x).For i = 1, 2, let P (i) * denote the confidence posterior for θ * defined given some x ∈ X such that for all θ * ∈ Θ * and λ * ∈ Λ * , where the dependence of on x is suppressed, in Section 2.3.Since p (i) (X) ∼ U(0, 1) under the null hypothesis that θ * = 0, it follows that i.e., the confidence posterior probability of the null hypothesis is equal to the pvalue (Bickel, 2011d,a); cf.van Berkum et al. (1996).With θ = 0 if θ * = 0 and θ = 1 if θ * = 0, equation (18) yields P (i) θ = 0 = p (i) (x).From widely applicable conditions for two-sided hypothesis testing (Sellke et al., 2001;Bickel, 2011a) and with some Ṗ prior ∅ ∈ (0, 1) given as the lower bound of the prior probabilities of the null hypothesis and the restriction that no such probability is 1, the knowledge base is the set of plausible posteriors, the distributions on {0, 1} , 2 {0,1} with as the lower bound of the plausible posterior probability of the null hypothesis, where θ ∼ Ṗ .That lower bound is the greater of the two lower bounds found by separately applying the methodology of Sellke et al. (2001) to p (1) (x) and p (2) (x).(The binary operator ∧ in the above equation means "the minimum of," and ∨ will similarly stand for "the maximum of.")Since Theorem 1 applies, the moderate posterior P κ is given by equation ( 8) with where More simply, κ || P (1) < I P (2) κ || P (1) > I P (2) .
Letting Ṗ∅ = Ṗ θ = 0 and letting θ denote the focus parameter according to the moderate posterior θ ∼ P κ , Since P κ ∈ P κ , from which the extreme condition p is omitted for brevity.In the case of no caution, the working Bayesian posterior probability is recovered: P 0 θ = 0 = Ṗ θ = 0 , which does not depend on p (x).
More interestingly, the case of complete caution leads to which has no dependence on Ṗ θ = 0 .The simplifying effect of considering only a single p-value is evident from using p (1) (x) = p (2) (x) in the formulas ( 20) and (21).
For example, expression (21) results in a unique P 1 θ = 0 equal to the blended posterior probability of Bickel (2011a).When formulas ( 20) and ( 21) say no more than P κ θ = 0 ∈ p (1) (x) , p (2) (x) , equation ( 8) ensures the uniqueness of the moderate posterior probability by equating it with the p-value closest to the working Bayesian posterior probability: which is a special case of Corollary 2. In this way, the caution parameter, the working Bayesian posterior, and the constraints on the plausible posteriors together overcome the dilemma of whether to use the more conservative p-value or the less conservative p-value.
5 Extending the caution framework

Variations of the framework
The above framework for balancing Bayesian and frequentist approaches to inference does not apply to all situations encountered in applications.The various permutations of the Bayesian and confidence posteriors as the working posterior Ṗ , used exclusively in the absence of caution, and a benchmark posterior P , over which inference will be improved as much as possible, in equations ( 12) and ( 15) lead to four versions of the proposed approach: 1. Ṗ is a Bayesian posterior in Ṗ, and P is a confidence posterior.This version yields the balance between Bayesian and frequentist inference defined in Section 3 and illustrated in Section 4.
2. Ṗ is a confidence posterior, and P is a Bayesian posterior in Ṗ.The potential uses of this reversal are unclear since it would paradoxically lead to dependence on a single Bayesian posterior to the extent of the caution.4. Ṗ = P , where P is a confidence posterior.Using the same Bayesian posterior as both the working posterior and the benchmark posterior is useful when a set Ṗ of plausible posteriors can be specified but when no member of that set can be singled out as special.In many cases involving a continuous parameter θ, no such member can be derived from the knowledge base Ṗ without imposing arbitrary procedures such as averaging over the members with respect to some measure chosen for convenience.That will be explained in Section 5.2, where the case of two unequal confidence posteriors will also be considered.
For simplicity, the versions are described as if P = P , but they also pertain to a set P of multiple benchmark posteriors that define the moderate posterior P κ according use of κ = 1 in the absence of a working Bayesian posterior in order to avoid excessive dependence on P at the expense of Ṗ, the knowledge base.On the other hand, allowing P κ / ∈ Ṗ makes P κ less dependent on the precise borders of Ṗ, and this may be desirable to the extent that such borders are uncertain or subjectively specified.
An alternative to the above approach in the absence of a specified Ṗ is to apply the strategy of Section 3 with Ṗ as a function of Ṗ, following Gajdos (2008).Examples of functions that transform a set of distributions to a single distribution include the Steiner point (Gajdos, 2008), the arithmetic mean ("center of mass"), and the maximum entropy distribution (Paris, 1994).In the continuous-parameter case, such functions require a base measure for partitioning.
There is no need to impose an arbitrary base measure if two different confidence posteriors Ṗ and P are under consideration Ṗ = P .Using them as the working posterior and the benchmark posterior in equations ( 12) and ( 15) would be most appropriate when Ṗ represents a newer or more risky procedure and when P corresponds to a better established or more thoroughly tested procedure.More generally, equation ( 8) specifies how to apply a working confidence posterior Ṗ with a set P of benchmark confidence posteriors.

Discussion
The featured moderate-posterior methodology has been contrasted with the simpler κCG methodology.As Examples 1 and 2 illustrated under quadratic loss, the former can yield unique actions in a wide variety of settings in which the latter cannot.
Using CG minimaxity (κ = 1), uniqueness has been achieved under quadratic loss by restricting the action space to finite bounds (Betrò and Ruggeri, 1992) and by similarly restricting the parameter space Θ (Abdollah Bayati and Parsian, 2011).The moderate-posterior estimators did not require such restrictions.
The main advantage of the moderate-posterior framework is that it provides first principles from which a statistician may derive a Bayesian analysis, a frequentist analysis, or a combination of the two, depending on the chosen level of caution and on the quality of prior information.This allows the caution level to be precisely reported with the resulting statistical inferences.In addition, the caution level may be determined by the needs of an organization or collaborating scientist rather than by the personal attitude of the statistician.
Various factors may be considered in choosing the level of caution.For example, more caution with Bayesian inference may be warranted when the confidence posterior represents a frequentist procedure that has stood the test of time than when it represents a new frequentist procedure based on questionable assumptions.The caution level could then be interpreted as the pre-data degree of reluctance an agent has in modifying the frequentist procedures encoded in the confidence posterior.
The moderate-posterior framework of Section 3 is general enough to incorporate conflicting frequentist approaches, as seen in Example 3.For additional generality, Section 5.2 provides ways to modify the framework for situations in which any dependence on a subjective or guessed Bayesian posterior would be undesirable.
In other situations, any dependence of inference on the level of caution would be undesirable.Provided that there is at least a little caution, the use of a sufficiently broad set of plausible posteriors under the unmodified framework ( §3) eliminates any other dependence on the degree of caution (Remark 1).
Polansky (2007) called P * Θ † the observed confidence level of the hypothesis that θ * ∈ Θ † .Confidence posteriors for which θ * is a real scalar (Θ * ⊆ R) and the σfield is Borel (H * = B (Θ * )) are usually called confidence distributions, each of which encodes confidence intervals of all confidence levels and hypothesis tests of all sim- H * .P will be considered as a set of confidence posterior distributions of the focus parameter even though more literally they are not necessarily confidence posteriors but rather fiducial-like distributions derived from the set P * of confidence posteriors by the laws of probability.(Hannig (2009) provides a recent review of fiducial inference.)In the simplest case of θ = θ * , (Θ, H) = (Θ * , H * ) and P = P * .While confidence distributions are used here for concreteness, P can be a set of any distributions on (Θ, H) to use as benchmarks with respect to which the posterior introduced in the next section is intended as an improvement.

Example 3 .
Consider the indicator parameter θ defined such that θ = 0 if the null hypothesis about θ * * is true (θ * * = 0) and θ = 1 if the alternative hypothesis about θ * * is true (θ * * = 0).Equivalently, in terms of θ * = |θ * * |, θ = 0 if θ * = 0 and θ = 1 if θ * > 0. If Ṗ is a working Bayesian posterior for θ * * , then Ṗ θ = 0 3. Ṗ = P , where Ṗ is a Bayesian posterior in Ṗ.Using the same Bayesian posterior as both the working posterior and the benchmark posterior is attractive in the absence of reliable confidence intervals or p-values from which a confidence posterior could be constructed.Thus, this version extends the scope of the framework across the domains to which Bayesian methods apply.However, this version becomes trivial whenever equation (15) holds according to Corollary 1, for in that case, P κ = arg inf Q∈ Ṗκ I Q|| Ṗ = Ṗ for all κ ∈ [0, 1] since Ṗ ∈ Ṗκ necessarily.In other words, the Bayesian posterior would be used for inference irrespective of the degree of caution and the knowledge base.