On Bayesian Oracle Properties

When model uncertainty is handled by Bayesian model averaging (BMA) or Bayesian model selection (BMS), the posterior distribution possesses a desirable"oracle property"for parametric inference, if for large enough data it is nearly as good as the oracle posterior, obtained by assuming unrealistically that the true model is known and only the true model is used. We study the oracle properties in a very general context of quasi-posterior, which can accommodate non-regular models with cubic root asymptotics and partial identification. Our approach for proving the oracle properties is based on a unified treatment that bounds the posterior probability of model mis-selection. This theoretical framework can be of interest to Bayesian statisticians who would like to theoretically justify their new model selection or model averaging methods in addition to empirical results. Furthermore, for non-regular models, we obtain nontrivial conclusions on the choice of prior penalty on model complexity, the temperature parameter of the quasi-posterior, and the advantage of BMA over BMS.


Introduction
In one of the most highly cited statistical papers (which is cited more than 2,500 times already), Fan and Li (2001) introduced an "Oracle property" for a frequentist penalization method of model selection, that statistical inferences "work as well as if the correct submodel were known." Such a property has not been widely studied in the Bayesian context, with the exception of a few pioneering works: Ishwaran and Rao (2010) who considered linear models with spike and slab priors, Hong and Preston (2012) who addressed post selection prediction, and Li and Jiang (2014) who considered Bayesian generalized method of moments. However, it is well recognized that such a problem is important: model selection and inference after selection belong to the No. 1. open problem in Bayesian statistics (Jordan 2011).
The current paper reveals a simple and general relation between model selection consistency and the oracle property in the Bayesian context. The simplicity of this relation has motivated us to write a short paper. Instead of focusing on one application and developing the full details, we point out a diversity of possible applications with useful key references, leaving their complete development to possible future work. We believe that a longer paper, perhaps detailing applications to a particular example, would obscure the fundamental importance of the proposed simple and general relation. Nevertheless, we have included some details and technical proofs in the Supplementary Materials(SM), which also contain several other theoretical results of possible interest.
The main messages of this paper and its SM include: • In the Bayesian context, the seemingly more complicated oracle property on the posterior distribution is equivalent to the seemingly simpler property of posterior consistency of model selection. We do not believe that this is unknown before; more likely, such a relation may have already been implicitly used in proofs of various special situations. The goal of our paper is to point out this relation explicitly, generally, and apply it systematically.
• There are many previous works which have already established model selection consistency in their specific situations (see Section 4). Our current results then imply that they have actually also proved oracle properties of the posterior distributions as well in each case, even if they did not mention this in their papers. With some additional effort in bounding the model selection error, their results can be used to establish the stronger oracle properties for posterior mean as defined in our SM. These lead to many possibilities for further extension of existing Bayesian asymptotic results to the context of Bayesian model selection.
• The seemingly simpler property of posterior model selection consistency is indeed not a strong assumption. There exist general ways to prove it and bound the convergence rate, based on a general framework of quasi-posterior, which allows any empirical risk to be used as the (quasi-)log-likelihood.
• The quasi-posterior framework allows nonstandard limiting distribution for the oracle posterior, which can be nonnormal, or with nonstandard convergence rate different from n 1/2 . As shown in the SM, we can accommodate discontinuous empirical risks with cubic-rootasymptotics, and partial identification in Bayesian moment inequalities.
• Many of the relations that we established in this paper and its SM are assumption-free and simple, therefore very general. They reveal intrinsic relations among a variety of concepts without being kidnapped by technical features that are specific to particular statistical models. To name a few of such relations that appear in the main text: Propositions 1 relates total variational distance from the oracle performance to model selection error; Proposition 2 relates model averaging, model selection, and oracle model; Proposition 3 relates the model selection error to the risk function in the quasi-posterior.

Bayesian Model Averaging
Let π be any probability measure, which will be taken to be the posterior probability measure conditional on the observed data. Let M be a random index which will be taken to index a model, and M 0 be a possible value of M, which will be taken as the true model index under which the data are generated.
For any event A, we are interested in the difference where π(A|M 0 ) is the "oracle" posterior, pretending to have known the true model M 0 , whereas π(A) = M π(M)π(A|M) is the mixed posterior via model averaging, allowing possibilities of all models as weighted by the model posteriors π(M) given the data.
Let us try to define an "oracle property" as the following (Oi)+(Oii): That is, the true model has posterior probability converging to 1 as n increases, AND the limiting behavior of the posterior distribution (for any parameter) in total variation is equivalent to the "oracle posterior" pretending to have known the true model. 5 This kind of oracle property is similar in essence to the frequentist oracle property of Fan and Li (2001) but is more general.
To fully appreciate the generality of the current definition, note that in the current general context, we do not require the oracle posterior π(·|M 0 ) to have an asymptotic normal limiting distribution at a rate of square root n. The most attractive aspect of Fan and Li's oracle property is that the inferential results "work as well as if the correct submodel were known" (Fan and Li 2001, Abstract). This aspect is already fully captured in (Oii) and there is no need to impose restrictions on the nature of the oracle posterior π(·|M 0 ).
Such relaxation allows us to include many examples (in additional to classical examples with asymptotic normal limits), such as the situation with discontinuous posterior distribution which is characterized by cubic root n asymptotics (see, e.g, Jun, Pinske and Wang 2012), and partially-identifiable posterior distributions with O(1) rate asymptotics (see, e.g., Liao and Jiang 2010). It is also noted that the current framework allows quasi-posterior distributions (such as a posterior based on a partially misspecified Gaussian model, or the Bayesian generalized method of moments considered in Li and Jiang 2014), which may not be constructed from a likelihood function, but nevertheless forms a quasi-posterior -so long as it forms a probability measure π().
The formulation (Oii), of course, can also be applied to classical situations. The definition in the convergence in probability of the total varia-tion difference allows a natural connection to classical Bayesian asymptotic normality or Bernstein von Mise theorem on the oracle posterior, which is formulated in the same mode of convergence.
The following fundamental equality can be derived: This proposition reveals a deep relation between three different topics: model averaging (π), oracle performance (π(·|M 0 )) , and model selection (π(M 0 )). The total variation distance between the model average posterior and the oracle posterior is exactly equal to the posterior probability of missing the true model. Therefore, if there is model selection consistency then sup A∈F |π(A) − π(A|M 0 )| = o p (1) for any set of events F that may be data dependent. Then in total variation, π() and π(·|M 0 ) have the same limiting distributions. Therefore, on this most general level, we have obtained that model selection consistency (Oi) is equivalent to this oracle property (Oii) [or (Oi)+(Oii)]: Theorem 1. Posterior model selection consistency (Oi) is equivalent to the posterior oracle property (Oii) for Bayesian model averaging.

Bayesian Model Selection
Although we have talked about model averaging so far, similar results also exists for Bayesian model selection. SupposeM is any Maximum-A-Posteriori model choice, so that π(M) = max M π(M), we are interested in the total variation distance between the posterior π(·|M) based on Bayesian model selection, and the oracle posterior π(·|M 0 ) based on the true model M 0 .
Then we have: Propostion 2. The maximal total variation distance among the three posteriors: π, π(·|M), and π(·|M 0 ), is at most twice the posterior probability of missing the true model: Therefore, model selection consistency We suspect that the Bayesian model selection consistency (Oi) may be implicitly stronger than the frequentist version. For example, in the present context, a frequentist style condition of model selection consistency may bê M = M 0 in probability, which can be achieved by letting 1 − π(M 0 ) < 0.5, instead of asking it to be o p (1).
We now comment that (Oi) is indeed establishable in a wide variety of contexts, where people have most focused on the model selection consistency only, and stopped from stating its attractive implications of the oracle properties (Oii) and (Oiii), perhaps subconsciously feeling that these would be much more difficult problems on the theory of limiting distributions. The following section lists some of such possible applications where (Oi) has already been established.

Possible applications
There has been extensive work in Bayesian model selection consistency (Oi). All these results are possible to be extended to imply Bayesian oracle properties (Oii). When there are already known results on the limiting behavior of the oracle posterior π(·|M 0 ) under the true model, then the limiting behavior will apply also to π from model averaging and to π(·|M) from model selection. For example, asymptotic normality can be derived by the Bernstein von Mises Theorems for various kinds of true models. For example, for finite dimensional true model, see van der Vaart, A.W. (1998), Section 10.2, andShen (2002) for nonparametric and semiparametric situations. For true models being generalized linear models with increasing dimensions, one can apply Ghosal (2000). For quasi-Bayesian posterior, one can apply Belloni and Chernozhukov (2009).
The following are some examples where posterior consistency for model selection (Oi) has been studied: (i) (Classical cases) It is known that posterior consistency for model selection (Oi) commonly holds for the usual likelihood-based Bayes posterior, in classical finite dimensional cases, (see, e.g., Wasserman 1997 eqn. (42)). Then our result suggests that the oracle property (Oii) (for model averaging) and (Oiii) (for model selection) also hold. The implication of the oracle property can be the classical square root n asymptotic normal limiting distributions as suggested by the Bernstein von Mises Theorem (see, e.g., van der Vaart, A.W. (1998), Section 10.2). In these classical cases, model selection consistency (Oi) can be proved from the BIC approximation (Schwartz 1978). The BIC approximation will be extended in the Supplementary Materials to accommodate non-classical cases and prove model selection consistency. See also Item (iv) below regarding the cubic-root asymptotics.
(ii) (GMM) Model selection consistency (Oi) is also proved in Li and Jiang (2014) who consider quasi-posteriors constructed from GMM (generalized method of moments), allowing increasing dimensionality with sample size. This is enough for them to derive the quasi-Bayesian oracle properties which imply asymptotic normality with efficient variances, based on the Bernstein von Mises type results of Belloni and Chernozhukov (2009).
(iii) (GLM) For generalized linear models, in the high dimensional case (with dimension of parameters much higher than the sample size), Liang, Song and Yu (2013) proved the posterior model selection consistency (in their equation (19)). Our results indicate that their results imply the oracle properties on the posterior distribution of, e.g., the mean parameters. Asymptotic normality can be obtained from the oracle properties when the true model is sparse (bounded in n or moderately increasing), if one can apply the appropriate Bernstein von Mises Theorem on the true model, such as those studied in Ghosal (2000) (for generalized linear models with increasing dimensions).
(iv): (Cubic Root Asymptotics; We will discuss some of this in the Supplementary Materials.) Jun, Pinske and Wang (2012) have considered discontinuous quasi-posteriors. With some choice of a scaling parameter used in the quasi-posterior, it is possible to guarantee the model selection consistency (Oi). Then oracle properties (Oii) and (Oiii) will hold for post-model-average or post-model-selection Bayesian inference. An interesting feature here is that the limiting distribution of the oracle posterior does not follow the classical √ n-asymptotics. For some choices of the scaling parameter used in the quasi-posterior, thoughts following the approach of Jun, Pinske and Wang (2012) suggest a slower than n 1/3 convergence rate.
(v): (Partial Identification; We will discuss some of this in the Supplementary Materials.) Liao and Jiang (2010) considered a quasi-posterior derived from moment inequalities, where parameters are not pointidentified but only identified upto a set Ω called an identification region. In a model selection setup with a unique true model that intersects with Ω, 7 their Theorem 4.1 implies model selection consistency (Oi). Then oracle results (Oii) and (Oiii) automatically hold. In this setup of partial identification, the limiting distribution of the oracle posterior is nonstandard: it does not shrink as n increases, so the rate is O(1).
Nor the limiting distribution is normal in general -the nonidentifiability over Ω suggests that the limiting posterior distribution should be roughly the prior distribution truncated inside the identification region Ω. (See comments in Section 3.1, Liao and Jiang 2010.) (vi): (Quasi-posterior; We will discuss this in more details in the Supplementary Materials.) This is a general framework of quasi-posterior where we can derive general bounds on the mis-selection probability where R n is an empirical risk related to a theoretical risk R, κ is a prior, and λ > 0 is a scaling parameter which can increase with n. The equations (2) and (3) in the Supplementary Materials (SM) directly imply the following assumption free bound for π(M c 0 ), which is used in the Supplementary Materials to obtain model selection consistency: is the "limiting version" of quasi-posterior π, where theoretical risk R is used in place of the empirical risk R n .
This assumption free bound is only useful when γ > r + 2|u| > 0. We show in the SM that in fact typically we can make r + 2|u| = o p (γ), and so that π(M 0 ) can be exponentially small in λγ and decreases very quickly with sample size n. This can happen with two methods: one includes a complexity penalty to the risk R, and another adds no complexity penalty but considers candidate models with separated parameter spaces.

Discussion
We have established a fundamental relation between three different topics: Bayesian model selection, model averaging, and oracle performance. The relatively basic property of model selection consistency is shown to be equivalent to a seemingly more advanced distributional result, the oracle property. The result is very simple and general. Unlike some previous Bayesian oracle properties discussed in special cases such as Ishwaran and Rao (2011) and Li and Jiang (2014), the current work is completely free from any restriction on the type of prior or (quasi-)likelihood function used, or even from any restriction on the limiting distribution of the oracle posterior. It may be the first time that the Bayesian oracle property has been studied at this general level. Given the success of the frequentist analogue studied by Fan and Li (2001), we believe our results will have applications in a wide variety of situations (in addition to the possible examples discussed in this paper). For example, although the applications in our SM focus on finite dimensional cases only, the relationships described in the main text obviously allow increasing dimensions as well.

References:
Belloni, A. and Chernozhukov, V. (2009 July 22, 2015 6.1 Mean oracle property and convergence rate of model selection In this section we define a different kind of oracle property (the "mean oracle property"), which is typically more demanding than model selection consistency, but can be meaningful in more circumstances than the oracle property (the "posterior oracle property") in the main text.
We will now define the mean oracle property and show that it requires more than the model selection consistency: the error rate π(M c 0 ) needs to be sufficiently small. In some situations the (quasi-)posterior π is not useful but its mean E(θ) = θdπ is useful, which may have a good limiting distribution for statistical inference for a parameter of interest θ 0 , even when the (quasi-)posterior distribution itself does not have a valid interpretation. This can happen for quasi-posteriors when its credible region does not have asymptotically correct coverage probability (see, e.g., Chernozhukov and Hong 2003). We would like a mean oracle property such that In this case, it is not useful enough to establish |E(θ) − E(θ|M 0 )| = o p (1), since, e.g., the convergence rate |E(θ|M 0 ) − θ 0 | is typically O p (n −1/2 ) for regular cases. Instead, we hope the difference |E(θ) − E(θ|M 0 )| between the model average posterior mean and the oracle posterior mean is smaller to a higher order. Note that then we have the mean oracle property. We will construct situations when this kind of "fast convergence" for model selection will happen. We will later comment that in many other cases, the rate o p (n −1/2 ) cannot be guaranteed.
(BIC approximation on the posterior implies that π(M c 0 ) can be of order O p (n −1/2 ) due to candidate models that have only 1 redundant parameter.) A different relation can be useful in this case: The mean oracle property holds if there is a finite number of model candidates and each π( We have to check each model M j ∈ M c 0 (i.e. M j = M 0 ). Each product in the sum of (1) can be small enough for 2 different reasons. For models which misses nonzero parameters, π(M j ) is typically exponentially small. For models that does not miss nonzero parameters but includes redundant parameters, E(θ|M j ) − θ 0 is typically of the same order as E(θ|M 0 ) − θ 0 , and therefore E(θ|M j ) −E(θ|M 0 ) is also of the same order as E(θ|M 0 ) −θ 0 . Then it is enough to have π(M j ) = o p (1).
Since we have established that both oracle properties (of the mean and of the distribution) depend on the posterior probability of selecting wrong models π(M c 0 ), we devote the following sections on bounding this probability.

Risk convergence and model selection convergence
In this section, we show a general way to bound the posterior probability of selecting wrong models, by how often the posterior proposes suboptimal values of a risk function.
We have argued that the oracle properties are determined by the model selection error π(M c 0 ). We now provide its bound due to a relation to risk convergence. We will define models M j as restricting θ ∈ Θ j for possibly overlapping parameter spaces Θ j (possibly dependent on sample size n), j = 0, 1, 2, ... and we use M 0 denote the true model. We will consider a risk function R(j, θ) : ∪ j≥0 ({j} × Θ j ) → ℜ, which can "risk identify" the true model, roughly in the sense that inf R over M 0 is lower than inf R over M c 0 . Then, if we can prove posterior risk convergence, in the sense that the posterior will favor a small risk R, then we can hope to show that posterior will favor the true model M 0 .
To be more rigorous, let S = ∪ j≥0 S j and S j = ({j} × Θ j ). We will define a true model M 0 : (j, θ) ∈ S 0 . Then M c 0 corresponds to the event (j, θ) ∈ ∪ j>0 S j . Define the gap γ Then we obviously have This connects model selection convergence to risk convergence at the rate of gap. A zero gap will make the above relation useless, however.
We will consider several risks for R to make a nonzero gap.
1. (separated parameter spaces) The risk R(j, θ) = C(θ) is not dependent on j. In this case, the gap would be zero and make the above relation useless, if the true model is nested in some other models: Θ 0 ⊂ Θ j for some j > 0. Instead, we will consider separated parameter spaces: 10 Impose a separation condition of the parameter spaces (in minimal set distance) d(Θ 0 , ∪ j>0 Θ j ) > δ(> 0), under a suitable metric d. If we have an identifiability condition on the "true parameter" θ 0 ≡ arg min Θ 0 C(θ): d(θ, θ 0 ) > δ =⇒ C(θ) − C(θ 0 ) > γ(> 0), then the gap can be taken to be at least γ > 0. (We allow partial identification that the minimizer θ 0 can be a nonsingleton set.) 2. (penalized risk) The risk is dependent on j, and includes a penalty against the complexity of Θ j . Then Θ 0 ⊂ Θ j for a more complicated Θ j , j > 0, does not automatically lead to a useless zero gap. For example, suppose Suppose that for all other j > 0 (such that Θ j ⊃ Θ 0 ), we have inf Θ j C(θ)−inf Θ 0 C(θ) ≻ γ (in the sample dependent sense, where γ > 0 decreases with sample size n). Then the gap can be taken as γ > 0.
3. The risk function R in Case 1 above can be the Hellinger distance d H (f θ , f θ 0 ) between the parameterized densities, or the regression mean square error µ θ − µ θ 0 2 between the parameterized mean functions, or the classification error of a parameterized classification rule. Then the current relation (2) connects model selection consistency to any posterior risk consistency results in a model average context, in Hellinger distance, in regression or classification, such as in Jiang (2007). Then we can prove oracle properties due to the fundamental equivalence that we have established in our paper, between model selection and posterior oracle properties. This scheme can be symbolically represented as: risk convergence ⇒ model selection consistency = oracle property.

Quasi-posterior and risk convergence
In this section, we show a general way to bound the posterior probability of proposing suboptimal values of a risk function, by using its empirical risk to construct the (quasi-)posterior.
We now consider a class of quasi-posteriors, where π ∝ e −λRn κ where κ is a prior, R n is an empirical risk, and λ > 0 is a scaling parameter that can depend on n, which is analogous to the inverse temperature in statistical physics. Typically λ = n, as in usual Bayesian posterior, where −λR n is the log likelihood. However, we will allow other rates on λ ≻ 1 in our general formulation.
In this case, (we will prove in Section 6.7) that the risk convergence rate in R can be bounded as is the "limiting version" of the quasi-posterior π, where theoretical risk R is used in place of the empirical risk R n .
The term u measures the difference R n − R on the support of the limiting posterior. We here use the simplest uniform bound where the supremum is taken over the entire support of the prior. This will typically lead to u = O p (d ln n/ √ n) (due to a uniform large deviation theorem) where d is a complexity measure related to the maximal parameter dimension.
The term r can similarly be bounded by r = O(d ln λ/λ) (due to a Laplace approximation of R). 11 This latter rate can also be derived by the inequality and choosing a = d ln λ/λ. Detailed argument is similar to the remarks after Proposition 1 in Li, Jiang and Tanner (2014).
Therefore, if the gap γ ≻ r + 2|u| and γ ≻ 2.1 ln n/λ (a n ≺ b n or b n ≻ a n means a n /b n → 0 as n → ∞), then π(M c 0 ) ≤ π(R ≥ inf R + γ) ≺ e − ln n = 1/n, and we achieve the fast convergence of model selection, and therefore the oracle properties for both the posterior distribution and the posterior mean hold.
It is noted that when R depends on both the model index j and θ, such as the penalty based R = C(θ)+d j γ discussed in the last subsection, in order for |u| ≤ 2 sup |R n −R| to be smaller than the gap γ, we can let R n = C n (θ)+d j γ, where C n is a usual sample version of C with sup |C n − C| typically of order d ln n/ √ n as mentioned before. This leads to a quasi-posterior π ∝ e −λCn κe −λγd j .
This corresponds to a strong complexity penalty on the model complexity d j . Such a dimensional penalty is not needed when we use separated parameter spaces (corresponding to a local prior), as in discussed in the previous subsection.
Note that both these methods (penalty prior or separated parameter spaces) are very general. and neither requires differentiability of R n , or pointidentification of θ 0 = arg min Θ 0 C.
A third method will allow more general priors but less general R n which allows a BIC (or Laplace) type approximation.

BIC approximation and model selection convergence
In this section we express the BIC approximation with a general scaling parameter, which can be used later to treat nonstandard asymptotics.
In the BIC approach, the complexity penalty will only come indirectly from approximating an integral. We do not need to directly penalize the risk function. Therefore the risk function R(j, θ) = C(θ) does not depend on the model index j and the empirical risk R n is C n (θ).
Assume that the integral in the posterior model probability satisfies a BIC type approximation where θ j = arg min θ∈Θ j C(θ) is assumed to be unique for convenience, and C is the pointwise limit of C n in probability.
Then denoting d j = dim(Θ j ), we have Any 'wrong' models where inf {j}×Θ j C−inf C = c j > 0 (and C n (θ j )−C n (θ 0 ) = c j + o p (1)) will have an exponential rate π(M j )/π(M 0 ) = O p (e −0.9c j λ ). Any 'true' models where inf {j}×Θ j C − inf C = 0 will satisfy θ j = θ 0 , C n (θ j ) − C n (θ 0 ) = 0, and the second term in the maximum dominates 12 . Therefore, any overly complex 'true' model with d j > d 0 will have a polynomial rate Suppose that there is no simpler 'true model' that has d j ≤ d 0 other than M 0 , and that there is a fixed number of candidate models. Then π(M c 0 ) = o p (1), and the posterior oracle property holds. In addition, the posterior mean can also satisfy the oracle property due to comments following (1). This is because the probability of selecting overly complex true models is o p (1), and the probability of selecting 'wrong' models is exponentially fast in λ. Assume that the scaling parameter λ is polynomial in n, and that all the 'true' models with inf {j}×Θ j C − inf C = 0 have a common polynomial convergence rate ǫ n for E(θ|M j ) − θ 0 . Then |E(θ) − θ 0 | = o p (ǫ n ) due to (1).

Cubic root asymptotics
In this section, we argue that the BIC approximation can be applied to a nonstandard case considered by Jun, Pinske and Wan (2012). Our mean oracle property implies that one of the nonstandard convergence rates from Jun, Pinske and Wan (2012), who did not consider model selection but assumed the true model to be known, remains true even after model selection, i.e., even if the true model is unknown.
The BIC condition for (4) is usually derived from a quadratic approximation of the empirical risk C n . It is noted the BIC condition may still hold even when C n is not differentiable. For example, for predictors X i and binary responses Z i , when C n had a discontinuous form C n = n −1 n i=1 Z i I[X ⊤ i θ ≥ 0], its expectation C = EC n may still have a quadratic approximation. Following the cubic-root asymptotics as described in Jun, Pinske and Wan (2012) (their Theorem 1), we find that the BIC condition can still hold for an unusual scale parameter λ ≺ n 2/3 . 13 The heuristics of the condition λ ≺ n 2/3 is as derived in Jun, Pinske and Wan (2012). For model M 0 where θ 0 = arg min Θ 0 C(θ) is assumed to be unique (and similarly for all other models): The second square bracket is stochastic and O p (n −1/2 θ − θ 0 β ) where β = 0.5, instead of 1, due to the indicator functions in C n . The first square bracket is nonstochastic and can be approximated by a quadratic form 0.5(θ − θ 0 ) ⊤ V (θ − θ 0 ) for some positive definite matrix V . If the stochastic term is dominated by the quadratic nonstochastic term, then we can ignore the stochastic term and have a BIC type approximation. This will happen when λn −1/2 θ − θ 0 β ≺ λ θ − θ 0 2 = t 2 ∼ 1 where t is called a rescaled parameter. This leads to λn −1/2 λ −β/2 ≺ 1 or λ ≺ n 1/(2−β) = n 2/3 . As commented in the last section, the fact that the BIC condition can still hold for λ ≺ n 2/3 implies both the posterior oracle property and the mean oracle property. The posterior convergence rate is, due to the quadratic approximation above, λ −1/2 , which is slower than n −1/3 due to the condition on λ. In this case, the posterior oracle property is often not useful because the posterior is a quasi-posterior that does not guarantee a valid interpretation. However, the posterior mean is asymptotically normal with no bias if λ ≻ n 2/5 (see the comments after Theorem 1 of Jun, Pinske and Wan 2012). Therefore, the mean oracle property is still useful. Applying Case (iii) of Theorem 1 in Jun, Pinske and Wan (2012), we have that θdπ is asymptotic normal for the true parameter, at a convergence rate n −1/2 λ 1/4 , which is faster than n −1/3 and slower than n −2/5 for our choice of λ. The contribution of our mean oracle property basically says that this rate from Jun, Pinske and Wan (2012), who did not consider model selection but assumed the true model to be known, remains true even after model selection, i.e., even if the true model is unknown.

Partial identification
In this section, we study the partial identification caused by moment inequalities. We first study the oracle posterior, where the model is given, and show that the limiting posterior is nonstandard due to partial identification. Secondly, we introduce an example to show that defining a meaningful true model can be very subtle due to partial identification; simpler models which fit the data as well are not always better than more complex models. We finally show that to be conservative, one can define a mixture true model by allowing all models that are compatible with the data, and that this mixture true model is meaningful and can be consistently selected.
6.6.1 Nonstandard limiting distribution due to partial identification.
First we assume that the model is given and explain how partial identification leads to a nonstandard limiting posterior. Suppose we are interested in making inference about some parameter θ that satisfies a vector of moment inequalities Em(θ) ≥ 0, where m(θ) is a vector of random moment functions that depend on the data. We can define π ∝ e −λRn κ where λ is the scaling parameter that could depend on n, κ is a prior for θ, for some (minimal set) distance d, R + represents the set of ψ-vectors with all nonnegative components, andm denotes the sample average of m(θ). When d is the Euclidean (possibly weighted) and λ = 0.5n, this becomes the choice of empirical risk in Chernozhukov, Hong and Tamer (2007), and also corresponds to a Laplace approximation of the posterior used in Liao and Jiang (2010). When the identification region Ω = [θ : d 2 (Em(θ), R + ) = 0] has positive prior probability, typically one has, in total variation distance, where the limiting distribution is κ(θ|Ω) ∝ κ(θ)I(θ ∈ Ω), i.e., the prior truncated in the identification region Ω. 14 This limiting distribution has contraction rate 1, instead of the usual rate n −1/2 , due to partial identification, and is generally nonnormal.
6.6.2 Subtlety of consistent model selection due to partial identification.
In this subsection, we claim that due to the uncertainty about the true parameter associated with partial identification, it is NOT always good to favor simpler models that are compatible with the data. We need to be conservative to allow possibly nonzero parameters in the model.
Now we proceed to model selection and describe the subtlety of defining a meaningful true model due to partial identification. In a similar situation of moment inequalities, Liao and Jiang (2010) have considered model selection consistency using complexity penalization. However, as pointed out in Liao (2010), their consistent procedure for joint model and moment condition selection may miss the true parameter. Liao (2010) considered a counter-example (Example 3.4.1) with selection of moment conditions.
A model which proposes θ ∈ Θ j is compatible if Θ j ∩ Ω is nonempty. This will make inf θ∈Θ j R = inf R = 0, where R = d 2 (Em(θ), Ψ). Incompatible models have inf θ∈Θ j R > 0, which can be shown to have ignorable posterior probabilities due to the earlier sections that discuss risk convergence.

Consider three candidate models
The second component θ 2 has therefore an intrinsic uncertainty for model selection. It is possible that the true parameter is θ * = (0, 0), but it is also possible that the true parameter θ * = (0, 0.5), since both θ 2 = 0 and 0.5 fall in [−0.2, 0.6] = [EL 2 Y 2 , EU 2 Y 2 ]. The data cannot tell between the two possibilities.
A complexity penalized model selection would select the simplest model that is compatible with the moment constraints, where a compatible model is such that inf θ∈Θ j R = inf R = 0, or equivalently, Θ j ∩ Ω is nonempty. In our case, both models Θ 2,3 are compatible but not Θ 1 , and the simplest compatible model is Θ 3 . Yet, this is a wrong model when θ * = (0, 0.5), since the nonzero θ * 2 component is missed by Θ 3 !

A possible solution to the subtlety of consistent model selection
We note that in this partial identification case, we cannot simply choose the simplest compatible model. A consistent model selection method that chooses the simplest compatible model may be wrong. Liao (2010) proposes to give up forcing the model selection consistency and use a meaningful prior to allow all compatible models to be represented with nonvanishing posterior probabilities. We will view this problem in a different perspective. We will still view it as a problem of consistent model selection, but to be conservative, we redefine the true model to be a mixture distribution over all models that are compatible with the data. Then the model selection procedure based on a quasi-posterior will still be consistently selecting this mixture true model. We will also derive the limiting posterior given this mixture true model, and argue that it makes sense intuitively due to partial identification.
A compatible model j is such that Θ j ∩ Ω = ∅ and inf θ∈Θ j R(θ) = 0. We will say that j ∈ J 0 . A model is incompatible if Θ j ∩ Ω = ∅ or inf θ∈Θ j R(θ) > 0. We will say that j ∈ J 1 . Let g = inf j∈J 1 inf θ∈Θ j R(θ) and assume it to be positive (which will be true if there is a finite number of candidate models.) To be conservative, we would like to allow all compatible models to be kept in the large sample limit. We can therefore attempt to group j ∈ J 0 together to be the true model M 0 and relabel it as k = 0. Then one can rewrite π(k, θ) ∝ e −λRn(θ) κ k κ(θ|k) where κ 0 = ν(j ∈ J 0 ), κ(θ|0) = ν(θ|j ∈ J 0 ) = j∈J 0 ν j ν(θ|j)/ j∈J 0 ν j can be recognized as a mixture prior over all compatible models, and {κ k κ(θ|k) : k > 0} take same values as {ν(j, θ) : j ∈ J 1 }.
Then the current formalization is converted to the same form as Section 6.3. The gap parameter γ in (2) and (3) can be taken as g here to obtain: π(M c 0 ) ≤ e −0.5λ(g−r−2|u|) .
This is typically exponentially small in λ and n, which will imply oracle properties for both the posterior distribution and the posterior mean.
Due to the discussions for (5), we know that the limiting oracle prior (which is also the limit of the posterior) is κ(θ|Ω, k = 0) ∝ κ(θ|0)I(θ ∈ Ω) ∝ j∈J 0 ν j ν(θ|j) j∈J 0 ν j I(θ ∈ Ω). This is a mixture of priors over all compatible models, truncated in the identification region Ω. Even though this limit is very different from the classical square root n normal limit, it is a meaningful and makes common sense. The problem is partially identified: the data identify only the region Ω; and within the region, information only comes from the priors of all models that are compatible with the data.

Proof of Propositions
Proof of Proposition 1: For any event A and B, π(A) = π(A|B)π(B)+π(A|B c )π(B c ) and π(A|B) = π(A|B)π(B) + π(A|B)π(B c ). Then Proof of Proposition 2: In the proof of Proposition 1 above, we could replace M 0 byM and obtain sup A |π(A) − π(A|M )| ≤ π(M c ). The right hand side is at most π(M c 0 ) since π(M) ≥ π(M 0 ). Now combine this with the result of Proposition 2 using the triangular inequality leads to the proof. Q.E.D.
Then use the Jensen's inequality for the denominator and the Cauchy-Schwartz's inequality for the numerator to get Then applying this upper bound of φ(A) to (7) leads to the proof. Q.E.D.
15 Therefore, for any probability measure p and event B, we have that the total variation distance |p − p(·|B)| T V = p(B c ). 16 We note that the very general relations (6) and (7) appearing in the proof below are assumption free and simple, and may be of interest themselves in bounding the quasiposterior by its limiting version.