Bayesian Restricted Likelihood Methods: Conditioning on Insufficient Statistics in Bayesian Regression

Bayesian methods have proven themselves to be successful across a wide range of scientific problems and have many well-documented advantages over competing methods. However, these methods run into difficulties for two major and prevalent classes of problems: handling data sets with outliers and dealing with model misspecification. We outline the drawbacks of previous solutions to both of these problems and propose a new method as an alternative. When working with the new method, the data is summarized through a set of insufficient statistics, targeting inferential quantities of interest, and the prior distribution is updated with the summary statistics rather than the complete data. By careful choice of conditioning statistics, we retain the main benefits of Bayesian methods while reducing the sensitivity of the analysis to features of the data not captured by the conditioning statistics. For reducing sensitivity to outliers, classical robust estimators (e.g., M-estimators) are natural choices for conditioning statistics. A major contribution of this work is the development of a data augmented Markov chain Monte Carlo (MCMC) algorithm for the linear model and a large class of summary statistics. We demonstrate the method on simulated and real data sets containing outliers and subject to model misspecification. Success is manifested in better predictive performance for data points of interest as compared to competing methods.

data. By careful choice of conditioning statistics, we retain the main benefits of Bayesian methods while reducing the sensitivity of the analysis to features of the data not captured by the conditioning statistics. For reducing sensitivity to outliers, classical robust estimators (e.g., M-estimators) are natural choices for conditioning statistics. A major contribution of this work is the development of a data augmented Markov chain Monte Carlo (MCMC) algorithm for the linear model and a large class of summary statistics. We demonstrate the method on simulated and real data sets containing outliers and subject to model misspecification. Success is manifested in better predictive performance for data points of interest as compared to competing methods.

Introduction
Bayesian methods have provided successful solutions to a wide range of scientific problems, with their value having been demonstrated both empirically and theoretically. Bayesian inference relies on a model consisting of three elements: the prior distribution, the loss function, and the likelihood or sampling density. While formal optimality of Bayesian methods is unquestioned if one accepts the validity of all three of these elements, a healthy skepticism encourages us to question each of them.
Concern about the prior distribution has been addressed through the development of techniques for subjective elicitation (Garthwaite et al., 2005;O'Hagan et al., 2006) and objective Bayesian methods (Berger, 2006). Concern about the loss function is reflected in, for example, the extensive literature on Bayesian hypothesis tests (Kass and Raftery, 1995).
The focus of this work is the development of techniques to handle imperfections in the likelihood f (y|θ) = L(θ|y). Concern for imperfections in the likelihood are reflected in work considering minimally informative likelihoods (Yuan and Clarke, target inferential quantities of interest. We call this likelihood a restricted likelihood because conditioning is done on a restricted set of data; the set which satisfies the observed summary statistics. This restricted likelihood leads to a formal update of the prior distribution based on the sampling density of the summary statistics. The remainder of the paper is as follows: Section 2 introduces the Bayesian restricted likelihood and provides context with previous work, Section 3 demonstrates some advantages of the methods on simple examples, and Section 4 details an MCMC algorithm to apply the method to Bayesian linear models. This computational strategy is a major contribution to the work, providing an approach to apply the method on realistic examples. Many of the the technical proofs are in the Appendix 8 with R code available from the authors. Sections 5 and 6 illustrate the method with simulated data and a real insurance industry data set containing many outliers with a novel twist on model evaluation. A discussion (Section 7) provides some final commentary on the new method.

Examples
To describe the use of the restricted likelihood, we begin with a pair of simple examples for the one-sample problem. For both, the model takes the data y = (y 1 , . . . , y n ) to be a random sample of size n from a continuous distribution indexed by a parameter vector θ, with pdf f (y|θ). The standard, or full, likelihood is L(θ|y) = n i=1 f (y i |θ). The first example considers the case where a known subset of the data are known to be bad in the sense of not informing us about θ. This case mimics the setting where outliers are identified and discarded before doing a formal analysis. Without loss of generality, we label the good cases 1 through n − k and the bad cases n − k + 1 through n. The relevant likelihood to be used to move from prior distribution to posterior distribution is clearly L(θ|y 1 , . . . , y n−k ) = n−k i=1 f (y i |θ). For an equivalent analysis, we rewrite the full likelihood as the product of two pieces: where the second factor may not actually depend on θ. We wish to keep the first factor and drop the second for better inference on θ.
The second example involves deliberate censoring of small and large observations. This is sometimes done as a precursor to the analysis of reaction time experiments (e.g., Ratcliff, 1993) where very small and large reaction times are physiologically implausible; explained by either anticipation or lack of attention of the subject. With lower and upper censoring times at t 1 and t 2 , the post-censoring sampling distribution is of mixed form, with masses F (t 1 |θ) at t 1 and 1 − F (t 2 |θ) at t 2 , and density f (y|θ) for y ∈ (t 1 , t 2 ). We adjust the original data y i , producing c(y i ) by defining c(y i ) = t 1 if y i ≤ t 1 , c(y i ) = t 2 if y i ≥ t 2 , and c(y i ) = y i otherwise. The adjusted update is performed with L(θ|c(y)). Letting g(t 1 |θ) = F (t 1 |θ), g(t 2 |θ) = 1 − F (t 2 |θ), and g(y|θ) = f (y|θ) for y ∈ (t 1 , t 2 ), we may rewrite the full likelihood as the product of two pieces f (y i |θ, c(y i )). , Only the first part is retained in the analysis. Several more examples are detailed in Lewis (2014).

Generalization
To generalize the approach in (1) and (2), we write the full likelihood in two pieces with a conditioning statistic T (y), as indicated below: L(θ|y) = f (T (y)|θ) f (y|θ, T (y)).
Here, f (T (y)|θ) is the conditional pdf of T (y) given θ and f (y|θ, T (y)) is the conditional pdf of y given θ and T (y). In the dropped case example, the conditioning statistic is T (y) = (y 1 , . . . , y n−k ). In the censoring example, the conditioning statistic is T (y) = (c(y 1 ), . . . , c(y n )). We refer to f (T (y)|θ) as the restricted likelihood and L(θ|y) = f (y|θ) as the full likelihood.
Bayesian methods can make use of a restricted likelihood since T (y) is a welldefined random variable with a probability distribution indexed by θ. This leads to the restricted likelihood posterior π(θ|T (y)) = π(θ)f (T (y)|θ) m(T (y)) , where m(T (y)) is the marginal distribution of T (y) under the prior distribution. Predictive statements for further (good) data rely on the model. For another observation, say y n+1 , we would have the predictive density f (y n+1 |T (y)) = f (y n+1 |θ)π(θ|T (y)) dθ.

Literature review
Our motivation for the use of summary statistics in Bayesian inference is concern about outliers or, more generally, model misspecification. Specifically, the likelihood is not specified correctly and concentrating on using well chosen parts of the data can help improve the analysis (e.g., Wong and Clarke, 2004). Direct use of restricted likelihood for this reason appears in many areas of the literature. For example, the use of rank likelihoods is discussed by Savage (1969), Pettitt (1983Pettitt ( , 1982, and more recently by Hoff et al. (2013). Lewis et al. (2012) make use of order statistics and robust estimators as choices for T (y) in the location-scale setting. Asymptotic properties of restricted posteriors are studied by Doksum and Lo (1990), Clarke and Ghosh (1995), Yuan and Clarke (2004), and Hwang et al. (2005). The tenor of these asymptotic results is that, for a variety of conditioning statistics with nontrivial regularity conditions on prior, model, and likelihood, the posterior distribution resembles the asymptotic sampling distribution of the conditioning statistic.
Restricted likelihoods have also been used as practical approximations to a full likelihood. For example, Pratt (1965) appeals to heuristic arguments regarding approximate sufficiency to justify the use of the restricted likelihood of the sample mean and standard deviation. Approximate sufficiency is also appealed to in the use of Approximate Bayesian Computation (ABC), which is related to our method. ABC is a collection of posterior approximation methods which has recently experienced success in applications to epidemiology, genetics, and quality control (see, for example, Tavaré et al., 1997;Pritchard et al., 1999;Marjoram et al., 2003;Fearnhead and Prangle, 2012). Interest typically lies in the full data posterior and ABC is used for computational convenience as an approximation. Consequently, effort is made to choose an approximately sufficient T (y) and update to the ABC posterior by using the likelihood L(θ|B(y)), where B(y) = {y * |ρ(T (y), T (y * )) ≤ }, ρ is a metric, and is a tolerance level. This is the likelihood conditioned on the collection of data sets that result in a T (·) within of the observed T (y). With an approximately sufficient T (·) and a small enough , heuristically L(θ|B(y)) ≈ L(θ|T (y)) ≈ L(θ|y). Consequently, the ABC posterior approximates the full data posterior and efforts have been made to formalize what is meant by approximate sufficiency (e.g., Joyce and Marjoram, 2008). ABC is related to our method in that the conditioning is on something other than the data y. However, we specifically seek to condition on an insufficient statistic to guard against misspecification in parts of the likelihood. Additionally, we develop methods where the conditioning is exact (i.e. = 0).
This work extends the development of Bayesian restricted likelihood by arguing that deliberate choice of an insufficient statistic T (y) guided by targeted inference is sound practice. We also expand the class of conditioning statistics for which a formal Bayesian update can be achieved. Our methods do not rely on asymptotic properties, nor do they rely on approximate conditioning.

Illustrative Examples
Before discussing computational details, the method is applied to two simple examples on well known data sets to demonstrate its effectiveness in situations where outliers are a major concern. The full model in each case fits into the Bayesian linear regression framework discussed in Section 4.
The first example is an analysis of Simon Newcomb's 66 measurements of the passage time of light (Stigler, 1977); two of which are significant outliers in the lower tail. The full model is a standard location-scale Bayesian model also used in Lee and MacEachern (2014):  (Huber and Ronchetti, 2009) and, for comparability, roughly 5% of the residuals are trimmed for LTS. Two additional approaches to outlier handling are considered: 1) the normal distribution is replaced with a t-distribution and, 2) the normal distribution is replaced with a mixture of two normals. The t-model assumes y i iid ∼ t ν (β, σ 2 ) with ν = 5. The prior on σ 2 is IG(5, ν−2 ν 10) and ensures that the prior on the variance is the same as the other models. The mixture takes the form: with the prior p ∼ beta(20, 1) on the probability of belonging to the 'good' compo- There is modest variation among these centers. Posteriors in this second group have less dispersion than those in the first group.
The pattern for predictive distributions differs (see bottom plot in Figure 1 are concentrated. The restricted likelihood based on LTS and the mixture model results are also centered appropriately, but comparatively less concentrated. The LMS predictive is concentrated, but it is poorly centered. Overall, we find that the restricted likelihood methods based on M-estimators provide the most attractive analysis for these data. They provide sharp and appropriate inference for parameters (β) and for prediction.
As a second example, a data set measuring the number of telephone calls in Belgium from 1950-1973 is analyzed. The outliers in this case are due to a change in measurement units on which calls were recorded for part of the data set. Specifically, for years 1964-1969 and parts of 1963 and 1970, the length of calls in minutes were recorded rather than the number of calls (Rousseeuw and Leroy, 1987). The full model is a standard normal Bayesian linear regression: where β = (β 0 , β 1 ) , y is the vector of the logarithm of the number of calls, and X is the n × 2 design matrix with a vector of 1's in the first column and the year covariate in the second. Prior parameters are fixed via a maximum likelihood fit to the first 3 data points. In particular, the prior covariance for β is set to Σ 0 = gσ 2 0 (X p X p ) −1 , with X p the 3 × 2 design matrix for the first 3 data points, g = n = 21, σ 0 = 0.03 and µ 0 = (1.87, 0.03) . This has the spirit of a unit information prior (Kass and Wasserman, 1995) but uses a design matrix for data not used in the fit. Finally a = 2 and b = 1. for the slope and intercept with Huber's 'proposal 2' for scale, and a heavy-tailed tdistribution model. The first three data points were used to specify the prior with each model using the remaining 21 for fitting. The normal theory model was also fit after removing observations 14-20 (years 1963 -1970).
dimensional statistics T (y) and parameters θ, the direct computational strategies described in Lewis (2014) can be used to estimate the restricted posterior conditioned on essentially any statistic. These strategies rely on estimation of the density of f (T (y)|θ) using samples of T (y) for many values of θ; a strategy which breaks down in higher dimensions. This section outlines a data augmented MCMC algorithm that can be applied to the Bayesian linear model when T (y) consists of estimates of the regression coefficients and scale parameter.

The Bayesian linear model
We focus on the use of restricted likelihood for the Bayesian linear model with a standard formulation: where x i and β ∈ R p , σ 2 ∈ R + , and the i are independent draws from a distribution with center 0 and scale σ. X denotes the design matrix whose rows are x i .
For the restricted likelihood model, conditioning statistics are assumed to be of the estimator for the regression coefficients and s(X, y) ∈ {0} ∪ R + is an estimator of the scale. Throughout, observed data and summary statistic is denoted by y obs and T (y obs ) = (b(X, y obs ), s(X, y obs )), respectively. Several conditions are imposed on the model and statistic to ensure validity of the MCMC algorithm: C1. The n × p design matrix, X, whose i th row is x i , is of full column rank.
C2. The i are a random sample from some distribution which has a density with respect to Lebesgue measure on the real line and for which the support is the real line.
C3. b(X, y) is almost surely continuous and differentiable with respect to y.
C4. s(X, y) is almost surely positive, continuous, and differentiable with respect to y.
Properties C5 and C6 of b are called regression and scale equivariance, respectively.
Properties C7 and C8 of s are called regression invariance and scale equivariance.
Many estimators satisfy the above properties, including simultaneous M-estimators (Huber and Ronchetti, 2009;Maronna et al., 2006) for which the R package brlm (github.com/jrlewi/brlm) is available to implement the MCMC described here.
Further software development is required to extend the MCMC implementation beyond these M-estimators. The package also implements the direct computational methods described in Lewis (2014). These methods are effective in lower dimensional problems and were used in both examples in Section 3.

Computational strategy
The general style of algorithm we present is a data augmented MCMC targeting f (θ, y|T (y) = T (y obs )), the joint distribution of θ and the full data given the summary statistic T (y obs ). The Gibbs sampler (Gelfand and Smith, 1990) iteratively samples from the full conditionals 1) π(θ|y, T (y) = T (y obs )) and 2) f (y|θ, T (y) = T (y obs )). When y has the summary statistic T (y) = T (y obs ), the first full conditional is the same as the full data posterior π(θ|y). In this case, the condition T (y) = T (y obs ) is redundant. This allows us to make use of conventional MCMC steps for generation of θ from the first full conditional. For typical regression models, algorithms abound. Details of the recommended algorithms depend on details of the prior distribution and sampling density and we assume this can be done (see e.g., Liu, 1994;Liang et al., 2008).
For a typical model and conditioning statistic, the second full conditional f (y|θ, T (y) = T (y obs )) is not available in closed form. We turn to Metropolis-Hastings (Hastings, 1970), using the strategy of proposing full data y ∈ A := {y ∈ R n |T (y) = T (y obs )} from a well defined distribution with support A and either accepting or rejecting the proposal. Let y p , y c ∈ A represent the proposed and current full data, respectively.
Denote the proposal distribution for y p by p(y p |θ, T (y p ) = T (y obs )) = p(y p |θ, y p ∈ A) = p(y p |θ). The last equality follows from the fact that our p(·|θ) assigns probability one to the event {y p ∈ A}. These equalities still hold if the dummy argument y p is replaced with y c . The conditional density is for y ∈ A and I(·) the indicator function. This includes both y p and y c . The Metropolis-Hastings acceptance probability is the minimum of 1 and R, where For the models we consider, evaluation of f (y|θ) is straightforward. Therefore, the difficulty in implementing this Metropolis-Hastings step manifests itself in the ability to both simulate from and evaluate p(y p |θ)-the well defined distribution with support A. We now discuss such an implementation method for the linear model in (8).

Construction of the proposal
Our computational strategy relies on proposing y such that T (y) = T (y obs ) where T (·) = (b(X, ·), s(X, ·)) satisfies the conditions C3-C8. It is not a simple matter to do this directly, but with the specified conditions, it is possible to scale and shift any z * ∈ R n which generates a positive scale estimate to such a y via the following Theorem, whose proof is in the appendix.
Theorem 4.1. Assume that conditions C4-C8 hold. Then, any vector z * ∈ R n with conditioning statistic T (z * ) for which s(X, z * ) > 0 can be transformed into y with conditioning statistic T (y) = T (y obs ) through the transformation Using the theorem, the general idea is to first start with an initial vector z * drawn from a known distribution, say p(z * ), and transform via h(·) to y ∈ A. The proposal density p(y|θ) is then a change-of-variables adjustment on p(z * ) derived from h(·). In general however, the mapping h(·) is many-to-one: for any v ∈ R n and any c ∈ R + , cz * +Xv map to the same y. This makes the change-of-variables adjustment difficult.
We handle this by first noticing that the set A is an n−p−1 dimensional space: there are p constraints imposed by the regression coefficients and one further constraint imposed by the scale. Hence, we restrict the initial z * to an easily understood n − p − 1 dimensional space. Specifically, this space is the unit sphere in the orthogonal complement of the column space of the design matrix: where C(X) and C ⊥ (X) are the column space of X and its orthogonal complement, respectively. The mapping h : S → A is one-to-one and onto. A proof is provided by Theorem 8.1 in the appendix. The one-to-one property makes the change of variables more feasible. The onto property is important so that the support of the proposal distribution (i.e. the range of h(·)) contains the support of the target f (y|θ, y ∈ A), a necessary condition for convergence of the Metroplis-Hastings algorithm (in this case the supports are both A).
Given the one-to-one and onto mapping h : S → A, the general proposal strategy is summarized as follows: 1. Sample z * from a distribution with known density on S.
2. Set y = h(z * ) and calculate the Jacobian of this transformation in two steps.
and, by condition C7, every element of this set has s(X, z) = s(X, y obs ). Specifically, set z = s(X,y obs ) s(X,z * ) z * . There are two pieces of this Jacobian: one for the scaling and one for the mapping of the sphere onto Π(A). The latter piece is given in equation (12).
). This shift is along the column space of X to the unique element in A. The Jacobian of this transformation is given by equation (13).
The final proposal distribution including the complete Jacobian is given in equation (14) with details in the next section. Before giving these details we provide a visualization in Figure 3 of each of the sets described above using a notional example to aid in the understanding of the strategy we take. In the figure, n = 3, p = 1, and the conditioning statistic is T (y) = (min(y), (y i − min(y)) 2 ). The set A is depicted for T (y obs ) = (0, 1) which we describe as a "warped triangle" in light blue, with each side corresponding to a particular coordinate of y being the minimum value of zero. The A is the combination of three quarter circles, one on each plane defined by y i = 0. The projection of this manifold onto the deviation space is depicted by the bowed triangular shape in the plane defined by y i = 0. The circle in this plane represents the sample space for the intermediate sample z * . Also depicted is the vector 1, the design matrix for the location and scale setting.
other two coordinates are restricted by the scale statistic to lie on the quarter circle of radius one in the positive orthant. In this example, the column vector X = 1 (shown as a reference) spans C(X) and S is a unit circle on the orthogonal plane (shown in red). Π(A) is depicted as the bowed triangle in dark blue. We will come back to this artificial example in the next section in an attempt to visualize the Jacobian calculations.

Evaluation of the proposal density
We now explain each step in computing the Jacobian described above.

Scale from S to Π(A)
The first step is constrained to C ⊥ (X) and scales the initial z * to z = s(X,y obs ) s(X,z * ) z * . For the Jacobian, we consider two substeps: first, the distribution on S is transformed to that along a sphere of radius r = z = s(X, y obs )/s(X, z * ). By comparison of the volumes of these spheres, this transformation contributes a factor of r −(n−p−1) to the Jacobian. For the second substep, the sphere of radius r is deformed onto Π(A).
This deformation contributes an attenuation to the Jacobian equal to the ratio of infinitesimal volumes in the tangent spaces of the sphere and Π(A) at z. Restricting to C ⊥ (X), this ratio is the cosine of the angle between the normal vectors of the two sets at z. The normal to the sphere is its radius vector z. The normal to Π(A) is given in the following lemma with proof provided in the Appendix. Gradients denoted by ∇ are with respect to the data vector.
Lemma 4.2. Assume that conditions C1-C2, C4, and C7 hold and y ∈ A. Let ∇s(X, y) denote the gradient of the scale statistic with respect to the data vector evaluated at y. Then ∇s(X, y) ∈ C ⊥ (X) and is normal to Π(A) at z = Qy in As a result of the lemma, the contribution to the Jacobian of this attenuation is where γ is the angle between the two normal vectors. This step is visualized in dimensional matrix P = QA is of full column rank.
As a consequence of this lemma, the parallelepiped spanned by the columns of P is not degenerate (it is n − p − 1 dimensional), and its volume is given by where r = rank(P ) = n − p − 1 and σ 1 ≥ σ 2 ≥ · · · ≥ σ r > 0 are the singular values of P (e.g., Miao and Ben-Israel (1992)). Combining Lemmas 4.3 and 4.4 above leaves us with the following result concerning the calculation of the desired Jacobian.
Theorem 4.5. Assume that conditions C1-C5 and C7-C8 hold. Then the Jacobian of the transformation from the distribution along Π(A) to that along A is equal to the volume given in (13).

The proposal density
Putting all the pieces of the Jacobian together we have the following result. Any dependence on other variables, including current states in the Markov chain, is made implicit.
Theorem 4.6. Assume that conditions C1-C8 hold. Let z * be sampled on the unit sphere in C ⊥ (X) with density p(z * ). Using the transformation of z * to y ∈ A described in Theorem 4.1, the density of y is where r = s(X, y obs )/s(X, z * ), and cos(γ) and Vol(P ) are as in equations (12) and (13), respectively.
Some details for computing the needed quantities are worth further explanation.
Computing Vol(P ) involves finding an orthornormal matrix A whose columns span T y (A). This matrix can be found by supplementing B with a set of n linearly independent columns on the right, and applying Gram-Schmidt orthonormalization.
The computational complexity of this step is O(n 3 ). This is infeasibly slow when n is large because it must be repeated at each iterate of the MCMC when a complete data set is drawn. However, using results related to principal angles found in Miao and Ben-Israel (1992) the volume (13)   The lemma implies that Vol(P ) is the product of the singular values of U B.
For example, below we consider M-estimators defined by the estimating equations: where ψ and χ are almost surely differentiable. The gradients can be found by differentiating this system of equations with respect to each y i . In theory, finite differences could also be used as an approximation if needed.

Simulated Data
We study the performance of restricted likelihood methods in a hierarchical setting where the data are contaminated with outliers. Specifically, simulated data come from the following model: with µ = 0, τ 2 = 1, σ 2 = 4. The values of p i , m i , and n i depend on the group and are formed using 5 replicates of the full factorial design over factors p i , m i , n i with levels p i = .1, .2, .3, m i = 9, 25, and n i = 25, 50, 100. This results in 90 groups that have varying levels of outlier contamination and sample size. We wish to build models that offer good prediction for the good portion of data within each group. The full model for fitting is a corresponding normal model without contamination: allows the data augmentation described earlier to be performed independently within each group. That is, there is a separate Gibbs step for each group to generate the group level data matching the statistics for that group.
To assess predictive capability, the models are compared using Kullback-Leibler (KL) divergence from the distribution of good data to the posterior predictive distribution. Specifically, for the i th group of the k th simulated data set y k compute: where M indexes the fitting model and f (ỹ|θ i , σ 2 ) = N (ỹ|θ i , σ 2 ), the normal density function with (known) mean θ i and variance σ 2 , evaluated atỹ. For the Bayesian For c = 0.5 and c = 1, the results favor the restricted likelihood methods with a slight advantage to the use of Tukey's location estimator over Huber's. This is likely due to the fact that Tukey's estimator essentially trims extreme outliers in the estimation procedure while Huber's estimator discounts them (Huber and Ronchetti, 2009).
The choice of c = 2 corresponds to a particularly poor prior distribution. The prior has substantial mass above σ 2 = 4, with prior means for σ 2 from 8.9 to 32 as a s varies. Additionally, the tuning parameters chosen for the location and scale estimators result in an upward bias in the estimate of σ 2 . This bias depends on m and p. For example, for m = 9 and p = .1, Huber's version converges to roughly 4.8 as n grows. The bias is greater for more severe levels of contamination. The alignment of biases in prior distribution and in likelihood from the summary statistic (when applied to the contaminated data) inflates the estimate of scale. Not surprisingly, a poor prior distribution whose weakness matches the weakness in the likelihood results in poorer inference. In this case, poorer than the classical estimators.

Real Data
We illustrate our methods with a pair of regression models for data from Nationwide Insurance Company that concern prediction of the performance of insurance agencies. Nationwide sells many of its insurance policies through agencies which provide direct service to policy holders. The contractual agreements between Nationwide and these agencies vary. Our interest is the prediction of future performance of agencies where performance is measured by the total number of households an agency services ('household count'). The data are grouped by states with a varying number of agencies by state. Identifiers such as agency/agent names are removed. Likewise, state labels and agency types (identifying the varying contractual agreements) have been made generic to protect the proprietary nature of the data. Additionally, the counts were scaled to have standard deviation one before analysis. As an exploratory view, a plot of the square root of (scaled) household count in 2012, against that in 2010 is shown in Figure 8  2012 are of special interest. One could easily subset the analysis to only these agencies, removing the others. However, we leave them and use the data as a test bed for our techniques by fitting models that do not account for agency closures or contract type. Our expectation is that the restricted likelihood will facilitate prediction for the 'good' part of the data (i.e., open, 'type 1' agencies).

State Level Regression model
The first analysis is based on individual regressions fit separately within states. The following normal theory regression model is used as the full model for a single state: where y i and x i are the square rooted household count in 2012 and 2010 for the i th agency, respectively. The hyper-parameters a 0 , b 0 , µ 0 and σ 2 0 are all fixed and set from This prior is in the spirit of the Zellner's g-prior (Zellner, 1986;Liang et al., 2008).
In general, scaling the prior variance se(β) 2 by a factor g = n p is analogous to the unit-information prior (Kass and Wasserman, 1995), with the difference that we are using a prior data set, not the current data set, to set the prior. The obvious reason why this model is misspecified is due to omission of the contract type and agency closure information. Closing our eyes to these variables, many of the cases appear as outliers. Additionally, the model assumes equal variance within each state, an assumption whose worth is arguable (see Figure 8).
We compare four Bayesian models: the standard Bayesian normal theory model, two restricted likelihood models, both with simultaneous M-estimators, and a heavytailed model. For the restricted likelihood methods we use the same simultaneous M-estimators as in the simulation of Section 5 adapted to linear regression. The heavy-tailed model replaces the normal sampling density in (19) with a t-distribution with ν = 5 degrees of freedom. The Bayesian models are all fit using MCMC, with the restricted versions using the algorithm presented in Section 4.2. We also fit the corresponding classical robust regressions and a least squares regression.

Method of model comparison
We wish to examine the performance of the models in a fashion that preserves the essential features of the problem. Since we are concerned with outliers and model misspecification, we understand that our models are imperfect and prefer to use an out-of-sample measure of fit. This leads us to cross-validation. We repeatedly split the data into training and holdout data sets; fitting the model to the training data and assessing performance on the holdout data.
The presence of numerous outliers in the data implies that both training and validation data will contain outliers. For this reason, the evaluation must be robust to a certain fraction of bad data. The two main strategies are to robustify the evaluation function (e.g., Ronchetti et al., 1997) or to retain the desired evaluation function and trim cases (Jung et al., 2014). Here, we pursue the trimming approach with log predictive density for the Bayesian models and log density from plug-in maximum likelihood for the classical fits used as the evaluation function.
The trimmed evaluation proceeds as follows in our context. The evaluation function for case i in the holdout data is the log predictive density, say log(f (y i )), with the conditioning on the summary statistic suppressed. The trimming fraction is set at 0 ≤ α < 1. To score a method, we first identify a base method. Denote the predictive density under this method by f b (y). Under the base method, log(f b (y i )) is computed for each case in the holdout sample, say i = 1, . . . , M . Order the holdout sample according to the ordering of log(f b (y i )) and denote this ordering by y b (1) , y b (2) , . . . , y b (M ) . That is, for i < j log(f b (y b (i) )) < log(f b (y b (j) )). All of the methods are then scored on the holdout sample with the mean trimmed log marginal pseudo likelihood, where , the Student-t performs similarly to our restricted methods for smaller training sample size (25% of the sample). However, the performance is slightly worse for the larger training sample size (50% of the sample). Intuitively, as more data is available for fitting, more outliers appear and the heavy-tailed model compensates for them by assuming they come from the tails of the model; an assumption which is detrimental for prediction. Comparisons of the models depend on α as seen in Figure   10 which shows results for different α for training sample size 0.5n. For smaller α (in this case α = 0.1), many outliers are left untrimmed resulting in lower TLM for all methods and noticeably larger standard deviation for the classical robust methods and our restricted likelihood. Larger values of α ensure that the predictive performance assessment excludes the majority of outliers. The proportion of 0 counts in the data is roughly 0.14, suggesting that α should be at least this large.

Hierarchical regression model
The previous analysis treated states independently. A natural extension is to reflect similar business environments between states using a hierarchical regression. The proposed model is: where y ij is the i th observation of square rooted household count in 2012 in the j th state, n j is the total number of agencies in state j, and J is the number of states.
x ij is the square rooted household count in 2010 and β j represents the individual regression coefficient vector for state j. The parameters µ 0 , σ 2 0 , a 0 , and b 0 are fixed by fitting the regression y ij = x ij β + ij using Huber's M-estimators to the prior data set from two years before. Using the estimates from this model, we set µ 0 =β, σ 2 0 = n p se(β) 2 (n p = 2996 is the number of observations in the prior data set), a 0 = 5 and b 0 =σ 2 (a 0 − 1). We constrain a + b = 1 in an attempt to partition the total variance between the individual β j 's and the overall β. We take b ∼ beta(v 1 , v 2 ).
Using the prior data set, we assess the variation between individual estimates of the β j to set v 1 and v 2 to allow for a reasonable amount of shrinkage. To allow for dependence across the σ 2 j we first take (z 1 , . . . , z J ) ∼ N J (0, Σ ρ ) with Σ ρ = (1 − ρ)I + ρ11 .
Then we set σ 2 j = H −1 (Φ(z j )) where H is the cdf of an IG(a 0 , b 0 ) and Φ is the cdf of a standard normal. This results in the specified marginal distribution, while introducing correlation via ρ. We assume ρ ∼ beta(a ρ , b ρ ) with mean µ ρ = a ρ /(a ρ +b ρ ) and precision ψ ρ = a ρ + b ρ . The parameters µ ρ and ψ ρ are given beta and gamma distributions, with fixed hyperparameters. More details on setting prior parameters are given in the appendix.
Using the same techniques as in the previous section, we fit the normal theory hierarchical model above, a thick-tailed t version with ν = 5 d.f., and two restricted likelihood versions (Huber's and Tukey's) of the model. For the restricted methods, we condition on robust regression estimates fit separately within each state.
We also fit classical robust regression counterparts and a least squares regression separately within each state. Hierarchical models naturally require more data and so we include states having at least 25 agencies resulting in 22 states in total and n = j n j = 3180 total agencies. For training data we take a stratified (by state) sample of size 3180/2 = 1590 where the strata sizes are n j /2 (rounded to the nearest integer). The remaining data is used for a holdout evaluation using TLM computed separately within each state: (1)j , y b (2)j , ..., y b (M j )j is the ordering of the M j holdout observations within state j according to the log marginals under the base model b. For the non-Bayesian models, f A (y b (i)j ) is estimated using plug-in estimators for the parameters for state j.
T LM b (A) j is computed for each state for K = 50 splits of training and holdout sets.
The Bayesian models are fit using MCMC, with the restricted versions applying the algorithm laid out in Section 4 and adapted to the hierarchical setting as described in Section 5. For the MH-step proposing augmented data, the acceptance rates for the two restricted likelihood models across all states and repetitions range from 0.24 to 0.74.
The average over states, T LM b (A) · = 1 22 22 j=1 T LM b (A) j for each of the K repetitions is summarized in Figure 11 for several trimming fractions using the Student-t as the base model. The points are the average of the T LM b (A) · over the K repetitions with error bars plus/minus one standard deviation over K with larger values representing better predictive performance. As the trimming fraction used for the TLM increases, so does TLM since more outliers are being trimmed. Similar patterns were seen in the individual state level regressions in Section 6.1. Despite being used as the base model to compute TLM, the Student-t doesn't perform well in comparison to the robust regressions. We attribute this to the assumption of heavier tails resulting in smaller log marginal values on average; emphasizing again that the t-model will do well to discount outlying observations but does not provide a natural mechanism for predicting 'good' (i.e., non-outlying) data. For each trimming fraction, our restricted likelihood hierarchical models outperform the classical robust regressions fit separately within each state. The hierarchical model also reduces variance in predictions resulting in smaller error bars. This improvement decreases with α but is still noticeable for α = 0.2. Both the Tukey and Huber versions perform similarly.
It is also interesting to examine the results within each state. Figure 12  In larger states, the standard deviations are virtually identical. Similar benefits are often seen for hierarchical models (e.g., Gelman, 2006).

Discussion
This paper develops a Bayesian version of restricted likelihood where posterior inference is conducted by conditioning on a summary statistic rather than the complete data. The framework blends classical estimation with Bayesian methods. Here, we concentrate on outlier-prone settings where natural choices for the conditioning statistic are classical robust estimators targeting the mean of the non-outlying data (e.g., M-estimators). The likelihood conditioned on these estimators is used to move from prior to posterior. The update follows Bayes' Theorem, conditioning on the observed estimators exactly. Computation is driven by MCMC methods, requiring only a supplement to existing algorithms by adding a Gibbs step to sample from the space of data sets satisfying the observed statistic. This step has additional computation costs arising from the need to compute the estimator and an orthonormal basis derived from gradients of the estimator at each iteration. The cost of finding the basis can be reduced by exploiting properties of the geometric space from which the samples are drawn as described in Section 4.2. We have seen good mixing of the MCMC chains across a wide-variety of examples.
The Bayesian restricted likelihood framework can be used to address model misspecification, of which the presence of outliers is but one example. The traditional view is that, if the model is inadequate, one should build a better model. In our empirical work, as data sets have become larger and more complex, we have bumped into settings where we cannot realistically build the perfect model. We ask the question "by attempting to improve our model through elaboration, will the overall performance of the model suffer?" If yes, we avoid the elaboration, retaining a model with some level of misspecification. Acknowledging that the model is misspecified implies acknowledging that the sampling density is incorrect, exactly as we do when outliers are present. In this sense, misspecified models and outliers are reflections of the same phenomenon, and we see restricted likelihood as a method for dealing with this more general problem.
Outside of outlier-prone settings, we might condition on the results of a set of estimating equations designed to enforce a lexical preference for those features of the analysis considered most important, yet still producing inferences for secondary aspects of the problem. This leads to questions regarding the choice of summary statistic to apply. In the literature, great ingenuity has been used to create a wide va-riety of estimators designed to handle specific manifestations of a misspecified model.
The estimators are typically accompanied by asymptotic results on consistency and limiting distribution. These results can be used as a starting point to choose appropriate conditioning statistics in specific settings. For example, a set of regression quantiles may be judged the most important feature of a model. It would then be natural to condition on the estimated regression quantiles and to use a flexible prior distribution to allow for nonlinearities in the quantiles. The computational strategies we have devised allow us to apply our methods in this setting and to make full predictive inference. In general, we recommend a choice of conditioning statistic based on the analyst's understanding of the problem, model, reality, deficiencies in the model, inferences to be made, and the relative importance of various inferences.
The framework we develop here allows us to retain many benefits of Bayesian methods: it requires a complete model for the data; it lets us combine various sources of information both through the use of a prior distribution and through creation of a hierarchical model; it guarantees admissibility of our decision rules among the class based on the summary statistic T (y); and it naturally leads us to focus on predictive inference. The work does open a number of questions for further work, including a need to investigate restricted likelihood methods as they relate to model selection, model averaging for predictive performance, and model diagnostics. Thus, by the chain rule ∇s(X, y) = Q∇s(X, Qy) = Q∇s(X, z). Hence X ∇s(X, y) = 0 as desired. From equation (26), all vectors z ∈ Π(A) satisfy s(X, z ) = s(X, y) = s(X, y obs ), and so all directional derivatives of s along each tangent v to Π(A) in C ⊥ (X) at z are equal to 0 (i.e., ∇s(X, z) · v = 0). Thus ∇s(X, z) is orthogonal to Π(A) at z. Since Π(A) has dimension n − p − 1, ∇s(X, z) gives the unique (up to scaling and reversing direction) normal in the n − p dimensional C ⊥ (X).

Proof of Lemma 4.3
Proof. Without loss of generality, assume the columns of X form an orthonormal basis for C(X) and likewise the columns of W form and orthonormal basis for C ⊥ (X).
With earlier notation, H = XX and Q = W W . The set A is defined by the p + 1 equations s(X, y) = s(X, y obs ), b 1 (X, y) = b 1 (X, y obs ), . . . , b p (X, y) = b p (X, y obs ).
Consequently, the gradients are orthogonal to A. Let ∇b(X, y) denote the n × p matrix with columns ∇b 1 (X, y), . . . , ∇b p (X, y). We seek to show the n × (p + 1) matrix [∇b(X, y), ∇s(X, y)] has rank p + 1. Using property C5, we have that b(X, y) = b(X, Qy + Hy) = b(X, Qy) + X y Then ∇b(X, y) = Q∇b(X, Qy) + X and The last column comes from Lemma 4.2. The matrix [XX , W W ] is of full column rank (rank n), and so the rank of [∇b(X, y), ∇s(X, y)] is the same as the rank of the matrix on the right hand side of (27). This last matrix has rank p + 1 since ∇s(X, y) = 0 by C8, and so does [∇b(X, y), ∇s(X, y)].

Proof of Lemma 4.4
Proof. P is the projection of the columns of A onto C ⊥ (X). For this to result in a loss of rank, a subspace of T y (A) must belong to C(X). Following property C5, for an arbitrary vector Xv ∈ C(X), b(X, y + Xv) = b(X, y) + v. From the property, we can show that the directional derivative of b along Xv with v = 0 is v, which is a nonzero vector. Hence Xv / ∈ T y (A).
Proof of Corollary 4.7 Proof. The corollary relies on a lemma and theorem from Miao and Ben-Israel (1992) which we restate slightly for brevity of presentation. The principal angles between subspaces pluck off a set of angles between subspaces, from smallest to largest. The number of such angles is the minimum of the dimensions of the two subspaces. Miao and Ben-Israel's first result (their Lemma 1) connects these principal angles to a set of singular values, and hence to volumes.
Miao and Ben-Israel's second result (their Theorem 3) makes a match between the principal angles between a pair of subspaces and the principal angles between their orthogonal complements. To establish the corollary, we appeal to Lemma 8.2 and Theorem 8.3. Translating Miao and Ben Israel's notation, we have M = C ⊥ (X), Q M = W , L = T y (A), and Q L = A. By Theorem 8.3, the nonzero principal angles between T y (A) and C ⊥ (X) are the same as the nonzero principal angles between T ⊥ y (A) and C(X). By 8.2, the non-unit singular values of W A are the same as the non-unit singular values of U B.

Setting the hierarchical prior values
This section describes the how the prior parameters are set in Section 6.2. Using the previous data set from two years prior, we fit separate (robust) regressions to each state and a regression to the entirety of the data at once. Let the estimates for the fits to each state beβ 1 , . . . ,β J ,σ 1 , . . . ,σ J and the estimates from the single regression bê β andσ. These are classical robust estimates using Tukey's regression and Huber's scale. For this sections, let n j denote the number of observations in the j th state (of the previous data set) and set n p = n j .
First, consider v 1 and v 2 in the prior b ∼ beta(v 1 , v 2 ). In the hierarchical model (20), b = 0 implies all the β j s are equal (no variation between states) and b = 1 implies the β j s vary about µ 0 according to Σ 0 = n p · se(β) 2 (see Section 6.1). We seek a prior measure for what we think b should be. Using the prior fit, a measure for uncertainty for β is Σβ = se(β) 2 , the estimate of the variance from the single regression. For the β j s, take δ j =β j −β and set the prior uncertainty to Σ δ = n −1 p j n j δ 2 j . Consider g = Σ δ /Σβ measuring of the amount of uncertainty between the β j s relative to that of β. Now in the prior, we heuristically set the uncertainty in the β j s (bΣ 0 ) to be approximately equal to g · Σβ. That is, bΣ 0 ≈ g · Σβ = g n Σ 0 , suggesting b ≈ g n . Thus, we set E[b] = g n . The precision, v 1 + v 2 , is set to 10, completing the specification for the prior on b.
Finally, recall ρ ∼ beta(a ρ , b ρ ) with mean µ ρ = a ρ /(a ρ + b ρ ) given a beta prior and precision ψ ρ = a ρ + b ρ given a gamma prior. There is little evidence of any strong correlation amongst estimates of σ 2 j in the prior data set and we set the prior mean of µ ρ equal to 0.2 and prior variance to .01. Noting var(ρ|µ ρ , ψ ρ ) = µ ρ (1−µ p )/(ψ ρ +1) we plug in µ ρ = 0.2 and var(ρ|µ ρ , ψ ρ ) = 0.01. Solving for ψ ρ results in a value of 15. This is taken to be the mean of the gamma prior on ψ ρ . Finally, we set the rate parameter for to 1 implying the variance of the gamma prior is equal to its the mean. With this specification, the prior on ρ has 80% of the central mass between roughly 0.03 and 0.42 and reflects our prior belief that there is likely only weak positive correlation amongst the σ 2 j 's.