A New Bayesian Approach to Robustness Against Outliers in Linear Regression

Linear regression is ubiquitous in statistical analysis. It is well understood that conflicting sources of information may contaminate the inference when the classical normality of errors is assumed. The contamination caused by the light normal tails follows from an undesirable effect: the posterior concentrates in an area in between the different sources with a large enough scaling to incorporate them all. The theory of conflict resolution in Bayesian statistics (O'Hagan and Pericchi (2012)) recommends to address this problem by limiting the impact of outliers to obtain conclusions consistent with the bulk of the data. In this paper, we propose a model with super heavy-tailed errors to achieve this. We prove that it is wholly robust, meaning that the impact of outliers gradually vanishes as they move further and further away form the general trend. The super heavy-tailed density is similar to the normal outside of the tails, which gives rise to an efficient estimation procedure. In addition, estimates are easily computed. This is highlighted via a detailed user guide, where all steps are explained through a case study. The performance is shown using simulation. All required code is given.


Introduction
The distribution most commonly assumed on the error term in the linear regression model Y = x T β + is without a doubt a normal, denoted /σ ∼ N (0, 1). Estimating the regression coefficient vector β is in this case equivalent to using ordinary least squares (OLS) method, whether Bayesian (setting the usual noninformative prior on β) or maximum likelihood estimates (MLE) are computed. Given the remarkable properties of OLS (under certain conditions) such as minimum variance among unbiased estimators, the normal model is often considered as a benchmark in terms of efficiency in the absence of outliers. However, it is like |z| −1 (log |z|) −θ -to achieve whole robustness for both parameters. The idea of using heavier tails than the Student came after the work of Andrade and O'Hagan (2011) who, in the location-scale framework, achieved only partial robustness for the scale by modelling with polynomial tails. As mentioned by West (1984), an outlying observation is accommodated if the posterior distribution converges to that excluding the outlier as this one tends to infinity, which corresponds to our definition of whole robustness. In contrast, partial robustness translates into a significant (but limited as the outliers approach plus or minus infinity) impact on the estimation of the parameter.
The second step towards the objective is therefore to generalise the results of Desgagné (2015) to linear regression. In fact, it is a generalisation of the results of Desgagné and Gagnon (2017), which are essentially an application of those of Desgagné (2015) in simple linear regression through the origin for robust estimation of ratios. This second step represents our key theoretical contribution. We provide two sufficient conditions that lead to whole robustness. The first one is to assume a super-heavy tailed distribution on the error. The other specifies the breakdown point, which tends to the optimal value of 0.5 as the sample size goes to infinity. The validity of our robust method is thus supported by theoretical results. While these are similar to those of Desgagné and Gagnon (2017), a more sophisticated proof technique is required given that the location parameter of the conditional distribution of Y is now an inner product of a known vector and β containing p unknown parameters. Throughout the paper, we focus on continuous explanatory variables to simplify explanation and notation. The results are nonetheless valid in ANOVA and ANCOVA (analyses of variance and covariance), and for variable selection where joint posteriors of models and parameters are considered. The corresponding sufficient conditions are given as remarks after the theoretical results. The price to pay to achieve whole robustness for all parameters is that the use of super heavy-tailed distributions prevents us from obtaining normal conditional distributions. There is therefore a computational cost, in the sense that we cannot implement a Gibbs sampler; it will however be noticed that easy-to-use samplers can be used, which makes the cost negligible.
The third and final step towards the objective is to carefully select the super heavy-tailed distribution in the wholly robust model. To achieve this, we start with the premise that applied statisticians are satisfied with the normal model in the absence of outliers and we specifically design a robust solution from that. We set the distribution of the error as a log-Pareto-tailed normal (LPTN), a super heavy-tailed distribution introduced by Desgagné (2015). Its density exactly matches the standard normal on the central part having a mass of ρ. The parameter ρ is thus the single one to be chosen by the user, and is typically set to a value between 0.80 and 0.98. The resulting model produces robust estimates exhibiting a similar behaviour to OLS in the absence of outliers, where the trade-off between high degree of similarity with OLS and high degree of robustness is controlled through ρ. The model has built-in robustness that resolves conflict in a sensitive way (see Figure 1). It completely considers the nonoutliers (from 30 to 32.5 in Figure 1), essentially excludes the observations that are clearly outlying (beyond 38 in Figure 1), and between these two clear cases, contains and bounds their impact. The first two cases correspond to the strategy commonly applied in practice, where an observation is either kept or discarded. In the last case, the method reflects that in the gray area there is a level of uncertainty about the fact that those observations really are outliers or not. Our main practical contribution is therefore to provide an efficient and robust model that automatically deals with this type of uncertainty, which is especially valuable in high-dimensional problems and when several analyses have to be performed. This rest of the article is organised as follows. The linear regression model is detailed in Section 2.1, the LRVD family is presented in Section 2.2 and the theoretical results are provided in Section 2.3. More practically, efficient and robust regression is investigated in Section 3. The LPTN distribution is first presented in Section 3.1. A discussion about efficiency of the robust model with LPTN errors is provided in Section 3.2. Practical details of our approach are addressed in Section 3.3 through a case study on the modelling of house market values. Numerical methods such as Markov chain Monte Carlo (MCMC) are discussed for the computation of different posterior quantities: means, medians, credible intervals, prediction of future observations and hypothesis testing via Bayes factors. A powerful tool for outlier identification is also proposed. In Section 3.4, a simulation study is conducted to compare the performance of our approach with different Bayesian alternatives. Note that even though our approach is Bayesian, it is possible to use it in a frequentist setting through maximum a posteriori probability (MAP) estimates, which correspond to MLE when the prior is set to 1. We thus also include in our study the frequentist methods mentioned above.

Conflict Resolution in Linear Regression via LRVD
We henceforth assume that f is a strictly positive continuous probability density function (PDF) on R that is symmetric with respect to the origin, for which all parameters are known and such that there exists a threshold above which g(z) = zf (z) is monotonic. Examples of such PDF are the normal, logistic, Laplace, Student (with prespecified degrees of freedom) and the LPTN (see Section 3.1).

Linear Regression Model
(i) Let Y 1 , . . . , Y n ∈ R be n random variables representing data points from the dependent variable and x T 1 := (1, x 12 , . . . , x 1p ), . . . , x T n := (1, x n2 , . . . , x np ) be n vectors of observations from the explanatory variables, where p ∈ {2, 3, . . .}, n ≥ p+1 and x ij ∈ R are assumed to be known. As mentioned in the introduction, we focus on the situation where all explanatory variables are continuous. The linear regression model is given by where the n random variables 1 , . . . , n ∈ R and the p-dimensional random variable β := (β 1 , . . . , β p ) T ∈ R p represent the errors and the vector containing the regression coefficients, respectively. These n + 1 random variables are conditionally independent given σ > 0, a scale parameter, with a conditional density for i given by (1/σ)f ( i /σ) , i = 1, . . . , n.
A large variety of priors fits within the structure assumed in (ii). This is the case for non-informative priors such as π(β, σ) ∝ 1/σ and π(β, σ) ∝ 1, and practically all proper densities. Informative priors shall however be used with caution, especially when they translate into light tailed densities. They may indeed contaminate the inference if they are in conflict with the information carried by the data. Establishing the conditions that guarantee robustness to informative priors in linear regression is not trivial.
We study robustness of the estimation of β and σ in the presence of outliers. In this paper, an observation (x i , y i ) is considered as an outlier if its error i = y i − x T i β is relatively far from 0, where β defines the probable hyperplanes for the bulk of the data. Note that robustness against outlying errors is a different concept than robustness against outlying x i or y i . They are generally equivalent though, except for the unusual case where an observation is outlying in x i and y i but still manages to lie in the general trend, and consequently, be a nonoutlier in error. From a theoretical perspective, we study the asymptotic behaviour in the sense that we let outliers' errors i approach +∞ or −∞. Our strategy to mathematically represent this situation is to let their y i approach +∞ or −∞ while their vector x i remains fixed. We thus specify a particular path along which the outliers move away from the general trend.
We assume that each outlier goes to −∞ or +∞ at its own specific rate, to the extent that the ratio of two outliers is bounded. More precisely, we assume that for i = 1, . . . , n, where a i , b i ∈ R are constants such that b i = 0 if the point is a nonoutlier and b i = 0 if it is an outlier, and then, we let ω → ∞. We mathematically distinguish the outliers from the nonoutliers through the following.
Among the n observations (y 1 , . . . , y n ) =: y n , we assume that k of them form a group of nonoutlying observations, that we denote y k , while = n − k of them are considered as outliers. For i = 1, . . . , n, we define the binary functions k i and i as follows: if y i is a nonoutlying value k i = 1, and if it is an outlier i = 1. These functions take the value of 0 otherwise. Therefore, we have k i + i = 1 for i = 1, . . . , n, with n i=1 k i = k, and n i=1 i = . Let the joint posterior density of β and σ be denoted by π(β, σ | y n ) and the marginal density of (Y 1 , . . . , Y n ) be denoted by m(y n ), where (2.3) Let the joint posterior density of β and σ arising from the nonoutlying observations only be denoted by π(β, σ | y k ) and the corresponding marginal density be denoted by m(y k ), where Proposition 2.1 (Tail behaviour of the posteriors).
Remark 2.1. When any type of explanatory variables is considered (continuous, discrete as in ANOVA or a mix of both as in ANCOVA), the densities are proper if we additionally assume that the design matrix (comprised of n or k observations) has full rank. In variable selection, when the joint posterior of the models and parameters is considered, this joint posterior is proper if the assumptions are verified for the "complete" model (the model with all variables). The assumptions are more technical for the moments and are not provided here. We essentially need enough of "different" x i vectors. In the proof, it is made clear what is required.

Log-Regularly Varying Distributions
We now provide an overview of the class of log-regularly varying functions (LRVF), as introduced in Desgagné (2013) and Desgagné (2015), following the idea of regularly varying functions developed by Karamata (1930). They form an interesting class of functions with useful properties for robustness.
In Desgagné (2015), it is shown that Definition 2.1 is equivalent to the following: there exists a constant A > 1 and a function s ∈ L 0 (∞) such that for z ≥ A, g can be written as Examples of LRVF are g(z) = (log z) −θ (with s(z) = 1) and g(z) = (log z) −θ log(log z).
Definition 2.2 (LRVD). A random variable Z and its distribution are said to be log-regularly varying with index θ ≥ 1 if their density f is such that zf (z) ∈ L θ (∞).
Definition 2.2 implies that any density f with tails behaving like |z| −1 (log |z|) −θ with θ > 1 is a LRVD. Some examples like the LPTN distribution are given in Desgagné (2015). The most important property of this class of distributions follows from Definition 2.1: the asymptotic location-scale invariance of their density, as stated in Proposition 2.2.
Proposition 2.2 essentially implies that the conditional density of an outlier (1/σ) f ((y − x T β)/σ) asymptotically behaves like f (y) as y → ∞. The densities of the outliers at the numerator of posterior densities cancel each other out with those at the denominator in the marginal, provided that the integral can be interchanged with the limit. This is the idea of the proof of our robustness result presented in the next section. The greatest challenge is however to prove that we can indeed interchange the limit and the integral. This part leads to the condition about the maximum number of outliers to guarantee robustness.

Resolution of Conflicts
We now present Theorem 2.1, the main theoretical contribution of this paper.
The two sufficient conditions of Theorem 2.1 are remarkably simple. Condition (i) indicates that modelling must be performed using a super heavy-tailed density f , more precisely using a LRVD, e.g. a LPTN as proposed. Condition (ii) gives in fact the breakdown point, generally defined as the proportion of outliers ( /n) that an estimator can handle. We have /n ≤ 1/2 − (p − 1/2)/n, which translates into a breakdown point of 50% as n → ∞ (for fixed p), usually considered as the maximum and best desired value. Condition (ii) is thus generally satisfied in practice.
Results (a) to (d) are different representations of whole robustness. Essentially, the posterior inference arising from the whole sample converges towards the posterior inference based on the nonoutliers only. The impact of outliers then gradually vanishes as they approach plus or minus infinity.
In Result (a), the asymptotic behaviour of the marginal m(y n ) is described. This result is used in Section 3.3 to assess robustness of Bayes factors for testing H 0 : β i = 0 versus H 0 : β i = 0 (when i ≥ 2). Result (a) is in fact the centrepiece of Theorem 1; its demonstration requires considerable work, and leads relatively easily to the other results of the theorem.
The convergence of the posterior density in Result (b) enables to assess that the MAP estimates of β and σ are wholly robust. Given that these estimators correspond to the MLE when the prior is proportional to 1, the frequentist estimates are, as a result, also wholly robust. This allows establishing a connection between Bayesian and frequentist robustness.
Result (c) indicates that any estimation of β and σ based on posterior quantiles (e.g. using posterior medians and Bayesian credible intervals) is robust to outliers. Note that in fact we obtain the stronger result of L 1 convergence: which in turn implies that P(β, σ ∈ E | y n ) → P(β, σ ∈ E | y k ) as ω → ∞, uniformly for all sets E ⊂ R p × R + , a slightly stronger than convergence in distribution given in Result (c) which requires only pointwise convergence.
Posterior expectations are wholly robust as well, as indicated by Result (d). It is interesting to notice that all these results guarantee the robustness of a variety of Bayes estimators.
Remark 2.2. When any type of explanatory variables is considered, the same results as in Theorem 2.1 hold under the following additional assumption: it is possible to choose n/2 + (p − 1/2) (or n/2 + (p − 1/2) + M ) nonoutliers -the required number of nonoutliers depending on which results we target (Results (a) to (c) or Results (a) to (d)) -that have p-wise linearly independent x i vectors. This means that any p vectors x i1 , . . . , x ip among the chosen subgroup must be linearly independent. In variable selection, the convergence of the joint posterior of the models and their parameters, and of the expectations, hold if the assumptions are verified for the complete model. Remark 2.3. We prove that modelling with f having tails behaving like |z| −1 (log |z|) −θ is sufficient to obtain the results in Theorem 2.1. It seems "almost" necessary because, on one hand, a tail behaviour of z −2 (corresponding to a Student density) is not sufficient, and on the other hand, |z| −1 is not integrable.

Efficient and Robust Regression Using LPTN
In Section 2.3, we stated theoretical results which essentially indicate that using a LRVD for the errors ensures a high breakpoint of 1/2 − (p − 1/2)/n with a whole rejection of the outliers as their error goes to +∞ or −∞. The conflict is thus resolved and the linear regression is in agreement with the bulk of the data.
In this section, we build on these results to propose a solution in the realistic situation where a statistician satisfied with the normal model in the absence of outliers seeks protection in the eventuality of contamination by outliers. Mathematically, we consider the context where the errors have a mixture distribution, with a normal component for the bulk of the data and another component F 0 for the outliers, that is where 0 < α ≤ 1 represents the proportion of normal observations in the sample. We thus look for efficient estimators that perform well in the absence of outliers, that is when ω = 1 and the model is the pure normal. As mentioned in the introduction, OLS (or equivalently the normal model) is considered as the benchmark in this situation. Our efficient estimators must also be robust and perform in the presence of outliers, and this, for as many scenarios of α < 1 and F 0 as possible.

LPTN Distribution
The solution we propose consists in assuming that the errors have a LPTN distribution with a prespecified parameter ρ ∈ (2Φ(1) − 1, 1) ≈ (0.6827, 1), denoted LPTN(ρ). More precisely, we still have i | σ D ∼ (1/σ)f ( i /σ), but the density f is now assumed to be where z ∈ R, and τ > 1 and λ > 0 are functions of ρ with ϕ(·), Φ(·) and Φ −1 (·) being the PDF, cumulative distribution function (CDF) and inverse CDF of a standard normal, respectively. The LPTN distribution was introduced by Desgagné (2015), who in fact presents a more general version than that shown here. The parameter λ that controls the tail decay was originally free and a multiplicative normalising constant K(ρ, λ) was needed. For example, the center of the density (the area |z| ≤ τ ) was given by K(ρ, λ)ϕ(z). In order to pursue our efficiency objective, we set the constant to 1, which in return forces λ to be automatically set as a function of ρ. The parameter ρ, chosen by the user, thus represents the mass of the central part that exactly matches the N (0, 1) density.
As ρ increases, f approaches the normal. An increase in ρ also implies an increase in λ and τ , which translates into a density f with lighter tails. Efficiency is also expected to increase, but robustness to decrease. A compromise has therefore to be made and it is controlled by the statistician through the parameter ρ. In other words, this parameter represents the tolerance to (bounded) impact from outliers at the benefit of efficiency when the data set is not contaminated. The user can also select its value based on prior opinion about the probable proportion of outliers, by setting it to 1 minus this proportion.
The rationale behind proposing the LPTN is thus that, in addition to exactly matching the normal density on the part with highest probability, this distribution has log-Pareto tails ensuring that our theoretical robustness result hold, and this for any value of ρ. This type of tails consequently accommodates for a large spectrum of α and F 0 in the mixture (3.1) when α < 1 and generates efficient inference when α = 1 as well (this latter characteristic is discussed in Section 3.2). A comparison between different LPTN densities is shown in Figure 2. Note that, as required for our theoretical results of Section 2, the LPTN distribution has a strictly positive continuous PDF on R that is symmetric with respect to the origin and such that zf (z) is monotonic for z > τ .

Efficiency of the LPTN Model
To theoretically study the efficiency of the LPTN Model, we consider the situation where the data are generated from a normal and evaluate the performance of the robust estimators in the asymptotic situation n → ∞. We start by providing evidences that the estimators for β are consistent, while it depends on ρ for σ. We consider that the generative normal model has β 0 ∈ R p and σ 0 > 0 as true parameter values, and denote the associated density of one data point g := N (x T i β 0 , σ 2 0 ). Denote that associated with the LPTN model p (β,σ) (y i ) := (1/σ)f ((y i − x T i β)/σ), where f is a LPTN(ρ). In Bunke et al. (1998), it is proved that if the divergence KL(β, σ) := log(g(y i )/p (β,σ) (y i )) g(y i ) dy i (3.4) is minimised at a unique (β * , σ * ) and some regularity conditions are satisfied, then lim n→∞ E[(β, σ) | y n ] = (β * , σ * ) with probability 1, where the expectation is with respect to the posterior arising from the LPTN model. This is proved through the strong consistency of the MAP.
In the supplementary material in Section 6.2, we prove that the first derivative of (3.4) with respect to β equals 0 at β 0 , and this for any value of σ. While setting β = β 0 in (3.4), we show that it is minimised at σ * which depends on ρ (see Figure 3). We also show that most of the regularity conditions in Bunke et al. (1998) are satisfied. This analysis suggests that the true values for the regression coefficients are recovered even though the LPTN model is misspecified. For σ, the closer ρ is to 1, the more similar are σ * and σ 0 . For instance, when ρ = 0.9, σ * /σ 0 = 1.03, and beyond ρ = 0.95, this ratio is essentially 1.
When the data are generated from the normal model, estimators arising from it are certainly more efficient. We however numerically verified that the learning rate for the robust estimators is the same as the normal ones, suggesting that the efficiency is bounded away from 0 for all n. Some additional details are needed to rigorously prove the consistency of the Bayes estimates and to accurately conclude about efficiency.

Case Study
We carry out in this section a linear regression analysis on a given data set using our robust approach and also the classical method with the normal assumption for comparison. In doing so, we address all practical considerations, resulting in a straightforward implementation by users. In this regard, all R code used to produce numerical results is provided at https://arxiv.org/abs/1612.06198, which also allows reproducing these results.
For a given city, we want to model the market value of a house in thousands of dollars using the average home value in its residential sector in thousands of dollars, the living area in square metre (sq.m.) and the land area in sq.m. We consider a simulated sample of size n = 50 that contains 3 outliers (it is given in detail in the provided R code). To give an overview of it, we present in Table 1 the data for Home 2 and for the outliers: Homes 1, 3 and 49.

Characteristics
Home 2  Home 2 has a value of $326,000 (the sample mean is $504,900), is located in a residential sector where houses are valued at $343,000 in average (the sample mean is $508,880), has a living surface of 205 sq.m. (the sample mean is 200 sq.m) and a land of 345 sq.m. (the sample mean is 500 sq.m). Homes 1 and 3 both have aberrantly low values, while it is the opposite for Home 49. They are meant to represent a damaged house, a data entry error and an eco-friendly house, respectively.
To improve the interpretation of the linear regression, the explanatory variables are centred around their respective sample mean. Therefore, for each house, we define x i2 as the average value in its residential sector (in $1,000) minus 508.88, x i3 as the living area minus 200 and x i4 as the land area minus 500. Note that centring affects only the constant of the model, β 1 , which can now be interpreted as the predicted value of the typical house with average fea- In Figure 4, we plot the dependent variable against each explanatory variable to depict their respective linear relation. The pairwise correlations between the explanatory variables are all below 0.10, suggesting that these graphs provide a fair representation of the multivariate relation. The parameters of the generative model have been set to create the expected situation in which an increase in any feature is associated with an increase in home value. For the analysis, the density f is assumed to be a LPTN(ρ = 0.95) for the robust model and a N (0, 1) under the classical model. We also set π(β, σ) ∝ 1/σ, the usual noninformative prior. The estimation of the parameters is done through the posterior density as expressed in (2.3). The posterior means, medians and credible intervals are computed through a random walk Metropolis (RWM) algorithm, one of the easiest to implement Metropolis-Hastings (Metropolis et al. (1953) and Hastings (1970)) algorithms. More sophisticated methods like the Hamiltonian Monte Carlo (HMC, see, e.g., Neal (2011)) could be used given that the likelihood function is differentiable almost everywhere. The MAP and MLE are computed through optimisation procedures; we use the general-purpose optim function in R based on NelderMead algorithm. It is of common knowledge that maximisers (MAP and MLE) may not provide a posterior summary as good as posterior means, for instance. The advantage is that they can be computed quickly. We find them particularly useful for directly giving starting points for the RWM algorithm and for conducting simulation studies as in Section 3.4.
These estimates are presented in Table 2, in which the numbers in square brackets are those based on the 47 nonoutliers only (the sample without Homes 1, 3 and 49). The credible intervals (CI) are constructed from the regions with highest posterior density using the coda package. Some interesting observations can me made. First, in the absence of outliers (results in brackets), the results of the robust LPTN model are very similar to those of the nonrobust normal model. As mentioned in Section 3.1, the LPTN(0.95) is very similar to the N (0, 1), in fact identical except for the 5% tails. The normal model is the benchmark in terms of efficiency. All presented point estimators of β under the normal model indeed correspond to OLS, which are known to produce the best estimates (in a frequentist sense) when the errors are uncorrelated with zero mean and homoscedastic with finite variance. This is the case for the nonoutliers. Our example thus suggests that the choice between the posterior means, medians, MAP or MLE is not crucial for the robust model as well. Second, we observe that in the presence of the 3 outliers (i.e. using the whole sample of size n = 50), the results of the LPTN model are barely affected, showing similar results to those excluding the outliers, while the normal model is clearly contaminated by the outliers. This is consistent with our theoretical asymptotic results which indicate agreement with the bulk of the data under the robust model. In particular, the estimate for σ under the LPTN model is about half that arising from the normal model, resulting in much shorter credibility intervals for the robust model. With the posterior in hand, one can take the inference one step further with outlier identification and prediction. The former is first discussed. For each observation i = 1, . . . , n, one can estimate the value fitted by the hyperplane x T i β, the realisation of the error y i − x T i β and its standardised version This can be achieved through their MAP estimates (or MLE) by simply plugging in the MAP estimates (or MLE) of β and σ (as given in Table 2) in their expression. Or possibly better, they can be estimated by their posterior mean or median. For this purpose, samples can be directly generated from their posterior distribution through the values of β and σ already generated from the RWM algorithm (or obviously, it can be done at the same time the algorithm runs). Consider for instance Home 49, which is valued at y 49 = 1,000, the posterior means give fitted values of 704.0 (LPTN) and 759.7 (normal), errors of 296.0 (LPTN) and 240.3 (normal) and standardised errors of 6.28 (LPTN) and 2.52 (normal). We note that the hyperplane is attracted towards the outlier under the normal model, which leads to an estimated error less extreme than that under the LPTN model.
Naturally, large estimates for standardised errors |z i | indicate strong evidence of outlyingness. A threshold of 2.5 is sometimes recommended to differentiate outliers from nonoutliers, see, e.g., Gervini and Yohai (2002). On this basis, Home 49 appears clearly as an outlier under the LPTN model, while the conclusion is unclear for the normal model.
To provide a measure of outlyingness, we evaluate the probability for a (unrealised) standardised error i0 /σ -which density is f -to be more extreme than |z i |: Under the normal model, we have where τ = 1.96 and λ = 3.08 when ρ = 0.95, as computed with (3.3). The measure (z i ) is a random variable as it is a function of the unknown parameters β and σ, and can be estimated a posteriori using the same technique as above. In the same spirit as Gervini and Yohai (2002), one can flag observations with estimates for (z i ) lesser than a chosen threshold. A reasonable threshold, in our opinion, should lie between 0.01 and 0.02. This corresponds to a range of 2.47 to 3.11 of MAP estimates for |z i | under the LPTN model if is estimated through its MAP (because this is achieved by plugging in the MAP of |z i |).
If we look again at results of Home 49, the posterior means for (z i ) give 0.0024 and 0.0208 for the LPTN and normal models, respectively. Home 49 appears again clearly as an outlier under the LPTN model, whereas it is much less convincing for the normal model. At a threshold of 0.02 or less, this observation would not be considered as an outlier. Outlier detection using the wholly robust LPTN model is effective; outliers do not mask each other, a well-known phenomenon arising with nonrobust models typically due to overestimation of the scale σ, and sometimes because of attraction of hyperplans. The posterior means for the standardised errors z i are plotted in Figure 5, along with the posterior means for (z i ) for the three outliers.
For predicting a future observation, say Y n+1 = x T n+1 β + n+1 , we estimate its posterior predictive density by sampling from it through the RWM algorithm as before. For each realisation of (β, σ) in the Markov chains, we generate n+1 from an LPTN (or a normal for the nonrobust model) centred at 0 with a scale parameter σ, to which we add x T n+1 β. We can thus easily compute posterior predictive quantities such as the median, credible intervals, probabilities and so on. Note that the expectation does not exist under the LPTN (because it does not exist for n+1 ). MAP can be approximated from the sample, but because it requires extra work, we suggest using the median for prediction.
If for example we consider the future observation of the typical house with x n+1,2 = x n+1,3 = x n+1,4 = 0, the posterior predictive medians for Y n+1 are 514.0 and 504.9 under the LPTN and normal models, respectively; they are as expected around the posterior medians of the intercept β 1 . The credible intervals are (417.4, 611.6) and (313.7, 698.6) for the LPTN and normal models, respectively. We note the shorter length for the robust model, which is attributable to the robust estimation of the scale σ.
Finally, we easily perform statistical hypothesis testing through Bayes factors. For this, we implement a reversible jump algorithm (Green (1995)) with two models and uniform prior on these. If, for instance, we want to test for hypotheses H 0 : β 4 = 0 versus H 1 : β 4 = 0, the implementation essentially requires the tuning of an additional RWM algorithm; that for sampling the parameters of the model without x 4 . In our example, the Bayes factors are 1.68 × 10 3 and 1.74 × 10 3 for the LPTN and normal models, respectively. If we exclude the outliers, they become 2.80 × 10 3 and 2.12 × 10 3 for the LPTN and normal models, respectively.
The Bayes factor is a robust measure under the model with a LPTN distribution on the error term. Indeed, Result (a) of Theorem 2.1 states that the marginal m(y n ) behaves like m(y k ) because when the assumptions of Theorem 2.1 are satisfied for the larger model, they are automatically satisfied for the smaller. As a result, the Bayes factor m(y n )/m(y n | H 0 ) behaves like m(y k )/m(y k | H 0 ).

Performance Evaluation
In this section, we evaluate the performance of the robust LPTN model through a simulation study. We consider the same data set and model as in Section 3.3, but get rid of y n which are generated. Several values for ρ are considered: ρ = 0.80, 0.84, 0.90, 0.93, 0.95, and 0.98. As in the last Section, it is compared with the nonrobust normal model. We add the Bayesian approach of Box and Tiao (1968) with normal mixtures and the model with the Student distribution. For the latter, we consider different degrees of freedom (df): 1, 2, 4, 6, and 10. We set π(β, σ) ∝ 1 and estimate the parameters using the MAP, which therefore corresponds to the MLE. The Bayesian methods thus become direct competitors to the frequentist robust estimators like the popular M-and S-estimators. These as well as MM-, REWLSE (the two best frequentist methods according to the recent review by Yu and Yao (2017)) and LTS estimators are included in the simulation study.
The data y n are generated through the errors i | σ D ∼ (1/σ)f ( i /σ) under the following scenarios: where the x i of the outliers are modified to make them high-leverage points (the procedure is explained in detail below), • Scenario 4: f = 90% N (0, 1) + 10% N (3, 1), where the x i of the outliers are modified to make them high-leverage points.
Nonoutliers are generated from the first mixture component, whereas outliers are generated from the second one. The choice of locations for the outliers aims at producing challenging and interesting situations, where a vast spectrum of behaviours are observed for especially the LPTN and Student models with their different sets of parameters ρ and df. Scenarios 2 and 4 are studied to show how performance varies when the number of outliers is doubled, from 5% to 10% of the sample size. For each scenario, we consider two sample sizes: n = 50 and n = 100. The case n = 50 corresponding to the original x 1 , . . . , x 50 , 50 additional observations from the explanatory variables are generated in the same fashion as the original ones for the case n = 100. For Scenarios 3 and 4, when an error is generated from the second mixture component (that generating extreme values), say i0 , we modify one of the coordinates of the associated x i0 to make the observation an high-leverage point. More precisely, we randomly choose a covariable number, say j 0 ∈ {2, 3, 4}, and The performance of each model/estimator is evaluated through the premium versus protection approach of Anscombe and Guttman (1960). This approach consists in computing the premium to pay for using a robust alternative R to the normal N when there are no outliers (Scenario 0), and the protection provided by this alternative when the data sets are contaminated (which is likely in the other scenarios). The premium and protections associated with a robust alternative R are evaluated through the following: where S represents the scenario under which the protection is evaluated (1, 2, 3 or 4), and M N (β | S), for instance, denotes an error measure M for estimating β byβ using the normal model N , in Scenario S. The scenario is not specified for the premium because it does not vary; it is Scenario 0. The premiums and protections with respect toσ -Premium(R,σ) and Protection(R,σ | S)have the analogous definitions. We consider two distinct error measures (M R (β | S) and M R (σ | S)) to highlight the difference between them, and also because there is no natural way of combining them. We propose to define M R (β | S) as the square root of the expectation with respect to Y n (and therefore the estimates associated with each realisation) of the average squared vertical distances between the estimated and true hyperplans measured at each observation x i : where X is the design matrix with rows x T 1 , . . . , x T n . The expression after the second equality provides us with another interpretation. The measure repre- , the square root of the trace of the mean square error (MSE) matrix forβ. Given that under the normal model σ 2 (X T X) −1 is the covariance matrix ofβ, standardisation is applied toβ in our measure. Forσ, we simply use the square root of its MSE: 1/2 . Note that the expectations are approximated through the simulation of 250,000 vectors y n . The premium and protection for a given robust alternative R in a given scenario S are therefore the relative increase and decrease in M R (· | S) due to the use of the robust alternative instead of the normal (the benchmark model), respectively. For each robust alternative, there are four premiums to compute: one for the measure forβ and one for the measure forσ, in the cases n = 50 and n = 100. There are sixteen protections to compute given that we also do this for Scenarios 1, 2, 3, and 4. The idea is to graphically present the results by plotting the couples (Premium(R,β), Protection(R,β | S)) for all robust alternatives. The results for Scenarios 1 and 2 are shown in Figure 6, and those for Scenarios 3 and 4 in Figure 7.
From this premium versus protection perspective, a robust alternative dominates another if its premium is smaller and protection larger. This means that in Figures 6 and 7, we are looking for points in the upper left parts. It is noticed that the robust alternatives are all excellent candidates, except maybe for S -estimator that we choose not to show because of its large premium forβ and its same behaviour as MM -estimator forσ. In particular, the presented robust alternatives all handle high-leverage points.
By looking at Figures 6 and 7, we notice that the LPTN curve (in green) dominates the Student curve (in orange), more remarkably forσ, but also forβ. We also notice that the optimal values for ρ for the LPTN are around the nonoutlier percentages, i.e. around 0.95 (the second point starting from the lower left corner) in Scenarios 1 and 3, and around 0.90 (the fourth point starting from the lower left corner) in Scenarios 2 and 4. This justifies our suggestion in Section 3.1 for selecting ρ based on prior knowledge about probable proportions of outliers, if users do not have other preferences. The best LPTN models in all scenarios essentially dominate all the other alternatives with respect toσ. As forβ, the performance of these LPTN models is among the best. The mixture model appears better in this case, but often by little. The difference varies depending on the number of outliers and the sample size. For instance, look at the LPTN(0.95) in Scenarios 1 and 3 (and also at the scale of the x axis), and notice how the LPTN(0.98) gets closer to the mixture model in these scenarios when doubling the sample size, which makes this model almost the best. This allows to make an interesting remark: for a given percentage of outliers (and therefore of nonoutliers), a larger sample size translates into enhanced protection, because there are more nonoutliers. This is especially true for LPTN models with ρ close to 1.
To summarise, the LPTN emerges as the best Bayesian robust approach with top performances for bothβ andσ. Even when considered among the frequentist methods, it seems the most balanced alternative.

Conclusion
The goal of this paper, which was to provide a solution that reaches gold standards in terms of premium versus protection for all parameters, is now achieved. The foundations for great protection were established through our main theoretical contribution: the proof of whole robustness results for linear regression.
The key result is the convergence of the posterior distribution towards that based on the nonoutliers only when the outliers approach plus or minus infinity (Result (c), Theorem 2.1). The robustness results hold under two simple and intuitive conditions. Firstly, the error term must follow a super heavy-tailed distribution, namely a LRDV, to accommodate for the presence of outliers. Secondly, the number of outliers must not exceed half the sample n/2 minus p−1/2 (the number of regression coefficients minus 1/2). This last condition translates into a limiting breakdown point of 0.5 as n → ∞. Although the whole robustness results are theoretical and asymptotic, their practical relevance has been shown through a comprehensive study of the LPTN model. This specific choice of super heavy-tailed distribution represented our main practical contribution as the resulting model is remarkably efficient and deals with outlying observations in an automatic and sensitive manner, succeeding in achieving low premium in addition to large protection. The procedure for analysing data sets to which it gives rise is also easy to use. These characteristics of the LPTN model make it a particularly appealing Bayesian alternative to the partially robust Student model.

Proofs
We in fact provide in this section sketches of the proofs of Proposition 2.1 and Theorem 2.1 for space considerations. The detailed proofs can be found in the supplementary material in Section 6.1.

Proof of Proposition 2.1
Let us pretend for now that the scale parameter is known and that its value is σ 0 . To simplify, we denote the posterior density as π(β | y n ) := π(β, σ = σ 0 | y n ). To prove that it is proper, we show that the marginal m(y n ) is finite. We have that using π(β, σ 0 ) ≤ B max(1, 1/σ 0 ) (by assumption) and f ≤ B (because of the assumptions on this PDF), and the changes of variables u i = (y i − x T i β)/σ 0 , i = 1, . . . , p, B being a positive constant. The last quantity above is finite given that the determinant is different from 0 because all explanatory variables are continuous. Note that this justifies also the assumption mentioned in Remark 2.1 about the full rank of the design matrix when any type of explanatory variables is considered.
An additional integral with respect to σ is added in front when π(β, σ | y n ) is considered. For σ not too small (bounded from below), it is easy to see that the additional integral is finite because max(1, 1/σ) is bounded and σ −(n−p) is integrable if n − p ≥ 2. This is the case because n > p + 1 by assumption. For small σ, the proof is more technical and requires to bound more carefully the densities f than above. See the supplementary material Section 6.1.1 for details.
Proving that π(β, σ | y k ) is proper is similar. For the moments, we use that using f ≤ B. This is finite given that m(y n ) < ∞ and the integral is finite because it corresponds to the marginal of n−M observations, and n−M > p+1 by assumption.
For the moments of β j , it is more technical. Consider the first moment. We would like to compute instead the first moment of |y i − x T i β| because (|y i − x T i β|/σ)f (|y i − x T i β|/σ) ≤ B (because of the assumptions on f ), and as for the moments of σ, it would be easy to show that the integral is finite. The strategy is to write β j as e T j β, where e j is a vector of size p having 1 at the j-th position and 0's elsewhere, and to write e T j as a linear combination of p vectors x i1 , . . . , x ip to essentially retrieve what we were looking for. See the supplementary material Section 6.1.1 for details.

Proof of Theorem 2.1
Proof of Result (a). To prove this result, we use that and show that this integral converges towards 1 as ω → ∞. Assuming that we can interchange the limit and the integral, we have that using Proposition 2.2 in the second equality, and next Proposition 2.1. Note that the conditions of Proposition 2.1 are satisfied given that k ≥ + 2p − 1 ⇒ k ≥ p + 2, assuming that ≥ 1 (otherwise the proof is trivial) and because p ≥ 2.
To interchange the limit and the integral, we need to prove that the integrand is bounded by an integrable function of β and σ that does not depend on ω. As in Section 5.1, let us set for now the scale parameter to a positive value σ 0 . We know that Consider that β ∈ F, a set such that the hyperplanes pass (relatively) close to the nonoutliers (fixed observations), and therefore, (relatively) far to the outliers. In this case, for large enough ω, we have that is bounded above using Proposition 2.2 because x T i β is bounded (recall that y i = a i + b i ω), and the remaining terms on the right-hand side (RHS) in (5.1) give π(β, σ 0 | y k ) which is integrable.
Consider now that β ∈ O, a set such that the hyperplanes pass (relatively) close to the outliers. The difference is that we are not sure that these hyperplanes do not pass close to the nonoutliers (see Figure 8). In this example, n = 5, k = 4 and = 1, which satisfy the assumptions in Theorem 2.1: k − = 3 ≥ 2(p − 1/2) = 3. We also have that is bounded above using again Proposition 2.2 but now because |y 4 − x T 4 β| is close to ω (this is explained in greater detail in the supplementary material Section 6.1.2). Note that it would not be true if x 1 = x 4 , which is why we require to have enough of different vectors x i in Remark 2.2. The remaining terms on the RHS in (5.1) are which after multiplying and dividing by the right marginal is proportional to the posterior density based on y 1 , y 2 , y 3 , y 5 , which is integrable given that 4 = n − ≥ p + 2 = 2p + − 1 = 4. This justifies the assumption on the number of nonoutliers in Theorem 2.1 given by k = n − ≥ 2p + − 1. The strategy to do the proof in general is to rewrite the domain of β (which is R p ) as a finite number of mutually exclusive sets, in which it is always possible to proceed as above. The function to bound thus becomes a finite sum, where each term is bounded above by integrable function. When σ is free, an additional level of technicalities is added because |y i − x T i β| can be large, but not |y i − x T i β|/σ. See the supplementary material Section 6.1.2 for all the details.

Proof of Result (b). We have that
The absolute value on the RHS converges to 0 as ω → ∞ uniformly on (β, σ) ∈ [−ϑ, ϑ] p × [1/η, η] using Proposition 2.2 and Result (a), for any ϑ ≥ 0 and η ≥ 1. On this set, π(β, σ | y k ) is bounded using the assumptions on the prior and f and the fact that m(y k ) is finite. This concludes the proof.
Proof of Result (c). Result (c) is a direct consequence of Result (b) using Scheffé's theorem (see Scheffé (1947)). See the supplementary material Section 6.1.2 for details.

Proof of Result (d). Result (d) is proved through a mix of the strategies used
to show Result (a) and that the moments exist in Proposition 2.1. Assuming that we can interchange the limit and the integral, we have using Result (b). Again, we have to prove the integrand is bounded by an integrable function of β and σ that does not depend on ω. To achieve this, we bound above σ M π(β, σ | y n ) by a constant times a function similar to the one that is shown to be bounded by an integrable function of β and σ in the proof of Result (a). See the supplementary material Section 6.1.2 for details. We proceed with the same strategy for E[β M j | y n ].
the divergence and the regularity conditions in Bunke et al. (1998). Finally, we provide a result in Section 6.3 that was used to verify that all point estimators of β under the normal model correspond to OLS, as mentioned in Section 3.3.

Proofs
Recall the assumptions on f : f is a strictly positive continuous PDF on R that is symmetric with respect to the origin, and such that both tails of |z|f (z) are monotonic, which implies that the tails of f (z) are also monotonic. The monotonicity of the tails of f (z) and |z|f (z) implies that there exists a constant M > 0 such that |y| ≥ |z| ≥ M ⇒ f (y) ≤ f (z) and |y|f (y) ≤ |z|f (z). (6.1) All these assumptions on f imply that f (z) and |z|f (z) are bounded on the real line, and both converge to 0 as |z| → ∞. We can therefore define the constant B > 0 as follows: π(β, σ)/ max(1, 1/σ) .

Proof of Proposition 2.1
To prove that π(β, σ | y n ) is proper (the proof for π(β, σ | y k ) is omitted because it is similar), it suffices to show that the marginal m(y n ) is finite. Recall that we require that n > p + 1. The reader will notice that only n ≥ p + 1 is required if π(β, σ) is bounded by B/σ for all σ > 0 and β ∈ R p (instead of π(β, σ) is bounded by B max(1, 1/σ)). We first show that the function is integrable on the area where the ratio 1/σ is bounded. More precisely, we consider β ∈ R p and δM −1 ≤ σ < ∞, where δ is a positive constant that can be chosen as small as we want (upper bounds are provided in the proof). We next show that the function is integrable on the area where the ratio 1/σ approaches infinity, that is 0 < σ < δM −1 . We have In Step a, we use π(β, σ) ≤ B max(1, 1/σ) and we bound each of n−p densities f by B. In Step b, we use the change of variables u i = (y i −x T i β)/σ for i = 1, . . . , p. The determinant is non-null because all explanatory variables are continuous. Indeed, consider the case p = 2 for instance (i.e. the simple linear regression); the determinant is different from 0 provided that x 12 = x 22 , and this happens with probability 1. When any type of explanatory variables is considered, we need to be able to select p observations, say those with x i1 , . . . , x ip , such that the matrix with rows x T i1 , . . . , x T ip has a non-null determinant. This is possible when the design matrix has full rank, which is specified in Remark 2.1. In Step c, we use n > p + 1. Note that if, instead, we bound π(β, σ) by B/σ in Step a, one can verify that the condition n ≥ p + 1 is sufficient to bound above the integral.
We now show that the integral is finite on β ∈ R p and 0 < σ < δM −1 . In this area, the ratio (1/σ) approaches infinity. We have to carefully analyse the subareas where y i − x T i β is close to 0 in order to deal with the 0/0 form of the ratios (y i −x T i β)/σ. In order to achieve this, we split the domain of β as follows: . . , n}. The set R i represents the hyperplanes y = x T i β characterised by the different values of β that satisfy |y i − x T i β| < δ. In other words, it represents the hyperplanes passing near the point (x i , y i ), and more precisely, at a vertical distance of less than δ. The set ∩ n i1=1 R c i1 is therefore comprised of the hyperplanes that are not passing close to any point. The set ∪ n i1=1 (R i1 ∩ (∩ n i2=1(i2 =i1) R c i2 )) represents the hyperplanes passing near one (and only one) point. The set ∪ n i1,i2=1(i1 =i2) (R i1 ∩ R i2 ∩ (∩ n i3=1(i3 =i1,i2) R c i3 )) represents the hyperplanes passing near two (and only two) points, and so on.
We choose δ small enough to ensure that R i1 ∩ R i2 ∩ . . . ∩ R ip ∩ R ip+1 = ∅ when i 1 , . . . , i p+1 are all different. It is possible to do so because an hyperplane passes through no more than p points. This implies that Note that all sets R i1 ∩ R i2 ∩ . . . ∩ R ip are nonempty when i 1 , . . . , i p are all different, because all explanatory variables are continuous (which implies that the p×p matrix with rows given by x T i1 , . . . , x T ip has a determinant different from 0). As mentioned above, if they are not all continuous, we have to select them in order to have to have a matrix with a non-null determinant. Note also that is nonempty for all i 1 , i 2 such that i 1 = i 2 , and so on. Finally note that the decomposition of R p given in (6.2) is comprised of p i=0 n i mutually exclusive sets given by . . , n with i 1 = i 2 , and so on. We now consider 0 < σ < δM −1 and β in one of the p i=0 n i mutually exclusive sets given in (6.2). As explained above, the difficulty lies in dealing with the hyperplanes β that are such that |y i −x T i β| < δ for some points (x i , y i ). The strategy is essentially to use the product of (1/σ)f ((y i − x T i β)/σ) of these points to integrate over β, and to bound the other terms of m(y n ). Therefore, if , we consider the points (x i1 , y i1 ), (x i2 , y i2 ), . . . , (x ip−1 , y ip−1 ), and any other point (x ip , y ip ) (leading to a matrix with a non-null determinant) to integrate over β, and so on. We have In Step a, we use π(β, σ) ≤ B max(1, 1/σ) = (B/σ) max(σ, 1) ≤ (B/σ) max(δM −1 , 1). In Step b, for all i / ∈ {i 1 , . . . , i p } we use f ((y i − x T i β)/σ) ≤ f (δ/σ) by the monotonicity of the tails of f because |y i − x T i β|/σ ≥ δ/σ ≥ δδ −1 M = M . In Step c, we bound n − p − 1 terms (1/σ)f (δ/σ) by B/δ.
We now prove that the M -moments exist if n > p + 1 + M (considering the whole data set, the proof for the posterior expectations based on the nonoutliers only is omitted because it is similar). It is easy to prove that E[σ M | y n ] < ∞, from what has been demonstrated above. Indeed, where M densities f have been bounded by B. We know that m(y n ) is finite because n > p + 1. We also know that the last integral above is finite because it corresponds to the marginal of n − M observations, which is finite given that n − M > p + 1.
For the expectations E[|β j | M | y n ], we detail the proof for the cases M = 1 and M = 2. From these, it will be clear that the result holds in general, with further technicalities. For the proof, we use that β j can be rewritten as e T j β, where e j is a vector of size p having 1 at the j-th position and 0's elsewhere. This vector can be expressed as a linear combination of p vectors x i1 , . . . , x ip , i 1 , . . . , i p ∈ {1, . . . , n}, because these vectors are linearly independent and form a basis of R p (given that all explanatory variables are continuous). Using the triangle inequality, we have where a 1 , . . . , a p ∈ R. We can prove that E[|y is − x T is β| | y n ] < ∞ in the same way that we proved that E[σ M | y n ] < ∞, using instead that Therefore, E[|β j | | y n ] < ∞ if n > p + 2. For the second moment, using again the triangle inequality we have We analyse the two last terms separately, starting with the second one. Using again the triangle inequality we have, All the terms in the sum are finite if n > p + 3. Also, and we proceed as before. In the second equality, we write x is as a linear combination of x i1 , . . . , x ip , i 1 , . . . , i p ∈ {1, . . . , n} \ {i s }. To be able to do this, we need to select p linearly independent vectors among the remaining n − 1 ≥ p. It is possible given that n > p + 3 and all explanatory variables are continuous. Therefore, E[|β j | 2 | y n ] < ∞ if n > p + 3.

Proof of Theorem 2.1
Recall that we assume that ≤ n/2 − (p − 1/2) ⇔ k ≥ n/2 + (p − 1/2) ⇔ k ≥ + 2p − 1. In addition, we will assume that ≥ 1, i.e. that there is at least one outlier, otherwise the proof is trivial. A proposition and a lemma that are used in the proof are first given, and the proofs of Results (a) to (d) follow. The proofs of this proposition and this lemma can be found in Desgagné (2015).
Proof of Result (a). We first observe that We show that the last integral converges towards 1 as ω → ∞ to prove Result (a). If we use Lebesgue's dominated convergence theorem to interchange the limit ω → ∞ and the integral, we have using Proposition 2.2 in the second equality, since x 1 , . . . , x n are fixed, and then Proposition 2.1. Note that the conditions of Proposition 2.1 are satisfied because k ≥ + 2p − 1 ⇒ k ≥ p + 2 (because ≥ 1 and p ≥ 2). When any type of explanatory variables is considered, we select p observations, say those with x i1 , . . . , x ip , such that the matrix with rows x T i1 , . . . , x T ip has a non-null determinant. This is possible given the assumption mentioned in Remark 2.2. Note also that pointwise convergence is sufficient, for any value of β ∈ R p and σ > 0, once the limit is inside the integral. However, in order to use Lebesgue's dominated convergence theorem, we need to prove that the integrand is bounded by an integrable function of β and σ that does not depend on ω, for any value of ω ≥ y, where y is a constant. The constant y can be chosen as large as we want, and minimum values for y will be given throughout the proof. In order to bound the integrand, we divide the domain of integration into two areas: 1 ≤ σ < ∞ and 0 < σ < 1. Again, we want to separately analyse the area where the ratio 1/σ approaches infinity.
We assumed that y i can be written as y i = a i + b i ω, where ω → ∞, and a i and b i are constants such that a i ∈ R and b i = 0 if i = 1 (if the observation is an outlier). Therefore, the ranking of the elements in the set {|y i | : i = 1} is primarily determined by the values |b 1 |, . . . , |b n |, and we can choose the constant y larger than a certain threshold to ensure that this ranking remains unchanged for all ω ≥ y. Without loss of generality, we assume for convenience that ω = min We now bound above the integrand on the first area. Area 1: Consider 1 ≤ σ < ∞ and assume without loss of generality that y 1 , . . . , y +2p−1 are + 2p − 1 nonoutliers (therefore k 1 = . . . = k +2p−1 = 1). We have In Step a, we use y i = a i + b i ω and Lemma 6.1 to obtain because |a i /σ| ≤ |a i | for all i. We also use π(β, σ) ≤ B max(1, 1/σ) = B. In Step b, we use again Lemma 6.1 to obtain f (ω)/f ( Step c, we set b i = 0 if k i = 1 and we use the symmetry of f to obtain f (−x T i β/σ) = f (x T i β/σ). In Step d, we use the assumption k 1 = . . . = k p = 1.
Now it suffices to demonstrate that is bounded by a constant that does not depend on ω, β and σ since using the following change of variables: u i = x T i β/σ, i = 1, . . . , p. The determinant is different from 0 because all the explanatory variables are continuous. When any type of explanatory variables is considered, we assume without loss of generality that x 1 , . . . , x p are additionally linearly independent. Note that if instead, in Step a above, we bound π(β, σ) by B/σ, one can verify that the condition k ≥ p + 1 is sufficient to bound above the integral.
In order to prove the result, we split the domain of β as follows: The set O i represents the hyperplanes y = x T i β characterised by the different values of β that satisfy |b i ω − x T i β| < ω/2. In other words, it represents the hyperplanes that pass at a vertical distance of less than ω/2 of the point ( which is considered as an outlier since ω → ∞ (recall that b i ω = y i − a i ). Analogously, the set F i represents the hyperplanes that pass at a vertical distance of less than ω/γ of the point (x i , 0), which is considered as a nonoutlier. Therefore, the set ∩ i O c i represents the hyperplanes that pass at a vertical distance of at least ω/2 of all the points (x i , b i ω) (all the outliers). The set ∪ i (O i ∩ (∩ i1 F c i1 )) represents the hyperplanes that pass at a vertical distance of less than ω/2 of at least one point (x i , b i ω) (an outlier), but at a vertical distance of at least ω/γ of all the points (x i , 0) (all the nonoutliers). For each i 1 ∈ I F , the set ) represents the hyperplanes that pass at a vertical distance of less than ω/2 of at least one point (x i , b i ω) (an outlier), at a vertical distance of less than ω/γ of the point (x i1 , 0) (a nonoutlier), but at a vertical distance of at least ω/γ of all the other nonoutliers, and so on. Now, we claim that O i ∩ F i1 ∩ · · · ∩ F ip = ∅ for all i, i 1 , . . . , i p with i j = i s , ∀i j , i s such that j = s, meaning that there is no hyperplane that passes at a vertical distance of less than ω/2 of the point (x i , b i ω) (an outlier) and at the same at a vertical distance of less than ω/γ of p points (x ij , 0) (nonoutliers). To prove this, we use the fact that x i (a vector of size p) can be expressed as a linear combination of x i1 , . . . , x ip . This is true because all explanatory variables are continuous, and therefore, linearly independent with probability 1. When any type of explanatory variables is considered, we select observations 1 to + 2p − 1 to be such that any p vectors x i1 , . . . , x ip , with {i 1 , . . . , i p } ⊂ {1, . . . , + 2p − 1}, are linearly independent. This is possible given the assumption mentioned in Remark 2.2. As a result, considering that β ∈ F i1 ∩· · ·∩F ip and x i = p s=1 a s x is for some a 1 , . . . , a p ∈ R, we have In Step a, we use the reverse triangle inequality. In Step b, we use that |b i | ≥ 1 and | p s=1 a s x T is β| ≤ p s=1 |a s ||x T is β| ≤ p s=1 |a s |ω/γ because β ∈ F i1 ∩ · · · ∩ F ip , which means that |x T i β| < ω/γ for all i ∈ {i 1 , . . . , i p }. In Step c, we define the constant γ such that γ ≥ 2 p s=1 |a s | (we choose γ such that it satisfies this inequality for any combination of i and i 1 , . . . , i p ). Therefore, we have that β / ∈ O i . This proves that O i ∩ F i1 ∩ · · · ∩ F ip = ∅ for all i, i 1 , . . . , i p with i j = i s , ∀i j , i s such that j = s. This in turn implies that (6.5) can be rewritten as This decomposition of R p is comprised of 1 + ) for i 1 ∈ I F , and so on.
We are now ready to bound the function on the left-hand side in (6.4). We first show that the function is bounded on β ∈ ∩ i O c i and 1 ≤ σ ≤ ω/(γM ). For all i ∈ I O , we have using the monotonicity of f because |b i ω − x T i β|/σ ≥ ω/(2σ) ≥ γM/2 ≥ M (we choose γ ≥ 2), and then Lemma 6.1. Therefore, on β ∈ ∩ i O c i and 1 ≤ σ ≤ ω/(γM ), Now, we consider the area defined by: 1 ≤ σ ≤ ω/(γM ) and β belongs to one of the In Step a, we use f ≤ B for all i ∈ I O . In Step b, we use the fact that in any of the sets in which β can belong, there are at least nonoutlying points (x i , 0) such that |x T i β| ≥ ω/γ. Indeed, the case in which there are the least nonoutliers such that |x T i β| ≥ ω/γ corresponds to β ∈ ∪ i (O i ∩ F i1 ∩ · · · ∩ F ip−1 ∩ (∩ ip =i1,...,ip−1 F c ip )). In this case there are p − 1 nonoutliers such that |x T i β| < ω/γ (observations i 1 to i p−1 ), which leaves + p − 1 − (p − 1) = nonoutliers such that |x T i β| ≥ ω/γ (i.e. that there are sets in the intersection ∩ ip =i1,...,ip−1 F c ip ). Therefore, for nonoutliers such that |x T i β| ≥ ω/γ, we use by the monotonicity of f because |x T i β|/σ ≥ ω/(γσ) ≥ M , and then Lemma 6.1. For the remaining k − p − nonoutlying points, we use f ≤ B. Note that this argument justifies the need of the assumption k ≥ + 2p − 1.
Area 2: Consider 0 < σ < 1. We actually need to show that For Area 2, we proceed in a slightly different manner than for Area 1. We begin by dividing the first integral above into two parts as follows: , . . . , n} and i = 1}. Note that the definition of O i is very similar as that of the set defined in (6.6) (this is why we use the same notation); its interpretation is also very similar. We show that the first part above is equal to the integral R p 1 0 π(β, σ | y k ) dσ dβ and that the second part is equal to 0. For the first part, we again use Lebesgue's dominated convergence theorem in order to interchange the limit ω → ∞ and the integral. We have using Proposition 2.2 in the second equality since x 1 , . . . , x n are fixed, and lim ω→∞ 1 ∩iO c i (β) = 1 R p (β) = 1 ⇔ lim ω→∞ 1 ∪iOi (β) = 0. Indeed, if i = 1 and b i > 0 (which implies that y i > 0), β ∈ O i implies that |y i − x T i β| < ω/2 ≤ y i /2, which in turn implies that y i /2 < x T i β < 3y i /2, and in the limit, no β ∈ R p satisfies this (we have the same conclusion if b i < 0). Note that pointwise convergence is sufficient, for any value of β ∈ R p and σ > 0, once the limit is inside the integral. We now demonstrate that the integrand is bounded, for any value of ω ≥ y, by an integrable function of β and σ that does not depend on ω. Consider , by the monotonicity of the tails of |z|f (z) and then the monotonicity of the tails of f (z), because |y i − x T i β|/σ ≥ |y i − x T i β| ≥ ω/2 ≥ y /2 ≥ M , if we choose y ≥ 2M . Lemma 6.1 is used in the last inequality with ω = (y i − a i )/b i . Therefore, which is an integrable function. We now prove that We first bound above the integrand and then we prove that the integral of the upper bound converges towards 0 as ω → ∞. In the same manner as in the proof of the inequality in (6.4), we split the domain of β as follows: where F i := {β : |x T i β| < ω/γ}, ∀i ∈ I F , and I F := {1, . . . , + 2p − 1} (we assume as previously that y 1 , . . . , y +2p−1 are + 2p − 1 nonoutliers, and therefore k 1 = . . . = k +2p−1 = 1). The definition of F i is the same as that of the set defined in (6.7). For an interpretation of this set and of the sets involve in the decomposition of ∪ i O i , see the proof of (6.4). Given that |y i | ≥ ω for all i ∈ O i , we can use the same mathematical arguments as in the proof of (6.4) to show that O i ∩ F i1 ∩ · · · ∩ F ip = ∅ for all i, i 1 , . . . , i p with i j = i s , ∀i j = i s such that j = s. Therefore, )) for i 1 ∈ I F , and so on. We now consider the area defined by: 0 < σ < 1 and β belongs to one of these p−1 i=0 +2p−1 i mutually exclusive sets. We have In Step a, we use Lemma 6.1 to obtain f (ω)/f ( Step b, we use π(β, σ) ≤ B max(1, 1/σ) = B/σ. We also use that in any of the sets in which β can belong, there are at least + 1 nonoutlying points such that |x T i β| ≥ ω/γ (corresponding to β ∈ F c i for at least + 1 nonoutlying points). Indeed, the case in which there are the least nonoutliers such that |x T i β| ≥ ω/γ corresponds to β ∈ ∪ i (O i ∩ F i1 ∩ · · · ∩ F ip−1 ∩ (∩ ip =i1,...,ip−1 F c ip )). In this case there are p − 1 nonoutliers such that |x T i β| < ω/γ (say observations i 1 to i p−1 ), which leaves at least +2p−1−(p−1) nonoutliers such that |x T i β| ≥ ω/γ (i.e. that there are +2p−1−(p−1) sets in the intersection ∩ ip =i1,...,ip−1 F c ip ), and we know that +2p−1−(p−1) = +p > +1 because we only consider the models with p ≥ 2. This implies that there exists a set of + 1 indices, say {i p , . . . , i +p } ⊂ I F , such that for all i ∈ {i p , . . . , i +p }, using the monotonicity of the tails of f in the first inequality because, if we define the constant a (k) := max i∈{1,...,k} |a i | with ω ≥ y ≥ (2γ)a (k) , we have In the second inequality, we use Lemma 6.1 (as mentioned in the proof of (6.4), we choose γ ≥ 2). In Step c above, we use the monotonicity of the tails of |z|f (z) to obtain (ω/σ)f (ω/σ) ≤ ωf (ω) for terms, because ω/σ ≥ ω ≥ y ≥ M if we choose y ≥ M . In Step d, we use (1/σ)f (ω/σ) ≤ B/ω.
In order to prove that (B/ω)m(y I R ) → 0 as ω → ∞, it suffices to prove that m(y I R ) is bounded by a constant that does not depend on ω, because 1/ω → 0. In Section 6.1.1, we proved that a marginal, as m(y I R ), is bounded by a constant that does not depend on ω if the number of observations (which is k − 1 in our case) is greater than or equal to p + 1 if the prior divided by 1/σ is bounded (which is the case for m(y I R )). Because we assume that k ≥ + 2p − 1 and ≥ 1 (the proof for the case = 0 is trivial), and because we only consider the models with p ≥ 2, m(y I R ) is the marginal of k − 1 ≥ + 2p − 2 ≥ p + 1 observations. As a result, We therefore have that Proof of Result (b). Consider (β, σ) such that π(β, σ) > 0 (the proof for the case (β, σ) such that π(β, σ) = 0 is trivial). We have, as ω → ∞, The first ratio in the last equality does not depend on β and σ and converges towards 1 as ω → ∞ using Result (a). Also, the product converges towards 1 uniformly in any set (β, σ) ∈ [−ϑ, ϑ] p × [1/η, η] using Proposition 2.2 given that x 1 , . . . , x n are fixed. Furthermore, since f and π(β, σ)/ max(1, 1/σ) are bounded, π(β, σ | y k ) is also bounded on any set (β, σ) ∈ [−η, η] p × [1/η, η]. Then, we have Proof of Result (c). Using Proposition 2.1, we know that π(β, σ | y k ) and π(β, σ | y n ) are proper. Moreover, using Result (b), we have the pointwise convergence π(β, σ | y n ) → π(β, σ | y k ) as ω → ∞ for any β ∈ R p and σ > 0, as a result of the uniform convergence. Then, the conditions of Scheffé's theorem (see Scheffé (1947)) are satisfied and we obtain the convergence in L 1 of π(β, σ | y n ) towards π(β, σ | y k ) as well as the following result: uniformly for all sets E ⊂ R p × R + . Result (c) follows directly.

Proof of Result (d).
We prove that the moments converge through a mix of the strategies used to show Result (a) and that the moments exist in Proposition 2.1. For any M , a positive integer, we have assuming that we can interchange the limit and integral and using Result (b). To interchange the limit and integral, we again use Lebesgue's dominated convergence theorem which requires that the integrand is bounded by an integrable function of β and σ. We prove that it is the case using that We have that m(y n ) is bounded using Proposition 2.1, using f ≤ B for the M first observations and assuming without loss of generality that these observations are nonoutliers (therefore k 1 = . . . = k M = 1). Therefore, we need to show that is bounded by an integrable function of β and σ, where y * k := y k \ {y 1 , . . . , y M } (the nonoutlier group without the first M nonoutliers). In the proof of Result (a), it has been shown that it is the case under the assumptions of Theorem 2.1, which are satisfied considering this modified data set with k−M ≥ n/2+(p−1/2) (see the additional assumption for Result (d) of Theorem 2.1).
For the expectations E[β M j | y n ], we proceed in the same way, we simply additionally consider that, as in the proof of Proposition 2.1 (see Section 6.1.1), β j can be rewritten as e T j β, and that next, e j can be expressed as a linear combination of p vectors x i1 , . . . , x ip , where now these are selected among the nonoutliers, i.e. i 1 , . . . , i p ∈ {i : k i = 1}. We detail the case M = 1. From it and what has been done before, it will be clear the result holds in general, with further technicalities. As above, assuming that we can interchange the limit and integral and using Result (b). As above, we have to show that the integrand is bounded above by an integrable function of β and σ. We beforehand use that a s y is , as mentioned, where a 1 , . . . , a p ∈ R and i 1 , . . . , i p ∈ {i : k i = 1}. The integrand thus becomes a sum of 2p terms, and we prove that each one of them is bounded above by an integrable function of β and σ, which will complete the proof. As above a s (y is − x T is β)π(β, σ | y n ) ≤ |a s ||y is − x T is β| π(β, σ) We have that m(y n ) is bounded using Proposition 2.1, Therefore, a s (y is − x T is β)π(β, σ | y n ) is bounded above by a constant times where y * k := y k \ {y is } (the nonoutlier group without the i s -th nonoutlier). As mentioned above, in the proof of Result (a), it has been shown that it is the case under the assumptions of Theorem 2.1, which are satisfied considering this modified data set with k − 1 ≥ n/2 + (p − 1/2) (see the additional assumption for Result (d) of Theorem 2.1). The proofs for the terms with a s y is π(β, σ | y n ) is similar.

Complement of Section 3.2
In Section 3.2, we mention that the first derivative of the divergence KL(β, σ) := log(g(y i )/p (β,σ) (y i )) g(y i ) dy i (6.8) with respect to β equals 0 at β 0 , and this for any value of σ. We also mention that while setting β = β 0 in (6.8), it is minimised at σ * which depends on ρ.
Finally, we mention that most of the regularity conditions in Bunke et al. (1998) are satisfied. We now show all this. We rewrite the divergence: where E g denotes the expectation with respect to g, and omitting the index i. The first term is computed exactly: The second term is rewritten as: In the same way as for δ, we show that we can interchange the derivative with respect to η and the integral. We have which is bounded by |z| times a constant. This is an integrable function with respect to ϕ. Therefore, we can interchange the integral and the derivative: Setting the derivative equals to 0 leads to zη f (zη) f (zη) ϕ(z) dz = −1.
We cannot solve this explicitly, but numerical calculations show that the solution is unique. For instance, (6.9) as a function of η with ρ = 0.95 is shown in Figure 9 (a), with the maximiser η * as a function of ρ in Figure 9 (b). The previous analysis suggests that (β * , σ * ) = (β 0 , σ 0 /η * ). Condition 1. The parameter space Θ is a closed (possibly unbounded) convex set in R d with a nonempty interior. The density p θ (y) is bounded for all θ and y, and its carrier {y : p θ (y) > 0} is the same for all θ.
This condition is not directly satisfied because the parameter space is open (σ > 0). But it should not be a problem if we can show that it is possible to choose > 0 such σ 0 ∈ [ , ∞) and that the mass outside of this set goes to 0 as the sample size increases. Indeed, we could "define" the parameter space to be [ , ∞) × R p which is a closed convex set and lose nothing asymptotically. On this parameter space p θ (y) is bounded for all θ and y, and its carrier {y : p θ (y) > 0} = R is the same for all θ.
The last part is satisfied if the prior is strictly positive over the parameter space, which is usually the case (it is true in our numerical analyses). The first part essentially requires that the measure does not "explode" in some areas. Under the assumption mentioned in Section 2.1 in our paper on the prior and if the parameter space is [ , ∞) × R p , we have that π(S[θ, r]) = 1 S[θ,r] π(θ) dθ ≤ 1 1 S[θ,r] dθ = c r p+1 , implying that the first part holds. Condition 6. Let L : Θ × Θ → R + be a measurable loss function with L(θ, θ) = 0, c 1 , c 2 , c 3 , b 4 , b 5 be positive constants such that for all t, θ ∈ Θ.
It is easily seen that the quadratic loss function satisfies this, pointing towards the consistency of the posterior mean of β. Under Conditions 1 to 5, a result in Bunke et al. (1998) indicates that the posterior density concentrates around θ * = (β 0 , σ * ), pointing in this case towards the consistency of the part of the posterior mode associated with β.
The last term on the RHS is equal to β T X n T y n , which cancels out with the remaining term in (6.10), minus (X n T y n ) T (X n T X n ) −1 X n T y n . Putting these results together yields π(β, σ | y n ) ∝ π(σ) 1 σ n exp − 1 2σ 2 n i=1 y 2 i − (X n T y n ) T (X n T X n ) −1 X n T y n × exp − 1 2σ 2 (β − (X n T X n ) −1 X n T y n ) T X n T X n (β − (X n T X n ) −1 X n T y n ) .
It just remains to prove that n i=1 y 2 i − (X n T y n ) T (X n T X n ) −1 X n T y n = y n −ŷ n 2 .
We have n i=1 y 2 i − (X n T y n ) T (X n T X n ) −1 X n T y n = y n T y n − ((X n T X n ) −1 X n T y n ) T X n T y n = y n T y n −ŷ T n y n = (y n −ŷ n ) T (y n −ŷ T n +ŷ T n ) = (y n −ŷ n ) T (y n −ŷ n ) + (y n −ŷ n ) Tŷ n = (y n −ŷ n ) T (y n −ŷ n ).