Bayesian inference and prediction for mean-mixtures of normal distributions 1

We study frequentist risk properties of predictive density estimators for mean mixtures of multivariate normal distributions, involving an unknown location parameter θ ∈ R d , and which include multivariate skew normal distributions. We provide explicit representations for Bayesian posterior and predictive densities, including the benchmark minimum risk equivariant (MRE) density, which is minimax and generalized Bayes with respect to an improper uniform density for θ . For four dimensions or more, we obtain Bayesian densities that improve uniformly on the MRE density under Kullback-Leibler loss. We also provide plug-in type improvements, investigate implications for certain type of parametric restrictions on θ , and illustrate and comment the ﬁndings based on numerical evaluations.


Introduction
The findings of this paper relate to predictive density estimation for mean mixture of normal distributions.The modelling of data via mixing multivariate normal distributions has found many applications and lead to methodological challenges for statistical inference.These include finite mixtures, as well as continuous mixing on the mean and/or the variance.Whereas, scale or variance mixtures of multivariate normal distributions compose a quite interesting subclass of spherically symmetric distributions, modelling asymmetry requires mixing on the mean and prominent examples are generated via variance-mean mixtures (e.g., [7]), as well as mean mixtures of multivariate normal distributions (e.g., [1,5]) and references therein).Moreover, such mean mixtures, which are the subject of study here, generate or are connected to multivariate skew-normal distributions (e.g., [6]) which have garnered much interest over the years.
The development of shrinkage estimation techniques, namely since Stein's inadmissibility finding ( [26]) concerning the maximum likelihood or best location equivariant estimator under squared error loss in three dimensions or more, has had a profound impact on statistical theory, thinking, methods, and practice.Examples include developments on sparsity and regularization methods, empirical Bayes estimation, adpative inference, small area estimation, non-parametric function estimation, and predictive density estimation.Cast in a decision-theoretic framework, Stein's original result has been expanded in many diverse ways, namely to other distributions or probability models, and namely for spherically symmetric and elliptically symmetric distributions (see for instance, [11]).There have been fewer findings for multivariate skew-normal or mean mixtures of normal distributions, but the recent work of Kubokawa et al. [18] establishes point estimation minimax improvements of the best location equivariant estimator under quadratic loss, when the dimension of the location parameter is greater than or equal to four, and with underlying known perturbation parameter.
Predictive density estimation has garnered much interest over the past twenty years or so, and addresses fundamental issues in statistical predictive analysis.Decision-theoretic links between shrinkage point estimation and shirnkage predictive density estimation for normal models have surfaced (e.g., [14], [13]) and stimulated much activity (see for instance [12]), including findings for restricted parameter spaces (e.g., [10]).The main objective of this work is thus to explore the problem of predictive density estimation for mean mixtures of normal (MMN) distributions.A secondary objective is to provide novel representations for Bayesian posterior distributions and predictive densities, which have been found to be lacking in the literature.
Following early findings of Komaki (e.g., [14]) on the predictive density estimation problem for multivariate normal models under Kullback-Leibler loss, George, Liang and Xu in [13] exhibited further parallels with the point estimation problem for normal distribution under quadratic loss.They provide sufficient conditions on marginal distributions and prior distributions to get improved shrinkage predictive density estimators when the dimension is greater than or equal to three.Thus, motivated by these connections, it is interesting to investigate whether such shrinkage plays any role in the predictive density estimation problem for mean-mixture of multivariate normal models and we focus on frequentist risk efficiency of predictive density estimators under Kullback-Leibler loss.Our contribution here consists in identifying classes of plug-in type predictive densities and of Bayes predictive densities which are minimax and dominate the benchmark minimum equivariant estimator (MRE) for the case when the dimension of the location parameter is greater than or equal to four.
The organization of this manuscript is as follows.Section 2.1 contains several introductory definitions, properties and examples of MMN models, including a useful canonical form which subdivides the MMN distributed vector into d independent components, one of which a univariate MMN distribution and the others as normal distributions.Section 2.2 focuses on the predictive estimation framework with a KL loss decomposition, and an initial representation for the MRE density accompanied by various examples.Section 2.3 expands on the calculation of minimax risk and a representation in terms of the entropy of a univariate distribution.Section 3 is devoted to Bayesian posterior and predictive analysis with several novel representations.Sections 4.1 and 4.2, namely Theorems 4.4, Theorem 4.5 and Corollary 4.1, contain the main dominance findings, with plugin type and Bayesian improvements.In both cases, the main technique employed rests upon the canonical transformation presented in Section 2.1 and permits to split up the KL risk as the addition of two parts, one of which can be operated on using known normal model prediction analysis findings.Section 4.3 deals with parametric restrictions and further applications of Theorems 4.4 and 4.5.Finally, some further illustrations are provided in Section 5.

Preliminary results and definitions
Here are some details, properties and definitions on mean mixture of normal distributions, its canonical form, and predictive density estimation.In the following, we will denote φ d (z; Σ) the probability density function (pdf) of a N d (0, Σ) distribution evaluated at z ∈ R d and for positive definite Σ.When Σ = I d , we may simplify the writing to φ d (z), and then for d = 1 to φ(z).We will denote Φ the cdf of a N (0, 1) distribution.

The model
The distributions of interest are mean-mixtures of multivariate normal distributions, both for our observables and densities to be estimated by a predictive density estimator.Such distributions connect to multivariate skew-normal distributions and have been the object of interest in recent work with studies of stochastic properties (e.g., [1], [5]), and shrinkage estimation about its location parameter ( [18]).Definition 2.1.A random vector X ∈ R d is said to have a mean-mixture of normal distributions (MMN), denoted as X ∼ M M N d (θ, a, Σ, L), if it admits the representation where θ ∈ R d is a location parameter, a ∈ R d − {0} is a known perturbation vector, Σ is a known positive definite covariance matrix, and V is a scalar random variable with cdf L.
Alternatively, the random vector X has stochastic representation where Z ∼ N d (0, I d ) and V ∼ L on R, and its probability density function can be expressed as: Thus, we note that the density function of MMN random vector can be decomposed in two parts: one symmetrical density φ d (•) and the other part which is a function of the projection of (x − θ) in the direction of Σ −1 a.Moreover, this construction isolates the asymmetry in the direction Σ −1 a and the scale is controlled by the random variable V .
Remark 2.1.It is easy to see that the family of MMN distributions is closed under linear combinations of independent components.Specifically, if Namely, for the identically distributed case with Σ i = Σ and the sample mean with b i = 1/n, we obtain that It thus follows, as observed in [18], that findings applicable for a single MMN distributed observable X can be extended to the random sample case.
We now turn our attention to a fundamental decomposition, or canonical form, ( [1]) for MMN distributions which will be most useful.
Lemma 2.1.For a random vector X ∼ M M N d (θ, a, Σ, L) as in (2.1), there exists an orthogonal matrix H such that the first row of H is proportional to a Σ −1/2 and Z = HΣ −1/2 X has a M M N d (HΣ −1/2 θ, a 0 , I d , L) distribution with a 0 = ( Such a Z may be referred to as a canonical form and is comprised of d independent components.Moreover Z − HΣ −1/2 θ has d − 1 components which are N (0, 1) distributed and another distributed as M M N 1 (0, a 0 , 1, L) .Such a canonical form construction is not unique and depends on the choice of H.
As already mentioned, the family of MMN distributions contains many interesting distributions and we refer to the above-mentioned references for various properties.We expand here with illustrations, which will also inform us for our predictive density problem and related Bayesian posterior analysis.
A prominent example is the multivariate skew normal distribution due to Azzalini and Dalla Valle [6].If we consider V ∼ T RN (0, 1), the standard truncated normal distribution on R + in (2.1), we get the multivariate skew-normal family of distributions with densities We denote this as X ∼ SN d (θ, a, Σ).Here, we note that V ∼ χ 2 1 , i.e. the square root of a Chi-square distribution with k = 1 degrees of freedom.Various other choices of the mixing density have appeared in the literature (e.g., [5]), namely cases where V ∼ χ 2 k or V is Gamma distributed.Here is a general result containing such cases as well as many others.
Theorem 2.1.For a mixing density of the form ) ) is given by with Z ∼ N (0, 1),
Proof.The result follows from (2.3) as The density was studied in [1,2].The exponential case with α = 1 simplifies with More generally for positive integer α, the density's expression brings into play the (α − 1) th lower-truncated moment of a normal distribution.For instance, with E {(Z + ∆)|Z + ∆ ≥ 0} = ∆ + R(∆), we obtain for the case α = 2 the model density: with the above c 1 and c 2 .
The density was given in [5] and, as previously noted, the case k = 1 reduces to the skewnormal case in (2.4).As in Example (A) for positive integer k, the density's expression involves a lower-truncated moment of a normal distribution.
(C) Kummer type II mixing with (v+σ) a+b with a, c, σ > 0, b ∈ R, and ψ the confluent hypergeometric function of type II defined for γ 1 , γ 3 > 0 and This class of densities includes for b = −a the Gamma densities in (A), as well as Beta type II densities for c = 0 and b > 0 The resulting mean-mixture density is given by (2.6) and involves interesting expectations of the form E W a−1 (W +σ) a+b |W ≥ 0 where W ∼ N (∆, 1) with ∆ = c 2 /c 1 .

The prediction problem
(2.8) Let p(x|θ) and q(y|θ) denote the conditional densities of X and Y given θ, respectively.Based on observing X = x, we consider the problem of finding a suitable predictive density estimator q(y; x) for q(y|θ) , y ∈ R d .
The ubiquitous Kullack-Leibler (KL) divergence between two Lebesgue densities f and g on R m , defined as is the basis of Kullback-Leibler loss given by L(θ, q) = ρ(q θ , q) .(2.9) We will make use of Lemma 2.1's canonical form as in (2.1) to transform a mean mixture of normal distributions vector into two independent components and to capitalize on the corresponding simplification for KL divergence which is as follows.
Lemma 2.2.Let T = (T (1) , T (2) ) ∈ R m and U = (U (1) , U (2) ) ∈ R m be random vectors subdivided into independently distributed components T (i) and U (i) of dimensions m i for i = 1, 2 with m 1 + m 2 = m.Denote f and g the densities of T and U , respectively, and f 1 , f 2 , g 1 , g 2 the densities of , respectively.Then, we have (2.10) Proof.By independence, we have which is (2.10).
We evaluate the performance of the density estimators using KL loss in (2.9), and the associated KL risk function For a prior density π for θ with respect to a σ−finite measure ν, it is known (e.g., [3,4]) that the Bayes predictive density is given by (2.12) A benchmark predictive density estimator for q(y|θ), y ∈ R d , is given by the Bayes predictive density estimator qU (y; X), y ∈ R d , with respect to the uniform prior density on R d .It is known to be the minimum risk equivariant (MRE) predictive density estimator under changes of location, as well as minimax.In [16], a representation, which applies to both integrated squared-error loss and KL loss, for qU is provided.For our prediction problem, the following result makes use of this representation and summarizes the above optimality properties.
Lemma 2.3.The MRE predictive density estimator of the density of Y relative to model (2.8) under KL loss, is given by the Bayes predictive density qU under prior π U (θ) = 1 .Furthermore, we have ) Finally, qU (y; X) is minimax under KL loss.
Proof.The MRE and minimax properties are given in [24] and [20], respectively.For a location family prediction problem with X ∼ p(x − θ) and Y ∼ q(y − θ) independently distributed, it is shown in [16] that qU (y; X) = q * p(y − x) , with p(t) = p(−t) , i.e., the convolution of q and the additive inverse of p followed by a change of location equal to x.
For model (2.1), the above convolution q * p is given by the density of Y − X in model (2.1) with θ = 0, and the result follows since Here, we can see that the MRE density estimator also belongs to the class of MMN distributions with same perturbation parameter a and location parameter x.As well, the distribution of the difference V 2 − V 1 plays a key role in Theorem 2.3's representation of the MRE predictive density, and as illustrated in the next subsection of examples.

Minimax risk and entropy
The Kullback-Leibler risk expressions brings into play the entropy associated with MMN distributions.Such a measure is not easily manipulated into a closed form (see for instance [9] for the study of entropy for skewed-normal distributions), but they can be expressed in terms of the entropy of a univariate MMN distribution, as illustrated with the following expansion of the constant and minimax risk of the MRE density qU in the context of model (2.8).For a Lebesgue density on R d , defined as we will make use of the following well-known and easily established properties.
) ∼ f be a random vector with independently distributed components As implied by part (a) of the above lemma, the entropy H(f µ ) is constant as a function of µ for location family densities f µ (t) = f 0 (t − µ), as is the case for M M N d (µ, b, Σ, L) densities.Now, we have the following dimension reduction decomposition for the entropy Lemma 2.5.We have for d ≥ 2: With the above, we conclude with an expression for the constant and minimax risk.Theorem 2.2.In the context of model (2.8), the Kullback-Leibler risk of the MRE density qU is given by by the independence of X and Y , the constancy of location family density q θ , and since Y − X|θ ∼ M M N d (0, a, Σ S , L 3 ).The result then follows from Lemma 2.5.
The particular case with Σ X = σ 2 X I d and Σ Y = σ 2 Y I d follows directly from (2.14) and yields

Minimum risk predictive density: Examples
Theorem 2.1 tells us that the minimum risk predictive density is given by qU (•; The result is quite general and can be viewed as an extension of the multivariate normal case with a = 0 and qU (•; Here are some interesting examples.When continuous, the mixing distributions can be taken to have a scale parameter equal to one without loss of generality, since a multiple can be integrated into the shape vector a.
(A) For the case of degenerate V 2 with P( Here the distribution of V 3 is Laplace or double-exponential with density 1 2 e −|v 3 | on R. Therefore, from Theorem 2.1, we have . By making use of Lemma 5.9 in the Appendix with

S
, and c = 0, we obtain (for a = 0) . truncated normal distributed TN(0, 1) (or equivalently as χ 2 1 ) for which X and Y are i.i.d. as multivariate skew-normal as in (2.4).A straightforward calculation yields the density Now, by making use of Lemma 5.9 with S +2a a , and B = ± (y−x) a σ 2

S
, collecting terms, and setting f k = σ 2 S + ka a, we obtain the minimum risk equivariant predictive density where Φ 2 (z 1 , z 2 ; ρ) the cdf evaluated at z 1 , z 2 ∈ R of a bivariate normal distributions with means equal to 0, variances equal to 1 and covariance equal to ρ.In the evaluation above, we made use of the identities , which is a special case of the Sherman-Morrison formula for the matrix inversion of A + b 1 b 2 with A being a square matrix and b 1 and b 2 vectors of the same dimension.

Bayes posterior analysis and predictive densities
In this section, we expand on and document representations for Bayesian posterior and predictive densities for mean mixture of normal distributions.

Posterior densities
Bayesian posterior analysis of MMN models relate to the general form X|K, θ ∼ f θ,K , K ∼ g , and θ ∼ π , (3.16) with observable X ∈ R d , density g of K free of θ, and π prior density for θ ∈ R d .Such a set-up leads to the following intermediate result, taken from [21].
Lemma 3.6.For model (3.16), the posterior distribution of U = d θ|x admits the representation π k ,x being the posterior density of θ as if K = k had been observed, and with m π,k being the marginal density of X as if K = k had been observed.
We now apply the above to MMN distributions as in Definition 2.1.where the distribution L * has density Furthermore, it follows immediately that  Now, it is easily seen for cases where K ∼ g as in (3.22) that which is the density of a TN B−c 2 A+c 1 , 1 √ A+c 1 ; (0, ∞) distribution.Hence, the above, which yields the density associated with L, provides a complete description of the posterior distribution in (3.19) for all considered cases of mixing density (3.22).Analogously, the corresponding expectation A+c 1 ) provides an explicit expression for the posterior expectation E(θ|x) in (3.21).

Predictive densities
We now continue the above posterior analysis by focussing on the Bayes predictive density (i.e., the conditional density of Y given X = x) for MMN distributions and a normally distributed prior for the unknown location parameter.In doing so, the following extension come into play.Definition 3.2.A random vector Z ∈ R d is said to have a mean mixture of normal distribution with two directions, denoted as where θ ∈ R d is a location parameter, a 1 , a 2 ∈ R d are known perturbation vectors, Σ is a known positive definite covariance matrix, and W 1 , W 2 are scalar random variable with joint cdf L.
We make use of the following intermediate result provided in [21] and applicable to mixture models of the form: X|K, θ ∼ f θ,K with K ∼ g ; Y |J, θ ∼ f θ,J with J ∼ h, and θ ∼ π. (3.23) In the above set-up, X ∈ R d is observable, the mixing variables K and J are independently distributed with distributions free of θ, the variables X and Y are conditionally independent on θ, and π is a prior density for θ ∈ R d with respect to a σ−finite measure ν.
Lemma 3.7.For model (3.23), setting π k ,x and g π,x as in Lemma 3.6, the Bayes predictive density of Y admits the mixture representation and q π (y|j , k ) = R d q θ,j (y) π k ,x (θ) dν(θ), which can be interpreted as the Bayes predictive density for Y as if Y ∼ q θ,j and K = k had been observed.
Applied to mean mixture of multivariate normal distributions with a normal distributed prior, we obtain the following presented as a theorem.
distribution, with L the joint cdf of (K , J ) with independently distributed K ∼ g π,x as in (3.20) and (b) Moreover, whenever a Y = ca X for a X = 0 and a fixed c ∈ R, the above predictive distribution is M M N d (ωx + (1 − ω)µ, a X , (ωσ 2 X + σ 2 Y )I d , L 3 ), with L 3 the cdf of cJ − ωK , and (J , K ) distributed as above.Finally, for a X = 0, i.e., for Proof.Part (b) follows immediately from part (a).For part (a), consider X = X − K a X and Y = Y − J a Y .The result then follows from Lemma 3.7 with the familiar predictive density estimation result: Remark 3.3.We point out that the minimum risk predictive density matches the density in (b) with τ 2 = ∞, i.e., ω = 1.

Dominance Results
In this section, we first provide KL risk improvements on the MRE predictive density qU for estimating the density of ) with d ≥ 4. Such improvements are necessarily minimax as a consequence of Theorem 2.3.Our findings cover two types of improvements: (i) plug-in type (Section 4.1), and (ii) Bayesian improvements (Section 4.2).Furthermore, we provide analogue results for certain type of restricted parameter spaces which are also applicable for d = 2, 3. Examples will be provided in Section 5.
The restriction to covariance matrices that are multiple of identity is justified by convenience and the fact that there is no loss of generality in doing so.Remark 4.4.Predictive density estimates are intrinsic by nature which implies that the developments of this section, presented for Σ In doing so, one produces a predictive density estimator q 1 (y ) = q(y ; x ), y ∈ R d , for the density q Y of Y , which equates to as a predictive density estimator of the density q Y of Y .Moreover, the Kullback-Leibler ρ(q Y , q 1 ) and ρ(q Y , q 2 ) are equal, i.e.
as seen with the change of variables t → Σ −1/2 X t.

Plug-in type improvements
In the normal case with and can be improved by plug-in type densities of the form q θ(•; X) ∼ N d θ(X), (σ 2 X + σ 2 Y )I d .Indeed, the KL risk performance of q θ relates directly to the "dual" point estimation risk of θ(X) for estimating θ under squared error loss θ − θ 2 , with q θ(•; X) dominating qU (•; X) if and only if θ(X) dominates X ( [10]).For MMN distributions, such a duality does not deploy itself in the same way, but does so after transformation of (X, Y ) to a canonical form and through the intrinsic nature of predictive densities.The following result exhibits this and is applicable to d ≥ 4.
Then, the predictive density q H, ζ2 (y; X) = q 1 (h 1 y; X) × q 2, ζ2 (H 2 y; X), y ∈ R d , dominates qU under KL loss if and only if ζ2 (Z 2 ) dominates Z 2 as an estimator of 2 and for the model Proof.Set , and with

Now consider the class of predictive densities of the form
for estimating the density of Y .As in Remark 4.4, the Kullback-Leibler risk performance of q H, ζ2 (•; X) for estimating the density of Y is equivalent to the Kullback-Leibler risk performance of q ζ2 (•; X ) for estimating the density of Y .Furthermore, observe that the MRE density estimator qU equates to density q ζ2,0 (•; X ) with ζ2,0 (Y 2 ) = Y 2 .It thus follows, with the independence of the components of Y and X , Lemma 2.2, and setting which yields the result.
The above dominance finding is quite general with respect to the specifications of a, L

Bayesian improvements
We now focus on Bayesian predictive densities that dominate qU .In doing so, we work with canonical forms as in Lemma 2.1, apply the partitioning argument of Lemma 2.2, and take advantage of known results for prediction in (d − 1) multivariate normal models.We consider a class of improper priors on θ which is the product measure of a (improper) uniform density over the linear subspace spanned by a and a second component of the prior (π 0 ) supported on the subspace orthogonal to a.The measure of this nature splits resulting Bayes predictive densities into independent parts and leads to a decomposition the KL risk in two additive parts.Hence, the dominance result is obtained by dominating the part of the KL risk corresponding to the orthogonal space to a, where transformed variables are N d−1 distributed and where we can capitalize on known results.Namely, the superharmonicity of π 0 , or its associated marginal density or its associated square root marginal density, will suffice for dominance and minimaxity.
(a) Eq. (4.28) follows from the transformation of variables under the orthogonal matrix H.Note that the distribution of the transformed variables is where a 0 = (b) Observe that the MRE density estimator qU (•; X) corresponds to π 0 (θ) = 1, i.e., the improper uniform density on ζ (2) ∈ R d−1 .By virtue of Lemma 2.2, the KL risk difference between qU (•; X) and qπ (•; X) is then expressed as and part (b) follows.
Remark 4.5.Theorem 4.5's dominance finding in part (b) is unified with respect to the model settings a, L 1 and L 2 , as well as the dimension d ≥ 4, σ 2 X , and σ 2 Y .Furthermore, as seen in the lines of the proof, the difference in risks between the predictive densities qU and qπ : (i) does not depend on the mixing L 1 and L 2 , and (ii) depends on θ only through Starting with [14], continuing namely with [13], several Bayesian predictive densities q π 0 (•; X 2 ) have been shown to satisfy the dominance condition in part (b) of the above Theorem.Such choices lead to dominating predictive densities of qU .In [13], analogously to the quadratic risk estimation problem with multivariate normal observables (e.g., [27,11]), sufficient conditions for minimaxity are conveniently expressed in terms of the marginal density of ) associated with density π 0 and given by The superharmonicity of either π 0 , m π 0 (z, σ 2 ) for z ∈ R d−1 , for various values of σ 2 , or as well of m π 0 (z, σ 2 ), each lead to sufficient conditions for minimaxity.We recall here that the superharmonicity of h : R d−1 → R holds whenever the Laplacian Corollary 4.1.Consider the prediction context of Theorem 4.5 and a prior density π 0 as in (4.27) other than the uniform density.Suppose that m π 0 (z, σ 2 X ) is finite for all z ∈ R d−1 and that d ≥ 4.Then, the following conditions are each sufficient for qπ (•; X) given in (4.28) with prior density as in (4.27) to dominate the MRE density qU : X , with strict inequality on a set of positive Lebesgue measure on R d−1 for at least one σ 2 ; X , with strict inequality on a set of positive Lebesgue measure on R d−1 for at least one σ 2 ; (iii) The prior π 0 is such that Proof.The results follow from part (b) of Theorem 4.5 and Theorem 1 -Corollary 2 in [13].
Choices of the prior density π 0 satisfying the conditions of Corollary 4.1 thus rest upon analyses for the normal case which are plentiful.In particular, several examples of π 0 , and the resulting predictive density q π 0 , are provided in [13].These provide explicit representations of minimax predictive densities qπ given in (4.28).A detailed example is presented in Section 5.
The orthogonality decomposition used in this Section leads to a further interesting representation which generalizes the one obtained in the multivariate normal case, and for which we now expand upon.For the multivariate normal case, referring to Theorem 4.5's decomposition, with ), a well-known representation of the Bayes predictive density associated with prior density π 0 for ζ (2) , given by [13], is with , and where q U (•; X (2) ) is the MRE predictive density of the density of Y (2) based on X (2) , and given by a N d−1 (x (2) , (σ 2 X + σ 2 Y ) I d−1 ) density.For the MMN case, we now have the following.Lemma 4.8.For a prior π 0 and H in Theorem 4.5, the corresponding Bayes predictive density qπ admits the representation ) Proof.Using the set-up of Theorem 4.5, and expressions (4.28) and (4.29), the MRE predictive density is obtained as Therefore, from (4.28) and (4.29) again, as well as from 4.30, we obtain qπ (y; X) = q U (h , which yields the result. To conclude describing the dominance findings of this section and of Section 4. Such densities do not depend on a and have contours given by hypersurfaces of cylinders with axis given by a (or h 1 = a a ).Here is an example of three contours for d = 3 and a = (1, 1, 1) .

Restricted parameter spaces
Theorem 4.5's decomposition also leads to implications when there exists parametric restrictions on ζ (2) = Hθ.Statistical models where parametric restrictions are present appear naturally in a great variety of contexts, and there is a large literature on related inferential problems, namely for a decision-theoretic approach (e.g., [23,28]).Questions of predictive analysis under parametric restrictions are also of interest with findings obtained in [22,17,10].Namely, for normal models, specifically model (2.8) with a = 0, Σ X = σ2 X I d , Σ Y = σ 2 Y I d with θ constrained to a convex set C 0 with non-empty interior, [10] showed that the Bayes predictive density associated with the uniform prior for θ on C 0 dominates the MRE predictive density under Kullback-Leibler loss.The next results extends this finding to MMN models.Proof.As in Theorem 4.5 and the given proof, we infer that qπ given in (4.28) with prior density . 2 But, since this latter dominance holds precisely for density π = π C for the uniform density choice π 0 = π 0,U as shown in [10], the result follows.
The setting of C above is quite general and interesting examples includes balls and cones.As earlier, the finding is unified and general to the MMN models.Here are two applications of Theorem 4.6. .As evaluated in [17], we obtain and (4.31) then yields qπ C (y; x) = qU (y; x) , and qU the MRE density which is that of a Example 4.5.Theorem 4.6 applies for θ restricted to a cylinder of radius, say m, with the axis along the direction a, i.e., examples of which are drawn in Figure 1.The dominating predictive density qπ Cm is Bayes with respect to the uniform prior density on C m , which corresponds to (4.33) with g(t) = I (0,m) (t).An explicit expression for qπ Cm can be derived from Lemma 4.8 with π 0 the uniform density on the ball B m = {t ∈ R d−1 : t ≤ m} and marginal density with F ν,λ the cdf of a χ 2 ν (λ) distribution.From (4.31), we thus obtain , and qU the MRE density

Illustrations
We provide here illustrations of Theorems 4.4 and 4.5 accompanied by numerical comparisons and various observations. . (5.34) Thus, the prior density is the product measure on R d with uniform prior on the linear subspace spanned by a and the above harmonic measure on the (d − 1)−dimensional chosen subspace orthogonal to a. Since π 0 is superharmonic on R . In particular for odd d ≥ 5, as shown in the Appendix, one may obtain which relates to known results on the inverse moments of a chi-square variable with even degrees of freedom (e.g., [8]), as well a closed form for an incomplete gamma function which intervenes in Komaki's [14] representation of m π 0 .From (4.31) and the above, we thus have where w and σ 2 W are as given in Lemma 4.8.Risk differences between qU and qπ H are plotted in Figure 2a and Figure 2b as a function of ζ (2) 2 , or equivalently as a function of i.e., in terms of the average squared component of ζ (2) .The actual risks depend on the underlying mixing distributions L 1 and L 2 , but not the risk differences as previously observed in Remark 4.5.
Observe as well that t is independent of a and only depends on the direction a/ a .Figure 2a has σ 2 X = 1, σ 2 Y = 2 and varying d, while Figure 2b has fixed d = 5, σ 2 X = 1 with σ 2 Y = cσ 2 X and varying c.As seen with Figure 2a, the improvement in KL risk vanishes at t → ∞, but gains in prominence with increasing d, and with the proximity of θ to the linear subspace spanned by a.As seen with Figure 2b, the KL risk difference loses in prominence with larger c which is consistent with the fact that MRE density gains in reliability when the variance σ 2 X of the observable decreases.Frequentist risk ratios between qU and qπ H are plotted in Figure 2c  Example 5.7.(Plug-in type improved predictive density) In the context of Theorem 4.4, consider plug-in type predictive densities q H, ζ2 (y; X) = q 1 (h 1 y; X) × q 2, ζ2 (H 2 y; X) with the choice of the (d) Kullback-Leibler risk ratio between qU and q 1 × q 2, ζJ S as a function of t =

Concluding remarks
In this work, we have addressed the problem of determining efficient predictive densities under Kullback-Leibler frequentist risk for multivariate skew-normal distributions and, more generally, for mean mixtures of multivariate normal (MMN) distributions, and provided Bayesian and plug-in type predictive densities which dominate the MRE density, and are minimax in four dimensions or more.In doing so, we have made use of a canonical transformation which leads to the decomposition of the Kullback-Leibler risk for the predictive densities being considered into two additive parts, one of which matching that of the MRE and minimax density, the other relating to a normal model and permitting improvement in view of shrinkage predictive density estimation results for such models.Further implications are provided for certain type of parametric restrictions.In addition, motivated by the relative paucity of analytical representations for Bayesian posterior and predictive densities, we have contributed such explicit representations.This work represents, to the best of our knowledge, a first foray of the study of predictive density estimation for MMN distributions.The findings are thus novel and they are also unified.The canonical transformation technique may well find further applications in predictive analysis, such as for mean-variance mixture of normal distributions.Extensions to other choices of loss (e.g., α-divergence) and to unknown covariance structures would be most interesting to investigate as well.

Lemma 2 . 4 .
(a) For T ∈ R d with density f and U = ψ(T ) ∼ g with ψ : R d → R d invertible with inverse Jacobian J ψ , we have H(g) = −E log |J ψ | + H(f ) ; (b) Let T = (T (1) , T independently distributed, and the result follows from part (b) of Lemma 2.4 and a straightforward evaluation of the entropy H(φ d−1 ).

. 21 ) 3 . 2 .Example 3 . 3 .
Remark For the improper prior density π(θ) = 1, one obtains θ|x ∼ M M N d (x, −a, Σ, L) by a direct calculation.It can also be inferred from the above Example with ∆ = τ 2 I d and τ 2 → ∞.It is interesting to further study the above posterior distributions for the particular cases where the mixing density (i.e., V or K) of the MMN model is of the form

Theorem 4 . 4 .
Consider X, Y distributed as in model (2.8) with a = 0, d ≥ 4, θ ∈ R d , Σ X = σ 2 X I d , and Σ Y = σ 2 Y I d ,and the problem of obtaining a predictive density estimator q(y; X), y ∈ R d , for the density of Y .Let H = h 1 H 2 be an d × d orthogonal matrix such that h 1 = a a .Define the densities

for σ 2 X = 1 , σ 2 Y = 2 1 .
and varying d.These ratios depend additionally on the mixing distributions L 1 and L 2 and they are set here with χ 2 1 mixing (Example 2.1 (B)), i.e., X|θ and Y |θ have skew-normal distributions with densities given in (2.4) and MRE density expanded upon in part (D) of Section 2.4.We further set a = 1 d = (1, . . ., 1) , in which case the harmonic prior density on θ in (5.34) reduces to π 0 (θ) = ||θ − θ1 d || −(d−3) with θ = 1 d d i=1 θ i .With the above settings, the constant (and minimax) risk of the MRE density can be computed from (2.15).For instance, we obtain R(θ, qU ) ≈ 1.0954 for d = 5, ≈ 1.5187 for d = 7 and ≈ 1.9403 for d = 9.These are close to linear with the term 0137 for d = 5, ≈ 1.4191 for d = 7 and ≈ 1.8246 for d = 9), representing the MRE risk for the normal case with a = 0, being dominant in (2.15).As seen in Figure 2c, where the risk ratios are plotted with respect to t = 1 d−1 ||θ − θ1 d || 2 , the gains increase in d and with the closeness of the θ i 's to θ.

Figure 2 :
Figure 2: KL risk performance of the different predictive density estimators with the MRE

.
With the standard representation T |K ∼ χ 2 d−1+2K with K ∼ P oisson ||z|| 2 2σ 2 , we have E T (3−d)/2 = and the model density is given by (2.6) with the above h, c 1 1, and L 2 of model(2.8).Furthermore, observe by examining (4.26) that the risk difference depends on θ only through ζ (2) = H 2 θ and this for any choice of H 2 .More strikingly as seen with (4.26), the risk difference does not depend on the mixing distributions L 1 and L 2 and can be simply described by a quadratic risk difference of point estimators which arise in a (d − 1) variate normal distribution problem.An illustration of Theorem 4.4 will be presented in Section 5.
1, we point out that the plug-in type improvements of Theorem 4.5 and the Bayesian dominance results of Theorem 4.5 and Corollary 4.1 are applicable regardless of the choice of the orthogonal completion H 2 of H, thus adding to choices of π 0 leading to minimaxity.Furthermore, the above developments are unified and the findings are applicable for all MMN models (2.8) with Σ X = σ 2 X I d and Σ Y = σ 2 Y I d , as well as for Σ Y = cΣ X as justified in Remark 4.4.Remark 4.6.A particular appealing choice of H 2 , which will be further explored below in Sections 4.3 and 5, is the such that H 2 H 2 = I d − aa a a in which case d−1 for d ≥ 4, it follows from Corollary 4.1 that the Bayes predictive density qπ H (•; X) given in (4.28), as well as in (5.36) below, dominates the MRE density qU and is consequently minimax.
An explicit expression for qπ H is available from Lemma 4.31 with marginal density