Optimal Shrinkage Estimation of Predictive Densities under –divergences

We consider the problem of estimating the predictive density in a heteroskedastic Gaussian model under general divergence loss. Based on a conjugate hierarchical set-up, we consider generic classes of shrinkage predictive densities that are governed by location and scale hyper-parameters. For any α-divergence loss, we propose a risk-estimation based methodology for tuning these shrinkage hyper-parameters. Our proposed predictive density estimators enjoy optimal asymptotic risk properties that are in concordance with the optimal shrinkage calibration point estimation results established by Xie, Kou, and Brown [53] for heteroskedastic hierarchical models. These α-divergence risk optimality properties of our proposed predictors are not shared by empirical Bayes predictive density estimators that are calibrated by traditional methods such as maximum likelihood and method of moments. We conduct several numerical studies to compare the non-asymptotic performance of our proposed predictive density estimators with other competing methods and obtain encouraging results. MSC2020 subject classifications: Primary 62L20; secondary 60F15, 60G42.


Introduction
Predictive density estimation (prde) is one of the fundamental problems in statistical prediction analysis (see chapters 2, 7 and 10 of Aitchison and Dunsmore, 1975 and chapters 2, 3 and 9 of Geisser, 1993). Predictive density estimates assign probabilities to all possible future outcomes and can be used for better risk assessment and decision making than traditional point estimation methods (Mukherjee, 2013;Liang, 2002;Xu, 2005). Predictive densities have been widely used in a host of statistical applications in weather forecasting (Taylor and Buizza, 2004), finance (Tay and Wallis, 2000), information theory (Liang and Barron, 2005;Yuan and Clarke, 1999;Barron et al., 1998) as well as for model diagnostics and validation (Gelman et al., 2013;Pardoe, 2001;Gelman et al., 2014).
In this paper, we consider multivariate predictive density estimation under general divergence loss in a heteroskedastic Gaussian model. For point estimation, since the seminal work of James and Stein (1961) there has been substantial research toward understanding the risk properties of shrinkage estimators for the homoscedastic hierarchical normal models (see Fourdrinier et al., 2018, Efron, 2012 and the references therein). The concept of shrinkage is important because it provides an elegant framework for combining information from related populations and often leads to substantial improvements in the performances of estimators used for simultaneous inference. Komaki (2001Komaki ( , 2004; George et al. (2006); Brown et al. (2008) demonstrated the critical role of shrinkage priors for constructing efficient predictive density estimates (prdes) under Kullback-Leibler (KL) loss. High-dimensional decision theoretic parallels between prde under Kullback-Leibler loss and point-estimation under quadratic loss have been established in Fourdrinier et al. (2011); Xu and Liang (2010); George et al. (2012); Kubokawa et al. (2013); Johnstone (2017, 2015); Yano and Komaki (2017); Ghosh et al. (2019). Ghosh et al. (2008), Suzuki and Komaki (2010), Maruyama and Strawderman (2010), L' Moudden and Marchand (2018) and Ghosh and Kubokawa (2018) extended those parallels for prde under general α-divergences. Using KL loss as a divergence measure between the true and estimated predictive density leads to convenient tractable analysis. However, predictive densities calibrated by KL loss are often non-robust to outliers and may under-estimate the variance or ignore important local attributes of the true density. To circumvent these issues, it is becoming increasingly popular in complex prediction approaches (Wang et al., 2018;Hernández-Lobato et al., 2016;Cichocki and Amari, 2010;Zhang et al., 2017) to use the class of α-divergences (Amari, 2009;Liese and Vajda, 2006;Barron et al., 1992) that covers a wide spectrum of divergence measures with contrasting attributes.
For predictive density estimation in multivariate Gaussian models, Ghosh et al. (2008) showed that the canonical minimax prde, which is the Bayes prde under uniform prior, is not admissible under general divergence loss for dimensions greater than 2. For dominating the canonical minimax prde, Ghosh et al. (2008) used prdes that are not necessarily Bayes, whereas Maruyama et al. (2019) established the domination results for the Bayes prde under the harmonic prior of George et al. (2006). Ghosh and Kubokawa (2018) established that the hierarchical Bayes prde has lower frequentist risk than that of the empirical Bayes prde in a regression set-up. While Ghosh et al. (2008); Ghosh and Kubokawa (2018) showcased enhanced predictive efficiency of the Bayes prde from non-informative priors over plug-in prdes, L'Moudden and Marchand (2018) proposed improving plug-in prdes directly. However, all of these results are based on the homoskedastic model. They also do not provide any prescription for selecting a particular prde among the host of feasible and admissible prdes.
Here, we study the prde in a heteroskedastic set-up where our target density is no longer spherically symmetric, and consider the problem of finding optimal shrinkage directions. We provide a data driven program for determining the optimal directions (location) and magnitude (scale) of shrinkage such that the resultant prde has minimal frequentist risk among a wide class of shrinkage estimators. Our proposed prde not only possesses the plug-in-dominance properties of the Bayes prde as in Ghosh et al. (2008), but also obtains the minimal risk among a wide class of shrinkage rules. These αpredictive risk optimality properties parallel those established by Xie, Kou, and Brown (2012) for point estimation in heteroskedastic hierarchical models.
Recent point estimation results of Xie, Kou, and Brown (2012); Xie et al. (2016); Tan (2015); Weinstein et al. (2018) have brought to light new shrinkage phenomena in heteroskedastic models. A hierarchical set-up specifying a second-level structure to motivate the shrinkage is considered, and the corresponding hyper-parameters are subsequently estimated. Whereas, the common practice is to choose the conjugate hierarchical structure and estimate the hyper-parameters through empirical Bayes maximum likelihood estimator (EBMLE) or empirical Bayes method of moments (EBMOM), we instead consider tuning the hyper-parameters by minimizing efficient risk estimates as in Xie, Kou, and Brown (2012).
A significant finding reported in Xie, Kou, and Brown (2012); Xie et al. (2016); Tan (2015); Weinstein et al. (2018) is that, under heteroskedasticity, EBMLE or EBMOM provide sub-optimal predictive performance and are far outperformed by algorithms tuned using risk estimation-based approaches. We establish asymptotic optimality of our proposed predictive methods akin to the point estimation results in Xie, Kou, and Brown (2012). These asymptotic properties are not shared by EBMLE or EBMOM based prdes. We establish asymptotic convergence rates of our risk estimates as dimension increases. Dimension independent non-asymptotic characterizations of the predictive risk of our proposed estimators are also provided using maximal inequalities for martingales. We compare these comprehensive results for general α-divergence with those of Xu and Zhou (2011) who studied empirical Bayes prde in spherically symmetric homoskedastic Gaussian model under KL loss. Our general α-divergence results well reconcile with the KL results in the existing literature. Through numerical studies, we demonstrate the benefits of using α-divergence based risk calibrated prdes over EBMLE or EBMOM based prdes. The direction of shrinkage and the shape of the optimally shrunken prdes greatly varies as α changes.

Predictive Set-Up
Predictive Sequence Model Consider observing a vector X = {X 1 , . . . , X n } where X i are independent among each other and X i follows N (θ i , σ 2 i ), i = 1, . . . , n. Based on observing X we would like to predict the unknown density of future observations Y = Here, σ i , ν i are known and thus, r i = ν 2 i /σ 2 i is the known ratio of the future-to-past variances. The observed past X and unobserved future Y are only related through the unknown location parameter θ = {θ i : 1 ≤ i ≤ n}. This is the heteroskedastic version of the Gaussian predictive model studied in Komaki (2001); George et al. (2006); Brown et al. (2008); Xu and Zhou (2011). Letp(y|x) be any prde for the true density p(y|θ, Loss Function Consider α-divergence as the measure of discrepancy between the prde and the true density. For any fixed α ∈ R, the loss defined as Integrating over the density of the observed past, we have the predictive α-risk as: When α = 3, we have Pearson χ 2 -divergence where as α = −3 corresponds to Neyman χ 2 -divergence; α = −1 is KL loss and α = 1 yields reverse KL loss; α = 0 corresponds to Bhattacharyya-Hellinger (BH) loss Bhattacharyya (1943). Note that, the predictive risk might not be well-defined for all α ∈ R on which we reflect later. For α 0 ∈ {−1, 1}, R n,α0 (θ,p) = lim α→α0 R n,α (θ,p), so that the KL and reverse KL predictive risk expressions will follow from the general α-predictive risk. Note that, the α predictive risk is the posterior predictive relative entropy regret criterion introduced in Sweeting et al. (2006) and differs from the reference prior inducing criterion studied in Clarke and Yuan (2010); Ghosh et al. (2011).

Hierarchical Set-Up and Shrinkage Prdes
Next, we assume a hierarchical higher level exchangeable structure on the unknown location parameters. Let {θ i : 1 ≤ i ≤ n} be independent and identically distributed (i.i.d.) from a N (η, τ ) prior where η ∈ R and τ ≥ 0 are unknown hyper-parameters. This exchangeable hierarchical set-up, wellused in the literature (Xie, Kou, and Brown, 2012;Efron and Morris, 1973;Robbins, 1964;Zhang, 2003), allows partial pooling of information from quantities of interest for different yet related groups of populations. Define the integrated Bayes risk with respect to any prior π n on θ as B n (π,p) = R n,α (θ,p)π n (θ) dθ.
The following result (proved in George et al. (2021, Section 1.2)) shows that for all α ≤ α U and for any product normal priors, i.e., π n (θ) = φ(θ i − η; τ ), the integrated Bayes risk has a well-defined minima and the resultant Bayes predictive density estimator is also a product of normal densities. where Henceforth, we consider only α-divergences with α ≤ α U . Note that as min i r i ≥ 0, BH, KL, reverse KL, Neyman χ 2 divergences are always covered in our results. If min i r i ≥ 1, then the Pearson χ 2 divergence is also covered. Based on the above result, we consider the following flexible class , andq 1 ,q 2 are q/2 and (1 − q/2)th quantiles of X 1 , . . . , X n for any prefixed q ∈ (0, 1). The class is indexed by the location and scale hyper-parameters, η and τ . For any sensible shrinkage predictors, it is enough to confine the location hyper-parameter η within the 100q% range of the observed data. The scale hyper-parameter τ varies over the non-negative axis. In the following section, we provide a methodology for choosing the hyper-parameters so that the resultant prde has optimal risk properties among all prdes in the class S.

Extension to Non-diagonal Predictive Set-Ups
The results and methodology developed here can encompass non-diagonal predictive set-ups where X ∼ N (θ, Σ p ) and Y ∼ N (θ, Σ f ) with Σ p and Σ f being known positive definite matrices. For a nondiagonal prior θ ∼ N (η, Λ), Lemma 2.1 can be extended as follows with α U being 1 + 2λ min (Σ f Σ −1 p ), which is the generalization of the scalar case. The proof of the lemma is presented in George et al. (2021, Section 1.2).
For tractable shrinkage classes, we need to impose lower dimensional structures on η and Λ in (2.2). Perhaps, the most popular choice is η = η1 and Λ = τ I, which extends the class S based on (2.1) to the non-diagonal set-up. In the following sections, we describe our method and its associated results for the class S. However, note that the methodology can be extended to other shrinkage classes based on (2.2).
Hereon, we describe our method first for the diagonal predictive set-up as it produces comparatively simpler expressions that can be intuitively studied and compared to the point estimation results of Xie, Kou, and Brown (2012) and the predictive KL results in Xu and Zhou (2011). The general results for non-diagonal set-ups along with their complete proofs are provided in George et al. (2021, Section 1).

Risk Estimation and Hyper-parameter Calibration
Denote by R n,α (θ; η, τ ) the risk R n,α (θ,p[η, τ ]) of any arbitrary memberp[η, τ ] in S. The following result proved in George et al. (2021, Section 1.3) shows that this multivariate predictive risk decouples as functions of the corresponding coordinate-wise risks and subsequently can be explicitly written through closed form expressions as functions of η, τ, θ and α. Letting τ → ∞, we get the second display in Theorem 2.4 of Ghosh et al. (2008).
Theorem 3.1. For α ≤ α U and α = ±1, the risk of any prdep[η, τ ] ∈ S can be expressed as The risk for the KL and reverse KL losses can be derived from the above expression by noting that for α = α 0 ∈ {−1, 1}, where, the last equality follows from the fact that H i = 0 when α ∈ {−1, 1}, and L'Hôpital's rule. Thus, Note that R n,−1 matches the KL risk expression in equation (11) of Xu and Zhou (2011). We next estimate the predictive risk R n,α (θ; η, τ ). Noting that ( Consider their averageĤ n (τ, η; α) = n −1 n i=1Ĥ i (τ, η; α). For any fixed α ≤ α U , select the hyper-parameters that minimize the average risk estimate, i.e., (η n,α ,τ n,α ) = arg min 0≤τ, η∈ [q1,q2] sign(α 2 − 1)Ĥ n (η, τ ; α) (3.1) to obtain our proposed prdep[η n,α ,τ n,α ](y|x). Thus, when |α| < 1, we maximizê H n (η, τ ; α) and we minimize it when |α| > 1. However, note that in both the cases, this corresponds to minimizing risk estimates of the actual risk. For α = ±1, we directly minimize the unbiased estimates of R n,1 and R n,−1 . Taking the limit of the hyper-parameters (η n,α ,τ n,α ) as α → 1 − or α → −1 + also yields similar results. Figure 1 shows the BH risk of prdes for n = 5 and 10 at θ = t1 where t varies from 0 to ∞. The risks of the best prde in S (which is characterized later in (4.1)) are plotted in blue. The risk of our proposed method is calculated by Monte-Carlo integration and is plotted in red. In dotted black lines, we have the risk of the best invariant prdep U which is the Bayes prde from the uniform prior. We see that compared top U significant gains in risk can be obtained by estimators in S when |t| is near the origin and the gains decrease when |t| is large. Also, the risk of the proposed method is reasonably close to the minimum attainable risk in S. Next, we rigorously document these risk properties of prdes tuned by the proposed procedure.

Theory Results
We first establish a non-asymptotic concentration bound on the deviation ofĤ n from the true log-risk. Consider the expected absolute deviation: We establish an upper bound on D n,α that depends on the L 2 norm of the signal strength: Theorem 4.1. For any α ≤ α U , any fixed η ∈ R and θ ∈ R n , for all n ≥ 1, where, κ 0 = 12 is an absolute constant and r * = inf i r i .
Our next result which uses Theorem 4.1 shows that our proposed risk estimate approximates the average of the logarithm of the true multivariate risk uniformly well for all prdes in the shrinkage class S. Thus, calibrating the hyper-parameters by minimizing the risk estimates is a sensible choice. To facilitate shorter mathematical proofs we assume the following asymptotic conditions: Though these conditions may be further relaxed as noted after Theorem 3.1 of Xie, Kou, and Brown (2012), we do not seek the full generality as the conditions are not restrictive and the proofs presented under these assumptions contain all the essential statistical perspectives.

Theorem 4.2.
Under Assumptions A1-A2, for any α ≤ α U and a n = o(n 1/2 ), a n sup The above result (proved in George et al. (2021, Section 1.5)) shows that the risk es-timatesĤ n have near-parametric √ n-rates of convergence barring some poly-log terms. Thus, we expect the risk estimates to be reasonably precise even for moderate dimensions n. This attribute is reflected in the simulation studies in Section 5.
To study the risk properties of our proposed prdep[η n,α ,τ n,α ], we next introduce the oracle risk (OR) hyper-parameters as those which minimize the true risk function: (η or n,α , τ or n,α ) = arg min 0≤τ, η∈ [q1,q2] R n,α (θ; η, τ ). (4.1) These oracle choices are not really estimators since they depend on the unknown θ values. They are not obtainable in practice but provide the theoretical benchmark that one can ever hope to reach. Indeed, no prdes in S can have smaller risk than the oracle risk prdep[η or n,α , τ or n,α ]. Consider the average logarithmic ratio of the risk deviations where γ α = sign(α 2 − 1).
[A3] For any ε > 0 and −∞ < a 0 < a 1 < ∞, The following result shows that our estimated hyper-parameters are close to the oracle hyper-parameters and our proposed prde has risk close to the oracle risk. This property is not shared by the EBMLE or EBMOM. The estimating equations for calibrating the hyper-parameters based on the EBMLE and EBMOM (discussed in the following section), differ from our proposed methodology so that the estimated hyper-parameters can be highly different (see cases IV-V in Section 5). In such cases the EBMLE and EBMOM tuned prdes have much higher risk than the oracle unless the risk function is completely flat. As our proposed estimatorp[η n,α ,τ n,α ] is asymptotically nearly as good as the oracle prde, its asymptotic risk is no larger than that of any other prdes in the general class S.
Theorem 4.3. For any α ≤ α U , under assumptions A1-A2, the logarithmic ratio ρ n,α (θ) converges asymptotically satisfying lim sup n→∞ n 1/2 ρ n,α (θ) < ∞. Additionally, with assumption A3, we have: The above result is proved in George et al. (2021, Section 1.6). It shows that as n increases, the average logarithmic ratio of risk deviations from the oracle (ALRORD) converges to 0 for any prde calibrated by the proposed method. As such, as n → ∞ the ALRORD of any prde calibrated by the proposed method is always bound above by O p (n −1/2 ). Assumption A3 implies that the ALRORD for an arbitraryp[η, τ ] based on hyper-parameter (η, τ ) cannot be bounded below O p (n −1/2 ) unless the hyper-parameter (η, τ ) is within any prefixed -neighborhood of the oracle hyper-parameter values. This ensures that as n → ∞, the hyper-parameter estimates in (3.1) converge to the oracle values in (4.1).

Simulation Experiments
We conduct six simulation experiments to compare the performance of our estimation methodology with competing methods for calibrating estimators in S. We consider the EBMLE tuned prde which uses hyper-parameters the EBMOM tuned prde whose hyper-parameters are solutions to the following equationŝ the extended James-Stein (Xie, Kou, and Brown, 2012) based prde in S with hyperparameterŝ as well as the completely non-informative (NI) prde which has τ → ∞ and the oracle estimator of (4.1). Additionally, we also consider the Bayes prdep C from the heavy tailed Cauchy prior.
The six simulation regimes are inspired by experiments in section 4 of Xie, Kou, and Brown (2012). In Cases I to V, we have Gaussian noise with mean 0 and variances σ 2 i being i.i.d. from Uniform[0.5, 1.5] distribution. Thus, the mean noise variance is 1. In Case I, we generate θ ∼ N (0, 2I n ) and the r i as uniformly distributed between 0.1 and 1. The average signal-to-noise-ratio (snr) is 2 here. Note, that this case is in perfect congruence with the hierarchical normal set-up of Section 2. The remaining cases are different from the normal models and conjugate priors set-ups of Lemmas 2.1 and 2.2. In Case II, θ i are i.i.d. from a Uniform prior on [− √ 3, √ 3] and thus has variability 1. Here, we consider stratification in r. The r i are generated from a mixture of three uniform distributions with significantly different supports. In Case III, we introduce dependence between the means and the future variances by setting θ = r while r is i.i.d. form Uniform[0.1,2]. Case IV is similar to case 3, except r i 's are no longer bounded as before but are generated from an inverse-Chisquare with 5 degrees of freedom. In Case V, {(θ i , r i ) : i = 1, . . . , n} are independent among themselves, r i takes two possible values with P (r i = 10) = 0.7 and P (r i = 1) = 0.3, and the true mean values are generated conditionally on the r i values with [θ i |r i = 10] and [θ i |r i = 1] both being normal distributions with mean 0 and standard deviations 0.1 and 1, respectively. In Case VI, we consider θ, r as in case 3. However, here the noise is no longer Gaussian but is from a uniform distribution with mean 0 and variance 1. The snr is 1.4 in this set-up. Figure 2: Adjusted BH risk of the prdes based on the oracle estimator of (4.1) (in black), EBMLE (in blue), EBMOM (in green), Cauchy prior based Bayes prde (in cyan) and our proposed method (in red) are plotted as n varies along the x-axis.
We report the Bhattacharyya-Hellinger predictive risk for the six cases in Table 1 and Figure 2. In each of the six cases, we generate {θ i , σ i , ν i : 1 ≤ i ≤ n} values once from the corresponding model and then calculate the adjusted BH risk for differentη andτ estimates across 100 replicates. We calculate the BH risk adjusted for dimension by ABH n (θ; η, τ ) = 1 − {1 − c 0 R n,0 (θ; η, τ )} 1/n . For each case and n, the reported adjusted BH predictive risk in Figure 2 is the average of the ABH value across the 100 replicates.
In Table 1, n = 1000 and so, it reflects the asymptotic risks of these estimators. Figure 2 compares the risk profiles as sample size varies (NI and JS were avoided in the display for their significantly higher error rates reduced clarity in some of the plots).  Table 1: As n → ∞, the limiting BH predictive risk (adjusted for dimensions) of different shrinkage prdes is reported in % of excess risk over that of the oracle estimator in (4.1).
From the table, we see our proposed estimation method is asymptotically close to the oracle in all the cases where as the figure displays that suitable accuracy can be attained for moderate n across all the concerned scenarios. The EBMLE and the EBMOM have risks similar to ours in cases I and II but has significantly worse performance in the other cases when there is dependence between θ and r. The Cauchy prior based Bayes prdep C performs considerably better than the EBMLE and EBMOM in Cases III, IV and V, when n is large (see Table 1), though its risk is still significantly higher than that of the proposed method. In moderate dimensions, the relative risk ofp C is substantially higher than the EBMLE, the EBMOM and the proposed methods. As such for n ≤ 200, the risk curve forp C is higher than the EBMLE or EBMOM in all cases except IV (see Figure 2). The JS based prde has erratic asymptotic risk behavior as it often fails to adapt to the heterogeneity in the data and the non-informative prior based prde has poor performance across all regimes. We present two numerical examples in the supplementary materials (George et al., 2021) where suboptimal performance of the JS based prde is witnessed.

Discussion and Future Work
We developed a risk estimation based methodology for tuning linearly shrunken prdes in Gaussian models under general α-divergence losses. The proposed risk estimation based method will be particularly useful under heterogeneity. If the set-up is homoskedastic (σ i = σ and ν i = ν for all i), then in high dimensions the proposed method will produce hyper-parameter estimates similar to the EBMOM and EBMLE. An interesting topic for future work will be to introspect the roles of prde under α-divergences when covariances are unknown as studied in Kato (2009) for the KL loss. Also, in applications it is important to choose an α-divergence loss that is tailored to the specific prediction task at hand (Kempthorne et al., 1988;Rosasco et al., 2004). Following Nguyen et al. (2009); Gül and Zoubir (2016) it will be important to study the roles different α predictive risks in hypothesis testing and classification problems. Another important direction would be understanding the roles of adaptive calibration under α-risk in the presence of latent structures in the mean parameters such as the sparsity restrictions studied in Mukherjee (2013); Yano et al. (2021) for the KL loss. Finally, extending the risk estimation based methodology developed here to non-normal models will be useful.