On discrete priors and sparse minimax optimal predictive densities

We consider the problem of predictive density estimation under Kullback-Leibler loss in a high-dimensional Gaussian model with exact sparsity constraints on the location parameters. For non-asymptotic sparsity levels, the least favorable prior is discrete. Here, we study the first order asymptotic minimax risk of Bayes predictive density estimates where the proportion of non-zero coordinates converges to zero as dimension increases. Motivated by an optimal thresholding rule in Mukherjee and Johnstone (2015), we propose a discrete prior and show that its Bayes predictive density estimate is minimax optimal. This produces a nonsubjective discrete prior distribution that minimizes the maximum posterior predictive relative entropy regret. We discuss the decision theoretic implications and the structural differences between our proposed prior and its closest predecessor – the geometrically decaying discrete prior of Johnstone (1994a) that produced minimax optimal point estimators under quadratic loss. Through numerical experiments, we present non-asymptotic worst-case risk of our proposed estimator across different sparsity levels. MSC2020 subject classifications: Primary 62L20; secondary 60F15, 60G42.


Introduction and main results
A fundamental problem in statistical prediction analysis is to choose a probability distribution based on observed data that will be good in predicting the behavior of future samples (Aitchison and Dunsmore, 1975;Geisser, 1993;Aitchison, 1975). The future probability density conditioned on the observed past is referred to as the predictive density and estimating it plays an important role in a number of statistical applications (Liang, 2002;Mukherjee, 2013). Consider the problem of predictive density estimation in a n-dimensional Gaussian location model where the observed past vector X ∼ N n (θ, v x I) and the future vector Y ∼ N n (θ, v y I). The variances v x and v y are known. The future and past vectors are related only through the unknown location vector θ. Consider predictive density estimators (PRDE)p(y|x) and measure their performance in estimating the true future density p(y|θ, v y ) = N n (θ, v y I) by the global divergence measure of Kullback and Leibler (1951), L(θ,p(·|x)) = p(y|θ, v y ) log p(y|θ, v y ) p(y|x) dy. (1.1) The KL risk integrates the above loss over the past distribution and is given by ρ(θ,p) = L(θ,p(·|x))p(x|θ, v x ) dx. Sweeting et al. (2006) showed that ρ(θ,p) constitutes a posterior predictive relative entropy regret criterion that can be used for the construction of nonsubjective prior distributions. Given any prior π on θ, the Bayes PRDEp π (y|x) = p(y|θ, v y )π(dθ|x). The average integrated risk B(π,p) = ρ(θ,p)π(dθ), when well-defined, is minimized byp π yielding the Bayes risk B(π) = infp B (π,p).
Decision theoretic parallels between PRDE and point estimation (PE) of the multivariate normal mean under square loss are established in Komaki (2001); George et al. (2006); Brown et al. (2008); Ghosh et al. (2008); Kato (2009a); Maruyama and Ohnishi (2019). These risk analysis results hold for any dimension n. In higher dimensions, Fourdrinier et al. (2011); Xu and Liang (2010); Kubokawa et al. (2013) developed minimax optimal PRDE for constrained parameter spaces [see George et al. (2012), Ch. 1 of Mukherjee (2013) and George et al. (2019) for extensive reviews]. Sparse PRDE under exact 0 sparsity constraints on the location parameter is studied in Mukherjee andJohnstone (2017, 2015) where efficacy of different PRDEs were evaluated with respect to the minimax benchmark risk R * (Θ) = infp sup θ∈Θ ρ (θ,p). For an 0 constrained parameter space Θ 0 [s n ] = {θ ∈ R n : n i=1 1{θ i = 0} ≤ s n } when η n = s n /n → 0, the first order asymptotic minimax risk was evaluated as R * (Θ 0 [s n ]) = (1 + r) −1 n η n log η −1 n (1 + o(1)) as n → ∞, (1.2) where r = v y /v x . The minimax risk increases as r decreases. The difficulty of the density estimation problem increases as r decreases as we need to estimate the future observation density based on increasingly noisy past observations. The rate of convergence of the minimax risk with n does not depend on r, and so exact determination of the constants is needed to show the role of r in this prediction problem. Several predictive phenomena that contrast with point estimation results have been reported with the divergence becoming palpable as r decreases.
Here, we study the risk of Bayes predictive density estimators based on sparse discrete priors. In order to incorporate the knowledge on sparsity of the parameters, we consider priors with an atom of probability (spike) at the origin. Spike-and-slab priors based procedures have been shown to be very successful for sparse estimation (Johnstone and Silverman, 2004;Clyde and George, 2000;Rockova and George, 2018). Here, we consider slabs based on discrete priors. In regimes with non-asymptotic sparsity levels, i.e., η n → η ∈ (0, 1), the least favorable prior is unique and discrete (Berger and Bernardo, 1992;Zhang, 1994). Risk analysis of estimators based on discrete priors has a rich history in statistical decision theory (Johnstone, 2013;Marchand et al., 2004;Bickel, 1983;Kempthorne, 1987), particularly for studying the worst-case geometry of parametric spaces. For tractable analysis and detailed insights, minimax optimality based on discrete priors is studied in the asymptotic regime (Johnstone, 1994b;Bickel, 1983Bickel, , 1981. Johnstone (1994a) (henceforth referred to as J94) established that for sparse point estimation a product prior based on discrete marginals containing equi-spaced support-points with geometrically decaying probability is asymptotically minimax optimal. Mukherjee and Johnstone (2017) (referred hereon as MJ17) showed that Bayes PRDE from the J94 prior is minimax sub-optimal under Kullback-Leibler loss. In this paper, we construct a discrete prior whose marginals have geometrically decaying tail probabilities akin to J94 but have different prior spacings so that the resultant Bayes PRDE is minimax optimal.
The discrete prior we study here is inspired by the risk diversification phenomenon introduced in Mukherjee and Johnstone (2015) (henceforth referred to as MJ15) for constructing minimax optimal PRDEs. MJ15 showed that in contrast to point estimation, for obtaining minimax optimality in sparse PRDE we need to incorporate the notion of diversification of the future risk. The optimal thresholding rule of MJ15 used two Bayes PRDEs: One of those is the Bayes PRDE from a symmetric product prior whose marginals have finitely many support points with atoms except at the origin having equal probability. Here, we conduct detailed worst-case risk analysis of PRDEs based on generic versions of such discrete priors. Unlike MJ15, our proposed prior has marginals with clusters of equi-probable atoms and the clusters have different probabilities. Compared to MJ15, our proposed clustered prior based Bayes PRDE has the advantage of avoiding the discontinuous thresholding operation in order to obtain sparse minimax optimality.
We first present our main result regarding minimax optimality of the Bayes PRDE from the proposed clustered discrete prior. Thereafter, we discuss the implications of the result along with detailed background and connections to the existing literature.

Main result: Minimax optimality
For any fixed positive r, consider the Bayes PRDE from a discrete product prior consisting of symmetric marginals π CL (defined below). The marginal has equispaced clusters of atoms with geometrically decaying probability content in the clusters as they move away from the origin. For any η ∈ (0, 1) and r ∈ (0, ∞) consider the univariate clustered discrete prior: which has an atom of probability 1−η at the origin and the remaining η probability shared across clusters. Each of the clusters C i has κ atoms {μ ij : j = 1, . . . , κ} of equal probability which is the reason for referring such prior distributions as clustered priors. Let v = (1 + r −1 ) −1 , λ e := λ e (η) = (−2v x log η) 1/2 and λ f := λ f (η, r) = v 1/2 λ e . For any fixed γ ≥ 1, the atoms in C 1 are aligned in between λ f and λ e in a geometric progression with common ratio γ, i.e., μ 1j (η, r, γ) = γ j−1 λ f ∧ λ e for 1 ≤ j ≤ κ. Such geometric spacing was introduced in MJ15 (see Theorem 1C). For i ≥ 2 the atoms are extended periodically to cluster C i as μ ij = (i − 1)μ 1κ + μ 1j and by symmetry μ −ij = −μ ij to the negative axis. Thus, the clusters themselves are equidistant at a separation of λ f and while the atoms within each cluster has equal probability, the clusters themselves have geometrically decaying probabilities: Our proposed cluster prior π C has γ = γ r and κ = K r where, Here, r 0 = 0.5. Note that, K r = 1 iff r ≥ r 0 . When K r ≥ 3 and i ≥ 1, all atoms except the K r th one in any cluster C i are aligned in a geometric progression starting from μ i1 = (i − 1)λ e + λ f , with common ratio 1 + 4r and μ iKr = iλ e . Table 1 shows the cluster size as r varies. Figure 1 shows the schematic diagram of the (truncated) prior with 6 clusters for two instances when r = 0.38 and r = 0.14 respectively. While the former has clusters of size 2, the latter has cluster size 4. Figure 1 illustrates a key aspect of the cluster prior: for r < r 0 the gap μ i,Kr − μ i,Kr−1 is allowed to vary widely with r while μ i+1,1 − μ i,Kr is fixed at λ f for all i. Now, consider the multivariate clustered prior π C n [η n , r](dθ) = n i=1 π C [η n , r](dθ i ) on R n . Then, the Bayes PRDEp C [η n , r] based on π C n [η, r] is asymptotically minimax optimal. Theorem 1.1. Fix any r ∈ (0, ∞). If η n = s n /n → 0, then lim n→∞ sup θ∈Θ0 [sn] ρ θ,p C [η n , r] R * (Θ 0 [s n ]) = 1.

Geometrically decaying priors: Background and risk analysis
For understanding the decision theoretic implications of the above result, we briefly revisit the risk properties of sparse product priors based on symmetric marginals. It follows from J94 that for point estimation of the normal mean over Θ 0 [s n ] under 2 loss, the posterior mean of the grid prior π EG n is minimax optimal as η n → 0. π EG n constitutes of i.i.d. copies of univariate grid prior π EG [η n ] which is defined for any fixed r and η ∈ (0, 1) as As η n → 0, the geometric contamination based discrete prior π EG [η n ] attains the least Fisher information. In contrast to π C , π EG always has only one point in each cluster. However, they have identical probability decay rate (geometric contamination) as the clusters extend away from the origin. MJ17 showed that the PRDE based on π EG n is sub-optimal for PRDE estimation based on KL loss. The Bayes PRDE based on a product grid prior whose univariate marginals π PG (subscripts PG and EG denote predictive and estimative grids) has reduced spacing between the atoms and reduced probability decay rate, was established to be minimax optimal in the predictive regime abet for r ≥r 0 = ( However, π PG is sub-optimal for r <r 0 . Note that, unlike the univariate grid priors π EG , π PG where support points have geometric probability decay, π C has support points with identical probability within each clusters. The clusters in π C however has the same decay rate as the support points in π EG . The maximum gap between atoms in π C equals the spacing in π PG . Equiprobable atoms in the clusters was introduced in MJ15 to control predictive risk via the new notion of risk diversification. As such consider a truncated cluster prior with only two clusters: where, C 1 = C 1 (η, r;γ r ,K r ) as in (1.4) withγ r = 1 + 2r andK r given by K r − 1 with the formula in (1.6) used withγ r in place of γ r . As the prior π TC is bounded at λ e , its corresponding Bayes PRDEp CT has unbounded risk.
was shown in MJ15 to be minimax optimal for any r ∈ (0, ∞). Inp T the threshold is λ e (η n ); above the threshold the Bayes PRDE based on the uniform prior, which is Gaussian with variance v x + v y was used where as below the threshold the Bayes PRDE from π CT is used. Thresholding rules are not smooth functions of the data and it was conjectured in Sec. 6 of MJ15 that periodic clustered priors of the form of (1.3)-(1.4) can attain minimax optimality without the discontinuous thresholding operation. Here, we study the risk properties of such cluster priors and establish minimax optimality of the properly calibrated prior π C . We found that the common ratioγ r used in MJ15 was not optimal and can be increased to γ r . However, as a consequence of removing thresholding we needed one more atom than MJ15 in our proposed cluster prior π C for small values of r. We show that the number of support points in π C as used here is necessary by proving the following risk properties of Bayes PRDEs based on generic cluster priors of (1.3). First, we show that our prescribed choice of r 0 = 0.5 is sharp and can not be further lowered: Any cluster prior with cluster-size one and the probabilty decay rate of at least η n is sub-optimal for all r < r 0 . Consider the following univariate prior with singleton atoms in each cluster: with ν ≥ 0 and l ≥ L(η) := (−2 log η) 1/2 . As ν, L vary, let SI be the class of Bayes PRDEsp SI [η n ; ν, L] based on n i.i.d. copies of π SI [η n ; ν, L]. Note, that this class includes Bayes PRDEs based on π EG n which correspond to The following result shows that the class SI is sub-optimal. Lemma 1.2. If η n = s n /n → 0, then for any r < r 0 , Second, we show that our prescribed cluster size K r can not be further reduced for any r < r 0 . The following result shows that any priors of the form (1.3)-(1.4) with γ > γ r and K γ = 1 + log(1 + r −1 )/(2 log γ) (which gets specified once γ is fixed by the structure of the atoms in (1.4)) will produce sub-optimal Bayes PRDEs. Also, dropping atoms from π C will lead to suboptimality. For any non-empty subset S ⊂ {1, . . . , K r }, instead of (1.4) consider priors π S [η, r] with clusters The following result shows that it is sub-optimal. Lemma 1.3. For any fixed r < r 0 , as η n = s n /n → 0, for any γ > γ r and As The proof of the above two lemmas are provided in section 3. We end this section by summarizing the key features of our results.
(a) We construct a prior π C n with geometric contamination akin to the prior π EG n in J94. Under high sparsity, the prior in J94 is asymptically least favorable for point estimation and has the least Fisher information; its posterior mean is minimax optimal under quadratic loss. However, the corresponding Bayes PRDEp EG is minimax sub-optimal under KL loss. Our proposed π C n has the same decay rate as π EG n but lot more atoms. It Bayes PRDEp C is minimax optimal under KL loss. (b) The proposed prior π C n is based on the minimax analysis of the posterior predictive relative entropy regret criterion of Sweeting et al. (2006) which differs from traditional reference prior inducing criterion (Bernardo, 1979) as it also considers predictive performance in relation to alternative nondegenerate prior distributions. (c) As conjectured in MJ15, using the proposed prior π C n lead to a minimax optimal procedure that do not involve threholding. The optimality in the structure of the proposed prior is established. Among all geometrically contantimated sparse discrete prior having cluster decay rate similar to π EG n in J94, it has the minimal cardinality as established in Lemmas 1.2 and 1.3.
(d) Compared to the simplier grid priors π PG n analyzed in MJ17, the geometry of manifold induced by the proposed prior π C n is significantly different. This necessitates separate analysis and proofs of the risk properties of the Bayes PRDEs from π C n . (e) The essential ingredient in the proof is the asymptotic analysis of the terms in the decomposition of the univariate predictive risk function in (2.2). The terms in the right hand side of (2.2) involve exponential Gaussian sums with different means and variances. Minimax analysis of the predictive regret involves asymptotic characterization of the differences between the logarithm of two exponential Gaussian sums. In contrast, minimax optimality in PE involves studying only one exponential Gaussian sum (see Ch 8.5 in Johnstone (2013) and the description in Sec. 2). Figure 2 shows the numerical evaluation of the predictive risk ρ(θ,p C [η, r]) of our proposed Bayes PRDE when η = 0.001 and r = 0.225. Each cluster has size three. The maximum riskp C [η, r] crosses the asymptotic theory limit but does not exceed by much. It shows that the asymptotic analysis is fairly reflective in this non-asymptotic regime. The risk function has its peak between μ 11 and μ 12 and is approximately periodic barring a few clusters near the origin. As the figure shows, the risk function is much smaller than the asymptotic limit of λ 2 f /(2r) for all the points in C 1 barring its first point. As all points in C 1 are equally likely, this implies that the cluster prior is not least favorable. The following result make this observation rigorous by explicitly evaluating the first order asymptotic Bayes risk of the cluster prior. It establishes that when there are two or more points in each cluster (i.e. r < r 0 ) the cluster prior is no longer least favorable.

Further result on the asymptotic Bayes risk
Theorem 1.4. If η n = s n /n → 0 as n → ∞, then the multivariate cluster prior π C n [η n , r] is not asymptotically least favorable for all r < r 0 . As such, its Bayes risk satisfies: r] is asymptotically least favorable for all r ≥ r 0 . Note that, when r ≥ r 0 , K r = 1 and the RHS of (1.8) equals 1. For r < r 0 , the proposed prior is no longer exactly asymptotically least favorable but its Bayes risk has the same order of the minimax risk as η n → 0.

Proof overview
We provide a brief overview of the proof of our main result. Detailed proofs of all the results are provided in section 3. The proof of Theorem 1.1 involves asymptotically upper bounding the risk sup θ∈Θ0 [sn] . Then, the asymptotic equality follows as the first term can not be smaller than the minimax risk by definition. Also, note that due to the product structure of the prior, the multivariate maximal risk can be evaluated based on the risk of the univariate Bayes PRDEp C [η n , r] by using the following relation: Asymptotic evaluation of the two expressions on the right above is done by using the risk decomposition Lemma 2.1 of MJ17. It reduces the calculation for the univariate predictive risk to finding expectation of functionals involving standard normal random variable Z as Here, thus q i = 2 −1 exp(−|i|λ 2 e,n /2) with λ e,n = (2 log η −1 n ) −1 and λ f,n = v 1/2 λ e,n ; N ij is the contribution to the risk of the jth support point μ ij (η n , r) within the ith cluster.
The risk contributions N ij are exponents of quadratic forms in μ ij , viz, The risk at the origin is well-controlled for this cluster prior based PRDE (see Lemma 3.1) and so, based on (2.1), it is suffices to bound sup θ ρ(θ,p C [η n , r]) by λ 2 f,n /(2r) to arrive at the desired result. This involves tracing two fundamentally different risk phenomena depending on the location of θ (a) θ ∈ C ±1 (b) θ / ∈ C ±1 . In the former case, E log D θ (Z) = O(λ f,n ) (and thus the contribution of the third term on the right of (2.2) is not significant. Also, E log N θ,v (Z) = O(λ f,n ) for |θ| ≤ λ f,n and so, asymptotically ρ(θ,p C [η n , r]) initially increases quadratically in θ and ρ(λ f,n ,p C [η n , r]) = λ 2 f,n /(2r) (1 + o(1)). However, if |θ| ∈ C 1 \ [0, λ f,n ], then E log N θ,v (Z) is significantly large and controls the predictive risk below the desired asymptotic limit (see Lemma 3.4).
If θ ∈ C i for any |i| > 1, then the risk phenomenon is quite different than the origin adjoining clusters. Now, E log D θ (Z) is significantly positive. However, an important ingredient of the proof is that its magnitude can be asymptotically well controlled by considering only atoms in C i or the nearest atom in C i−1 . Lemma 3.3 establishes that for θ ∈ C i with |i| > 1, Plugging these two bounds in (2.2) we get the desired upper bound in Lemma 3.4.

Background and preliminaries
For the technical proofs without loss of generality assume v x = 1. So, r = v x /v y = v x . Recall v = (1 + r −1 ) −1 and η n = s n /n. As demonstrated in (2.1), the multivariate maximal risk of the Bayes predictive density estimate (PRDE) from the cluster prior can be evaluated by studying the predictive risk of the univariate Bayes PRDEp C [η n , r] based on the univariate cluster prior π C [η n , r]: Henceforth, unless we explicitly mention, we would concentrate on univariate Bayes predictors and their risk functions. Recall, in the multivariate set-up we consider asymptotically sparse regimes, where η n → 0 as n → ∞. Hereon, for convenience of notation we write η instead of η n keeping the dependence on n implicit. Recall, λ e := 2 log η −1 , and λ f := 2 log η −v .

U. Gangopadhyay and G. Mukherjee
The point-masses in cluster C j are denoted by {μ jk : k = 1, . . . , K} where the common cluster size K is where, r 0 = 1/2. Further, recall that and μ jk = μ −jk for j < 0. So for r ≥ r 0 , that is, when K = 1, the clustered discrete prior only has point-masses {jλ f : j ∈ Z}. By Lemma 2.1 of MJ17 the predictive KL risk of the univariate cluster prior is given by: where Z is a standard normal random variable, and

Proof of Theorem 1.1
We first present the proof for r < r 0 because the proof is more intricate compared to the case when r ≥ r 0 . In the latter case, by definition K = 1, and the proof is is comparatively easier. It uses parts of the proof techniques used for r < r 0 case but also involves some fundamentally different attributes. Hence, it is presented afterwards where we also explain the choice of r 0 = 1/2.

Risk at origin
The risk at the origin for our cluster prior based Bayes PRDE is asymptotically much smaller than the risk for the thresholding based risk diversified PRDE of MJ15. As such, comparing equation (51) in the aforementioned paper with the following result, it follows that any thresholding based minimax optimal PRDE will have much higher risk at the origin than the cluster prior based Bayes PRDE. The Bayes PRDEs based on grid and bi-grids priors such as the π EG prior of J94 and π PG and π PG priors of MJ17 have similar risk to the cluster prior based Bayes PRDE at the origin.

Risk bounds at the non-origin parametric points
Next, we concentrate on the risk at the non-origin points. Our goal is to establish This along with (3.1) and the above result about the risk bound at the origin will imply that the multivariate maximum risk obeys which would establish the result in Theorem 1.1. By symmetry, it would be enough to prove the bound in (3.3) for positive θ. Hence, hereon in this subsection we only consider θ > 0. In this case the contribution of D θi s for i < 0 are expected to be negligible. This is formalized in the following result.
Lemma 3.2. For any r ∈ (0, ∞) and any fixed θ > 0 we have, Proof. Using the the inequality log(1 + x + y) ≤ x + log(1 + y) for nonnegative x, y we get, Using definition of j i and D θi and the fact that μ i < 0, θ > 0 we get, As i runs from 0 to −∞, j i goes from 1 to ∞ with each term repeating K times. Hence, summing over i < 0 we get, i<0 E D θi = o(1) as λ f → ∞. This completes the proof.
We first provide an upper bound on E log D θ (Z), which would be substituted in equation (3.2) to get the required upper bound. The following result is crucial as it shows that the infinite sum in the expression of D θ (Z) can be asymptotically reduced as a contribution from a single dominant term. This reduction greatly helps in tracking the risk of the cluster prior and is pivotal in the proof of Theorem 1.1. Lemma 3.3. For r < r 0 and any fixed θ > 0 we have, Proof. By virtue of the previous lemma, we can consider only contributions from D θi (Z)s with i > 0. We suppress the dependence of D θi (Z)s on θ and Z and simply write D i . Similarly D θ (Z) is written only as D θ . Note that, l d ≥ 0 because θ > 0. First we get an upper bound on E log D θ by separating the contribution from μ l d as follows In the right hand side of the above equation, the second term compares contribution of μ l d with that of the succeeding terms. We split it further by separating out the contribution of points in the next cluster from the rest in the following manner Using the inequality log(1 + x) ≤ log 2 + (log x) + we get Summing over i in the range l d + 1 ≤ i ≤ l d + K we get the fist term in the right-hand side of equation (3.5) is O(λ f ). Now let us consider the second term in the right-hand side of (3.5). Let Since i runs from l d + K to ∞ and θ ≤ μ l d +K , if we take a sum over i, we get a geometrically decaying sum so that the second term in the right-hand side of equation ( Hence, the second term in the right-hand side of equation (3.4) is O(λ f ). Now we consider the third term in the right-hand side of equation (3.4) is also O(λ f ). We split the sum as To consider the first term in the right-hand side above, take Then θ ≥ μ l d ≥ (μ l d + μ i )/2 and because of the structure of the atoms in the clusters, θ −(μ l d +μ i )/2 = O(λ f ). Note that, i and l d belong to the same cluster.
Using symmetry of the distribution of Z we get Summing over l d − K + 1 ≤ i ≤ l d − 1 we get the first term in the right-hand side of equation (3.8) is O(1). Now let us consider the second term in the right-hand side of (3.8). Since This shows that the second term in the right-hand side of (3.8) is O(λ f ). Here we see how η, which is the probability decay rate from cluster to cluster, goes with the cluster length λ e .
Finally, for each 0 (3.9) Using (3.9) and summing over i in the range 0 ≤ i ≤ l d − K we get This shows that the third term in the right-hand side of (3.8) is O(1). Thus we have proved that the second and third term in the right-hand side of (3.4) are O(λ f ) as λ f → ∞. This completes the proof.
The previous lemma essentially shows that to get an upper bound on E log D θ (Z) it is enough to consider only D θl d (Z) because asymptotically the contribution of the other terms are negligible. To prove (3.3) using (3.2) we need a lower bound on E log N θ,v (Z), which we get by the straightforward inequality E log N θ (Z) ≥ E log N θln (Z). Of course the novelty is in choice of l n (θ) and in the next result we see that these bounds are enough to prove (3.3).

Lemma 3.4. For r < r 0 and for any
Proof. For convenience we write l d (θ) and l n (θ) as l d and l n respectively. Note that from the definition of l d it follows μ l d ≤ θ ≤ μ l d +K . Let

Proof of Theorem 1.1 for r ≥ r 0
The proof follows essentially the same ideas of the proof in the case r < r 0 but there are some technical differences. The analysis of risk at the origin is unchanged because Lemma 3.1 holds for all r. So now, we analyze risk at nonorigin points and basically prove (3.6).
As before, we use the decomposition of risk in (3.2). Our strategy is the same, that is, showing that contribution of E D θl d (θ) (Z) for one particular index l d (θ) is dominant in E log D θ (Z) and using a naive lower bound on E log N θv (Z) considering E N θln(θ) for one particular index l n (θ).
The choices of the indexes in this case are slightly different. Recall that each cluster C j of π C [η, r] consists of only one point. The atoms are at μ p = pλ f for all p ∈ Z. By symmetry we only consider θ > 0. By Lemma 3.2, which didn't depend on value of r, we can ignore all D θp with p < 0. Suppose θ ∈ [μ l , μ l+1 ) for some l ≥ 0. The contribution of D θi for all i > l is negligible compared to D θl and the proof is exactly same as done in the beginning of Lemma 3.3, c.f., equations (3.4), (3.5), (3.6) and (3.7). The crucial difference from the sub-critical case arises now. We will see that, if l ≥ 1, then unlike the sub-critical case D θl is not always the dominant term. Instead D θ,l−1 dominates D θl for some θ if r > r 0 . To see this, note that, Using r ≥ 1/2 Hence, D θ,l−2 and the preceding D θi s are not dominant. Now if D θl is dominant, that is, θ ∈ [μ l + λ f /(2r), μ l+1 ) then using the naive We skip the details of the proof because it's exactly similar to the case l d = l n in Lemma 3.4. On the other hand, if D θ,l−1 is dominant, that is, θ ∈ [μ l , μ l + λ f /(2r)), then we use E log N θl (Z) as a lower bound of E log N θ,v (Z). We end up with a quadratic in θ similar to equation (3.12), which is nonnegative in [μ l , μ l + 2rλ f ]. Since this interval covers the interval [μ l , μ l + λ f /(2r)] for r ≥ r 0 we get the the above equation.

Proof of Lemma 1.2
Similar to Lemma 3.4, here also, since l > λ e , we can show that in the risk decomposition in (2.2) E log N θ,v (Z) can be replaced with E log N θ,ln(θ) (Z) for some l n (θ), and similarly E log D θ (Z) can be replaced with E log D θ,l d (θ) (Z) for some l d (θ) (D θ,p and N θ,p defined in Subsection 3.2.1), in the sense that the difference is negligible, that is O(λ f ), asymptotically. Consider θ ∈ [ν + (p − 1)l, ν + pl] for some p > 1. By calculations similar to Lemma 3.4, we can show for θ ∈ [ν + (p − 1)l, ν + (p − 1)l + λ f ] if we choose l n = l d = p then the risk is below the threshold λ 2 f /(2r) asymptotically and similarly l n = l d = p + 1 works for θ ∈ [ν +pl−λ f , ν +pl]. But for r ≤ 1/3 we have ν +pl−λ f > ν +(p−1)l+λ f , so that for θ = ν + (p − 1)l + λ f + for small > 0, we must choose l n = p + 1 and l d = p. This makes the risk go above the threshold λ 2 f /(2r) leading to sub-optimality.

Proof of Lemma 1.3
From calculations of Lemma 3.4 it is clear that the for any r < r 0 the choice of γ = γ r = (1 + 4r) cannot be improved upon for asymptotic minimax optimality. So if we have γ > γ r then we will have asymptotic minimax suboptimality. Since the common ration γ determines the cluster size K γ , the cluster size cannot be improved if we want to maintain minimax optimality. Also if we keep γ = γ r and then dropping points from the geometrically spaced grid also causes suboptimality. To see this, suppose the first point we drop is the m'th point in the cluster. If m = 1, i.e., we drop λ f then as we have already seen that this creates suboptimality. Let m > 1. This implies that (3.13)-(3.14) defined in the proof of Lemma 3.4 must statisfy the constraint: β(m + 1, 0) > α(m − 1, 0). Writing down the constraint in terms of r and letting r → 0 we get a contradiction, which shows that we cannot drop support points from our prescribed prior.

Proof of Theorem 1.4
The Bayes risk of the multivariate cluster prior B(π C n ) = nB(π C ) and the univariate Bayes risk is given by where K is defined in (1.6). From the risk calculations in Lemmas 3.2, 3.3, 3.4 it is clear that the first order asymptotic risk as η n → 0 can be reduced to just concentrating on the origin adjoining clusters C ±1 and thereafter by symmetry: (1)) . Now, by (3.2) and Lemma 3.4, for each 1 ≤ j ≤ K we have: Also, following exactly the similar asymptotic analysis as in Lemma 3.4 abet (1)). By construction, μ 1j ≥ λ f,n with strict equality only when j = 1 and so each of the terms barring the first one has some positive contributions. Thus, ρ(μ 11 ,p C [η n , r]) = λ 2 f,n /(2r) (1 + o(1)). For all j > 1, recalling μ 1j /λ f,n = (1 + 4r) j ∧ v −1/2 and v = (1 + r −1 ) −1 we have, where, the first term in the right side above is 0 only when j = K. Thus, the maximal risk is only attained at μ 11 = λ f,n . Thereafter, the risk decays and finally at j = K, the risk is negligible compared to the asymptotic minimax risk. Figure 3 shows the numerical evaluation of the risk of the cluster prior at the different support point of the first cluster. The figure shows the risk profile when η n = 10 −15 which well captures the asymptotic analysis and the aforementioned decay in the risk function is evident from the figure. Noting that the multivariate minimax risk is nη n λ 2 f,n /(2r)(1 + o(1)) as η n → 0, the result follows from the above display. When r > r 0 , then K = 1 and so, the above result directly imply B(π C n )/R * (Θ 0 [s n ]) → 1 as n → ∞. The condition s n → ∞ ensures that the prior concentrates on the parametric space Θ 0 [s n ] (see Theorem 1B of MJ15 for details) and thus is least favorable in this case.

Simulations
We introspect the performance of the aforementioned PRDEs across different sparsity regimes. The product structure of our estimation framework allows us to concentrate on the maximal risk of the corresponding univariate PRDEs.
(2.1) shows that the multivariate maximum risk ofp C over Θ 0 [s n ] is a function of sparsity level η n and the univariate risk ofp C . In table 2, we report the maximum risk of our proposed clustered prior based Bayes PRDEp C (in last column) as the degree of sparsity η and predictive difficulty r varies. Using (2.2), we evaluate the univariate risk ofp T for any fixed θ with high precision by using Monte Carlo integration. Thereafter, the maximum risk is found by a conducting a univariate grid search as θ varies over R.
The performance of the following related PRDEs (a) hard thresholding based plugin estimator (b) thresholding based risk diversified PRDEp T of MJ15 (c) Bayes PRDEp EG based on π EG prior of J94 (d) Bayes PRDEp PG based on π PG prior of MJ17 are respectively reported in columns 4 to 7 in table 2. The risk ofp EG andp PC are evaluated by looking at their maximum univariate risk and analogous version of (2.1) which follows from Lemma 2.1 of MJ17. The maximum risk ofp T is numerically evaluated by combining (24), (34), (43) and (47) in MJ15. Under moderate sparsity, the maximal values of the PRDEs exceed the minimax value specified by the asymptotic theory in (1.2). The exceedance is higher for lower values of r. From the table, it seems that the numeric results are in accordance with the asymptotic theory as η n ≤ 10 −3 . As expected the plug-in PRDE is highly sub-optimal for lower values of r across all regimes. Once the asymptotic behaviour sets in, the maximum risk of the proposed PRDEp C is near optimal among the concerned PRDEs across all r regimes; under moderate sparsity its maximum risk is little worse but for η n ≤ 10 −3 it has lower maximum risk thanp EG ,p PG , and similar risks asp T . However, due to the presence of the countably infinite univariate discrete prior inp C , the asymptotic approximation to its maximum risk as described by Theorem 1.1 comes into effect at relatively smaller η n values than in the risk ofp T .

Discussion and future work
The results developed here assume that the variances v x and v y are known.
If v y = rv x where r is known but v x is unknown, a simple approach would be to substitute an estimatev x of v x in the PRDEs discussed here. Forv x we can use the median absolute deviation from zero which is used for PE in the EbayesThresh package of Johnstone and Silverman (2005). For other good candidates forv x , see Xing et al. (2020) and the references therein. However, as shown in Kato (2009b) such plug-in approach will not be optimal. Recently, Maruyama et al. (2020) developed a decision theoretic framework under repeated sampling for studying point estimation efficiency in Gaussian models with unknown scale. As future work, it will be interesting to study PRDE under Kullback-leibler loss in such a framework. Also, if the sparsity level is unknown, it can be estimated from the data using the empirical Bayes maximum likelihood approach in Johnstone and Silverman (2005) and the estimated sparsity level can be plugged in the form of the Bayes PRDEs discussed in this paper. The PRDEs discussed here are based on spikeand-slab priors with the slab being infinite discrete priors. PRDEs based on continuous slabs (Rockova and George, 2018) are preferred for practical implementation. A manuscript studying adaptivity of such spike-and-slab PRDEs to unknown sparsity is forthcoming.