Convergence Rates for Bayesian Estimation and Testing in Monotone Regression

Shape restrictions such as monotonicity on functions often arise naturally in statistical modeling. We consider a Bayesian approach to the problem of estimation of a monotone regression function and testing for monotonicity. We construct a prior distribution using piecewise constant functions. For estimation, a prior imposing monotonicity of the heights of these steps is sensible, but the resulting posterior is harder to analyze theoretically. We consider a ``projection-posterior'' approach, where a conjugate normal prior is used, but the monotonicity constraint is imposed on posterior samples by a projection map on the space of monotone functions. We show that the resulting posterior contracts at the optimal rate $n^{-1/3}$ under the $L_1$-metric and at a nearly optimal rate under the empirical $L_p$-metrics for $0<p\le 2$. The projection-posterior approach is also computationally more convenient. We also construct a Bayesian test for the hypothesis of monotonicity using the posterior probability of a shrinking neighborhood of the set of monotone functions. We show that the resulting test has a universal consistency property and obtain the separation rate which ensures that the resulting power function approaches one.


Introduction
We consider the nonparametric regression model Y = f (X) + ε for a response variable Y with respect to a one-dimensional predictor variable X taking values in a bounded interval, and ε a mean-zero random error with finite variance σ 2 . Instead of the more commonly imposed smoothness condition, here we assume that f is a monotone increasing function on a bounded interval, which can be taken to be [0, 1] without loss of generality. We observe n independent replications (X 1 , Y 1 ), . . . , (X n , Y n ), where the design points X 1 , . . . , X n are either deterministic or are randomly sampled from a fixed distribution G. The error ε is assumed to be distributed independently of the predictor X.
The problem has been widely studied in the frequentist literature, and is commonly known as isotonic regression. Barlow and Brunk [5] obtained the greatest convex minorant (GCM) of a cumulative sum diagram as the least-square estimator under the monotonicity constraint. The Pool-Adjacent-Violators Algorithm (PAVA) describes a method of successive approximation to the GCM, and is the most commonly used algorithm for isotonic regression (see Ayer et al. [2], Barlow et al. [4], or De Leeuw et al. [19]). Brunk [12] showed that the estimated value of the regression function at a point converges at a rate n −1/3 , and obtained its asymptotic distribution. Durot [20] established the n −1/3 rate of convergence of the isotonic regression estimator under the L 1 -metric. Testing for monotonicity of a regression function has been studied in the frequentist literature by Akakpo et al. [1], Hall and Heckman [25], Baraud et al. [3], Ghosal et al. [21] and Bowman et al. [11].
A Bayesian approach to the monotone regression problem involves putting a prior on functions under the monotonicity constraint. Since step-functions can approximate monotone functions, a natural approach is to put priors on step heights under the monotonicity constraint, and possibly also on the locations and the number of intervals. For smoother sample paths, higher-order splines can be used instead of the indicator functions of intervals. To put a prior on a monotone regression function, Neelon and Dunson [28] used a basis consisting of piecewise linear functions, put a prior on the coefficients, and developed Markov chain Monte Carlo (MCMC) methods for posterior computation. Shivley et al. [34] used a spline basis and a mixture of constrained normal distributions as a prior for the coefficients in the basis expansion. The resulting procedure can be approximated by a constrained regression spline prior, which leads to an MCMC algorithm on the space of coefficients. They also discussed posterior consistency. Bornkamp and Ickstadt [10] modeled a monotone function as a mixture of parametric probability distribution functions, used a general random probability measure as a prior for the mixing distribution and developed MCMC methods for drawing posterior samples. Chipman et al. [16] put a prior on multivariate monotone regression function through a Bayesian additive regression tree structure by restricting the stepheights that leads to multivariate monotonicity of the function, and devised an MCMC posterior sampling technique.
A Bayesian approach to testing monotonicity was proposed by Salomond [30,32] based on the posterior probability of the event that the regression function is nearly monotone in the sense of a distance measure. The relaxation to near monotonicity instead of the exact monotonicity is needed to avoid the problem of falsely rejecting the null hypothesis of monotonicity because a monotone true function may be approximated by non-monotone functions, leading to a high probability of type I error (see Section 4). Scott et al. [33] used a more classical approach based on Bayes factors to test the hypothesis of monotonicity using an integrated Brownian motion or constrained regression spline prior on the regression function. They converted the hypothesis of monotonicity to a statement about the minimum value of the derivative function, and gave formula and results on the consistency of the Bayes factor. Coverage of Bayesian credible regions for monotone regression has been recently studied by Chakraborty and Ghosal [14]. They computed the limit and observed an interesting phenomenon that a posterior quantile interval for the value of the function has asymptotic coverage higher than the corresponding credibility level. This is the opposite of the phenomenon of less asymptotic coverage of Bayesian credible regions observed by Cox [17] for smooth function estimation. Moreover, they showed that starting with a suitable lower credibility level, which can be calculated and depends only on the target coverage, the intended asymptotic coverage can be obtained. Bayesian nonparametric methods have been developed also for other shape-constrained problems, such as monotone density and the current status censoring model. Salomond [31] established the near minimax rate n −1/3 for a decreasing density using a mixture of uniform densities as a prior. Chakraborty and Ghosal [15] studied posterior contraction rate and the limiting coverage of a Bayesian credible interval for a monotone decreasing density, and constructed a Bayesian test for the hypothesis of monotonicity. For a monotone regression function, asymptotic coverage of a credible interval for a regression quantile was obtained by Chakraborty and Ghosal [13]. They also showed that the posterior contraction rate may be improved by sampling in two stages, where the second stage samples are collected from the credible interval obtained in the first stage.
A difficulty with the usual Bayesian approach to isotonic regression is that the monotonicity constraint on the coefficient makes both posterior computation and study of posterior concentration with increasing sample size a lot more challenging. This is because a neighborhood of the true regression function contains non-monotone functions, which must be discounted for posterior sampling and estimating the prior concentration near the true regression function. A very useful approach that can still utilize the conjugacy structure is provided by a "projection-posterior" distribution. In this approach, the monotonicity constraint on the step size is initially ignored, so that the coefficient may be given independent normal priors. Hence the posterior distribution is also normal, allowing easy sampling, and large sample analysis of posterior concentration. Then a posterior distribution is directly induced by a projection map that projects a step function to the nearest monotone function in terms of the L 1 -distance or some other metric. A similar idea based on a Gaussian process prior was used by Lin and Dunson [27] for monotone regression. Bhaumik and Ghosal [6,7,8] used this idea of embedding in an unrestricted space and then projecting a conjugate posterior in regression models driven by ordinary differential equations. Bhaumik et al. [9] extended their approach to generalized regression described by partial differential equations. In this paper, we pursue the projection-posterior approach and show that the resulting projection-posterior distribution concentrates at the optimal rate n −1/3 in terms of the L 1 -distance.
We also obtain nearly optimal posterior concentration under an empirical L pdistance for 0 < p ≤ 2 using a different prior. In addition to being extremely convenient for theoretical studies, the approach is very convenient for posterior computation as well, allowing the use of the convenient conjugate prior and the resulting posterior sampling without needing to use more computationally expensive MCMC sampling. It may be mentioned that Chakraborty and Ghosal [14,15,13] also followed the projection-posterior approach, and indeed the prior used in the present paper was also used in Chakraborty and Ghosal [14,13]. In this paper, we also construct a Bayesian test for the hypothesis of monotonicity based on the posterior distribution of the difference between the unrestricted posterior sample and its projection. We show that the resulting test is universally consistent, in that the Type I error probability goes to zero and the power goes to one at any fixed alternative, regardless of smoothness. For a sequence of smooth alternatives, we also compute the needed separation from the null region to obtain high power. Our proposed test is similar in spirit to Salomond's [32] test in that both are based on the posterior probability of a slightly extended null region, but our use of the L 1 -metric on the function or the Hellinger metric on the density of Y , leads to the universal consistency.
The paper is organized as follows. In the next section, we formally introduce the modeling assumptions and the prior and describe the projection-posterior approach. In Section 3, we present results on posterior contraction rates of the projection-posterior distribution. In Section 4, we derive asymptotic properties of the proposed Bayesian tests. A simulation study assessing the accuracy of the proposed Bayesian estimator and tests with other Bayesian and non-Bayesian completing methods is presented in Section 5. Proofs of the main results are given in Section 6 and those of the auxiliary results in Section 7.

Model, prior and projection-posterior
The following notations will be used throughout the paper. Let I m stand for the m × m identity matrix. By Z ∼ N J (μ, Σ), we mean that Z has a J-dimensional normal distribution with mean μ and covariance matrix Σ. For a vector x, the Euclidean norm will be denoted by x . The transpose of a vector x is denoted by x T and that of a matrix A is denoted by A T . If f is a function and H is a measure, the L p -norm of f is given by f p,H = ( |f | p dH) 1/p for 1 ≤ p < ∞, and the L p -distance between two functions f and g is given by The indicator function will be denoted by 1 and # will stand for the cardinality of a finite set.
For two sequences of real numbers a n and b n , a n b n means that a n /b n is bounded, a n b n means that both a n b n and b n a n , and a n b n means that a n /b n → 0. For a random variable X and a sequence of random variables X n , X n → P X means that X n converges to X in P -probability.
Let F and F + respectively denote the space of real-valued measurable functions and monotone increasing functions on [0, 1], and for K > 0, let F + (K) = {f ∈ F + : |f | ≤ K}. For f : [0, 1] → R and d a distance on F, let the projection of f on F + be the function f * that minimizes d(f, h) over h ∈ F + , provided such a minimizer exists. The projection need not be unique, in which case any choice may be taken. If a minimizer does not exist, a near minimizer may be used as is commonly adopted for M-estimation; see van der Vaart [35], page 45. However, in this paper, we need to use the projection map only on piece-wise constant functions with respect to nicely behaved metrics like L 1 or L 2 , for which the projection exists and is unique. The topological closure of F + is denoted bȳ F + . The -covering number of a set A with respect to a metric d, denoted by N ( , A, d), is the minimum number of balls of radius needed to cover A.
Let G n (x) = n −1 n i=1 1{X i ≤ x}, the empirical distribution of the predictors X.
A prior distribution Π on the regression function f will be given by a random For the prior, J or ξ = (ξ 1 , . . . , ξ J−1 ) or both may be given, or these may be distributed according to a prior. Depending on their choices, the following three types of prior distributions will be considered in this paper. that is, the knots are sampled randomly without replacement from the observed values of the predictor variable (only applicable for a deterministic X with distinct values). 3. Type 3 prior: The knots are equidistant and the number of steps J is given a prior satisfying In all three cases, given σ and J, the coefficients θ 1 , . . . , θ j are given independent normal priors θ j |σ ∼ N(ζ j , σ 2 λ 2 j ), B 1 < λ j < B 2 for some B 1 , B 2 > 0 and bounded |ζ 1 |, . . . , |ζ J |. We write Λ = diag(λ 2 1 , . . . , λ 2 J ), the diagonal matrix with entries λ 2 1 , . . . , λ 2 J . The choices of these parameters are not important for asymptotic properties as long as the stated boundedness conditions are satisfied. In finite samples, the choices may make some impact though. If a prior guess f about the monotone regression is available, ζ j may be taken as the average Ijf (x)dG(x) off over I j , while λ j indicates the lack of faith in the prior belief in ζ j , j = 1, . . . , J. More commonly, in the absence of any reliable prior information, ζ j , j = 1, . . . , J, may be set to 0 and λ j to relatively large value, for a low-information diffuse prior. The Type 1 prior will be used to obtain optimal posterior contraction in L 1 -distance, Type 2 prior for posterior contraction in terms of an empirical L 2 -distance while Type 3 prior will be used for testing monotonicity against smooth alternatives of unspecified smoothness.
The variance parameter σ 2 is either estimated by maximizing the marginal likelihood, or is given an inverse-gamma prior σ 2 ∼ IG(β 1 , β 2 ) with β 1 > 2 and β 2 > 0. We , an n × J matrix, and θ = (θ 1 , . . . , θ j ) T . Thus the model can be written as Y = Bθ + ε, and the prior (given J and σ) as θ|(J, σ) ∼ N J (ζ, σ 2 Λ) with ζ = (ζ 1 , . . . , ζ J ) T . Then we have (see, eg., Hoff [26], page 155) that Since in our context B T B = diag(N 1 , . . . , N J ), it follows that θ j are a posteriori independent with (2. 2) The marginal distribution of the observations Y (given X and J, σ 2 , ξ) is As the coefficients θ have not been restricted to the cone of monotone increasing values Q := {(q 1 , . . . , q J ) : q 1 ≤ q 2 ≤ · · · ≤ q J }, the resulting regression function f = J j=1 θ j 1 Ij , sampled randomly from the posterior distribution Π(·|D n ), may not be monotone. In order to comply with the monotonicity restriction, a sampled value of the function f from its posterior (obtained through the posterior sampling of θ) is projected on the set of monotone functions F + on [0, 1] to obtain f * ∈ F + nearest to f with respect to some distance d. The induced distribution of f * will be called the projection-posterior distribution. It will be denoted by Π * n and will be the basis for inference on the regression function f . By its definition, the projection-posterior distribution is restricted to F + .
We also find that the projection f * of a step function f = For the the L 2 (G n )-distance, these values are obtained by the weighted isotonization procedure The optimizing values θ * 1 , . . . , θ * J can be computed using the PAVA and can be characterized as the left-derivative at the point n −1 j k=1 N k of the greatest convex minorant of the graph of the line segments connecting the points 1 of Groeneboom and Jongbloed [24]). The same solution is obtained even if the L 2 (G n )-distance is replaced by a wider class; see Theorem 2.1 of Groeneboom and Jongbloed [24].
We make one of the following design assumptions (DD) or (DR) on the predictor X and the assumption (E) on the error variables.
Condition (DD) (Deterministic predictor). The predictor variables X is deterministic assuming values X 1 , . . . , X n , and the counts N 1 , . . . , N J of J equispaced intervals I 1 , . . . , I J satisfy, for J → ∞, max{N j : The bound is clearly implied by the condition sup{|G , where G has a positive and continuous density g on [0, 1].
Condition (DR) (Random predictor). The predictor X is sampled independently from a distribution G, having a density g, which is bounded and bounded away from zero on [0, 1].
The assumption of normality on the error is only a working hypothesis. We only construct the likelihood function using the model ε 1 , . . . , ε n i.i.d. N(0, σ 2 ) with an unknown σ > 0. We assume the following condition on the true distribution of the error.
Condition (E) (True error distribution). The error variables ε 1 , . . . , ε n are i.i.d. sub-Gaussian with mean 0 and variance σ 2 0 . We denote the true value of the regression function by f 0 and write the vector of function values at the observed points by F 0 = (f 0 (X 1 ), . . . , f 0 (X n )). The true value of the variance σ 2 is thus σ 2 0 . We denote true distribution of (X, Y ) by P 0 . Let E 0 (·) and Var 0 (·) be the expectation and variance operators taken under the true distribution P 0 .
The error variance σ 2 may be estimated by maximizing the marginal likelihood of σ. From (2.3), it follows that the marginal maximum likelihood estimate of σ 2 is given bŷ The plug-in posterior distribution of f is then obtained by substitutingσ n for σ in (2.2). If instead, we equip σ 2 with inverse-gamma prior σ 2 ∼ IG(β 1 , β 2 ), then a fully Bayes procedure can be based on the posterior distribution see Hoff [26], page 155.

Preliminaries
To establish posterior contraction rates for f with unknown σ, we need to effectively control the range of values of σ. It will be shown in Lemma 7.2 that the maximum marginal likelihood estimator for σ 2 in the plug-in Bayes approach and the marginal posterior distribution of σ 2 in the fully Bayes approach, are consistent for any f 0 ∈ F + , and the convergence is also uniform over F + (K), for any fixed K > 0. This allows us to treat σ as essentially known in studying the posterior contraction.
As mentioned in the last section, we impose monotonicity on f by projecting f onto F + and use the projection-posterior distribution for inference. The following argument shows that the concentration property of the posterior at any monotone function is not weakened by this procedure.
Let Π * n stand for the projection-posterior distribution given by where f * is the projection of f on F + with respect to some metric d on the space of regression functions. Then for the true regression function f 0 ∈ F + and > 0, we have the following concentration inequality for the projection-posterior distribution: Consequently, the contraction rate of the unrestricted posterior is inherited by the projection-posterior, giving a path to the derivation of the posterior contraction of the projection-posterior distribution. To see this, note that d(f * , f) ≤ d(f 0 , f) by the property of the projection. Hence, using the triangle inequality For p ≥ 1, the L p -projection of a step function is easily computable, by algorithms similar to the PAVA (see Section 3.1 of De Leeuw et al. [19]). It may be noted that the concentration inequality for the projection-posterior applies well beyond the specific prior based on piece-wise constant functions used in the paper. For instance, if the regression function is also known to be smoother, we can use higher order B-splines (piece-wise constant functions are linear combinations of order 1, while piece-wise linear functions are linear combinations of order 2 B-splines) to construct a prior distribution that has better concentration property at smoother functions; see Section 10.4 of Ghosal and van der Vaart [23]. The only additional complications are that the posterior distributions of the coefficients are dependent normal and the projection cannot be computed by the PAVA. Nevertheless, the optimal posterior contraction rate obtained at a smooth monotone function may be passed to the projectionposterior distribution by (3.2). However, if only monotonicity is assumed, piecewise linear or higher order B-splines, although can be used to construct prior, may not be useful for obtaining the contraction rate, due to the lack of optimal approximation property of such functions at an arbitrary monotone function.

Contraction rates under the L 1 -metric
In this subsection, we derive the posterior contraction rate with respect to the L 1 -metric. An important factor determining this rate is the approximation rate of monotone functions by step functions. For the L 1 -metric, step functions with regularly placed knots are adequate for the optimal approximation rate (see Lemma 7.3), and hence it is sufficient to consider a Type 1 prior. In the following theorem, we derive the contraction rate at a monotone function in the L 1 -metric by directly bounding posterior moments. Theorem 3.1. Let f 0 ∈ F + , and assume that Condition (E) holds. Let the prior on f be of Type 1, with J → ∞ and J n. Let σ 2 be estimated using the plug-in Bayes approach or endowed with the inverse-gamma prior using a fully Bayes approach. Assume that either X is deterministic and Condition (DD) holds, or X is random and Condition (DR) holds. Then for n = max{J −1 , (J/n) 1/2 } and every M n → ∞, In particular, if we choose J n 1/3 , the projection-posterior contracts at the minimax rate n = n −1/3 . Moreover, the convergence is uniform over F + (K) for any K > 0.
Under Condition (DR), the L 1 (G)-distance is equivalent to the usual Lebesgue L 1 -metric on [0, 1], and hence the contraction rate may be stated in terms of the latter. Conditions (DD) or (DR) on X in the theorem above is needed only to conclude, using Lemma 7.2, that the estimator (or the posterior) for σ is consistent. The conclusion is only used to get an upper bound for σ. If instead, we assume an upper bound for σ (and change the prior on σ to comply with the bound, if the fully Bayes procedure is used), we can remove these conditions.

Contraction rates under the empirical L p -metric
When the metric under consideration is L p with p > 1, step functions based on equidistant knots do not have the optimal approximation property. To restore this ability, we need to allow arbitrary knots (see Lemma 7.3), and put a prior on these. Then the theory of posterior contraction for general (independent, not identically distributed) observations of Ghosal and van der Vaart [22] can be applied by computing the prior concentration rate near the truth and bounding the metric entropy of a suitable subset of the parameter space, called a sieve. However, due to their ordering requirement and possibly very uneven allocation of the knots ξ used for the construction of the optimal approximation, the concentration of the prior distribution of ξ near their values appearing in the optimal approximation may be low, and hence the posterior concentration rate may suffer. The problem can be avoided by choosing knots from the observed values of X when the predictor variable is deterministic and the empirical L pnorm f p,Gn is used. Then the optimal rate (up to a logarithmic factor) can be obtained. Theorem 3.2. Let X be deterministic assuming values X 1 , . . . , X n . Let f 0 ∈ F + and the prior on f be of Type 2, with log J log n. Let ε 1 , . . . , ε n be i.i.d. normal with mean zero and variance σ 2 , which is estimated using the plug-in Bayes approach or is endowed with the inverse-gamma prior using a fully Bayes approach. Then for any In particular, the best rate n = (n/ log n) −1/3 is obtained by choosing J (n/ log n) 1/3 . Moreover, the convergence is uniform over F + (K) for any K > 0.
If, instead of choosing J, we put a prior also on J following (2.1), then the contraction rate is given by n −1/3 (log n) (5−3t2)/6 . Clearly, with a prior on J given by (2.1), the best rate (n/ log n) −1/3 is obtained when t 1 = t 2 = 1. A Poisson (or a suitably truncated Poisson) prior meets the requirement. Again, Condition (DD) is used only to derive the consistency of the estimator (or the posterior) of σ, and the condition can be removed if σ is assumed to be bounded.
It would be interesting to obtain nearly optimal contraction rates for the continuous L p -metric, but we do not know an appropriate prior on the knotlocations that would allow sufficient prior concentration to yield the desired result. For the continuous metric L p -metric, the weak approximation with equal intervals allows only a sub-optimal approximation rate J −1/p (see Lemma 7.3), and consequently a suboptimal posterior contraction rate (n/ log n) −1/(p+2) .

Bayesian testing for monotonicity of f
A natural test for the hypothesis of monotonicity is given by the posterior probability of F + : reject the hypothesis if Π(f ∈ F + |D n ) is smaller than 1/2, say. The problem with this test is that for a true regression f 0 ∈ F + , even though the posterior is consistent at f 0 , the posterior probability Π(f ∈ F + |D n ) may be low because a large part of a neighborhood of f 0 may fall outside F + . In order to avoid such false rejections, one may quantify a test based on a discrepancy measure d(f, F + ) between f sampled from the posterior, and the set of monotone functions F + (that is, a nonnegative function f that vanishes exactly on F + ), or equivalently, based on d(f, f * ), where f * is the projection of f on F + . A reasonable test can be determined by the posterior probability Π(f : d(f, F + ) < τ n |D n ) for a sequence τ n → 0 slowly. In other words, we reject for low values of the posterior probability Π(F τn + |D n ) of the τ n -neighborhood F τn This approach was also pursued by Salomond [30,32], with a discrepancy measure given by d(f, Ij (with equidistant knots) and a cut-off τ n = (J log n)/n. This test has the probability of Type I error going to zero and has high power against smooth alternatives, if appropriately separated from the null. However, the power of this test at a non-smooth alternative may not go to one. This prompts us to propose an alternative test, based on the L 1 -distance as the discrepancy measure, which has the property of universal consistency, that is, the power at any fixed alternative goes to one.
Let H(α, L) be the Hölder space of α-smooth function with Hölder norm bounded by L (see Definition C.4 of Ghosal and van der Vaart [23]). Theorem 4.1. Consider a Type 1 prior with J n 1/3 . Let σ 2 be estimated using the plug-in Bayes approach or endowed with the inverse-gamma prior using a fully Bayes approach. Assume that X is random and Condition (DR) holds, and the errors satisfy Condition (E). For the discrepancy measure d(f, In the above theorem, the L 1 (G)-distance may be replaced by the L 1 -distance under the Lebesgue measure, since under Condition (DR), these two metrics are equivalent. In this case, part (c) may be strengthened by replacing the Hölder space H(α, L) by the Sobolev space with (1, α)-Sobolev norm bounded by L (see Definition C.6 of Ghosal and van der Vaart [23]). Also, if G is replaced by the empirical distribution G n (and assuming that Condition (DD) holds instead of Condition (DR) if X is deterministic), the conclusions in parts (a) and (c) will still hold. The proof is very similar. If σ has a known bound, then Condition (DD) or Condition (DR) is not needed.
The procedure involving the test φ n is computationally simple as it does not involve a prior on J. The algorithm for median isotonic regression (see Robertson and Wright [29] and De Leeuw et al. [19]) allows us to compute d(f, F + ) very efficiently. However, with a deterministic choice of J, the posterior contraction is not adaptive on classes of functions with different smoothness α, where the optimal order is n 1/(1+2α) . This is caused by the deterministic choice of J n 1/3 needed for the optimal rate at a monotone function, which differs from the optimal order of n 1/(1+2α) needed to match the minimax optimal rate n −α/(1+2α) (up to a logarithmic factor) at a function in H(α, L). This has a consequence on the degree of separation needed for high asymptotic power at an alternative regression function f 0 ∈ H(α, L) in testing for monotonicity. In order to make the power go to one, an n −α/3 -sized neighborhood of f 0 should not contain a part of F + , that is, the order of separation from f 0 to F + for high power needs to be at least n −α/3 in the above result. In other words, the order of separation n −α/3 (up to a logarithmic factor) between f 0 and F + is needed. However, this is more than what should be ideally needed by a testing procedure. Since the contraction rate at a function in H(α, L) is n −α/(1+2α) achievable by choosing J n −α/(1+2α) , in order to make the power go to one, only an n −α/(1+2α) -size neighborhood should be disjoint from F + . Thus the minimal order of separation from f 0 to F + needed for high power by any testing procedure is n −α/(1+2α) , or that, n −α/(1+2α) is the optimal order of separation. As the separation needed in the above result is larger than the minimal order except for α = 1, the resulting test is not rate-adaptive. Adaptation can, however, be restored in a class of uniformly bounded regression functions by using a prior on J, and letting cut-off value for the discrepancy with F + depend on J. The idea is similar to the one used in Salomond [32], except that we use the Hellinger distance between the underlying densities, instead of the maximum discrepancy measure on the coefficients used by him. This allows us to retain the universal consistency property.
Let d H (f 1 , f 2 ) stand for the Hellinger distance between p f1,σ and p f2,σ , where p f,σ stands for the joint density of (X, Y ) following Y |X ∼ N(f (X), σ 2 ) and X ∼ G, with respect to the measure the product of G and the Lebesgue measure. By an easy calculation, In the theorem, G can be replaced by the uniform distribution in the definition of the test. In this case, the Hölder space H(α, L) in part (c) can be replaced by the Sobolev space with (2, α)-Sobolev norm bounded by L.
Unlike Theorem 4.1, the proof requires the application of the general theory of posterior contraction. The weaker Hellinger distance for separation is used so that a test required for the application of the theory is available automatically without requiring the regression functions to be bounded by a known constant, a condition that will rule out the conjugate normal prior needed in the proof. An alternative is to use the empirical L 1 -distance and conclude parts (a) and (c) only, assuming that N j n/J uniformly in j = 1, . . . , J.

Simulation study
In this section, we compare the proposed method with some other Bayesian and non-Bayesian methods. For estimation, we compare the L 1 -distance of the Bayes estimator from the true function, with the L 1 -distance of other estimators from the true function. For testing, we compare the level and power of the Bayesian test with those of the competing tests for monotonicity.

Simulation for contraction rates
We perform a simulation study to demonstrate the behavior of our projectionposterior method of estimation in finite samples. We use the Bayesian algorithm for estimating monotone functions given by Bornkamp and Ickstadt [10], and the median isotonic regression estimator as benchmarks to compare our results with. We consider the following monotone functions on [0, 1]: is the distribution function of a Beta(a, b) random variable evaluated at x.
For every true regression function, we generate samples of size n from the model Y = f (X) + ε, with X as points equally distributed on (0, 1), ε generated from N(0, σ 2 ), with σ = 0.1. We take the number of pieces J as the greatest integer less than or equal to n 1/3 log n. We generate 1000 posterior samples for each dataset, project them onto the monotone class of functions and use the mean of projection-posterior samples as our estimator. We compare the performance of our method with that of the median isotonic estimator and the algorithm by Bornkamp and Ickstadt [10]. The R package "isotone" efficiently computes the L 1 -projection of a step function on F + . We use 5000 burn-in samples and 20000 MCMC samples to estimate f using the method of Bornkamp and Ickstadt [10], implemented in the R package "bnpmr".
To evaluate the performance of an estimator f , we find the L 1 -distance between f and the true function f 0 on a fine grid, calculated as We report the results in Table 1. Each entry is D(f, f 0 ) averaged over 1000 replications, and the standard deviation across these replications is reported in the bracket. We use "projection", "BNPMR" and "isotone" respectively to denote the estimators obtained from our method, Bornkamp and Ickstadt [10], and median isotonic regression. We observe that while no single method performs better than others in all the cases considered, the projection-posterior method seems to do better than Bornkamp and Ickstadt [10] for f = f 2 when n = 100 and f = f 4 . For f = f 1 , projection-posterior and isotonic regression give very similar results. In the experimental setup of Table 1, the approximate com-  puting times needed to generate the posterior samples from one dataset of size 100 were 0.57 seconds for projection-posterior, and 1 second for Bornkamp and Ickstadt [10]. This is expected, as our method does not involve drawing MCMC samples and is therefore computationally simpler than MCMC-based Bayesian methods. We display the scatter-plot of a dataset of size 100 generated from f = f 2 and f = f 4 with σ = 0.1, along with our estimator and a 95% projection-posterior credible region in Figure 1. The estimator is seen to approximate the flat parts of f 4 and most parts of f 2 quite well.

Simulation for testing monotonicity
In this section we report the results from a simulation study comparing the performance of our test for monotonicity with that of two other methods. The first method is the one by Salomond [32], and the second is a non-Bayesian test proposed by Ghosal et. al. [21].
We generate samples of size n from the model Y i = f (X i ) + ε i , with X i 's equally spaced on (0, 1). The errors ε i are i.i. d. N(0, 0.1 2 ). We generate 500 such datasets for each test function f on [0, 1]. We choose three monotone func-  [33], Salomond [32] and Ghosal et. al. [21] as examples of non-monotone functions with small departures from monotonicity. The test by Salomond [32] is for monotone decreasing functions, so we use his algorithm on the negative of our simulated Y -values.
We choose the number of knots J as the greatest integer less than or equal to n 1/3 . The hyperparameters are chosen as ζ j = 0 and λ 2 j = 100 for all 1 ≤ j ≤ J. For the error variance, we use the marginal maximum likelihood estimate of σ 2 . For our test and that of Salomond's [32], we use 1000 posterior samples for each dataset to make inference. The γ in Theorem 4.1 is chosen as 1/2.
To determine an appropriate cutoff point for our test, we find a slowly growing sequence M n such that using M n n −1/3 as the cutoff in Theorem 4.1 results in low Type 1 error for f = 0. We let M n = M 0 (log n) κ for M 0 , κ > 0. We run our test across different combinations of (M 0 , κ) on datasets generated from the model with f = 0 for several values of n, ranging from n = 50 to n = 10000, and choose the combinations that result in the misclassification error rate being less than 0.10. We then select the combination that maximizes the power when the test is used on datasets generated from non-monotone functions. From this analysis, we found M n = 0.8(log n) 0.1 to be an appropriate candidate for M n .
The Type 1 errors of the tests are presented in Table 2. The error rate corresponding to m 1 gives us an idea about the level of the test. The power of the tests are displayed in Table 3. We find that our test has low Type 1 error rate and it outperforms Salomond's [32] test in samples of sizes 50 and 100 in datasets generated from monotone functions. In terms of power, our test does slightly worse than the other tests in small and moderate sample sizes. However, as n grows, the power approaches one.
The computing time for our procedure was found to be quite reasonable. We evaluated d(f, F + ) using the gpava function in the R package "isotone". For a dataset with n = 500, using 2500 posterior draws, our test took 0.75 seconds to execute. On the same dataset, Salomond's [32] test took 2.03 seconds to run on the same processor, for 2500 posterior samples. As mentioned before, our test is computationally simpler than Salomond's [32] as it does not draw posterior samples of J, which is possibly the reason for the faster computational time.

Proofs of the main results
Proof of Theorem 3.1. In view of (3.2), it is enough to obtain the contraction rate of the unrestricted posterior. We prove the result for the plug-in Bayes approach; the fully Bayes case can be dealt with similarly. From Lemma 7.2, get a shrinking neighborhood U n of σ 0 with P 0 (σ n ∈ U n ) → 1. Hence for the purpose of the proof, we may assume thatσ n ∈ U n .
We first consider the case that X is deterministic. Let f 0J = J j=1 θ 0j 1 Ij with θ 0j = N −1 j i:Xi∈Ij f 0 (X i ) for all 1 ≤ j ≤ J. By Lemma 7.3 (a), we Table 2 Type 1 error: the proportion of instances out of 500 when H 0 was rejected. Our test based on the L 1 -distance, the test by Salomond [32] and that of Ghosal et al. [21] are denoted by T 1 , T∞ and T G respectively.  Table 3 Power: the proportion of instances out of 500 when H 0 was rejected. Our test based on the L 1 -distance, the test by Salomond [32] and that of Ghosal et al. [21] are denoted by T 1 , T∞ and T G respectively. have f 0J − f 0 1,Gn J −1 and the bound is also uniform for f 0 ∈ F + (K). To complete the proof, we now show that E 0 Π( f − f 0J 1,Gn > M n J/n D n ) → 0 for any M n → ∞. (6.1) Since f = θ j and f 0 = θ 0j on I j , f − f 0J 1,Gn = n −1 J j=1 N j |θ j − θ 0j |. Hence by the Cauchy-Schwarz inequality followed by Markov's inequality, bound the expectation of both terms, and put in (6.2) to obtain the desired result. For the first term,

M. Chakraborty and S. Ghosal
This follows because, by the boundedness of ζ j and λ −2 j , |E(θ j |D n )−Ȳ j | N −1 j . The second term in the last expression is σ 2 0 , and hence the expression is bounded above by a constant.
For random predictors, we use the · 1,G -distance, which involves another integration with respect to X 1 , . . . , X n on the left side of (6.1).
Proof of Theorem 3.2. Because of (3.2), it suffices to obtain the contraction rate of the unrestricted posterior. Since for 0 < p < 2, the L p (G n )-distance is dominated by the L 2 (G n )-distance, it suffices to prove the result for p = 2. We shall apply the general theory of posterior contraction (Ghosal and van der Vaart [23], Chapter 8) using the sieve f,σ denote the joint density of Y 1 , . . . , Y n for a regression function f . We verify the conditions of Theorem 8.26 of Ghosal and van der Vaart [23] for n = max{ (J log n)/n, J −1 }. Note that by Lemma 7.2, we can restrict σ to an arbitrarily small neighborhood of σ 0 , so the test construction in Lemma 8.27 of Ghosal and van der Vaart [23] is applicable.
By direct calculations, the Kullback-Leibler divergence and the squared Kullback-Leibler variation are respectively equal to Therefore for a sufficiently small , there exists C 1 > 0 such that The last expression is at least of the order (C 3 n ) J n −(J−1) for some C 3 > 0. Putting these together, we have − log Π(B n,0 ((f 0 , σ 0 ), n )) J[log(1/ n ) + log J] J log n n 2 n by the definition of n , fullfilling the condition of prior probability concentration needed for posterior contraction rate n .
Observe that the metric entropy log N ( , P n , · p,Gn ) of the sieve P n in (6.4) is bounded above by J log(n/ n ) J log n n 2 n . Finally, the prior probability Π(P c n ) of the complement of the sieve P n is bounded by Je −n 2 /2 e −cn 2 n for any c > 0, establishing the condition (8.33) of Ghosal and van der Vaart [23]. This leads to the rate n = max( (J log n)/n, J −1 ) when J is chosen deterministically. Clearly, the best choice is J (n/ log n) 1/3 , giving the nearly optimal rate (n/ log n) −1/3 . When J is given a prior, to lower bound Π(B n,0 ((f 0 , σ 0 ), )), we intersect the set with {J = J 0 }, where J 0 (n/ log n) 1/3 . This gives an additional factor e −b1J0(log J0) t 1 , which is absorbed in e −cn¯ 2 n by adjusting the constant for a pre-rate¯ n = (n/ log n) −1/3 , because t 1 ≤ 1. Modify the sieve in (6.4) by intersecting with {J ≤ J 1 }, where J 1 is to be determined. The prior probability of the complement P c n then contributes an extra factor a constant multiple of e −b2J1(log J1) t 2 to J 1 e −n 2 /2 . To obtain the final rate, we need to choose J 1 such that J 1 (log n) t2 exceeds a sufficiently large multiple of n¯ 2 n , and then the rate is given by (J 1 log n)/n = n −1/3 (log n) ( for J n 1/3 by Theorem 3.1. Then it follows that E 0 φ n = P 0 (Π(d(f, F + ) ≤ M n n −1/3 |D n ) < γ) → 0. Further, the convergence is uniform over f 0 ∈ F + (K) for any K > 0.
(b) Let f 0 / ∈F + be fixed and integrable. Using the properties of the projection, d(f 0 , F + ) = f 0 − f * 0 1,G is bounded by f 0 − f * 1,G , which, by the triangle inequality, is further bounded above by Then as shown in the proof of Theo- because d(f 0 , F + ) is fixed and positive. This implies that the probability of Type 2 error P 0 (Π(d(f, F + ) ≤ M n n −1/3 |D n ) ≥ γ) → 0.
(c) Let f 0 / ∈ F + and f 0 ∈ H(α, L) such that d(f 0 , F) ≥ ρ n (α). Consider the step function f 0J of f 0 as in part (b). By a well-known fact from approximation theory, we have that f 0 − f 0J 1,G ≤ C(L)J −α for some constant C(L) depending only on L. For instance, the bound follows from de Boor [18] as step functions with equidistant points are B-splines of order 1. Hence for J n 1/3 , for C > C(L), while for α = 1,  (6.5) provided that log J n log n. We write the expression inside the expectation as and bound In view of Condition (DR), G(I j ) are of the order 1/J, and by Lemma 7.1, N j are of the order n/J in probability uniformly in j = 1, . . . , J. Under the boundedness assumption on the prior parameters and the sampling variance, Var(θ j |D n ) 1/N j J/n with high probability, from the standard expressions for normal-normal conjugate setting (see the proof of Theorem 3.1).
To estimate (E(θ|D n ) − θ 0j ) 2 , withȲ j standing for N −1 j i:Xi∈Ij Y i and ε j standing for N −1 j i:Xi∈Ij ε i , we first observe that |ε j | 2 ≤ N −1 j log n (J log n)/n with high probability. Here we have used the maximal norm estimate using the squared-exponential Orlicz norm (see Lemma 2.2.2 of van der Vaart and Wellner [36]) and #{ε j : j ≤ J ≤ J n } J 2 n . By the same argument and the boundedness of f 0 , we also have with high probability. Also, |Ȳ j | is uniformly bounded with high probability, because Y i = f 0 (X i ) + ε i . Putting in the expression for E(θ j |D n ), we conclude that J j=1 (E(θ j |D n ) − θ 0j ) 2 ≤ (J log n)/n. Putting these estimates in (6.7), we find that the expression is bounded by M −2 0 with high probability simultaneously for all J ≤ J n . Hence by (6.6), it follows that (6.5) holds.
We also observe that, if the posterior contracts at the rate n at f 0 in the To see this, letf 0J stand for element of F J closest to f 0 in terms of d H . Recall that f 0J is the element of F J closest to f 0 in terms of the · 2,G -distance. The functionf 0J may be different from f 0J , but clearly |f 0J | ≤ K and |f 0J | ≤ K, whenever |f 0 | ≤ K. Thus from the definitions of the respective minimizers and the equivalence of the metrics d H and · 2,G on uniformly bounded functions, it follows that Note that, asf 0J is the closest to f 0 in F J in terms of the metric d H , so if for a In view of the last display, f 0J can replacef 0J in the assertion at the expense of changing the constant from M 0 to some appropriate M 0 , giving (6.8).
(a) If f 0 ∈ F + , then f 0J ∈ F + . By Lemma 7.3, the L 2 -approximation rate of F J with equidistant intervals at a monotone function is J −1/2 . Stated differently, in order to achieve an error bound , the number of intervals J should be chosen to be of the order −2 . Then standard arguments as in the proof of Theorem 3.2 show that the prior probability of a Kullback-Leibler neighborhood of size 2 is bounded below by exp{−C 1 −2 log(1/ )}. The required test with respect to d is automatically available, while the sieve can be chosen as in Theorem 3.2 and its entropy can be bounded in the same way by noting that d is bounded by the L 2 (G)-metric, leading to a (suboptimal) contraction rate n = (n/ log n) −1/4 . It also follows that for J n a large constant multiple of −1 n , the prior probability of J > J n is exponentially small compared with the prior concentration, and hence {J > J n } has also exponentially small posterior probability (see Theorem 8.11 (iii ) of Ghosal and van der Vaart [23]). Since log J n log n, it follows that (6.5) holds.
(b) Let f 0 / ∈F + be fixed and bounded. By the martingale convergence theorem, f 0J −f 0 2,G → 0 as J → ∞, so for a given > 0, we can get J 0 (depending on but not depending on n) such that f 0J0 − f 0 2,G < /2. Then for some δ > 0, we have Further, for J 1n an arbitrarily small multiple of n/ log n, the excess prior probability Π(J > J 1n ) can be bounded by e −bn for some b > 0 depending on c. Considering a sieve P n = f = J j=1 θ j 1 Ij , max j |θ j | ≤ n, J ≤ J 1n , standard estimates gives a bound for its L ∞ -metric entropy an arbitrarily small multiple of n. Therefore it follows that (see Theorem 6.17 of Ghosal and van der Vaart [23]) that E 0 Π(J > J 1n |D n ) → 0 and the posterior is consistent at Observe that for any f ∈ F J , h ∈ F + , by the triangle inequality Since f 0 / ∈F + , the first term is bounded below by a fixed positive number for any h ∈ F + . The second term is bounded above by (J log n)/n with high posterior probability, and J can be restricted to be at most J 1n , which can be taken to be an arbitrarily small multiple of n/ log n. Hence we can make the second terms as small as we like, with high posterior probability. By (6.8) and posterior consistency, the third term can also be made arbitrarily small with high posterior probability; note that here the posterior variation is due to the randomness of J only. Minimizing the left-hand side with respect to h ∈ F + , this shows that d H (f, F + ) is larger than some fixed positive number with high posterior probability in true probability. If J ≤ J 1n , this separation will exceed any fixed multiple of (J log n)/n with high posterior probability for all J ≤ J 1n , while the posterior probability of J > J 1n is small in true probability. Hence the posterior probability of the event {d H (f, F + ) > M 0 (J log n)/n} tends to one in true probability, prompting the test to reject the null hypothesis of monotonicity with true probability tending to one.
(c) Let f 0 / ∈ F + and f 0 ∈ H(α, L) such that d H (f 0 , F + ) ≥ ρ n (α). The proof is very similar to that of part (b) with the following changes. First, by the well-known L ∞ -approximation rate J −α at functions in H(α, L) by step functions, and standard arguments as used in part (a) and (b), giving prior concentration and L ∞ -metric entropy bounds, the posterior contraction rate at f 0 with respect to d H is n = (n/ log n) −α/(2α+1) , since the L ∞ -metric is stronger than d H . Also, with high posterior probability, J can be restricted to less than J 2n n 2 n / log n = (n/ log n) 1/(2α+1) . This bounds the second term For random X, using the fact that N j ∼ Bin(n; G(I j )), the expectation of (7.1) under G equals to J j=1 G(I j ) (f 0 (j/J) − f 0 ((j − 1)/J)) 2 , which, in view of Condition (DR), has the bound max 1≤j≤J G(I j )(f 0 (1) − f 0 (0)) 2 J −1 .
For the rest of the proof, we assume that X is fixed, satisfying Condition (DD); the random case can be dealt with similarly, by taking expectation with respect to G and using Condition (DR). We imitate the proof of Proposition 4.1 (a) of Yoo and Ghosal [37] but assuming that f 0 is monotone instead of smooth. Define U = (BΛB T + I n ) −1 . We write Among these terms, only the middle term arising out of the approximation of the true function by step functions, is different from Yoo and Ghosal [37] -the other two terms are bounded by J/n considering step functions as B-splines of order 1 in one dimension. The second term can also be bounded by a multiple of J −1 in the same way Yoo and Ghosal [37] did using the L 2 -approximation rate J −1/2 for monotone function, leading the upper bound a multiple of J/n + J −1 for the expression in (7.3).
To complete the proof of part (a), we bound Var 0 (σ 2 n ) by a multiple of n −1 . Again, we can follow the same steps in the proof of Proposition 4.1 (a) of Yoo and Ghosal [37] with the approximate rate for a smooth function replaced by the L 2 (G n )-approximation rate J −1 for a monotone function. We also observe that the bounds obtained in the proof are uniform over f 0 ∈ F + (K) for any K > 0.
Given part (a), the proof of part (b) follows exactly as in the proof of Proposition 4.1 (a) of Yoo and Ghosal [37]. Lemma 7.3. Let 1 ≤ p < ∞ and K > 0. Then for every f ∈ F + (K) and J > 1, there exist θ 1 ≤ · · · ≤ θ J from [−K, K] such that the following assertions hold. The proof of part (b) is essentially contained in the proof of Theorem 2.7.5 of van der Vaart and Wellner [36], although their theorem is about a bound for the bracketing or metric entropy. Implicit in their construction is that, given > 0, there exists a J = J( ) −1 , 0 ≤ ξ 1 < · · · < ξ J−1 ≤ 1 and θ 1 , . . . , θ J such that f J = J j=1 θ j 1 Ij satisfies f − f J p,H < , where I 1 , . . . , I J form an interval partition of [0, 1] with knots 0 = ξ 0 < ξ 1 < · · · < ξ J−1 ≤ ξ J = 1. For instance, one of the lower brackets in their construction of an -bracketing will satisfy the approximation property. The role of and J can be reversed, in that, given J, we can first obtain > 0 such that the corresponding J( ) is within J.
Finally, we need to conclude that the knot points ξ 1 < · · · < ξ J−1 can be chosen from the support of H. The construction in van der Vaart and Wellner [36] assumed, without loss of generality, that H is uniform. For a general H, the quantile transform is applied, transforming the jth knot ξ j to H −1 (ξ j ), which belongs to the support of H.