Asymptotics for p-value based threshold estimation in regression settings

We investigate the large sample behavior of a p-value based procedure for estimating the threshold level at which a regression function takes off from its baseline value – a problem arising in environmental statistics, engineering and other related fields. The estimate is constructed via fitting a “stump” function to approximate p-values obtained from tests for deviation of the regression function from its baseline level. The smoothness of the regression function in the vicinity of the threshold determines the rate of convergence: a “cusp” of order k at the threshold yields an optimal convergence rate of n−1/(2k+1), n being the number of sampled covariates. We show that the asymptotic distribution of the normalized estimate of the threshold, for both i.i.d. and short range dependent errors, is the minimizer of an integrated and transformed Gaussian process. We study the finite sample behavior of confidence intervals obtained through the asymptotic approximation using simulations, consider extensions to short-range dependent data, and apply our inference procedure to two real data sets. AMS 2000 subject classifications: Primary 62G20, 62G86; secondary 62G30.


Introduction
Consider a data generating model of the form Y = µ(X)+ , where µ is a continuous function on [0, 1] such that µ(x) = τ for x ≤ d 0 , and µ(x) > τ for x > d 0 .The covariate X may arise from a random or a fixed design setting and we assume that has mean zero with finite positive variance.The function µ need not be monotone and the baseline value τ is not necessarily known.We are interested in estimating and constructing confidence intervals (CI's) for the threshold d 0 , d 0 ∈ (0, 1), the point from where the function starts to deviate from its baseline value.
Recently, Mallik et al. (2011) introduced a novel method for estimating the threshold d 0 using a p-value based criterion function but did not provide a recipe for constructing confidence intervals (CI's) for the point d 0 .In this paper, we address the inference problem: we study the asymptotic properties of their procedure in the regression setting and use these results to construct asymptotically valid CI's for the threshold, both in simulation settings and for two key motivating examples from Mallik et al. (2011).The problem, which falls within the sphere of non-regular M-estimation is rather hard, and involves non-trivial applications of techniques from modern empirical processes, as well as results from martingale theory and the theory of Gaussian processes.Along the way, we also deduce results on the large sample behavior of a kernel estimator at local points (see Lemma 2 and Proposition 3) that are of independent interest.In most of the literature, kernel estimates are considered at various fixed points and are asymptotically independent (Csörgő and Mielniczuk, 1995a,b;Robinson, 1997).Hence, they do not admit a functional limit.However, these estimates, when considered at local points, deliver an invariance principle; see Lemma 2 and the proof of Proposition 3.
It is instructive to note that our problem of interest has natural connections with change-point analysis, though it is not a change-point problem for the regression function itself.Under mild assumptions, estimating d 0 can be treated as a problem of detecting a change-point in the derivative of a certain order of µ.The literature on change-point detection is enormous; earlier work includes Hinkley (1970); Korostelëv (1987); Dümbgen (1991); Müller (1992); Korostelëv and Tsybakov (1993); Loader (1996); Müller and Song (1997).We also refer the reader to Brodsky and Darkhovsky (1993), Bhattacharya (1994) and Csörgő and Horváth (1997) for an overview of results on change-point estimation in various settings.Change-points have also been extensively studied in time-series and sequential problems; see Horváth, Horváth and Hušková (2008), Hušková et al. (2008), Gut and Steinebach (2009), Steland (2010) and the references therein.There has also been some work in settings with gradually changing regression functions (local alternative type formulations) but under certain parametric assumptions (Hušková, 1998;Jarušková, 1998).
In particular, the problem of estimating change-points in derivatives within the context of regression has been addressed by a number of authors, e.g.Müller (1992), Cheng and Raimondo (2008), Wishart (2009), Wishart and Kulik (2010).We compare and contrast our approach to the work of these authors in the next section.We further note that our problem can be viewed as a special case of the more general problem of identifying the region where a function, defined on some multi-dimensional space, assumes its baseline (minimum or maximum) value.This problem is relevant to applications in fMRI and image detection (Willett and Nowak, 2007).
We show in what follows that the smoothness of the function in the vicinity of d 0 determines the rate of convergence of our estimator: for a "cusp" of order k at d 0 , the best possible rate of convergence turns out to be n −1/(2k+1) .The limiting distribution of an appropriately normalized version of the estimator is that of the minimizer of the integral of a transformed Gaussian process.The limiting process is new, and while the uniqueness of the minimizer remains unclear (and appears to be a interesting nontrivial exercise in probability), we can bypass the lack of uniqueness and still provide a thorough mathematical framework to construct honest CI's.Under the assumption of uniqueness, which appears to be a very reasonable conjecture based on extensive simulations, we establish auxiliary results to construct asymptotically exact CI's as well.
The paper is organized thus: we briefly discuss the estimation procedure and the basic assumptions in Section 2. The rate of convergence and the asymptotic distribution of the estimated threshold for a particular version of our procedure, along with some auxiliary results for constructing CI's, are deduced in Sections 3.1 and 3.2 assuming a known τ .Asymptotic results for variants of the procedure are discussed in Section 3.3 and extensions of these results to the situation with an unknown τ are presented in Section 4. We study the coverage performance of the resulting CI's through simulations in Section 5.The applicability of our approach to short-range dependent data is the content of Section 6.We implement our procedure to two data examples in Section 7 and end with a discussion in Section 8.The proofs of several technical results are available in the Appendix.

The Method
For simplicity, we first consider the uniform fixed design regression model of the form: with i s i.i.d.having variance σ 2 0 .Although we suppress the dependence on n, Y i and i must be viewed as triangular arrays.Let K be a symmetric probability density (kernel) and h n = h 0 n −λ denote the smoothing bandwidth, for some λ ∈ (0, 1), h 0 > 0. Then an estimate of the regression function, at stage n, is given by μ To the left of d 0 , the null hypothesis holds and these approximate p-values converge weakly to a Uniform(0,1) distribution which has mean 1/2.Moreover, to the right of d 0 , where the alternative is true, the p-values converge in probability to 0. This dichotomous behavior of the p-values motivates minimizing over values of d in (0, 1) to yield an estimate of d 0 .Simple calculations show that this is equivalent to minimizing Here, γ = 0.75.We refer to the above approach as Method 1.
We next describe an approach which avoids estimating σ 0 altogether.Relying upon the simple fact that E[Φ(Z)] = 0.5 for a normally distributed Z with zero mean and arbitrary variance, it can be seen that E[Φ √ nh n (μ(x) − τ ) ] converges to 1/2 for x < d 0 , while for x > d 0 , it converges to 1. So, the desired dichotomous behavior is preserved even without normalization by the estimate of the variance and an estimate of d 0 is given by dn = sargmin d∈[0,1] M n (d) where Here, sargmin denotes the smallest argmin of the criterion function, which does not have a unique minimum.We refer to this approach as Method 2 and study its limiting behavior in the paper.Analyzing this method is useful in illustrating the core ideas while avoiding some of the tedious details encountered in analyzing Method 1.
Remark 1.The above methods are based on a known τ .When τ is unknown, a plug-in estimate can be substituted in its place (more about this in Section 4).Also, for any choice of γ ∈ (1/2, 1) in (2.5) and (2.6), the estimator of d 0 is consistent.The proof follows along the lines of arguments in Mallik et al. (2011, pp. 898-900).

Variants
If the covariate X is random, one could still use the Nadaraya-Watson estimator to construct p-values.More precisely, let P n denote the empirical measure of (Y i , X i ), i = 1, . . ., n, which are independent realizations from the model Y = µ(X) + with σ 2 0 (x) = Var( | X = x) > 0 and X having a continuous positive density f .Then the Nadaraya-Watson estimator is: . and a consistent estimator of d 0 is: dn = sargmin d∈(0,1) Mallik et al. (2011, pp. 892)).An estimate based on p-values normalized by the estimate of variance can also be constructed but is computationally more complicated as an estimate of the variance function is needed.

Basic Assumptions
We adhere to the setup of Section 2, i.e., we assume the errors to be independent and homoscedastic and consider a fixed design for the regression setting.The smoothness of the function in the vicinity of d 0 plays a crucial role in determining the rate of convergence.Throughout this paper, we make the following assumptions.
1. Assumptions on µ: 2. The error possesses a continuous positive density on an interval.3. Assumptions on the kernel K: (a) K is a symmetric probability density.
(c) K is compactly supported, i.e., K(x) = 0 when |x| ≥ L 0 , for some As a consequence of these assumptions, µ and K are bounded, K2 = K 2 (u)du < ∞ and E|W | k < ∞, where W has density K. Also, both µ and K are Lipschitz continuous of order α = min(α 1 , α 2 ).These facts are frequently used in the paper.Common kernels such as the Epanechnikov kernel and the triangular kernel conveniently satisfy the assumptions mentioned above.The results in the next section are developed for a γ ∈ (1/2, 1) (cf.Remark 1) and a known τ .It will be seen in Section 4 that τ can be estimated at a sufficiently fast rate; consequently, even if τ is unknown, appropriate estimates can be substituted in its place to construct the p-values that are instrumental to the methods of this paper, without changing the limit distributions.Without loss of generality, we take τ ≡ 0 in the next section, as one can work with (Y i − τ )s in place of Y i s.
Comparison with existing approaches.Under the above assumptions, d 0 is a 'change-point' in the k-th derivative of µ.Our procedure for estimating this change-point relies on the discrepancy of p-values, the construction of which requires a kernel-smoothed estimate (or if one desires, local polynomial estimate) of µ.As noted in the Introduction, the estimation of a change-point in the derivative of a regression function has been studied by a number of authors using kernel-based strategies.However, the approaches in these papers are quite different from ours and more importantly, our problem cannot be solved by these methods without making stronger model assumptions than those above.In Müller (1992), the change-point is obtained by direct estimation of the k-th derivative (k corresponds to ν in that paper) on either side of the change-point via one-sided kernels and measuring the difference between these estimates.In contrast, our approach does not rely on derivative estimation.We use an ordinary kernel function to construct a smooth estimate of µ which is required for the point wise testing procedures that lead to the p-values.In fact, a consistent estimate that attains the same rate of convergence as our current estimate could have been constructed using a simple regressogram estimator with an appropriate bin-width, in contrast to the approach in Müller (1992) which uses a k-times differentiable kernel.Müller (1992) also assumes that the k-th derivative of the regression function is at least twice continuously differentiable at all points except d 0 -see, pages 738-739 of that paper -which is stronger than our continuity assumption on µ (k) (1(b) above).Cheng and Raimondo (2008) develop kernel methods for optimal estimation of the first derivative building on an idea by Goldenshluger, Tsybakov and Zeevi (2006), which is followed up in the context of dependent errors by Wishart andKulik (2010), andWishart (2009), but these papers do not consider the case k > 1.We also note that our method is fairly simple to implement.

Main Results
We state and prove results on the limiting behavior of the estimator obtained from Method 2. Results on the variant of the procedure discussed in Section 2.1 follow analogously and are stated without proofs in Section 3.3.We consider the model stated in (2.1) with homoscedastic errors and uniform fixed design, and study the limiting behavior of dn which minimizes (2.6).Recall that τ is taken to be zero without loss of generality.

Rate of convergence
We first consider the population equivalent of M n , given here by M n (d) = E {M n (d)}, and study the behavior of its smallest argmin.Let for i = 1, . . ., n, and Z 0 be a standard normal random variable independent of Z in 's.Also, let 's and Σ n (i/n)'s do not vary with i.We denote them by Φn and Σn for convenience.Using Corollary 1 and (A.1) from the Appendix, Σn converges to σ 0 √ K2 .Also, for such i's, any η > 0 and sufficiently large n, which converges to zero.Hence, by Lindeberg-Feller CLT, Z in / Σn and consequently, Φn converge weakly to Φ.In fact, for any i, we can also show that Φ i,n converges weakly to Φ. Let The following lemma provides the rate at which It can be shown by arguments analogous to Mallik et al. (2011, pp. 898-900) and corresponds to a local minima of M n for sufficiently large n, i.e., d n satisfies By Pólya's theorem, Φn converges uniformly to Φ. Consequently, Note that μ(x) = 0 for x < d 0 − L 0 h n and Φn (0) converges to Φ(0) = 0.5 < γ.So, if d n < d 0 , then for (3.2) to hold, d n + 1/n + L 0 h n > d 0 for large n and thus which gives the result.Also, when d 0 < d n ≤ d 0 +L 0 h n , the result automatically holds.So, it suffices to consider the case By a change of variable, be an interval where K is positive.Such an interval exists due to assumptions 4(a) and 4(b).Hence, As ) is bounded away from zero, and thus ( ), which yields the result.
As dn is, in fact, estimating d n , its rate of convergence for d 0 can at most be ν −1 n .Fortunately, ν −1 n turns out to be the exact rate of convergence of dn .Theorem 1.Let ν n be as defined in Lemma 1. Then ν n ( dn The proof is given in Section A.1 of Appendix.It involves coming up with an appropriate distance ρ n based on the behavior of M n near d 0 (Lemma 4) and then establishing a modulus of continuity bound for M n − M n with respect to ρ n .As the summands that constitute M n are dependent, the latter cannot be handled directly through VC or bracketing results (Theorems 2.14.1 or 2.14.2 of van der Vaart and Wellner (1996)); rather, we require a blocking argument followed by an application of Doob's inequality to the blocks.
The optimal rate is attained when 2k+1) .We now deduce the asymptotic distribution for this particular choice of bandwidth.

Asymptotic Distribution
With h n = h 0 n −1/(2k+1) , we study the limiting behavior of the process where M n is defined in (2.6).The process Z n (t) is minimized at h −1 n ( dn − d 0 ).At the core of the process Z n (t) lies the estimator μ, computed at local points and B loc (R) denote the space of locally bounded functions on R, equipped with the topology of uniform convergence on compacta.We have the following lemma on the limiting behavior of W n .
Lemma 2. There exists a Gaussian process W (t), t ∈ R, with almost sure continuous paths and drift The proof is given in Section A.2 of Appendix.For brevity, − x y is written as y x whenever x > y.Theorem 2. For t ∈ R, the process Z n (t) converges weakly to the process Using arguments almost identical to those for proving Lemma 3 in the Appendix, we have where the O(1/(nh n )) factor accounts for the boundary terms.Using the fact that The above display goes in probability to zero due to the asymptotic equicontinuity of the process W n and hence the term I n converges in probability to zero uniformly in t over compact sets.Further, we have As the mapping This completes the proof.A conservative asymptotic CI for d 0 can be obtained using the following result.
Theorem 3. The process Z(t) goes to infinity almost surely (a.s.) as |t| → ∞.Moreover, let ξ s 0 and ξ l 0 denote the smallest and the largest minimizers of the process Z. Also, let c s α/2 and c l 1−α/2 be the (α/2)th and (1 − α/2)th quantiles of ξ s 0 and ξ l 0 respectively.Then Note that ξ s 0 and ξ l 0 are indeed well defined by continuity of the sample paths of Z and the fact that Z(t) goes to infinity as |t| → ∞.Also, they are Borel measurable as, say for ξ s 0 , the events [ξ s 0 ≤ a] and the measurable event are well defined.The proof of the result is given in Section A.3 of the Appendix.
A minimum of the underlying limiting process lies in the set {y : Φ (W (y)) = γ}.As any fixed number has probability zero of being in this set, the distributions of ξ s 0 and ξ l 0 are continuous.The process {W (y) : y ∈ R} has zero drift for y < −L 0 and is therefore stationary to the left of −L 0 .Hence, it must cross γ infinitely often implying that Z has multiple local extrema.On the other hand, simulations strongly suggest that Z has a unique argmin though a theoretical justification appears intractable at this point.The issue of the uniqueness of the argmin of a stochastic process has mostly been addressed in context of Gaussian processes (Lifshits, 1982;Kim and Pollard, 1990;Ferger, 1999), certain transforms of compound Poisson processes (Ermakov, 1976;Pflug, 1983) and set-indexed Brownian motion (Müller and Song, 1996).These techniques do not apply to our setting; in fact, an analytical justification of the uniqueness of the minimizer of Z appears non-trivial.As the simulations provide strong evidence in support of a unique argmin, we use the following result for constructing CI's in practice.
Theorem 4. Assuming that the process Z has a unique argmin, we have Note that when the argmin is unique, Theorem 3 and Theorem 4 yield the same CI.The proof of Theorem 4 is a direct application of the argmin(argmax)continuous mapping theorem; see Kim and Pollard (1990, Theorem 2.7) or van der Vaart and Wellner (1996, Theorem 3.2.2).

Limit distributions for variants of the procedure
The rates of convergence and asymptotic distributions can be obtained similarly for the variants of the procedure that were discussed in Section 2.1.In what follows, we state the limiting distributions, without proofs, for some of these variants.
Results analogous to Theorem 3 can be shown to hold in the setting with heteroscedastic errors, i.e., Var( i ) = σ 2 0 (i/n), where σ 2 0 (•) is a positive continuous function.The process Z has the same form as in Theorem 3 apart from the fact that the σ 2 0 involved in the covariance kernel of the process W that appears in the definition of Z is replaced by σ 2 0 (d 0 ).When normalized p-values are used to estimate d 0 , we have the following form for the limiting distribution.
Proposition 1.Consider the setting with homoscedastic errors and covariates sampled from the fixed uniform design, as discussed in Section 2. Let d1 n denote the estimate obtained from Method 1 by minimizing M n defined in (2.5).Let h n = h 0 n −1/(2k+1) and W 1 (t), t ∈ R, be a Gaussian process with drift √ n-consistent estimate of σ 0 and Z 1 (t) possesses a unique argmin a.s., then When the covariate is sampled from a random design with heteroscedastic errors, the result extends as follows for the estimate based on non-normalized p-values.
Proposition 2. Consider the setting with covariates sampled from a random design and heteroscedastic errors, as discussed in Section 2.1.The variance function σ 2 0 (x) = Var( | X = x) is assumed to be continuous and positive.Let h n = h 0 n −1/(2k+1) and W (t), t ∈ R, be a Gaussian process with drift With dn defined as in (2.7) and Z possessing a unique argmin a.s., we have

The case of an unknown τ
Although most of the results have been deduced under the assumption of a known τ , in real applications τ is generally not known.In this situation, one would need to impute an estimate of τ in the objective function to carry out the procedure.It can be shown that the limit distribution does not change as long as we have a √ n-consistent estimate of τ .Quite a few choices are possible for estimating τ .If d 0 can be safely assumed to be larger than some η, then a simple averaging of the observations below η would yield a √ n-consistent estimate of τ .If a proper choice of η is not available, one can obtain an initial (consistent) estimate of τ using the method proposed in Section 2.4 of Mallik et al. (2011) (see (5.1)), compute dn and then average the responses from, say, [0, c dn ], c ∈ (0, 1), to obtain a √ n-estimate of τ .This leads to an iterative procedure which we discuss in complete detail in Section 5.

Simulations
We consider three choices for the underlying regression function All these functions are at their baseline value 0 up to d 0 = 0.5.The functions µ 1 (linear) and µ 2 (quadratic) both rise to 1 while µ 3 exhibits non-isotonic sinusoidal behavior after rising at d 0 .The right derivative at d 0 , a factor that appears in the limiting process Z, is the same for µ 1 and µ 3 .The functions are plotted in the upper left panel of Figure 1.The functions µ 1 and µ 2 are paired up with normally distributed errors having mean 0 and standard deviation σ 0 = 0.1, while the noise added with µ 3 is from a t-distribution with 5 degrees of freedom, scaled to have the standard deviation σ 0 .The three models, µ 1 with normal errors, µ 2 with normal errors and µ 3 with t-distributed errors, are referred to by the the name of their regression functions only.We work with γ = 3/4 as extreme values of γ (close to 0.5 or 1) tend to cause instabilities.
We construct the estimate of d 0 using the normalized p-values as they exhibit better finite sample performance and study the coverage performance of the approximate CI's obtained from the limiting distributions with estimated nuisance parameters.The error variance σ 2 0 is estimated in a straightforward manner using σ2 = (1/n) i {Y i − μ(i/n)} 2 .More sophisticated estimates of the error variance are also available (Gasser, Sroka and Jennen-Steinmetz, 1986;Hall, Kay and Titterington, 1990) but we avoid them for the sake of simplicity.We use the Epanechnikov kernel for constructing the estimate of µ.For moderate samples, the bad behavior of kernel estimates near the boundary affects the coverage performance.In order to correct for this, we only consider the terms between h n to 1 − h n in our objective function, i.e., for d The asymptotic distribution of the minimizer of this restricted criterion function still has the same form as in Proposition 1.A good choice for h 0 in the optimal bandwidth h n = h 0 n −1/(2k+1) can be obtained through minimizing the MSE of μ(d 0 ).Standard calculations shown that where W has density K.The MSE is minimized at 2k+1) where .
This bandwidth goes to 0 at the right rate needed for estimating d 0 .Moreover, efficient estimation of µ in the vicinity of d 0 is likely to aid in estimating d 0 .
Hence, we advocate the use of this choice of h 0 for our procedure.
With the above mentioned choice of h 0 , we compare the distribution of h −1 n ( dn − d 0 ) for n = 1000 data points over 2000 replications with the deduced asymptotic distribution.As τ is assumed unknown, we implement an iterative scheme.We obtain an initial estimate of τ using the method prescribed in Mallik et al. (2011), i.e., τinit = argmin (5.1) This estimate of τ , based on h opt 0 , is used to compute dn .We re-estimate τ by averaging the responses for which i/n ∈ [0, 0.9 dn ] and proceed thus.The Q-Q plots are shown in Figure 1 which show considerable agreement between the two distributions.Next, we explore the coverage performance of the CI's constructed by imputing estimates of the nuisance parameters in the limiting distribution.Computing h 0 requires the knowledge of the k-th derivative of µ at d 0 which we also need to generate from the limit process.To estimate Hence, an estimate of µ (k) (d 0 +) can be obtained by fitting a k-th power of the covariate to the right of dn .More precisely, an estimate of where b n ↓ 0 and nb 2k+1 n → ∞.For the optimal h n , this provides a good estimate of ξ 0 .
We include this in our iterative method where we start with an arbitrary choice of h 0 and compute τinit .We use τinit to compute dn and μ(k) (d 0 +).The parameter μ(k) (d 0 +) is estimated using a reasonably wide smoothing bandwidth b n , b n = 5(n/ log n) −1/(2k+1) .These initial estimates are used to compute the next level estimate of h 0 using the expression for h opt 0 .We re-estimate τ by averaging the responses for which i/n ∈ [0, 0.9 dn ] and proceed thus.On average, the estimates stabilize within 7 iterations.The coverage performance over 5000 replications is given below in Table 1.The approximate CI's mostly exhibit overcoverage for moderate sample sizes for µ 1 and µ 3 but converge to the desired confidence levels for large n.Also, the limiting distribution is same under models µ 1 and µ 3 which is evident from the coverages and the length of CI's for large n.  1 Coverage probabilities and length of the CI (in parentheses) using the true parameters (T) and the estimated parameters (E) for different sample sizes under µ 1 , µ 2 and µ 3 .

Dependent data
We briefly discuss the extension of Method 2 to dependent data in this section.
Our problem is relevant to applications from time series models (see Section 7) where it is not reasonable to assume that the errors i 's are independent.A data generating model of the form (2.1) can be assumed here with the exception that the errors now arise from a stationary sequence {. . .−1 , 0 , 1 , . ..} and exhibit short-range dependence in the sense of Robinson (1997).As with (2.1), the dependence of Y i 's and i 's on n is suppressed but they must be viewed as triangular arrays.The extension to this setting would work along the following lines.The estimate of µ with dependent errors still has the same form as (2.2).
With additional assumptions (Assumptions 1-5 of Robinson (1997)), it is guaranteed that √ nh n (μ(x i ) − µ(x i )), x i ∈ (0, 1) and x 1 = x 2 , converge jointly in distribution to independent normals with zero mean -a fact that justifies the consistency of our p-value based estimates in this setting (Mallik et al., 2011).Hence, dn , defined using (2.6), can still be used to estimate the threshold.The limiting distribution would be of the same form as in Lemma 2 but with a different scaling factor that appears in the covariance function of the process W .We outline the form of the limiting distribution below.The technical details are more involved in the sense of tedium but the approach in deriving the limiting distribution remains the same at the conceptual level.
To precisely state the limiting distribution, let ρ(i, j) = ρ(i − j) denote the covariance between i and j and let ψ denote the underlying spectral density defined through the relation Let W be a Gaussian process with drift m(•) (defined in Lemma 2) and covariance function It is not uncommon for the spectral density at zero, ψ(0) = (2π) −1 σ 2 0 j∈Z ρ(j), to appear in settings with short range dependence (Robinson, 1997;Anevski and Hössjer, 2006).
Proposition 3. Consider the setup of Method 2 with the errors now exhibiting short-range dependence as discussed above.Assume that for h n = h 0 n −1/(2k+1) , the resulting estimate dn obtained using (2.6) satisfies h −1 n ( dn − d 0 ) = O p (1) and that the process The proof is outlined in Section A.4 of the Appendix.An illustration of the above phenomenon is shown through a Q-Q plot (Figure 2), where we generate i 's from an AR(1) model i = 0.25 i−1 + z i .Here, z i s are mean 0 normal random variables with variance 0.0094 so that i 's have variance (0.1) 2 .The Q-Q plot shows considerable agreement between the empirical quantiles, obtained from samples of size n = 1000, with the theoretical quantiles.

Data Analysis
We now apply our procedure to two interesting examples from Mallik et al. (2011).
The first data set involves measuring concentration of mercury in the atmosphere through a LIDAR experiment.There are 221 observations with the predictor variable 'range' varying from 390 to 720 and there is visible evidence of heteroscedasticity.The observed covariates can be considered to have arisen from a random design and the threshold d 0 corresponds to the distance at which there is a sudden rise in the concentration of mercury.See pages 2 and 10 of Mallik et al. (2011) for more details.
We employ a variant of Method 2 based on the Nadaraya-Watson estimator without normalizing by the estimate of the variance.It is reasonable to assume here that the function is at its baseline till range value 480.The estimate of τ is obtained by taking the average of observations until range reaches 480, which gives τ = −0.0523.The estimate dn is obtained through the iterative approach described in Section 5.The expression for the approximate bias of the Nadaraya-Watson estimator turns out to be the same as that for the fixed design kernel estimator at d 0 while the approximate variance turns out to be (σ 2 0 K2 )/(nh n f (d 0 )) and the optimal value of h 0 is adjusted accordingly.The limiting distribution, as well as the optimal h 0 , involves the parameter σ 0 (d 0 ), which we estimate using σ( dn ) where σ2 (x The estimate dn has an inherent bias which is a recurring feature in boundary estimation problems.A simple but effective way to reduce this bias is to subtract the median of the limiting distribution with imputed parameters, say q0.5 , from our crude estimate, after proper normalization (so that the limiting median is zero).More precisely, dn − n −1/(2k+1) q0.5 is our final estimate.Assuming k to be 1, the resulting estimate of d 0 is 551.05 which appears reasonable (see Figure 1 of Mallik et al. (2011)).Moreover, the CI's are [550.53, 555.17] and [549.75, 557.82] for confidence levels of 90% and 95%, respectively, which also seem reasonable.
Our second data set, which comes from the last example in the Introduc-tion of Mallik et al. (2011) (see pages 3 and 10 of that paper for more details), involves the measurement of annual global temperature anomalies, in degree Celsius, over the years 1850 to 2009.The depiction of the data (see Figure 1 of Mallik et al. (2011)) suggests a trend function which stays at its baseline value for a while followed by a nondecreasing trend.We follow the approach of Wu, Woodroofe and Mentz (2001) and Zhao and Woodroofe (2012), and model the data as having a non-parametric trend function and short-range dependent errors.The flat stretch at the beginning is also noted in Zhao and Woodroofe (2012), where isotonic estimation procedures are considered in settings with dependent data.They also provide evidence for the errors to be arising from a lower order auto-regressive process.A comprehensive approach would incorporate a cyclical component as well (Schlesinger and Ramankutty, 1994), which we do not pursue in our paper.
The estimate of the baseline value, after averaging the anomalies up to the year 1875, is τ = −0.3540.Using this estimate of τ , we employ Method 2 with non-normalized p-values (see (2.6)) in this example with the optimal h 0 chosen through an iterative approach.Constructing the CI involves estimating an extra parameter ψ(0) for which we use the estimates computed in Wu, Woodroofe and Mentz (2001, pp. 800) (the parameter σ 2 estimated in that paper is precisely 2πψ(0)).Assuming k to be 1, the estimate of the threshold d 0 after bias correction, which signifies the advent of global warming, turns out to be 1912.The CI's are [1908, 1917] and [1906, 1919] for confidence levels 90% and 95% respectively.This is compatible with the observation on page 2 of Zhao and Woodroofe (2012) that global warming does not appear to have begun until 1915.

Conclusion
We conclude with a discussion of some open problems that can provide avenues for further investigation into this problem.
Adaptivity.In this paper we have provided a comprehensive treatment of the asymptotics of a p-value based procedure to estimate the threshold d 0 at which an unknown regression function µ takes off from its baseline value, with the aim of constructing CI's for d 0 .We have assumed knowledge of the order of the 'cusp' of µ at d 0 , which we need to achieve the optimal rate of convergence (and construct the corresponding CI's), though not for consistency.When k is unknown, ideas from multiscale testing procedures for white noise models (Dümbgen and Spokoiny, 2001;Dümbgen and Walther, 2008) can conceivably be used to develop adaptive procedures in our model.This is a hard open problem and will be a topic of future research.
Resampling.A natural alternative to using the limit distribution (with estimated nuisance parameters) to construct CI's for d 0 would be to use bootstrap/resampling methods.Drawing from results obtained in similar change-point and non-standard problems (see e.g., Sen, Banerjee and Woodroofe (2010); Seijo and Sen (2011)) it is very likely that the usual bootstrap method will be inconsistent in our setup.However, model based bootstrap procedures have recently been studied in the change-point context and have been shown to work (Seijo and Sen, 2011).Similar ideas may work for our problem as well, but a thorough understanding of such bootstrap procedures is beyond the scope of the present paper.Subsampling can be proven to be consistent in our setting, but its finite sample properties were seen to be rather dismal.
Behavior near the boundary.For our simulations, we concentrated on the case d 0 = 0.5.The estimate also performs well (in terms of MSEs) in settings where d 0 is close to the boundary as long as there are sufficiently many observations on either side of d 0 (see Section 2.3 of Supplementary material of Mallik et al. (2011)).We do not address the cases where d 0 is exactly at the boundary, e.g, d 0 = 0.This leads to a testing problem (flat stretch vs. no flat stretch) which goes beyond the scope of our discussion.However, we would like to point out that dn would be consistent even when d 0 = 0 or 1 regardless of the bad behavior of the kernel estimates near the boundary.
Minimaxity.The estimators studied in our paper attain the convergence rate of n −1/(2k+1) .This leads to a natural question as to whether this is the best possible rate of convergence.When µ is monotone increasing, d 0 is precisely µ −1 (τ ), where µ −1 is the right continuous inverse of µ.Wright (1981) (Theorem 1) shows that the rate of convergence of the isotonic least squares estimate µ at a point, x 0 , where the first k − 1 derivatives vanish but the kth does not, is precisely n −k/(2k+1) .A slightly more general result establishing a process convergence is stated in Fact 1 of Banerjee (2009).Using this in conjunction with the techniques for the proof of Theorem 1 in Banerjee and Wellner (2005), it can be deduced that the rate of convergence of the isotonic estimate of µ −1 at µ(x 0 ) is n −1/(2k+1) , which matches the rate attained by our approach.Hence, we expect this rate to be minimax in our setting.We note that this rate is not the same as the faster rate min(n −2/(2k+3) , n −1/(2k+1) ) obtained in Neumann (1997) for a change-point estimation problem in a density deconvolution model and also observed in the convolution white noise models of Goldenshluger, Tsybakov and Zeevi (2006) and Goldenshluger et al. (2008).These models are related to our setting; e.g., Problem 1 in Goldenshluger et al. (2008) is a Gaussian white noise model where the underlying regression function also has a cusp of a known order at an unknown point of interest.The convolution white noise model considered in Goldenshluger, Tsybakov and Zeevi (2006) (Problem 2 in Goldenshluger et al. (2008)) is equivalent to this problem for a particular choice of the convolution operator; see Goldenshluger, Tsybakov and Zeevi (2006, pp. 352-353) and Goldenshluger et al. (2008, pp. 790-791) for more details.Besides these being white noise models, they differ from our setting through an additional smoothness condition (Goldenshluger, Tsybakov and Zeevi, 2006, pp. 354-355), which translates, in our setting, to assuming that µ (k) is Lipschitz outside any neighborhood of d 0 , an assumption not made in this paper.Hence, Neumann's rate need not be minimax for our setting.The faster rate of Neumann (1997) was also observed for k = 1 in Cheng and Raimondo (2008) but once again under the assumption that the derivative of the regression function is at least twice differentiable away from the change-point, again an assumption not made in this paper.
and therefore, the above display is further bounded (up to a positive constant multiple) by Here, the final bound does not depend on x and thus, we get the desired result.
Note that the above result holds for generic functions µ and K, satisfying assumptions 1(a), 4(c) and 4(d).Letting µ(x) ≡ σ 2 0 and substituting K 2 for K, we get:

A.1. Proof of Theorem 1
To prove Theorem 1, we use Theorem 3.2.5 of van der Vaart and Wellner (1996) (see also Theorem 3.4.1)which requires coming up with a non-negative map Then a bound on the modulus of continuity with respect to ρ n is needed, i.e., where the map δ → φ n (δ)/δ α is decreasing for some α < 2. The rate is of convergence is then governed by the behavior of φ n .We start with the following choice for ρ n .
for some K 1 > 0. Then K 1 and κ > 0 can be chosen such that for sufficiently large n and We first provide the proof of Theorem 1 using Lemma 1.By the above Lemma, there exists A < ∞ such that for sufficiently large n and any δ > 0, where μ is defined in (3.1).Note that E {U (i, d)} = 0 and for 1 ≤ i, j ≤ n, U (i, d) and U (j, d) are independent whenever |i − j| ≥ 2L 0 nh n .Let j i 1 = i and j i l = j i l−1 + 2L 0 nh n .Then, where . This bound can also be shown to hold when d ≤ d n .Also, φ n (•) and ρ n (•, d n ) satisfy the conditions of Theorem 3.2.5 of van der Vaart and Wellner (1996).Hence, the rate of convergence, say r n , satisfies Note that r 2 n = ν n satisfies the above relation and therefore Proof of Lemma 4. Since μ(x) = 0 for x < d 0 −L 0 h n , note that d n > d 0 −L 0 h n for sufficiently large n.As Φ i,n (0) converges to 1/2 uniformly in i, it can be seen that for large n and for sufficiently large n and some K 0 > 0. Using (3.2), note that ) is bounded away from zero.To show this, note that by Lemma 3, Choose κ > 0 such that µ is non-decreasing in (d 0 , d 0 +3κ).For sufficiently large n, d n + η/ν n + L 0 h n < d 0 + 3κ, and hence, the integrand in the above display is non-negative.With L 1 such that K min = inf{K(x) : x ∈ [−L 1 , L 1 ]} > 0, the above display is bounded from below by Using Lemma 1, (d n − d 0 ) is O(1/ν n ) and hence, the above display is further bounded from below by As √ nh n /ν k n ≥ 1, (A.3) holds.Further, as the kernel K(u) is non-increasing in |u|, μ is non-decreasing in Using Lemma 1, there exists A 0 < ∞ such that for sufficiently large n, ν n |d 0 − d n | ≤ A 0 , and hence {ρ n (d, Letting K 1 = (1/2) min(γ − 1/2, K 0 ) and using (A.2) and (A.4), we get the desired result.

A.2. Proof of Lemma 2
In order to prove Lemma 2, we first justify a few auxiliary results required to prove the tightness of W n .Recall that Proof.As the kernel K is Lipschitz of order α > 1/2, there exists a constant Since α > 1/2, the result is a consequence of Theorem 12.3 of Billingsley (1968, pp. 95).
We use a version of the Arzela-Ascoli theorem to prove the next result and thus we state it below for convenience.
Theorem 5 (Arzela-Ascoli).Let f n be a sequence of continuous functions defined on a compact set [a, b] such that f n converge pointwise to f and for any Lemma 6.The sequence of functions √ nh n μ(d 0 + th n ) converges to m(t), uniformly over compact sets in R.
Proof.The pointwise convergence is evident from Lemma 2. To justify the uniform convergence, let zn (x, t) = (1/h n )µ(x)K((d 0 − x)/h n + t).By arguments similar to those for Lemma 3, As the above bound does not depend on t and α > 1/2, for s, t ∈ [−T, T ], and δ > 0, where ζ u is some intermediate point between d 0 and d 0 + uh n .The k-th derivative of µ is bounded on (d 0 , d 0 + (L 0 +T )h n ) for sufficiently large n and h k n √ nh n equals h k+1/2 0 .As K is uniformly continuous, the above display goes to zero as δ → 0 by DCT.Hence, by the Arzela-Ascoli theorem we get the desired result.
We now continue with the proof of Lemma 2. For (a i , t i ) ∈ R 2 , i = 1, . . .l, we have i,j Hence, the defined covariance function is non-negative definite and by Kolmogorov consistency, the Gaussian process W exists.
Let r(h) = K(h + u)K(u)du / K2 denote the correlation function of W .For W to have a continuous modification, by Hunt's theorem (e.g., see Cramér and Leadbetter (1967, pp. 169-171)), it suffices to show that r(h) is 1 − O((log(h)) −δ ) for some δ > 3 as h → 0. Note that the kernel K is Lipschitz continuous of order α and hence, we have Thus, W has a continuous modification.Next, we justify weak convergence of the process W n to W .
As a consequence of Lemma 5 and 6, the process W n is asymptotically tight.To justify finite dimensional convergence, it suffices to show that: where t 1 , t 2 ∈ R. Let x j = d 0 + t j h n , j = 1 and 2.Then, The last step follows from DCT as the k-th derivative of µ is bounded in a right neighborhood of d 0 and and, by a change of variable, Hence, the Lindeberg-Feller condition is satisfied for ¯ n (x j )s and by the Cramér-Wold device, (A.6) holds.This justifies the finite dimensional convergence and hence, we have the result.

A.3. Proof of Theorem 3
In order to prove Theorem 3, an ergodic theorem and Borell's inequality are found useful, which are stated below for convenience.For the proofs of the two results, see, for example, Cramér and Leadbetter (1967, pp. 147), and (Adler and Taylor, 2007, pp. 49-53), respectively.Also, we use Theorem 3 from Ferger (2004) which we state below as well.
Theorem 6.Consider a real continuous second order stationary process ξ(t) with mean 0 and correlation function R(t).If for any a > 0, then ξ satisfies the law of large numbers, i.e., T −1 T 0 ξ(t)dt converges a.s. to zero as T → ∞.
Theorem 7 (Borell's inequality).Let ξ be a centered Gaussian process, a.s.bounded on a set I. Then E {sup u∈I ξ(u)} < ∞ and for all x > 0, where σ 2 I = sup u∈I Var {ξ(u)}.Theorem 8 (Ferger (2004)).Let V n , n ≥ 0, be stochastic processes in D(R), defined on a common probability space (Ω, A, P ).Let ξ n be a Borel-measurable minimizer of V n .Suppose that: (ii) The trajectories of V 0 almost surely possess a smallest and a largest minimizer ξ s 0 and ξ l 0 respectively, which are Borel measurable.(iii) The sequence ξ n is uniformly tight.
Then for every x ∈ X, Here, X = {x ∈ R : P [V 0 is continuous at x] = 1}.
We now continue with the proof of Theorem 3. Let This is a mean zero stationary process and thus, so is the process When t < −L 0 , m(t) = 0, which gives W (t) = W 0 (t) and hence the third term in the above display goes to zero and Z(t) → ∞ a.s. as t → −∞.For t > 0, fix M > 0 and j be a positive integer.Then , where by station- ) which is finite, again due to Borell's inequality.Also, it can be seen that m(j) (j − L 0 ) k and hence j≥1 P sup t∈[j,j+1] (−W 0 (t)) > m(j) − M < ∞.Using Borel-Cantelli lemma, we get P [lim inf t→∞ W (t) > M ] = 1.As M can be made arbitrarily large, we get that W (t) diverges to ∞ a.s. as t → ∞ and consequently so does Z(t).
Note that Z n (defined in (3.4)) converges weakly to Z in B loc (R) and consequently, in D(R) as well.Moreover, Z has continuous sample paths with probability 1.As Z(t) → ∞ when |t| → ∞, ξ s 0 and ξ l 0 are well defined and Borel measurable.Further, recall that h −1 n ( dn − d 0 ), the smallest argmin of the process Z n (•), is determined by the ordering of finitely many random variables and hence, is measurable.Also, by Theorem 1, it is O p (1).Hence, conditions (i), (ii) and (iii) of Theorem 8 are satisfied with V n = Z n and V 0 = Z, and thus, Hence, we get the desired result.

A.4. Proof of Proposition 3
Given what has been done earlier for proving results from Section 3.2, it suffices to show that the process ¯ n (t), defined in (A.5), converges weakly to a mean zero Gaussian process having the covariance function of W in the setup of Section 6.As W n (t) = √ nh n μ(d 0 + th n ) + ¯ n (t), Lemma 6 then justifies the weak convergence of W n to W .The statement and the proof of Lemma 2 relies on the i.i.d.assumption only through the convergence of W n 's and the form of their limit.Hence, it would follow that the process Z n (defined in (3.4)) converges to Z.The result then follows from applying the argmin continuous mapping theorem as in proving Theorem 4.
We start by showing the covariance function of the process ¯ n converges to that of W .For t 1 , t 2 ∈ R, let x j = d 0 + t j h n , j = 1, 2. We have Cov(¯ n (t 1 ), ¯ n (t 2 )) = σ 2 0 nh n l,j ρ(l − j)K x 1 − l/n h n K x 2 − j/n h n .
Following the arguments identical to that in the proof of Lemma 2, this expression can be shown to converge to the covariance function of W .What remains now is the justification of the asymptotic normality of finite dimensional marginals of ¯ n and proving tightness.Justifying asymptotic normality of the finite dimensional marginals of ¯ n requires showing the asymptotic normality of any finite linear combination of marginals of ¯ n and then applying the Cramér-Wold device.Given the convergence of the covariances, it suffices to prove that for (c r , t r ) ∈ R, 1 ≤ r ≤ R ∈ N, where v 2 n = Var r≤r c r ¯ n (t r ) .The left hand side equals i w in i where As in (Robinson, 1997, Assumption 2), we assume i 's to be a linear process with martingale innovations and square summable coefficients, i.e, there is a sequence of martingale differences u j , j ∈ Z adapted to F j = σ{u k : k ≤ j} with mean 0 and variance 1, such that To show asymptotic normality, we justify conditions (2.3) and (2.6) from Robinson (1997).The condition (2.3) is just a normalization requirement which holds in our case as the variance of the left hand side of (A.7) is 1.The condition (2.6) of Robinson (1997)  We have The tightness follows from Theorem 12.3 of Billingsley (1968, pp. 95).