Flexible, boundary adapted, nonparametric methods for the estimation of univariate piecewise-smooth functions

: We present and compare some nonparametric estimation meth- ods (wavelet and/or spline-based) designed to recover a one-dimensional piecewise-smooth regression function in both a ﬁxed equidistant or not equidistant design regression model and a random design model. Wavelet methods are known to be very competitive in terms of denois- ing and compression, due to the simultaneous localization property of a function in time and frequency. However, boundary assumptions, such as periodicity or symmetry, generate bias and artiﬁcial wiggles which degrade overall accuracy. Simple methods have been proposed in the literature for reducing the bias at the boundaries. We introduce new ones based on adaptive combinations of two estimators. The underlying idea is to combine a highly accurate method for non-regular functions, e.g., wavelets, with one well behaved at boundaries, e.g., Splines or Local Polynomial. We provide some asymptotic optimal results supporting our approach. All the methods can handle data with a random design. We also sketch some generalization to the multidimensional setting. the performance of the proposed approaches we have an extensive set of simulations on synthetic data. An interesting regression analysis of two real data applications using these procedures unambiguously demonstrates their eﬀectiveness.


Introduction
Suppose we are given data: the goal being to construct estimates that have "small" risk. In order to have some meaningful estimate according to this criterion, one must assume certain regularity conditions on the unknown function f , such as f belongs to some Hölder classes, Sobolev classes, Besov classes and so forth. When the regression function f is sufficiently smooth, efficient smoothing methods such as kernel, splines and basis expansions have received considerable attention in the nonparametric literature (see, for example, Green and Silverman (1993); Eubank (1999); Härdle (1990); Hastie (2003) and references given therein). In contrast, which is the case of this paper, when the unknown function f , mostly smooth, is suspected to have few discontinuities, sharp spikes and abrupt changes, wavelet methods are very popular. The application of wavelet theory to the field of statistical function estimation was pioneered in ; . The methodology includes a coherent set of procedures that are spatially adaptive and near optimal over a range of function spaces of inhomogeneous smoothness. Wavelet procedures achieve adaptivity through thresholding of the empirical wavelet coefficients. They enjoy excellent mean squared error properties when estimating functions that are only piecewise smooth and have near optimal convergence rates over large function classes. For example they attain optimal convergence rates for the L 2 risk when f is in a ball of a Besov space B s p,r for p < 2, which can not be achieved by any linear estimator. For a thorough review of wavelet methods in statistics the reader is referred to Antoniadis (2007).
Despite their considerable advantages, however, standard wavelet procedures have limitations. It might be noticed that the vast majority of wavelet-based regression estimation have been conducted within the setting that the design points are fixed and equally spaced to enable the application of the wavelet transform to a compactly supported signal. Moreover, equispaced design or not, it is customary to impose some boundary assumptions, such as periodicity or symmetry, on the regression function. Such assumptions are not always reasonable, and certain types of bias and artificial wiggles often arise in this context, particularly those due to edge or boundary effects, which detract from the global performance of the estimators and whose influence should be reduced or eliminated whenever possible.
To handle such boundary problems, at least in the equispaced design case, three types of approach are used in wavelet regression: one can either impose extra constraints on the function f , such as periodicity, symmetry or antisymmetry (Ogden, 1996), or construct specialized wavelets on a compact interval (Cohen, Daubechies and Vial, 1993) or combine low-order polynomial terms and wavelet basis (Oh, Naveau and Lee, 2001). In the first strategy, artificially large wavelet coefficients result when the extra conditions on the regression function f are not satisfied. For the second strategy, while theoretically appealing, implementation of a modified discrete wavelet transform is considerably more involved. In the third method called polynomial wavelet regression (PWR) the idea is to estimate f with the sum of a set of (Coiflets) wavelet basis functions, f W , and a low-order (global) polynomial,f P : wheref PW is the PWR estimate for f . The hope is that, oncef P (x) is removed from the data, the remaining signal hidden in the residuals can be well estimated using wavelet regression with, for example, periodic boundary assumption. The use of Coiflets allows, with appropriately chosen hyper-parameters, to prove both analytically and empirically that polynomial wavelet regression is superior to wavelet regression for functions of inhomogeneous smoothness. The use of PWR for resolving boundary problems works very well iff P (x) is able to remove the "non-periodicity" in the data. However, due to the global nature off P (x) for those cases when f has complex boundary conditions or has some abrupt changing objects present near the boundaries, PWR does not work well. Oh and Lee (2005) therefore extend the PWR method to a method called LWPR by combining wavelet shrinkage with local polynomial regression which is known to possess excellent boundary properties. Note however, that no asymptotic analysis for the resulting estimator is given in their paper, but extensive simulation results provide some strong evidence that LWPR is effective in correcting boundary bias. Because of its performance and its simplicity we have chosen the last approach as a starting point to de-noise signals with irregular boundaries in the equispaced design case.
A closer look at the LWPR estimate shows that it can be considered as a linear combination of a local polynomial estimator and a wavelet regression estimator with equal coefficients. Our purpose is then to adopt a different approach by considering an adaptive combination of the two estimators, one based on stronger smoothing assumptions on the regression function f and well behaved at the boundaries and another one based on weaker assumptions. The adaptive choice of the weights will also allow us to get some asymptotic optimality results for the combined estimator.
A disadvantage of the above is that the method is limited to the simple equispaced dyadic case. In practice, however, there are many interesting applications in statistics where the samples are not equispaced and their size is not dyadic. It is therefore interesting to propose appropriate penalization methods to wavelet smoothing within the setting of non-equally spaced and non-dyadic design points that can handle efficiently the boundary problems. This will be studied in the second part of the paper. Let us just say that the idea of adaptively combining different regression procedures within a collection of regression procedures (e.g. kernel, spline, wavelet, local polynomial, etc.) will be explored in a context of ensemble learning by mixing or aggregation and compared to other wavelet regression procedures for random design univariate regression. This paper is organized as follows. In Section 2 we describe the waveletbased regression model with the basic concept of wavelets. We also present an aspect of wavelets described in Antoniadis and Fan (2001) crystallizing the penalized least squares approaches to wavelet nonparametric regression showing that they can be used to construct a set of basis functions over an arbitrary compact interval and that linear combinations of such basis functions are able to estimate particular, usually jagged, regression functions better than spline bases. A detailed description of our boundary corrections procedures and their asymptotic properties in the equispaced case is presented in Section 3. The random design case is studied in Section 4, and the procedures are investigated via various simulation settings and real data application examples in Section 6. Some concluding remarks are given in Section 7.

Wavelets and nonparametric regression
We consider the regression problem stated in Equation (1.1) with a non-stochastic equidistant design t i = (i − 1)/(n − 1), i = 1, . . . , n of size n = 2 J for some positive integer J, noise variables i that are i.i.d. Gaussian N (0, σ 2 ) and with a potentially nonsmooth regression function f that may present a wide range of irregular effects. Wavelets provide smoothness characterization of several function spaces. Many traditional smoothness spaces, for example Hölder spaces, Sobolev spaces and Besov spaces, can be completely characterized by the sequence of wavelet coefficients (e.g., Meyer, 1993). In the present paper we will consider the problem of estimating the regression function either over a range of piecewise Hölder classes or through the sequence space representation of Besov spaces. A function in a piecewise Hölder class can be regarded as the superposition of a regular smooth function in a Hölder class and an irregular perturbation consisting of jump discontinuities. The (inhomogeneous) Besov spaces on the unit interval (e.g., Donoho and Johnstone, 1998) B s p,r ([0, 1]), consist of functions that have a specific degree of smoothness in their derivatives. The parameter p can be viewed as a degree of function's inhomogeneity while s is a measure of its smoothness. Roughly speaking, the (not necessarily integer) parameter s indicates the number of function's (fractional) derivatives, where their existence is required in an L p -sense; the additional parameter r is secondary in its role, allowing for additional fine tuning of the definition of the space. Assuming that f belongs either to piecewise Hölder class or a Besov space B s p,r ([0, 1]) with s + 1/p − 1/2 > 0 (this condition ensures in particular that evaluation of f at a given point makes sense) enables to capture key characteristics of inhomogeneity in f and to exploit its sparse wavelet coefficients representation. Notice that Besov spaces contain not only the standard Hölder and Sobolev spaces but also the piecewise Hölder spaces with a finite number of discontinuous jumps (Meyer, 1993).

Wavelet series expansions and discrete wavelet transform (a review)
In this subsection, the wavelet transform and its implementation for discrete signals are briefly reviewed. The sole purpose of this review is to describe the tools which will be used later. We assume that we are working within an orthonormal basis generated by dilations and shifts of a compactly supported scaling function, φ(t), and a compactly supported mother wavelet, ψ(t), associated with an r-regular (r ≥ 0) multi-resolution analysis of L 2 (R), the space of square integrable functions. The wavelet theory basically considers functions on the real line. When a finite interval [0, 1] is involved then an easy solution is to consider periodic wavelets. More precisely, let L 2 [0, 1], ·, · be the space of squared-integrable functions on [0, 1] endowed with the inner product f, g = 1 0 f (t)g(t)dt. Assuming that f is periodic, one may work with periodic wavelet bases on [0, 1] (e.g., Mallat, 2009, Section 7.5 where φ jk (t) = 2 j/2 φ(2 j t−k) and ψ jk (t) = 2 j/2 ψ(2 j t−k). For any given primary resolution level j 0 ≥ 0, the collection {φ per j0k , k = 0, 1, . . . , 2 j0 − 1; ψ per jk , j ≥ j 0 ; k = 0, 1, . . . , 2 j − 1} is then an orthonormal basis of L 2 [0, 1]. The superscript "per" will be suppressed from the notation for convenience. Despite the poor behaviour of periodic wavelets near the boundaries, where they create high amplitude wavelet coefficients, they are commonly used because the numerical implementation is particularly simple. Therefore, for any f ∈ L 2 [0, 1], we denote by c j0k = f, φ j0k , k = 0, 1, . . . , 2 j0 − 1, the scaling coefficients and by d jk = f, ψ jk , j ≥ j 0 , k = 0, 1, . . . , 2 j − 1, the wavelet coefficients of f for the orthonormal periodic wavelet basis defined above; the function f is then expressed in the form The approximation space spanned by the scaling functions {φ j0k , k = 0, 1, . . . , 2 j0 − 1} is usually denoted by V j0 while the details space at scale j, spanned by {ψ jk , k = 0, 1, . . . , 2 j − 1}, is usually denoted by W j .
In statistical settings, we are more usually concerned with discretely sampled, rather than continuous, functions. It is then the wavelet analogy to the discrete Fourier transform which is of primary interest and this is referred to as the discrete wavelet transform (DWT). Given a vector of real values e = (e 1 , . . . , e n ) T , the discrete wavelet transform of e is given by d = W n×n e, where d is an n × 1 vector comprising both discrete scaling coefficients, s j0k , and discrete wavelet coefficients, w jk , and W n×n is an orthogonal n×n matrix associated with the orthonormal periodic wavelet basis chosen. In the following we will distinguish the blocks of W n×n spanned by the scaling functions and the wavelets, respectively. The empirical coefficients s j0k and w jk of e are given by Amato et al. When e is a vector of function values f = (f (t 1 ), ..., f (t n )) T at equally spaced points t i , the corresponding empirical coefficients s j0k and w jk are related to their continuous counterparts c j0k and d jk (with an approximation error of order n −1 ) via the relationships s j0k ≈ √ n c j0k and w jk ≈ √ n d jk (see for example Lemma 2 in Cai and Brown (1998)). Note that, because of orthogonality of W n×n , the inverse DWT (IDWT) is simply given by f = W T n×n d, where W T n×n denotes the transpose of W n×n . If n = 2 J for some positive integer J, the DWT and IDWT may be performed through a computationally fast algorithm (e.g., Mallat, 2009, Section 7.3.1) that requires only order n operations. Hereafter, the coarsest wavelet decomposition level j 0 will be chosen to be the closest integer to log 2 (log(n)) + 1, as suggested in Antoniadis, Bigot and Sapatinas (2001). Let us adopt a vector-matrix form of the nonparametric model given by equation (1.1): . . , f(t n )) T and = ( 1 , . . . , n ) T . After applying a linear and orthogonal discrete wavelet transform W n×n , the above discretised model becomes The orthogonality of the DWT matrix W n×n ensures that the transformed noise vector˜ is still distributed as a Gaussian white noise with variance σ 2 I n . Hence, the representation of the model in the wavelet domain not only allows to retain the linear structure of the model but also to exploit in an efficient way the sparsity of the wavelet coefficients in the representation of the nonparametric component.
A key step in classical wavelet regression is to estimate the true wavelet coefficients γ = W n×n f by thresholding the empirical wavelet coefficients z = W n×n Y . It is known that such wavelet thresholding estimators are special cases of penalized least-squares estimators (for example Antoniadis and Fan, 2001). That is, a thresholded estimator for γ can be obtained as the minimizer of z − γ 2 + p λ (γ), for some suitable penalty function p λ with penalty parameter λ. Given a penalty function p λ which is a nonnegative, nondecreasing and differentiable on (0, ∞), the solution to the minimization of the above problem exists and is unique (Antoniadis and Fan, 2001).
When p λ (·) = λ| · | ( 1 penalty) the corresponding estimator is obtained by the soft-thresholding operator (Antoniadis, 2007) (2.2) Several methods exist to select an appropriate threshold value, λ, such as the SUREshrink wavelet regression procedure of , the crossvalidation of Nason (1996), the universal threshold λ(n) = σ √ 2 log n of Donoho and Johnstone (1998) or the EbayesThresh procedure of Johnstone and Silverman (2005). A general review about some thresholding selection methods can be found in Antoniadis, Bigot and Sapatinas (2001).

Finite interval wavelet transform
To be able to perform the wavelet regression in [0, 1] by the discrete wavelet transform described above we have used periodic wavelets and scaling functions, which handle the boundaries by imposing periodicity of the regression function. When such an assumption is not satisfied artificially large wavelet coefficients may be created at the boundaries. A way to solve this problem is to use a particular wavelet basis for L 2 [0, 1] developed by Cohen, Daubechies and Vial (1993) which is closely connected with Daubechies' orthonormal and compactly supported wavelet basis of L 2 (R). This is accomplished by defining special scaling and wavelet functions at the boundaries as linear combinations of original scaling and wavelet functions. This approach naturally preserves regularity and refinability of the wavelet system and maintains orthogonality, therefore an appropriate Discrete Wavelet transform (DWT) algorithm can be applied. In practice if M denotes the number of vanishing moments of the wavelet system (or equivalently M − 1 is the maximum degree of exactly reproducible polynomials by the wavelet system), then M boundary left scaling and wavelet functions are defined at each scale j, φ L jk , ψ L jk , k = 0, . . . , M − 1, starting from the generating φ L k (t) and ψ L k (t) as φ L jk (t) = 2 j/2 φ L k (2 j t) and ψ L jk (t) = 2 j/2 ψ L k (2 j t); analogously M boundary right wavelet and scaling functions are defined at each scale j, φ R jk , ψ R jk , k = 2 j − M, . . . , 2 j − 1. Together with the 2 j − 2M interior unaltered scaling and wavelet functions φ jk (t) = 2 j/2 φ(2 j t − k) and ψ jk (t) = 2 j/2 ψ(2 j t − k), k = M, . . . , 2 j − M − 1, they represent a full multiresolution analysis on the finite interval. From a DWT perspective special filters are introduced at the left and right of the interval depending on the scale j, allowing a more involved implementation of a modified Discrete Wavelet Transform.
Under such a framework, the regression function f can be represented by (index fi means finite interval) t ∈ [0, 1], provided that 2 j0 ≥ 2M so that boundary scaling and wavelet functions (and corresponding filters) do not overlap. After applying again the finite interval adapted discrete wavelet transform W fi n×n the nonparametric regression problem can be written in matrix notation where z = W fi n×n Y , γ = W fi n×n f is the vector of corresponding coefficients and = W fi n×n . And once again wavelet regression estimation can be achieved by least squares penalization, as for the periodic case.

Basis expansions
The wavelet regression methodology discussed in the previous subsections, is designed for treating dyadic samples of equispaced data. The application of a wavelet analysis to irregularly spaced samples, eventually random, say T n = (T 1 , . . . , T n ) T , has been a subject of study for more than ten years. Most methods in the area work with a pre-and/or post-processing of the data in order to translate the problem into an equispaced one. Cai and Brown (1998) decompose the nonequispaced data into a warped wavelet basis and then project this decomposition onto a regular wavelet basis. Antoniadis and Pham (1998) implement a direct discretisation of a continuous wavelet analysis on the irregular grid to find numerical values for wavelet coefficients corresponding to regular basis functions. Kovac and Silverman (2000) interpolate the irregular observations in intermediate regular locations before starting the wavelet analysis. These and other methods require user-driven preprocessing, that might become difficult or even fail in case the data are "very" non-equidistant and are still affected by boundary problems. The idea is then to use wavelet basis functions evaluated on irregular grids as in Antoniadis and Fan (2001) and Wand and Ormerod (2011) as it is done for other functional bases such as for example B-splines, using a basis expansion based approximation for the nonparametric function f , which provides a way of handling nonequispaced designs.
When the nonparametric function f is supposed to be smooth one may use an approximation by its expansion on O'Sullivan splines basis functions {B } ∈N : where m is an appropriate truncation index that is allowed to increase to infinity with n. We assume that the B are in canonical form (e.g., Wand and Ormerod, 2008, Section 4). Under reasonable smoothness assumptions, the regression function can be well approximated by the above expansion and its estimation is therefore equivalent to estimate the coefficient vector α = (α 1 , . . . , α m ) T . Using the O'Sullivan basis construction described in Wand and Ormerod (2008) it is easy to compute the corresponding regression n × m matrix of the O'Sullivan basis functions evaluated at the T n irregular grid, i.e.
and adopt a vector-matrix form of the nonparametric regression model (1.1) to get: The vector of spline coefficients α can be estimated by minimizing an objective function of the following form: where the parameter λ is a smoothing parameter that controls the trade-off between fit and smoothness. The standard estimation procedures assume a quadratic form in the spline coefficients for the penalty J(f ) (e.g., Silverman, 1985;Eilers and Marx, 1996;Wood, 2006). In this case, λ can be selected by minimisation of the generalised cross validation (GCV) score, the generalised Akaike's information criterion (AIC), or restricted maximum likelihood (REML) estimation, to name a few. The computational methods of Wood (2006) implemented in the R-package mgcv are available to estimate f minimising (2.7). Morever, it can be shown, when s > 1 (e.g., Du, Parmeter and Racine, 2013, Proposition 2.1), that under appropriate conditions on the design when it is fixed or on its distribution when it is random, if λ ∼ n −2 s /(2 s +1) , then the solution of (2.7) has the following asymptotic rates: Thusf is consistent with convergence rates similar to those obtained in the equidistant fixed design case using local polynomials. However, other penalties can be used to impose some sparseness constraint on the coefficients. When the nonparametric function is not smooth one can approximate it using instead wavelet bases, as in Antoniadis and Fan (2001) and Wand and Ormerod (2011). More precisely, we may use its expansion on wavelet basis functions {W } : where K is again an appropriate truncation index that is allowed to increase to infinity with n. Again, for f within some Besov ball, f can be well approximated by the above expansion and the estimation is therefore equivalent to estimate the wavelet coefficient vector γ = (γ 1 , . . . , γ K ) T . Similarly to the spline case, as alluded to in Antoniadis and Fan (2001) and implemented in Wand and Ormerod (2011), we can also define the design matrices containing wavelet basis functions evaluated at T . We will denote again by W the corresponding wavelet regression n × K matrix of the wavelet basis functions evaluated at T (see the Appendix for a brief description of such a construction), i.e., with a corresponding vector-matrix form given by So, estimating f is equivalent to estimate the wavelet coefficient vector γ. However, the fact that f belongs to a Besov space implies sparsity of the wavelet coefficients and therefore the wavelet vector γ is obtained by minimizing an objective function of the following form: using some efficient penalties P λ in terms of estimation and model selection consistency as the ones discussed in Antoniadis (2007). Directly optimising (2.10) can be tricky for a given penalty function, especially when the penalty is non convex. To tackle the optimisation, it is more convenient to use an iterative thresholding viewpoint with a thresholding function corresponding to the selected penalty (e.g., She, 2012;Daubechies, Defrise and Mol, 2004;Bredies, Lorenz and Reiterer, 2014)). The reader may also look in Antoniadis (2007) for a survey on the one-to-one correspondence between threshold functions and penalty functions. It can be shown, using Theorem 2.1 of She (2012), that, provided that the spectral norm of the design matrix W is not large, and whatever the starting value of γ is, the iterated thresholding estimates minimise (2.10). The condition on boundedness of the spectral norm of W is easily obtained by rescaling the vector of coefficients γ and the penalty parameter λ.
Using the results of Bunea, Lederer and She (2014), assuming that K grows to infinity at an appropriate rate in such a way that a compatibility condition holds for the design matrix W rescaled by some constant C 0 , assuming a finite sparsity index less than n and a bounded entropy on the class of the irregular nonparametric regression functions, the penalized estimation produces estimateŝ f that are consistent with optimal rates R(f, f) = O(n −2s/(2s+1) ).
The above results on regression splines and regression wavelets, can be used to derive some new estimation procedures that tackle the boundary problem in a similar fashion as for the equidistant design case.

Boundary treatment in wavelet thresholding with equidistant design data
As noticed in the Introduction although classical wavelet regression (assuming periodic boundaries) provides reasonable fits far away from the boundaries, often artificial wiggles appear at the boundaries, one reason being that the wavelet transform filtering operations at the boundaries require values of the regression function outside its supported range. We will therefore review in this section first some of the existing methods to treat boundary problems in wavelet regression with fixed equidistant design, namely polynomial wavelet regression (PWR) and local polynomial wavelet regression (LWPR), before proposing our adaptive combination of local polynomial and wavelet regression estimators with sound asymptotic properties.

Polynomial wavelet regression (PWR)
The polynomial wavelet regression estimator (PWR) was proposed by Oh, Naveau and Lee (2001). It is based on a combination of a wavelet based regression estimatorf W and a low-order polynomial fitf P . The resulting estimator of f ,f PW is written aŝ t is a polynomial estimator of degree d. The use off PW requires the choice of d as well as the threshold value λ used for estimating the wavelet coefficients inf W . With appropriately chosen d and λ, it is demonstrated in Oh, Naveau and Lee (2001), both analytically and empirically, thatf PW is superior tof W . For this, it is desirable to maintain the orthogonality between the set of polynomial basis {t, . . . , t d } and the wavelet basis. This means that the equations t ψ(t)dt = t φ(t)dt = 0 have to be satisfied for = 1, . . . , d. Wavelets with such properties were constructed in Daubechies (1992) and named Coiflets. Hence, the use of Coiflets with at least d+1 vanishing moments implies that the polynomial regression term is orthogonal to the wavelet regression term. This orthogonality property allows Oh, Naveau and Lee (2001) to obtain some asymptotic results showing that the PWR estimators are competitive with other nonparametric procedures retaining the asymptotic optimality of the wavelet decomposition and reducing the edge effects.
On the practical side, several automatic methods for choosing both d and λ are proposed by Lee and Oh (2004). They are based on estimating values of d and λ that aim to minimize an estimate of the L 2 -risk between f andf PW . We only describe here one of the proposed methods. The interested reader is referred to Lee and Oh (2004) for a description of other approaches. To choose d a criterion similar to Mallow's C p is used with the polynomial estimatorf P . Then the SUREshrink or the EBayesThresh wavelet regression procedure (e.g., Antoniadis, Bigot and Sapatinas, 2001) is applied to choose the λ that aims to minimize the risk between f −f P andf W , wheref W is obtained by applying ordinary wavelet regression to the polynomial residuals Y i −f P (t i ). Note that due to the use of an appropriate Coiflet basis,f W is the same as the wavelet fit obtained on the original observations Y i .

Local polynomial wavelet regression (LPWR)
The use of PWR for handling boundary problems works very well when the polynomial fitf P is able remove the non-periodicity in the data, so the remaining signal can be well estimated using wavelet thresholding with periodic boundary conditions. However, due to the global nature off P , for those cases when f has complex boundary conditions or has some abrupt changes present near the boundaries, PWR does not work well. Lee and Oh (2004) proposed a new method which will also work well under such situations. The basic idea behind the proposed method is to introduce a local polynomial regression component to the wavelet shrinkage. Since local polynomial regression produces excellent boundary handling (Fan, 1992;Hastie and Loader, 1993), it is expected that the addition of this component to wavelet shrinkage will result in equally well boundary properties. Their local polynomial wavelet estimator can be written asf ( 3.2) The LPWR estimator is computed through an iterative algorithm inspired by a backfitting type algorithm. The following steps summarize the key points for finding the final local polynomial wavelet regression estimate,f LPWR .
1. Select an initial estimatef (0) for f and letf LPWR =f (0) . 2. For j = 1, 2, . . . iterate the following steps: (a) Apply wavelet thresholding to the residuals To use the above algorithm, one needs to choose the initial curve estimatef (0) in Step 1 and the smoothing parameter for the local polynomial fit in Step 2(b). To do so, Oh and Lee use Friedman's super-smoother (available as supsmu in R), while for the smoothing parameter for computing the local polynomial estimator, they use cross-validation. The numerical experiences they provide in their simulation study, using their R-code hybrid.r, suggest that the above algorithm converges very quickly.

Proposed adaptive combination of two regression estimators
Inspired by the LPWR strategy, we are now proposing an adaptive combination of local polynomial and wavelet regression estimators with sound asymptotic properties.
The local polynomial estimator is based on stronger assumptions on the regression function than the wavelet regression one and thus, with appropriately chosen hyper-parameters, the asymptotic rates of convergence are different. We will adopt a linear combination of the two estimators where the weights are estimated by Stein's Unbiased Risk estimation (SURE) in such a way that the adaptive estimator retains the optimal rates of convergence. The technique used here is similar to the one adopted by Burman and Chaudhuri (2011) who combined a parametric estimator (rate n −1 ) with a linear non-parametric estimator to obtain an estimator with optimal rates. The main difference is that we do not need for one estimator to be parametric, as long as its rate is not faster than n −1 , and also in that we combine a linear estimator with a nonlinear one in what follows.
Let us first recall also that for any s > 0, the Sobolev space W s 2 ([0, 1]) with non-integer regularity index s > 1/2 is defined as the space of tempered distributions whose Fourier transforms are square integrable with respect to the measure (1 + |x| 2 ) s on [0, 1] (Hörmandere, 1989, p. 240). When s is an integer, it coincides with the space of functions f having continuous derivatives of orders up to s − 1, and a square integrable derivative of order s. By the Sobolev embedding results concerning Besov spaces on the interval, which can be found in, for example, Donoho et al. (1996) 1] for any s > 1/2, the inclusion being continuous.
A local polynomial estimator is linear in the data, and when the regression function is assumed to be within a Sobolev space W s 2 [0, 1] with s ≥ 2, local polynomial regression smoothers, with an optimal choice of kernel and bandwidth, have nice sampling properties and high minimax efficiency. Our attention is focused on a fixed equidistant design. The global L 2 -risk of any estimatorf is then equivalent to the expected empirical risk of the estimator, that is where · 2 n = n −1 · 2 and · is the Euclidian norm of R n . It is then known (e.g., Fan, 1993) that a local polynomial estimatorf LP with an Epanechnikov kernel and an optimal bandwidth h n of the order n −1/(2 s +1) is optimal in terms of rates of convergence with an optimal rate given by r 1 (n) := R n (f,f LP ) = n − 2 s 2 s +1 . If one assumes that the unknown regression function belongs instead to a Besov space B s 2,2 [0, 1] ⊂ W s 2 [0, 1], then the best optimal rate is n − 2s 2s+1 , which is smaller than r 1 (n) since 2 s /(2 s + 1) < 2s/(2s + 1). If s > s then this rate, up to a logarithmic factor, can be only attained by a wavelet threshold estimator f W . In fact if the optimal rate obtained by wavelet thresholding is denoted by r 2 (n) = n − 2s 2s+1 log n, we obviously have r 2 (n)/r 1 (n) → 0, as n → ∞ which implies that f −f LP 2 n = O P (r 1 (n)) and f −f W 2 n = O P (r 2 (n)). We would like now to use a combination estimator which decides on the basis of data which estimator to use among these two. Define the hybrid estimator

The adaptive estimator
The Euclidian empirical distance between the true regression function f and the hybrid estimatorf (3.5) The approach chosen here is to treat this as a hyper-parameter and estimate it using Stein's unbiased risk estimation which is an unbiased estimate of the loss. By equation ( Direct optimization of R n (f,f α ) with respect to α is not feasible since the function f is unknown in the last term of the above expression. To proceed we need to derive an objective that substitutes for R n (f,f α ), and depends only on the noisy data. We now state a version of Stein's lemma with Gaussian errors that is useful in deriving an unbiased estimator of R n (f,f α ). A proof may be found for example in Blu and Luisier (2007).
Once the regression estimates are computed for some fixed bandwidth h h and some fixed threshold λ n of the right order, the optimum estimatorα is computed by minimizing an unbiased estimator of E(L n (α)) derived as follows using Lemma 3.1: Both ∂fLP(ti) ∂Yi and ∂fW(ti) ∂Yi can be easily computed using results in Blu and Luisier (2007). We can now state the following theorem that outlines the asymptotic behavior of our hybrid estimator: Theorem 3.1. Suppose that δ n tends to zero as n tends to infinty. Then (1)). The proof relies upon some Lemmas very similar to Lemmas 5.1, 5.2 and 5.4 in Burman and Chaudhuri (2011) and it is given in the Appendix.
Remark 3.1. The simple combination approach described above, even when using the SURE principle to estimate the mixing coefficient, may produce poor results in practice because it does not take into account the correlation induced by the fact that both nonparametric estimators are estimated on the same data set. Of course this poor behavior could be enhanced by using wavelets that are orthogonal to polynomials, like Coiflets. If one restricts attention to combinations that are linear in the estimators, computing the weights that mimic the least squares oracle weights may also be problematic in the non asymptotic sense since the resulting estimated coefficients, even when using the SURE principle, may overfit the data and therefore present a generalization error that can be poor. To attenuate this correlation problem one could use stacking regression for combining the estimates. Stacking was first presented by Wolpert (1992), who considered "neural networks", and extended to statistics by Breiman (1996), who considered "stacked regression" and provided some heuristic and numerical results to justify a method for combining estimators without suffering of the correlation problem and the generalization error. To avoid overfitting the weights are computed by minimizing a cross-validated squared error loss, where each estimator to be combined is estimated on the training data and the prediction errors are computed on the leaved-out test data. While this is not a problem for local polynomial estimators, it may be a problem when considering wavelet regression on nonequidistant data, since the DWT heavily relies upon equidistant data. However this idea will be pursued later in this paper, when considering regression on non-equidistant or random designs data.

Some other methods: trend filtering
We give here a brief background of 1 trend filtering, presented as a nonparametric regression method that uses an 1 -type penalty, and which is able to adapt to local differences in smoothness and achieve the minimax rate of convergence for weakly differentiable regression functions of bounded variation (Tibshirani, 2014). An implicit assumption with trend filtering is that the design points are evenly spaced. Assuming that t 1 < t 2 < · · · < t n are unique and equally spaced, for a given integer k ≥ 0, the kth order trend filtering estimateβ = (β 1 , . . . ,β n ) of (f (t 1 ), . . . , f(t n )) is defined by a penalized least squares optimization problem where D (k+1) ∈ R (n−k−1)×n is the k + 1 order finite difference matrix and λ ≥ 0 is a tuning parameter. The constant factor n k /k! multiplying λ is unimpor-tant, and can be absorbed into the tuning parameter λ, but it will facilitate comparisons in future sections. When k = 0, and so Hence, the constant trend filtering is the same as one-dimensional total variation denoising (e.g., Rudin, Osher and Fatemi, 1992). When k = 1 In general, as described by Tibshirani (2014), Apart from requiring unique and equally spaced observations, (3.6) has one parameter per data point, no intercept, and the design matrix is the identity matrix. For a general k, the kth order trend filtering estimate has the structure of a kth order piecewise polynomial function, evaluated across the inputs t 1 , . . . , t n . The knots in this piecewise polynomial are selected adaptively among t 1 , . . . , t n , with a higher value of the tuning parameter λ (generally) corresponding to fewer knots. Taking a λ of the order n 1/(2k+3) leads to an asymptotic rate faster than the minimax rate over Sobolev spaces (Tibshirani, 2014, Theorem 1).

Spline-wavelet adaptive combination
For a fixed nonequidistant design or even a random design one may still use the simple adaptive combination approach described in Subsection 3.3 applying the SURE principle to estimate the mixing coefficients with similar asymptotic results, since the rates of each estimator, based on splines or wavelet expansions (Subsection 2.3), are similar to those evoked for local polynomial estimators and wavelet thresholding estimators. Moreover, since no restriction is required on the design, "stacked regression" is also a possible option for combining spline and wavelet regression estimators without suffering of the correlation problem between the two estimates.

Spline-wavelet stacking
We will now describe with some extra details how stacked regression is performed with the two basis expansion estimators. To simplify notation, we will callf 1 the optimal spline based estimator andf 2 the optimal wavelet expansion estimator. Optimality here is regarded in an asymptotic sense, that is the regularization parameters for the splines penalty and for wavelet penalization are fixed but of the right asymptotic order. In practice however these hyperparameters are chosen in a data dependent way when computing the estimates. Stacked regression combines linearly the two estimatorsf j , j = 1, 2 to obtain f stack given byf where the estimatorβ = (β 1 ,β 2 ) of the parameters in eq. (4.1) is obtained aŝ (t i ) the leave-one-out estimate of the j type estimator at the design point t i . By using the cross-validated predictions stacked regression avoids giving unfairly high weight to models with higher complexity. There is a close connection between stacking and winner-takes-all model selection. If we restrict the minimization to weight vectors that have one unit weight and the rest zero, this leads to the model choice returned by the winner-takes-all based on the leave-one-out. Rather than choose a single model, stacking combines them with estimated optimal weights. This will often lead to better prediction, but less interpretability than the choice of only one of the models.

Matrix stacking regression of spline and wavelet bases
An alternative way of mixing wavelet and spline bases estimators is to stack the corresponding matrices of the bases.
Let S be the matrix of the spline basis and W the corresponding one for the wavelet basis (Subsection 2.3). Then the regression model writes up as Y ≈ α + Sγ S + W γ W + = SW γ + , with SW = 1 N S W being the matrix stacking S and W (and the intercept term) and γ = (α, γ S , γ W ) the set of intercept and spline and wavelet coefficients, respectively, to be estimated. As before a penalization term can be included in the regression resulting in the minimization of the following functional: with K S and K W being the number of spline and wavelet coefficients, respectively. Inclusion of the spline term is intended to improve accuracy at boundaries.

Adaptive regression mixing and aggregation
In applications many more nonparametric regression procedures have been developed in the literature and studied both theoretically and through systematic numerical investigations. Since the model noise level and the true regression function are unknown, the task of identifying the best among several estimation procedures is typically very difficult. Therefore, there is an advantage if one considers a list of distinct nonparametric regression procedures so that at least one of them is optimal or well-behaving for the unknown underlying data generating process. For the goal of estimating the regression function, as is the focus in this paper, one approach is to combine the various estimation procedures by a proper weighting of estimates from them. An example of such an approach, when attention is restricted to estimators based on regularized linear expansions of the regression function in either spline bases or wavelet bases, is stacking regression proposed in the previous subsection. If the combination leads to a performance similar or close to the best method in each scenario of the underlying data generation process, the combined estimator or prediction can outperform all the candidate procedures in repeated applications across different scenarios of the data generation process.
Combining regression procedures has been studied and allows to prove various interesting theoretical properties. Oracle inequalities show that properly combining arbitrary regression procedures leads to a risk close to the best among a target class of combinations of the candidate estimators/predictions plus a minimax-rate optimal "price of combining" that reflects the largeness of the class of allowed combinations; see Chen and Yang (2010) for a literature review. Successes of combining different predictions in applications have prompted more interest.
It is not our purpose here to combine all existing nonparametric procedures. We will therefore restrict our attention to the class of regularized linear expansions of the regression function in either spline bases or wavelet bases, and two supplemental methods developed recently in the machine learning literature that may handle regression function with heterogeneous smoothness, namely spatially adaptive regression splines with accurate knot selection schemes and 1 trend filtering. After briefly describing each of these procedures in the following, we will combine them using an aggregation method developed by Yang (2001) and called Adaptive Regression by Mixing (ARM). Results from Yang (2001) and Catoni (2004) show that the combined regression estimator achieves the best performance offered by the candidates in an accumulated risk.

Spatially adaptive regression splines (SARS)
Usually, good approximations of inhomogeneous functions by spline functions require a set of highly unevenly distributed knots. Zhou and Shen (2001) proposed an adaptive spline procedure based on a new knot selection scheme for nonparametric regression. The proposed procedure uses certain special local properties of spline function in knot selection and thus overcomes the knot confounding problem and high computational complexity in adaptive estimation encountered by spline proce-dures based on the traditional knot addition and deletion method. To improve computational efficiency of the knot selection, for a spatially inhomogeneous regression function that is smooth in one region and nonsmooth in another region (for example the boundaries or the vicinity of change-points), one needs to insert more knots in areas where the regression function is less smooth. To achieve this task the authors use of a guided knot search, which makes the search more efficient. We will not describe the method further here but the interested reader is referred to Zhou and Shen (2001) for details and properties of their estimates. We are grateful to the authors for providing an implementation in R code of their algorithm (SARS).
1 trend filtering When we reviewed trend filtering we have assumed that the design locations are implicitly evenly spaced; Ramdas and Tibshirani (2016) developed an algorithm to extend 1 trend filtering to irregularly spaced data using a specialized ADMM algorithm. Fortuitously, there is little that needs to be changed with the equidistant trend filtering problem when one moves from equispaced design points to arbitrary ones; the only difference is that the operator D (k+1) is replaced by D (t,k+1) , which is adjusted for the uneven spacings present in the design. These adjusted difference operators are still banded with the same structure, and are still defined recursively. Under appropriate regularity conditions on the design, the resulting 1 trend filtering retains the same asymptotical optimality as for the equidistant design case.
We focus now on the adaptive regression by mixing (ARM). When the noise is normally distributed, the ARM uses least squares as a discrepancy measure in the core step to apportion the weights to each candidate, and leads to good theoretical results (Yang, 2001). We apply K nonparametric regression procedures on the data: procedure j yields an estimatorf j . Denote the set of the K candidate methods by Γ. For simplicity, assume that n is even and that the data are sorted. The ARM algorithm is: Step 1. Split the data into two parts Step 2. Based on Z (1) , apply all the estimation procedures in Γ to get the regression estimatesf j , j = 1, . . . , K, and compute the mean squared errord j = 2 n Z (1) (Y i −f j (T i )) 2 for each candidate procedure j.
Step 3. For each procedure j ∈ Γ, predict Y i byf j (T i ) for Z (2) and compute the overall measure of discrepancyD Step 4. Compute the weight for procedure j aŝ . (4.2) Step 5. Letf ARM = j∈ΓŴ jfj .

52
U. Amato et al. Assuming that K is finite and fixed and that eachf j is asymptotically optimal within the class of functions it is designed to estimate, the resulting estimate is asymptotically as good as the best estimate in Γ (Yang, 2003, Theorem 1). Such a result justifies therefore to consider such an estimator for correcting the boundary problem.

Greedy pursuit and best basis selection from multiple libraries
In this subsection we handle the boundary problems by approximating the regression function f by a finite linear combination of elements of a given dictionary D of functions. Here, by dictionary, we mean a union of several bases or collections of general waveforms from L 2 ([0, 1]). One of the motivations for utilizing general overcomplete dictionaries rather than orthonormal systems is that it is not clear which orthonormal system, if any, is best for representing or approximating f . Thus, dictionaries are preferred and to manage the number of computations matching pursuits algorithms will be used with the aim to build "suboptimal yet good" finite approximations through a greedy selection of elements within the dictionary D.
We shall focus our attention to the BSML procedure recently proposed in Sklar et al. (2013). It selects basis functions adaptively from the union of multiple libraries, where each library consists of basis functions with similar forms and properties. Compared to using a single library, the advantage of using multiple libraries is that only relatively few basis functions need to be selected from each library to approximate the target function, particularly if the target function is spatially inhomogeneous and if the basis functions in different libraries capture different inhomogeneous features found in the true function. There are infinitely many choices of the libraries. Libraries may be selected from different families including Fourier, spline, radial, wavelet bases and so on. They may also be selected from different types within a family. We can have B-splines, truncated polynomials and reproducing kernel representers for the spline family. Within each type, we can specify different orders of basis, e.g., linear or cubic for polynomial splines. Wavelets of different types for wavelet families. Within each type, we can specify different filter numbers, resulting to more or less smooth wavelets.
To briefly describe the procedure, let us say that BSML starts with a null library L 0 , which contains all the basis functions that will be included in the model automatically. Let m = |L 0 | and M be a pre-specified maximum number of basis functions we want to select (including those in L 0 ). The number M is closeley related to the best rate of "suboptimal yet good" finite approximation through the greedy selection of elements within the dictionary union of multiple libraries. Basis functions are selected from L additional libraries that define the dictionary D one at a time. At each step k, denote the sequentially selected basis functions as b k for k = m + 1, . . . , M. Let B k = {b 1 , . . . , b k } for k = m, . . . , M, where B m = L 0 . Write "model B k " for "a linear combination of the basis functions in B k ", and also M k for the modeling procedure that includes both basis functions selection and estimation steps. The BSML procedures utilize the generalized degrees of freedom (GDF) to measure the complexity of a modeling procedure. This approach can be computationally demanding, since the GDF needs to be estimated at every step of the forward selection process. At each forward selection step, a greedy search for only one basis function to add to the current model M k is performed. For more details on the procedures the interested reader is refereed to the paper by Sklar et al. (2013).
To explore numerically the advantage of multiple libraries of basis functions using advanced model selection methods for treating the boundary problems in nonparametric regression for functions of heterogeneous smoothness, we have used a collection of R functions available in the R package bsml whose source can be downloaded from Yuedong Wang's web site.

Gaussian processes and stochastic partial differential equations
As suggested by one of the reviewers of our work, another possibility to derive flexible models which are practical to work with for estimating functions from noisy data are Gaussian processes (GP) and stochastic partial differential equations (SPDE). We would like to therefore add in this subsection some minor additional discussion concerning these two approaches.
A basic idea on how Gaussian Process models can be used for such a task is by formulating a Bayesian framework for regression. In order for the GP techniques to be of value in practice, one must chose between different mean and covariance functions in the light of the data at hand, reflecting any prior knowledge about the structure of the function of interest (a process that is referred to as training the GP model). The interested reader is referred to the book by Rasmussen and Williams (2005) for a comprehensive exposition to Gaussian Process regression models. These models can also be extended to handle piecewise-smooth functions with boundary constraints by adapting them for smoothing in the presence of change-points, which may be seen as more or less abrupt changes to the properties of the observed data. The paper by Osborne, Garnett and Roberts (2010) describes prior covariance functions for one-dimensional regression that model change-points and faults of many different types and gives a Bayesian solution to the smoothing of data from sources that may contain change-points and also some MATLAB code. We prefer to not pursue this approach here since training a GP model involves both model selection, or the discrete choice between different functional forms for mean and covariance functions as well as adaptation and estimation of the hyper-parameters of these functions which could be a disadvantage compared to the nonparametric methods discussed in our paper.
Concerning the second approach, the SPDE approach introduced by Lindgren and Rue (2008) and implemented in the R-INLA software package (Rue, Martino and Chopin, 2009) is also a flexible and efficient method to analyse data exhibiting complicated boundary constraints. Basically, the SPDE involves applying a differential operator D to a stochastic process, representing the structured dependence among observations, but this cannot be done in the same way as when one applies D to a known function because the process is random and, in many cases, realizations of it will not be suitably differentiable. Moreover, available software implementations are difficult to customize without high-level technical knowledge, limiting application to only those models available in the software or specially requested from software developers. We may therefore prefer a presentation to SPDE smoothing that explicitly draw links with basis-penalty smoothing approaches. There is a direct correspondence between smoothing and stochastic processes, and works by Fahrmeir and Lang (2001), Lindgren and Rue (2008) and Yue et al. (2014) show how basis-penalty smoothers in a Bayesian framework can be interpreted within the SPDE paradigm. For practical purposes, one may use the results of Miller, Glennie and Seaton (2019) and their R-code, to better understand the implementation and theory behind the SPDE approach. However we believe that the results obtained using such an SPDE approach to smoothing data with typical boundary constrains addressed in our paper compare similarly.

Multidimensional problems
The present paper specifically addresses unidimensional functions. However many applications involve multidimensional problems, so that it is interesting to briefly consider a possible generalization to this setting.
It is out of the scope of the paper to present a full theoretical treatment and extensive experimentation. However we observe that all methods can be easily plugged in an additive framework by relying on a backfitting procedure. On the other side one of them is naturally suited for a multidimensional setting without the necessity of resorting a backfitting iteration. It is the case of matrix stacking regression (Section 4.3), where, analogously to the procedure of stacking matrices of different bases for the same dimension of data, these stacked matrices are further stacked across all dimensions, giving rise to a simultaneous estimate of coefficients of the series expansion across all dimensions. In this respect we observe that grouped penalization can also be invoked to achieve selection of dimensions and/or regression methods.

Numerical experiments
The present Section introduces numerical experiments worked out on some test functions by a selection of methods considered in the previous Section or available from the literature.

Methods
Throughout the paper, the following methods have been considered for comparison (see Tab with the MCP and regularization parameter λ chosen by GCV. The Wavelet is generated from Daubechies Extremal Phase wavelets with 5 vanishing moments and maximum level 4. Due to periodicity of wavelets, the basis is not effective at boundaries for nonperiodic functions. CDV (CDV Wavelets): the Wavelet expansion method discussed in Subsection 2.3 with the MCP and regularization parameter λ chosen by GCV, considering the wavelet basis constructed on the finite interval wavelets of Section 2.2. In this way the method is suited also for generally nonequispaced and non-dyadic grids. We set the filter number to 3 and the highest level to 4. LPWR (Local Polynomial Wavelet Regression): the method presented in Subsection 3.2. It is claimed to take account of boundaries and therefore to improve accuracy there. LPWR works only for an equispaced and dyadic grid. WHYBRID (Hybrid LPWR): it is an adapted version of LPWR (Subsection 3.2) where an MCP penalization term is considered for the regression of the residuals instead of wavelet thresholding. In this way the method is suited for not necessarily equispaced and dyadic grids. Moreover it is developed to improve accuracy of the solution at the boundaries. WMESH (waveMesh; Haris, Simon and Shoiaje, 2018): it is based on a wavelet decomposition on a generally non-equispaced grid based on a linear interpolation scheme from wavelets on a regular grid to the actual data grid and a penalization functional given by l 1 norm; in addition a proximal gradient descent algorithm developed in Parikh (2014) is used to solve the corresponding optimization problem. The method is suited for generic non-equispaced grids and has no special treatment of boundary conditions. TREND (Trend Filtering): the method described in Subsection 3.4 in its version adapted to generic non-equispaced grid. No special treatment of boundaries is present. AC (Adaptive Combination): the method presented in Subsection 4.1 with basic regressors given by SARS, Trend Filtering and Wavelet series. By generalizing Eq. (3.4) the AC combination is obtained by unconstrained OLS with respect to weights w 1 , 2 n , withf TREND ,f W andf SARS being solutions obtained by Trend Filtering, Wavelet series and SARS, respectively. Inclusion of SARS among the regression methods is prone to achieve a better behavior at boundaries. The degree of the spline basis is 3. The wavelet basis is given by Coiflets with 3 vanishing moments and a maximum level of the transform of log 2 n − 2; MCP is used for penalization. It works for generic grids. Spline and wavelet solutions are mixed, the former being included to improve accuracy at boundaries, see Eq. (4.1). The number of knots for the spline basis is N/4 with lower and upper bounds 5 and 35, respectively. The wavelet family is Coiflet with 5 vanishing moments and 2 levels of the transform. MCP is used for penalization. MSR (Matrix Stacking Regression with Splines and Wavelets): the method presented in Subsection 4.3, suited for generic grids and aimed at improving accuracy of the solution at the boundaries thanks to the inclusion of splines. The same hyperparameters as SPL and WAV are used for the spline and wavelet bases, respectively. ARM (Adaptive Regression by Mixing): the method described in Subsection 4.4. We weight the solutions obtained by spline and wavelet basis, the former intended to improve accuracy at boundaries. BSML (Basis Selection Multiple Libraries): the method discussed in Section 4.5. It is suited for a generic grid; inclusion of a spline library is intended to improve accuracy at the boundaries.
Several other methods and/or different variants have been included in the comparisons, especially mixing different regressors. They are not reported here because of poorer results. R-codes implementing the above methods as well as the simulations and the scripts to produce plots and tables are available as Supplementary Material.

Test functions
We consider three synthetic test functions representative of typical regularity and/or boundary conditions to have full control of the accuracy of results. All functions have support in [0, 1] for simplicity. The data model is y( . . . , n, x i ∈ [0, 1], where ε i are i.i.d. Gaussian and σ is such that the Signal to Noise Ratio (SNR) is 3.5, with SNR being defined as SNR = var(y)/σ 2 . The test functions are defined as: Sin: f (x) = sin(11x). It is a regular function discontinuous at boundaries (|f (0) − f (1)| ≈ 1). Irregular: It is a nonsmooth function with three discontinuities and 2 bumps: All the functions are normalized to have standard deviation 7. We consider two different grid designs: equispaced and random, the latter generated by a uniform distribution in [0, 1]. Finally we choose two different sample sizes to represent small and medium size (n = 128 and n = 512 1 ).
In order to estimate accuracy of the different methodologies we rely on the Root Mean Square Error (RMSE) defined as withŷ being the estimate obtained by any method. To evaluate in particular accuracy at the boundaries, we also computed a specific RMSE at the boundaries (bRMSE) restricted to grid points close to the boundaries by a maximum distance δ = 0.1.
Experiments were replicated 100 times and RMSE and bRMSE were averaged over the replicates.

Results
Tables 2 and 3, referring to the total error from experiments for the three test functions and the two types of grid with size 128 and 512, respectively, show that methods adapted to boundaries do improve overall accuracy of estimate.
We observe that LPWR was not included among competitors in the case of random grid. Moreover WHYBRID shows some instabilities for the random grids (the more with the larger sample size) that degrade performance; therefore it was not included in boxplots because of the corresponding high value outliers for n = 512. On some occasions iterations diverge.  Figures 1-2 present the corresponding boxplots that also show the variation of the performance over the repetitions and the outliers for n = 128 and equispaced grid and n = 512 and random uniform grid, respectively.
SSW is a clear winner among the methods because it is ranked first in 9 cases out of 12. The closest runner is ARM, which is ranked among the best three methods 5 times out of 12. In this respect we mention that accuracy of ARM is very close to SSW in general. Actually both methods share common steps: both are based on a mix of regressors (chosen the same); however ARM adopts a 2-fold Cross Validation (without repetitions) to train the regression, while SSW relies on a full leave-one-out procedure. Moreover the mixing coefficients are estimated by (constrained) linear regression in SSW, while a closed formula (Eq. 4.2) is adopted for ARM.
WHYBRID and MSR also show quite good performances in general. All of them outperform Splines and Wavelets, that can be considered as the prototypes of methods for nonperiodic smooth and periodic irregular functions, respectively. This is confirmed also when error at boundaries is considered (bRMSE, Tabs. 4 and 5).  Analyzing in greater details the results in the tables, we observe the good behaviour of LPWR and TREND in the case of equispaced and random grid, respectively (see in particular Figs. 1-2); both are 3rd in a virtual ranking among methods, after SSW and ARM. If we restrict our attention to the boundaries (error bRMSE), we notice the good performance of LPWR (virtually ranked 2nd after SSW), that outperforms all other competitors in 3 cases out of 6. This confirms effectiveness of the method at the boundary, but at detriment of the performance in the middle. Finally we observe the good performances of Splines and Wavelets on specific functions, namely Sin for Splines and Irregular for Wavelets. In particular the latter is the first together with SSW in a virtual ranking among the methods for this function. Actually these functions satisfy the assumptions of the respective methods in terms of regularity and periodic conditions.
Finally we show in Fig. 3 the average of the retrieved functions over the 100 repetitions (therefore estimating the bias of the estimators) for an equispaced grid of size 512 and the best estimator (SSW). The estimate of the smooth parts of the functions is very good, including boundaries. Irregular parts as discontinuities or bumps are not satisfactory (in particular discontinuities are not reproduced in the Heavisine function).
Summarizing we can say that conventional methods based on single regressors and on splines or wavelets behave quite well for specific functions that meet the assumptions they are based on. However their accuracy quickly degrades when such requirements are not satisfied. SSW and ARM, that show some similarities in the procedures, are the best estimators. LPWR performs well at boundaries, for which it is designed, but not as well at the inner part of the function; in addition it can be applied only to equispaced, dyadic grids. Its upgraded version (WHYBRID), adapted to general grids, is degraded by some instabilities in the case of random grids. On the contrary a mix of regressors chosen individually with different properties and assumptions to satisfy does increase performances of the methods. This is particularly effective at boundaries, where inclusion of a well-behaved method there such as Splines or Local Polynomial regression is able to greatly improve performances not only at boundaries but also everywhere.
Our experiments did not show good performances from several other methods included in the present analysis and others not reported (PWR, CSRecSP based on Compressive sensing (Dai and Milenkovic, 2008), SARS, several variants of the methods introduced).

Real examples
In this Section we consider two real examples and compare the solutions obtained by the two best methods (SSW and ARM) with the classical Wavelet and Splines regression.
Motorcycle data It is the example introduced in Silverman (1985). It consists of acceleration data (in gravity units) over time after a motorcycle crash. It is an example of a quite smooth function (time spans a very short time, 57ms after the impact) with boundary values that can be considered almost similar. Fig. 4 shows the fit obtained by SPL, WAV, ARM and SSW. All solutions fit the data well, with Spline exhibiting a smoother behaviour.  Wool data This example is reported in Diggle (1991). The source of data is the Australian Wool Corporation, that recorded the price of the wool weekly from July 1976 to June 1984. The prices are the floor price, set from the Corporation, and the price actually paid for a particular grade (19μm nominal thickness in the data set), that comes out to be somewhat different. Data can be downloaded ad the link http://lib.stat.cmu.edu/datasets/diggle. The data, that in the time series show a seasonal trend, are complicated by a devaluation of the Australian dollar occurred in 1983 that generated a discontinuity. A consequence is that data are less regular; in addition the values at the boundaries are completely different. Fig. 5 shows the fit obtained by SPL, WAV, ARM and SSW. We notice incapability of a Wavelet regression to handle different boundaries, as expected. All other methods quite well reproduce the general behaviour of the function, with the Spline fit again being smoother and not able to reproduce some parts showing a sharper variability.

Conclusions
This paper dealt with the problem of nonparametric regression of a univariate function when the distribution of nodes is equidistant, fixed or random. As well known, most common and effective methods introduce artifacts at the boundaries when their assumptions are not satisfied by actual data (e.g., periodic conditions). This is more pronounced with wavelets, that outperform splines and polynomial models from the theoretical point of view when regularity of functions is considered, but whose constraints on the boundaries are severely tight.
The present paper introduced some ideas about how to circumvent the boundary problem using as an example some of the recent and claimed most accurate nonparametric regression methods. These include in particular mixing or aggregation of models and methods based on libraries of bases. The key idea is to include among methods at least one accurate also for nonsmooth functions, therefore wavelets, and one well behaved at boundaries, e.g., splines or polynomial regression. As a benefit, the current methods, particularly the ones based on wavelets, were also adapted to handle generic nodes, like nondyadic and nonequispaced ones.
Proofs of convergence for a class of mixing methods were given in the case of equispaced nodes. Experiments on simulated data have shown that such ideas improve accuracy of the fit non only at boundaries but also all over the domain of the function. Finally illustrations on two real applications were given.
The paper considered several methodologies as the basis of the ideas or simply as a comparison. For many of them, not reported for the sake of brevity, results came out to be poor. Among them we mention locally adaptive stacking, where the weights assigned to the different methodologies were depending on t. The idea was to weight locally the single regressions, e.g., favouring the more accurate ones where the function is irregular or the better appropriate at boundaries. We believe that this could be an interesting direction to investigate on.
An immediate extension of the methodologies concerns multivariate functions. In this respect all the proposed methodologies can be straightforwardly plugged in an additive framework by relying on backfitting. In addition one of them, namely MSR, is natively ready to be applied in a multidimensional setup without needing backfitting iterations.