Gaussian linear model selection in a dependent context

In this paper, we study the nonparametric linear model, when the error process is a dependent Gaussian process. We focus on the estimation of the mean vector via a model selection approach. We first give the general theoretical form of the penalty function, ensuring that the penalized estimator among a collection of models satisfies an oracle inequality. Then we derive a penalty shape involving the spectral radius of the covariance matrix of the errors, which can be chosen proportional to the dimension when the error process is stationary and short range dependent. However, this penalty can be too rough in some cases, in particular when the error process is long range dependent. In a second part, we focus on the fixed-design regression model assuming that the error process is a stationary Gaussian process. We propose a model selection procedure in order to estimate the mean function via piecewise polynomials on a regular partition, when the error process is either short range dependent, long range dependent or anti-persistent. We present different kinds of penalties, depending on the memory of the process. For each case, an adaptive estimator is built, and the rates of convergence are computed. Thanks to several sets of simulations, we study the performance of these different penalties for all types of errors (short memory, long memory and anti-persistent errors). Finally, we give an application of our method to the well-known Nile data, which clearly shows that the type of dependence of the error process must be taken into account.


Introduction
Let us consider the linear model where Y is the n-dimensional vector of observations, t * is an unknown (deterministic) vector to be estimated, and ε is the vector of errors.It is well know that Model (1) can serve as a canonical model to express a large class of statistical problems (see [BM01a]).In this paper, we focus on the estimation of the vector t * with a model selection approach, in the general framework where the error process ε is a dependent Gaussian random vector, with covariance matrix Σ.Our first goal is to give the theoretical form of the penalty function, depending on Σ, ensuring that the penalized estimator among a collection of models satisfies an oracle inequality.This model has been widely studied for independent and identically distributed (i.i.d.) errors, in particular by Birgé and Massart in the Gaussian case [BM01a].Baraud worked in the general i.i.d.case with a deterministic design first [Bar00], then with a random design [Bar02].Some extensions of these results to a β-mixing framework are presented in [BCV01].The idea of using a penalty function goes back to the pioneering works of Akaike [Aka73] and Mallows [Mal73].Later, Birgé and Massart developed a nonasymptotic approach to the selection of penalized models [BM01a], [BM01b], [BM07].
We follow in this paper the strategy developed by Birgé and Massart which is based on a non-asymptotic control of the fluctuations of the empirical contrast.
Let us be more precise here.In order to find a linear subspace that realizes a bias-variance tradeoff, let us introduce a finite collection of models {S m , m ∈ M}, denoting by d m the dimension of S m .Let then tm be the least squares estimator Proj Sm (Y ) of t * on S m .A penalization strategy is used by selecting a model with a criterion of the form m ∈ argmin m∈M Y − tm 2 n + pen(m) , where • n denotes the (normalized) euclidean norm in R n , and pen : M → R + is a penalty function defined on the family of models.Following the Birgé and Massart approach, we derive a penalty function which provides an oracle inequality for the model selection procedure in the dependent Gaussian framework.
In Section 2, a general penalty shape is presented.The main term is the quantity tr(Proj Sm Σ) (tr denoting the trace) which plays the same role as the term Var(ε 1 ) d m in the results of Birgé and Massart for i.i.d.Gaussian errors.Similar penalties have already been introduced by Gendre [Gen14] in the context of model selection for additive regression.However Gendre [Gen14] is not interested in the same questions as us: he is concerned with additive regression whereas our objective is to study the Gaussian regression with dependent errors.In the same way as for us, the analysis of [Gen14] is based on a general Gaussian model selection, but it appears that for our concern, the general penalty form we provide is more appropriate than that provided by [Gen14].In addition, the assumptions of [Gen14] do not apply to the context of long range dependent or anti-persistent errors.
Note that the trace tr Proj Sm Σ is bounded by d m ρ(Σ), where ρ(Σ) is the spectral radius of the covariance matrix.Hence, neglecting some residuals terms (see Section 2), the following penalty can be used: for any K > 1, For instance, if we suppose that the error process is a short memory stationary process with bounded spectral density, then the spectral radius is bounded, and this penalty shape is very closed to the i.i.d.case up to a constant.The penalty can still be chosen proportional to the dimension, as in the i.i.d.case, but the usual variance term is replaced by the spectral radius of the covariance matrix.However, the penalty (2) may be too rough in some cases, in particular if the error process is long range dependent.To see how to handle this case in a concrete situation, we study in Sections 3 and 4 the fixed-design regression model where (ε i ) i≥1 is a stationary Gaussian process.By standard arguments, this model can be written as a special case of the generic Model (1).Note that Model (3) has been widely studied in the literature (with possibly non Gaussian errors) via kernel or wavelets methods.
For kernel estimators, let us first quote the paper by Hall and Hart [HH90], who considered a particular class of Gaussian errors.The authors showed in particular that, for a twice differentiable function f * , the rate is the same as in the i.i.d.case if and only if k>0 |Cov(ε 1 , ε k )| < ∞, and they gave minimax rates in the long range dependent case.Let us also cite the papers by Csörgő and Mielniczuk [CM95a], [CM95b], [CM95c] (long memory is considered in [CM95b] and [CM95c]), Tran et al [TRYTV96] (short memory case), and Robinson [Rob97].Robinson's article provides very general results for short range and long range dependent processes, and rates of convergence for anti-persistent errors (also called negatively correlated errors) can be derived from his Lemma 3. Local polynomial fitting with long memory, short memory and anti-persistent errors is considered by Beran and Feng [BF02].Note that none of these articles adresses the issues of adaptive estimation or data-driven bandwidth selection.
For wavelets type estimators, let us first quote the paper by Wang [Wan96], who gave minimax results in the long range dependent case, when the function f * belongs to a Besov class.Let us also cite the papers by Johnstone and Silverman [JS97], Johnstone [Joh99], and more recently Li and Xiao [LX07] and Beran and Shumeyko [BS12].These four papers addressed the issue of a data-driven choice of the threshold.Theorem 1 in [Joh99] gave a very precise minimax result (up to constants), but for an asymptotic model which is a bit different from (3) (see the discussion at the end of the paper [Joh99]).By adapting the block thresholding method described in Hall et al [HKP99] to the long memory case, Li and Xiao [LX07] showed that the block thresholded wavelets estimators are adaptive and minimax for a large class of functions.
In Sections 3 and 4 of the present paper, we propose a model selection procedure to estimate f * via piecewise polynomials on a regular partition of size m.The choice of piecewise polynomials is very natural here, since the function f * is supported on [0, 1], and such estimators do not show bad behaviors near the boundary.We show that • For short memory error processes (i.e. when ρ(Σ) is uniformly bounded) the penalty is of the form pen(m) = K m n (for some constant K > 0 to be calibrated), the penalized estimator is adaptive with respect to the unknown regularity of the function f * , and yields the same rates of convergence as in the i.i.d setting.
• For long memory processes, that is when the auto-covariances γ ε (k) of the error process are such that for some κ > 0 and γ ∈ (0, 1), the penalty is a concave function of (m/n) pen(m) = K m n γ (for some constant K > 0 to be calibrated), the penalized estimator is adaptive with respect to the unknown regularity of the function f * , and yields the same minimax rates of convergence as in [Wan96].
• For anti-persistent errors such that and in the case of regressorams (piecewise polynomials of degree 0), the penalty has the form (for some constant K > 0 to be calibrated).The main part of the penalty is then a convex function of (m/n).The penalized estimator is adaptive with respect to the unknown regularity of the function f * , and yields faster rates of convergence than in the i.i.d setting.Note that similar rates can also be deduced from Lemma 3 in [Rob97].
In Section 4, we simulate different kind of short memory processes (a Gaussian ARMA(2,1) process, two non Gaussian β-mixing Markov chains), of long memory processes (a fractional Gaussian noise with Hurst index in (1/2,1), and a non Gaussian β-mixing Markov chain), and an anti-persistent process (a fractional Gaussian noise with Hurst index in (0, 1/2)).For regressograms on a regular partition of size m, we investigate different kind of penalties: the usual penalty proportional to m/n, a penalty proportional to (m/n) γ in the case of long range dependent or anti-persistent errors, and some penalties for which γ is estimated via an estimator of the Hurst index based on the Y i 's or on the residuals.Finally, an important message of this paper is that the slope heuristics [BM07] can be adapted to calibrate penalties in the context of regression with dependent errors.
In Section 5, we give an application of our method to the well known Nile data, and we continue the discussion started in Robinson's article [Rob97].In Section 6, we discuss other possible applications of the generals results of Section 2. Finally, Section 7 is devoted to the proofs of the results of Sections 2 and 3.

General setting
Recall the equation of the Gaussian linear model (1) where the mean vector t * belongs to R n and where the error vector ε is a Gaussian random vector.We consider the general setting where the components of Y are not independent The covariance matrix Σ is a n × n semidefinite matrix with eigenvalues The aim is to estimate the unknown vector t * from the observation Y .One standard strategy is to constrain the estimator to belong to a given linear subspace S of R n .Let • n denotes the (normalized) euclidean norm in R n The least squares contrast is defined for t ∈ R n by With a slight abuse of notation, we shall write Proj S for the projection operator on S and for its matrix on the canonical basis.The 2 risk of an estimator t is defined by where the expectation is under the distribution of Y .Using Pythagoras equality in R n together with (1), we find that the risk of Proj S (Y ) satisfies the following bias-variance decomposition The bias (Id − Proj S )t * 2 n is small for large enough linear subspace S. It can be easily checked that the variance term is equal to E Proj S (ε) n tr (Proj S Σ), see the proof of Theorem 2.1.As the i.i.d.case, the variance term tends to increase with the dimension of S.
In order to find a linear subspace that realizes a bias-variance tradeoff, we introduce a finite collection of linear subspaces {S m , m ∈ M} that we call models, and we denote by d m the dimension of S m .For m ∈ M, we denote by tm the least squares estimator Proj Sm (Y ) of t * on S m .We also introduce the oracle model m 0 , that is the model that provides the least squares estimator with minimum risk m 0 ∈ argmin m∈M {R( tm )}.Now the aim is to select a model in the collection such that the risk of the selected estimator is as close as possible to the oracle model.
The true risk R( tm ) of tm being unknown in practice, we introduce the empirical risk Obviously this criterion can not be used to select a model in the collection because of the overfitting effect.We follow a penalization strategy [Aka73,Mal73,BM01a,Mas07] by selecting a model with a criterion of the form m ∈ argmin m∈M Y − tm where pen : M → R + is a penalty function defined on the family of models.In this paper we perform a non asymptotic analysis of the risk of the selected estimator t m [Mas07].By this way we derive a penalty function which provides an oracle inequality for the model selection procedure, in the dependent Gaussian context.

A general Gaussian model selection result
Let π = {π m , m ∈ M} be a probability measure defined on M : m∈M π m = 1.We first give a general shape for the penalty function and the corresponding oracle inequality.
Theorem 2.1.For some constant K > 1, for any penalty function pen : M → R + such that for any m ∈ M, then there exists a constant C > 1 which only depends on K such that the estimator t m selected by the criterion (4) satisfies The main term in the penalty shape (5) is the trace term tr Proj Sm Σ .This quantity plays the same role as the term Var( 1 )d m in the results of Birgé and Massart for independent Gaussian errors [BM01a,Mas07].Of course, this penalty can only be calculated if the matrix Σ is completely known.However we will see that, in certain cases, we can consider effective strategies to circumvent this issue (see Sections 3 and 4).
We can propose penalty shapes from the upper bounds Actually, with a minor modification of the proof of Theorem 2.1, it can be checked that the risk bound (6) is still valid when replacing the lower bound in (5) by for any K > 1.If the sequence (ε i ) i≥1 is a stationary and short memory Gaussian process, then the spectral radius is bounded and the penalty shape (7) is completely in line with the case of independent Gaussian errors [BM01a,Mas07], the usual variance term Var( 1 ) being replaced by the spectral radius ρ(Σ).
The three penalty shapes given below depend on the probability π.If the collection of model is not too rich (see for instance [BM01a,Mas07] or Chapter 2 in [Gir14]), it might be chosen in such a way that is smaller or of the same order as the main terms tr Proj Sm Σ , dm i=1 λ i or d m ρ(Σ).To sum up, if the spectral radius is bounded and if the collection of models is not too rich we see that the penalty can be chosen proportional to the dimension d m , as in the independent case.
It is tempting to keep the penalty shape (7) as a general penalty shape for Gaussian linear model selection with dependent errors.However, as we will see later in the paper, this penalty shape is too rough in some cases.For instance, it cannot lead to minimax rates of convergence for non parametric regression with long range dependent errors (see Subsection 3.2).
At this point, it should be clearly quoted that a penalty similar to (5) has been given in the paper [Gen14].The main difference is that, in the inequality similar to (6) proved in [Gen14] (Inequality (2.2) of Theorem 2.1 in [Gen14]), the residual term is ρ(Σ)R n /n instead of ρ(Σ)/n.For the questions he has in mind (which are not directly related to time series), Gendre is able to effectively control this additional term R n .But it does not seem easy to handle for long range dependent errors or anti-persistent errors, which are precisely the kind of error processes that we want to study in the present paper.

Non parametric regression with Gaussian dependent errors
In this section we study the fixed design regression problem with dependent Gaussian errors.Let f * be a function in L ∞ ([0, 1]), and recall the equation of model ( 3) we can easily associate a linear subspace of R n to any linear subspace of L ∞ ([0, 1]).Slightly abusing the notation, we identify the function f to the vector I(f ), and we write For F a finite linear subspace of L ∞ ([0, 1]), we define the least-squares estimator f of f * on F as We shall only consider here the linear spaces S m of R n induced by the linear space F m of L ∞ ([0, 1]) generated by the family of piecewise polynomials of degree at most r (r ∈ N) on the regular partition of size m of the interval [0, 1].Obviously, the linear space S m has dimension d m = (r + 1)m; the case r = 0 corresponds to the regular regressogram of size m.
We denote by fm the least square estimator of f * on F m .We shall always consider some weights π m of order m −2 (suitably normalized in such a way that n m=1 π m = 1).For such weights, the terms involving π m in the general penalty (5) is of order ρ(Σ) log(m); in the applications given below, it will be negligible with respect to the main term tr Proj Sm Σ .

The case of short range dependent sequences
In this subsection, we assume that the error process (ε i ) i≥1 is stationary and short-range dependent.By short range dependent, we mean that Note that ( 8) is satisfied as soon as the spectral density of (ε i ) i≥1 is bounded, which corresponds to the usual definition of short range dependency.
In this setting, the model selection procedure is exactly the same as in the i.i.d.framework, by replacing the variance of the errors by the spectral radius in the penalty.More precisely, we obtain a penalty of the form pen for some positive constant K depending on the the degree r.We now select a model in M n according to the criterion (4), which can be rewritten as Following [Bar00], we derive rates of convergence when f * belongs to some Besov spaces B α, ,∞ for −1 < α < r + 1 and ≥ 2 (see [DL93] for the definition of Besov spaces).In short, the approximation term in the risk decomposition of fm satisfies (see Sections 4 and 7.4 in [Bar00]) where | • | α, is the usual norm on B α, ,∞ .Balancing the variance term and the approximation terms exactly as in case of i.i.d errors, we end up with the same rate of convergence as in the i.i.d.case Corollary 3.1.Let ( , α) be such that α ∈ (0, r + 1) and ≥ max(2, (2α + 1)/(2α 2 )).For a stationary Gaussian process satisfying (8), and for the estimator f m selected according to the penalized criterion procedure (9), sup where C depends on ρ ε , K, α, and L.
This upper bound is known to be the minimax rate of convergence for the estimation of f * in the i.i.d.case.This is satisfactory since a sequence of i.i.d.Gaussian random variables is of course short-range dependent.
As for the Gaussian i.i.d case, the penalty is defined up to a multiplicative constant K.The spectral radius is unknown, as is the variance of the errors in the standard i.i.d.setting.In practice, the penalty is chosen proportional to the model dimension m and calibrated according to the slope heuristic method introduced by Birgé et Massart [BM01b], see Section 4.2 further.

The case of long range dependent sequences
In this subsection, we assume that the error process (ε i ) i≥1 is strictly stationary, but we do not assume that (8) holds.Instead, we assume that where 11) is only an upper bound, so that it may happen that k>0 |γ ε (k)| < ∞; in such a case (8) holds and the process in short range dependent.But the interesting case is of course when This is what we mean here by long range dependent.
To control the main term of the penalty, we shall prove the following lemma Lemma 3.1.Let S m be the linear space of R n induced by the family of piecewise polynomials of degree at most r on the regular partition of size m of the interval where C depends on κ, γ and r.
Moreover, by the classical Gerschgorin theorem, we easily see that where B depends on κ and γ.Combining this last bound with Lemma 3.1, we infer from (5) that one can choose a penalty of the form pen(m) = K m γ n γ , for some positive constant K depending on κ, γ and r.Now, since the bias term (10) is still valid for any function f * in the Besov space B α, ,∞ (with −1 < α < r + 1 and ≥ 2), we can proceed as in Section 3.1 to get the rate of convergence of the estimator f m.The difference is that the bias-variance problem consists of balancing two terms of order 1 m 2α (bias) and m γ n γ (variance).This leads to the following corollary Corollary 3.2.Let ( , α) be such that α ∈ (0, r + 1) and ≥ max(2, (2α + γ)/(2α 2 )).For a stationary Gaussian process satisfying (11), and for the estimator f m selected according to the penalized criterion procedure (9), sup where C depends on γ, K, α, and L.
This rate is satisfactory, since it corresponds to the minimax rates described in the same setting by Wang [Wan96] when γ ε (k) is exactly of order k −γ .Note however that the minimax rate in [Wan96] is written for the usual L 2 ([0, 1])-norm.
Let us make some additional comments: if the exponent γ is known, then the slope heuristic can still be used to calibrate the other constants in the penalty term.We shall see that it works pretty well in the simulation section and we will also investigate the calibration of γ for the more general and difficult framework where the exponent γ is unknown.

Regular regressograms and anti-persistent errors
We now assume that the sequence (ε i ) i≥1 is stationary and anti-persistent in the following sense: there exist a parameter γ ∈ (1, 2) and a positive constant κ such that Condition (8) holds and For instance, Conditions (8) and (12) hold if (ε i ) i≥1 is a fractional Gaussian noise with Hurst index H ∈ (0, 1/2) (see Section 4.2 for the definition of the Hurst index).In that case, γ = 2 − 2H.The term anti-persistent is borrowed from this particular case.
In this subsection, we only consider the case of regular regressograms, which corresponds to estimators via piecewise polynomials of degree 0 on a regular partition of [0, 1].
To control the main term of the penalty, we shall prove the following lemma where C depends on κ and γ.
We infer from (5) that one can choose a penalty of the form for some positive constant K depending on κ, γ and ρ ε (recall that ρ ε is the constant appearing in (8)).Now, since the bias term (10) is still valid for any function f * in the Besov space B α, ,∞ (with −1 < α < 1 and ≥ 2), we can proceed as in Section 3.1 to get the rate of convergence of the estimator f m.This leads to the following corollary Corollary 3.3.Let ( , α) be such that α ∈ (0, 1) and ≥ max(2, (2α + γ)/(2α 2 )).For a stationary Gaussian process satisfying Conditions (8) and (12), and for the estimator f m selected according to the penalized criterion procedure (9), sup where C depends on γ, K, α, and L.
It is interesting to notice that, for a regularity α < 1, the rate of convergence given in Corollary 3.3 is faster than in the case where the sequence (ε i ) i≥1 is i.i.d.

Slope heuristics
For the results given in the previous sections, the penalty functions are known, in the best case, up to a multiplicative constant.The aim of the slope heuristics method proposed by Birgé and Massart [BM07] is precisely to calibrate a penalty function for model selection purposes.See [BMM12] and [Arl19] for a general presentation of the method.This method has shown very good performances and comes with mathematical guarantees for non parametric Gaussian regression with i.i.d.error terms, see [BM07,Arl19] and references therein.The slope heuristics have several versions (see [Arl19]).In this paper we use the dimension jump algorithm, which is implemented for instance in the R package capush.
The aim is to tune the constant κ in a penalty of the form pen(m) = κ pen shape (m) where pen shape is a known penalty shape.In the most standard cases, pen shape is the dimension of the model.Let m(κ) be the model selected by the penalized criterion with constant κ The Dimension Jump algorithm consists of the following steps (see Figure 3b for an illustration)

Presentation of the experiments
We simulate n observations according to the following generative model on [0, 1] In the simulations we take for f * the function The aim is to estimate f * on a regular partition of size m, for m ∈ {1, . . ., 200}.We simulate n observations ε according to an ARMA process, a Fractional Gaussian process and a non Gaussian Markov chain.The last framework allows us to evaluate the robustness of the model selection procedure without the Gaussian assumption.We consider samples of size n = 200, n = 500, n = 2000 , n = 5000 and the risk of each regressogram is computed over 100 simulations.
We now give more details on the error processes we use for the simulations.
• ARMA process.The ARMA(2,1) short memory process is defined by where (W i ) i∈Z is a sequence of i.i.d.N (0, 1) random variables.
In fact, for H < 1/2, the FGN (ε i ) i≥1 is anti-persistent in the sense of Definition (12) (with γ = 2 − 2H in Definition ( 12)).This is well known (see for instance [Ber94]), and follows from the fact that the ε i 's are the increments of a fractional Brownian motion B H , that is for i = 1, 2, . . .
In the simulations, we shall consider two cases -an anti-persistent case, with H = 0.2, -a long memory case, with H = 0.7.
• Non Gaussian Markov chain.We start from the Markov chain introduced by Doukhan, Massart and Rio [DMR94].
Let a be a positive real number, let ν be the probability with density x → (1 + a)x a 1 [0,1] and π be the probability with density x → ax a−1 1 [0,1] .We define now a strictly stationary Markov chain by specifying its transition probabilities K(x, A) as follows where δ x denotes the Dirac measure at point x.Then π is the unique invariant probability measure of the chain with transition probabilities K(x, •).Let (Z i ) i∈Z be the stationary Markov chain on [0, 1] with transition probabilities K(x, •) and invariant distribution π.Recall that the β-mixing coefficients of the chain (Z i ) i≥1 are defined by where • v is the variation norm.From [DMR94], we know that β Z (n) ∼ 1 n a .One can easily check than Z a i is uniformly distributed over [0, 1], so that ε i = Z a i − 0.5 is a stationary Markov chain (as an invertible function of a stationary Markov chain), with mean zero and mixing coefficient β(k) ∼ 1 n a .This chain is short range dependent if a > 1 and long-range dependent if a ∈ (0, 1) (see for instance [DGM18] for a deeper discussion on this subject).
In the simulations, we shall consider three cases -two short memory cases, with a = 8 and a = 1.5, -a long memory case, with a = 0.5.
In fact, for regressograms on a regular partition of size m, the main term of the penalty can be exactly determined by the behavior of Var(ε for some γ ∈ (0, 2), then the main term of the penalty will be of order (m/n) γ .We then see that γ is related to the usual Hurst index H (see for instance [Ber94]) of the partial sum process via the equality γ = 2 − 2H.Hence, for regressograms on a regular partition of size m, the main term of the penalty is of order (m/n) 2−2H .
This remains true for estimators based on piecewise polynomial of degree r ≥ 1 when γ ε (n) ∼ κn −γ for γ ∈ (0, 1) : again the penalty is of order (m/n) 2−2H with γ = 2 − 2H (see Subsection 3.2).However for anti-persistent errors in the sense of (12), the penalty term cannot be computed as precisely as for regressograms, and is of the usual order m/n (as in the usual short range dependent case).
For long range dependent Gaussian processes, the variance terms of the risk are not linear functions of the dimension, they behave as m γ for some γ ∈ (0, 1). Figure 1 shows the risk of the regressograms for observations simulated according to (13) with the error process following a Fractional Gaussian distribution with Hurst exponents between 0.1 and 0.9.
For anti-persistent cases (H < 0.5), the risk has a convex behavior for large dimensions, in accordance with a variance term of order m 2−2H (see Section 3.3).For the i.i.d.case (H = 0.5), the risk is linear for high dimensions.For the long range dependent cases (H > 0.5), the risk shows a concave behavior for large dimensions, in accordance with a variance term of order m 2−2H (see Section 3.2).
Figure 2 shows the risk of the regressograms for observations simulated according to (13), when the error process is the β-mixing Markov chains described above with a parameter a between 0.3 and 10.We remark a concave behavior for long range dependent processes (a < 1) and a linear behavior in the short range dependent case (a > 1).This suggests that the theoretical results obtained in Sections 3.1 and 3.2 could be also valid in non Gaussian contexts.For the simulations, we use the Whittle MLE-estimator [Whi53] implemented in the longmemo package, to estimate the Hurst index H.We compare several approaches • CDJ: Classical Dimension Jump method with a penalty shape proportional to the dimension.
• HGiven: Dimension Jump for the penalty shape m 2−2H with Hurst exponent H given.
• Wh(Y): Dimension Jump for the penalty shape m 2−2 Ĥ where Ĥ is the Whittle estimator computed on the Y process.
• Wh(Res): Dimension Jump for the penalty shape m 2−2 Ĥ where Ĥ is the Whittle estimator computed on the residuals of a model.For the method Wh(Res), we have to propose a model m 0 for which the Hurst exponent is computed on the residuals.Roughly speaking, the idea is to estimate the Hurst exponent in a sufficiently large model for which the bias is negligible.We propose a two step procedure, which is based on the selection of a pre-model m1 to estimate the Hurst exponent H on the residuals of m1 .This provides an estimator Ĥ which is used to design the penalty shape.The dimension jump is then used to select the final model m.We propose two versions for this two-step procedure: -CDJ+Wh(Res): Classical Dimension Jump to find a pre-model m1 , then Whittle estimator Ĥ to estimate the Hurst exponent and finally Dimension Jump with penalty shape m 2−2 Ĥ .
-Wh(Y)+Wh(Res): Dimension Jump with penalty shape m 2−2 Ĥ1 where Ĥ1 is the Whittle estimator on Y , this selects a pre-model m1 , then Whittle estimator Ĥ2 on the residuals of the model m1 and finally Dimension Jump with penalty shape m 2−2 Ĥ2 .

Short range dependence
In this section we study the performance of the model selection method in the short dependence framework.The penalty shape is chosen proportional to the model dimension, as in the i.i.d.case and we can apply the classical dimension jump method (CDJ) to calibrate κ.Roughly speaking, the slope heuristics relies, among other assumptions, on the fact that the empirical contrast behaves in high dimension as a linear function of the penalty shape.We also compare the performances of the CDJ method with the ones of the other approaches.As we shall see, other methods can give better results for n small.• Gaussian ARMA process We begin with the classical ARMA(2,1) short memory process defined in (14).Figure 3 shows the behavior of the empirical contrast for n = 2000 and an illustration of the dimension jump algorithm.As expected by the slope heuristics, a linear behavior of the empirical contrast can be observed in high dimensions (m ≥ 25).
Figure 4 shows the performance of the different methods.The boxplots on the left part of each graph show the risk of this model selection method over 100 trials.On the right, the risk function is displayed.
In this experiment, the classical dimension jump (penalty shape proportional to the dimension) works clearly well for n large (n ≥ 2000).It is however less efficient for n small.Indeed, the risk shows a concave behavior in large dimensions, as in the long memory case (as we shall see later on).For small n, an estimation of H with the Whittle estimator applied on the Y process and plugged into the penalty shapes finally gives better results than the classical dimension jump method.
The Whittle estimator computed on the residuals is also efficient for selecting the minimal risk model for n small.In this case we consider the residuals process of the model chosen at first step either by CDJ or by Wh(Y), the method CDJ + Wh(res) having bad results for n too small (n = 200).

• Non Gaussian Markov chain
To evaluate the robustness of the model selection procedure without the Gaussian error assumption, we consider the Non Gaussian Markov chain defined above.We simulate an error process ε distributed according to this stationary Markov chain, and we first make simulations in the short dependent case with a value of a = 8.As shown by Figure 5, a linear behavior of the empirical contrast can be observed, which is a good point for applying the slope heuristics here.
The performances of the methods are summarized on Figure 6.We can check on this figure that the classical dimension jump shows good performances.For all sample sizes, the dimension jump based on the Whittle estimator applied to Y is a little less efficient than the two-step methods.
We now consider a second short memory case with the Markov chain, with a = 1.5.This case is very closed to the limit case a = 1, which separates long memory from short memory.Figure 7 shows that the CDJ method works well for n large.But for n small, the four methods do not really manage to select a model close to the oracle model.
The methods based on the direct estimation of the Hurst exponent, like Wh(Y), give good results for n smaller than 500.Regarding the two-step methods, CDJ+Wh(res) shows bad performances for n small (n ≤ 500), while Wh(Y)+Wh(res) shows good results for n = 500 but poor results for n = 200.(a) Linear behavior of the empirical contrast.

Dimension Jump
Values of the penalty constant κ

Long range dependence
For long range dependent Gaussian processes, the variance terms of the risk are not linear functions of the dimension, they behave as m γ for some parameter γ ∈ (0, 1).We thus would like to use penalties proportional to m γ , see Section 3.2.For instance, for Fractional Gaussian processes, γ = 2 − 2H, where H is the Hurst exponent.Of course this coefficient is unknown in practice and thus we use some estimator of the Hurst exponent to calibrate the penalty.Generally speaking, estimating the Hurst exponent is a difficult statistical task, however a rough estimation can be sufficient for the model selection problem we study here.

• Fractional Gaussian Noise
For this experiment we simulate the error process with a Gaussian Fractional Noise of Hurst parameter H = 0.7.The performances of the methods are summarized on Figure 8.We can check on this figure that when using a penalty with the true Hurst exponent (H = 0.7) of the error process, the model selection method works correctly.We also note that the classical dimension jump (penalty shape proportional to the dimension) shows bad performances.On the other hand, the Whittle estimators applied to Y and plugged into the penalty shape show good results for all samples size.The two steps methods show also good performances for n large enough.

• Non Gaussian Markov chain
We now evaluate the robustness of our model selection procedure when the Gaussian error assumption is not satisfied.We consider here the Non Gaussian Markov chain in the long range dependent setting.As for the Fractional Gaussian Noise, the risk has a concave behavior for large dimension, see Figure 2 for an illustration.Then the penalty shape is equal to m a , where a is the decay rate of the covariances.
For this experiment we simulate the Markov chain with a = 0.5 for the error process.The performances of the methods are displayed on Figure 9.We observe that the classical dimension jump shows bad performances in this non Gaussian long range dependent context.When using the penalty shape m a (H given, with a = 2 − 2H), the performances are a little better than before, but not as good as one could hoped for.For n large enough (n ≥ 2000), the Whittle estimators applied on Y and plugged into the penalty shape shows satisfactory results.The performances of the two step methods are similar but from n = 5000.
This experiment suggests that more work should be done in this context.It seems that a concave penalty shape should be used, as expected, but that the good exponent could perhaps be different from a = 2 − 2H.(a) Linear behavior of the empirical contrast for n = 2000.

Dimension Jump
Values of the penalty constant κ

Anti-persistent errors with a Fractional Gaussian Noise
We consider the same simulation protocole with anti-persistent errors, following a Fractional Gaussian Noise with H = 0.2.Again, we observe a linear behavior of the empirical contrast in high dimension, see Figure 10a.
The performances of the different methods on this experiment are summarized by Figure 11.We can check that when using a penalty with the true Hurst exponent (H = 0.2), the model selection method works pretty well.The two-step methods, with the Whittle estimator computed on the residuals, give similar results for all n.On the other hand, the Whittle estimator applied directly on Y shows poor performances for n small, but it is better for n large.
We also note that, in this short range dependent case, the classical dimension jump shows good results for all n, as in the i.i.d.case.

Conclusion on the experiments
In these experiments we see that the penalty proportional to (m/n) (with a constant calibrated thanks to the jump dimension algorithm: CDJ method) performs quite well for short memory processes, but underperforms in all the other situations.The Wh(Y) method, with a penalty proportional to (m/n) 2−2 Ĥ and an estimator Ĥ based ont he Y i 's, performs quite well in most of the cases, but can show very bad performances (see for instance Figure 11) and is hard to justify from a heuristic point of vue.The two steps methods, with a penalty proportional to (m/n) 2−2 Ĥ2 and an estimator Ĥ2 based on the residuals of the first adjustment, performs well in most of the cases, with a clear preference for the Wh(Y)+Wh(Res) method.In fact, we suspect an overfitting with method CDJ for long memory processes, so that the residuals based on CDJ are not close to the original error process (see the application to the Nile data in Section 5).
We note that the two step method Wh(Y)+Wh(Res) gives performances close, even sometimes better, to the best of the other proposed methods.An interesting example is the Gaussian ARMA process: for large n (n ≥ 2000), the risk curve is quasi linear, and the CDJ method is the best method.But for small n (n ≤ 500), the risk curve is concave, as in the long memory case, and the Wh(Y)+Wh(Res) is the best method.This suggests that, even for short memory processes, a penalty proportional to (m/n) is not always a wise choice in practice.
Our final comment is then: instead of looking for a penalty proportional to (m/n) γ for an appropriate γ, it might be preferable to estimate directly the term tr(Proj Sm Σ).This could perhaps be done by giving an estimation of the covariance Σ based on the residuals of an appropriate pre-model.

Application to Nile data
In this section, we wish to continue the discussion on the Nile data initiated by Robinson in his 1997 article [Rob97].We borrow from Robinson his presentation of this dataset, as well as some other sentences: "These data consist of readings of annual minimum levels at the Roda gorge near Cairo, commencing in the year 622; often only the first 663 observations are employed because missing observations occur after the year 1284 (see [Tou25]).It was one of the hydrological series examined by [Hur51] which led to his recognition of the "Hurst effect" and invention of the R/S statistic".The data are plotted in Figure 12.
Robinson then summarizes the different ways of apprehending these data: either by considering that the cyclical variations come from a phenomenon of long memory, or by considering that the series can be written as the sum of a deterministic tendency plus a random noise.We refer to his article for relevant references on these questions.
Robinson applied different kernel estimators (with different bandwidths) to estimate the regression function.Then he estimated the Hurst coefficient H of the errors from the residuals of the regression (see Section 4 of his paper for the definition of the estimator of H).He noted that "These estimates thus vary greatly over the ranges of the smoothing employed" and concluded this section by "This study highlights the need for developing methods for choosing b and c which respond automatically to the strength of the dependence in u t " (here b and c are the bandwidth used to estimate the regression function and the Hurst index respectively; u t is the error process, according to Robinson's notations).
This last sentence motivates us to apply our methods on these data, since we have a way to select automatically a partition from the data.We try two penalties: the usual penalty proportional to m/n, using the "classical jump dimension" to calibrate the constant (see CDJ method in Section 4); this method should work well if the underlying error process was short range dependent.And a penalty proportional to (m/n) 2−2 Ĥ2 , where Ĥ2 is the Hurst estimator based on the residuals, according to the Wh(Y)+Wh(Res) method described in Section 4. Indeed, this method was the best method according to the different kind of simulations done in Section 4. The resulting estimators are plotted in Figure 13.
The CDJ method selects a partition of size m = 54, with a clear impression of overfitting: the estimated trend seems very irregular, with many brutal changes.It seems that some randomness is still present in the trend.The Hurst index estimated through the residuals obtained with the estimated trend gives Ĥ = 0.59, hence not so far from a white noise.
The Wh(Y)+Wh(Res) selects a much smaller partition, with m = 7.The trend looks more regular and interpretable, with a clear minimal period, a clear maximal period, and an almost constant tendency in between.It also suggests that an irregular partition should be used, which is a priori doable with our modelselection method, at the price of more tricky computations and algorithms.The Hurst index estimated through the residuals obtained with the estimated trend gives Ĥ = 0.79, in accordance with the long-range dependence hypothesis.
To be complete, the graph and the ACF of the residuals obtained with the Wh(Y) + Wh(Res) method are plotted in Figure 14.

Discussion
This paper deals with linear model selection with Gaussian dependent errors through 0 penalization.Several generalizations and extensions could be proposed in future works.
In this paper, we apply Theorem 2.1 to study the fixed design case, but clearly the theorem also applies to all the settings considered in [BM01a] (or Chapter 2 in [Gir14]) in the i.i.d case.In particular, if the error process is short range dependent, then for all these problems the penalty is the same as the i.i.d.case, the usual variance term being replaced by the spectral radius of the covariance matrix.
The performances of the 0 penalization strategy are studied in this work assuming that the distribution of the errors is stationary.However, Theorem 2.1 does not require this assumption.In a similar line of work, [Gen08] considers model selection for heteroscedastic Gaussian regression, for independent observations.It would be possible to study model selection for heteroscedastic Gaussian linear models with dependence and in particular in the long memory setting.
An other line of research concerns an extension of Theorem 2.1 for non linear models.Indeed, in the independent setting, a general model selection for non linear models is given in [Mas07] (Theorem 4.18).By combining a Gaussian concentration inequality together with a chaining argument for dependent variables, we believe that it is possible to generalize the 0 penalization strategy for non linear models.
Our work strongly relies on the Gaussian assumption.It would be also interesting to provide model selection results for non Gaussian noise.Note that [Gen14] gives a general model selection theorem for linear   models, under moment conditions.It would be interesting to revisit these results in the context of long range dependence.
As illustrated in the last sections, it appears to be possible to adapt the slope heuristics for calibrating penalties in the context of regression with dependent errors.It would be more satisfactory to provide justification of the slope heuristics in this context.A first step would be to justify the slope heuristics for regression with short memory errors.Finally, note that model selection for density estimation under mixing conditions with resampling penalties has been studied in [Ler11].This strategy is computationally expensive but it deserves to be investigated for regression under short and long memory errors.

Proof of Theorem 2.1
We adapt the proof of Theorem 2.2 in [Gir14] in the framework of dependent Gaussian errors.Starting from the definition of m, see Equation (4), we find that for all m ∈ M 2 n + pen(m), and thus It can be checked that E ε, t * − tm n ≤ 0 and finally we obtain that The theorem can be directly derived from the next result Proposition 7.1.1.For the penalty defined by Equation (5), there exists some constants a > 1 and L K ≥ 0 that only depend on K, and a random variable According to the proposition, we find that Thus, where and the proof of Theorem 2.1 is complete.

Proof of Proposition 7.1.1
We first recall a well known inequality from Cirel'son, Ibragimov et Sudakov [CIS76].
Theorem 7.1.Let F : (R n , • ) → R be a 1-Lipschitz function and η a random vector in R n such that η ∼ N n (0, σ 2 Id) for some σ > 0. Then there exists a random variable ξ following an exponential distribution of parameter 1 such that Note that the Lipschitz condition is expressed with respect to the (non-normalized) euclidean norm • in R n .We derive the following lemma for the projection of Gaussian random vectors.
Lemma 7.1.Let Σ be a n × n symmetric semidefinite matrix and S a linear subspace of R n .Let ε be a Gaussian random vector such that ε ∼ N n (0, Σ).Then there exists a random variable ξ following an exponential distribution of parameter 1 such that Proof.Let ε ∼ N n (0, Σ), then ε satisfies ε = √ Ση with η ∼ N n (0, Id).Let S be a linear subspace of R n .
We then check that the function η By applying Theorem 7.1 to the function η We are now in position to prove Proposition 7.1.1.Let Sm be the linear space spanned by S m and t * .By applying the inequality 2 x, y n ≤ a x 2 n + y 2 n /a for a > 1, we find that where Let m ∈ M. We start from the elementary inequality By permuting the matrices inside the trace operator, we can show that the quantity on the right side in ( 15) is exactly equal to 1 n tr Proj Sm Σ .However Sm is unknown because it depends on t * and thus we can not directly define the penalty in function of tr Proj Sm Σ .We then use the decomposition where V m is the orthogonal to S m in Sm .Note that the dimension of V m is (at most) one.By Pythagoras theorem Proj Sm (ε) and According to Lemma 7.1 and using the inequalities (15) and ( 16), there exists a random variable ξ m following an exponential distribution of parameter 1 such that Thus, the random variable Z satisfies We assume as in (5) that Then, Using the inequality (x + y) 2 ≤ (1 + α)x 2 + (1 + α −1 )y 2 , and taking α = K−a a for K > a > 1, we find that Next, For any K > 1, take a = K+1 2 .Then K > a > 1 is satisfied and the proof of Proposition 7.1.1 is complete with L K = 2K 2 +2K K−1 .

Proof of Lemma 3.1
For any m ∈ {1, . . ., n} and any j ∈ {1, . . ., m}, we define the discrete interval and we denote by (j) the length of I j : (j) = Card(I j ).Note that, for all j, [n/m] ≤ j ≤ [n/m] + 1.The linear space S m induced by the family of piecewise polynomials of degree at most r on the regular partition of size m of the interval [0, 1] is the space generated by the (r + 1)m columns of the design Let c k be the k-th column of the matrix X.Note that these columns are not all orthogonal, but they are linearly independent.

Proof of Lemma 3.2
We keep the notations of the proof of Lemma 3.1.Recall that the case of regular regressograms corresponds to the degree r = 0.In that case, the design matrix X of the proof of Lemma 3.1 contains only the m orthogonal columns filled with 0 and 1, and the linear space S m has dimension m.Denote by c 1 , . . ., c m the m columns of the design X.
This concludes the proof of Lemma 3.2.
and the minimizer of γ over S is the orthogonal projection of Y on S Proj S (Y ) = argmin t∈S γ n (t).

Lemma 3. 2 .
Let S m be the linear space of R n induced by the family of indicators of intervals on the regular partition of size m of the interval [0, 1].If Conditions (8) and (12) hold, then

2.
Find the constant κdj > 0 that corresponds to the highest jump of the function κ → d m(κ) , 3. Select the model m(2κ dj ), m ∈ argmin m∈M Y − fm 2 n + 2κ dj pen shape (m) .

Figure 1 :
Figure 1: Comparison of risk shapes for the fractional Gaussian process with Hurst coefficient between 0.1 and 0.9, and for n = 2000.

Figure 2 :
Figure 2: Comparison of risk shapes for the Markov chain, for n = 2000.
(a) Linear behavior of the empirical contrast (n = 2000).

Figure 4 :
Figure 4: Short Memory ARMA process.Risk curves and performances of the different calibration methods for n = 200, 500, 2000, 5000.

Figure 5 :
Figure 5: Illustration of the slope heuristics for the non Gaussian process (a = 8).

Figure 10 :
Figure 10: Illustration of the slope heuristics for the Fractional Gaussian process (H = 0.2).
Figure 13: Nile River data and resulting estimators.

Figure 14 :
Figure 14: Residuals and ACF of the residuals for the method Wh(Y)+Wh(res).