Sparsity considerations for dependent variables

: The aim of this paper is to provide a comprehensive introduction for the study of ℓ 1 -penalized estimators in the context of dependent observations. We deﬁne a general ℓ 1 -penalized estimator for solving problems of stochastic optimization. This estimator turns out to be the LASSO [Tib96] in the regression estimation setting. Powerful theoretical guarantees on the statistical performances of the LASSO were provided in recent papers, however, they usually only deal with the iid case. Here, we study this estimator under various dependence assumptions.


Sparsity in high dimensional estimation problems
In the last few years, statistical problems in large dimension received a lot of attention. That is, estimation problems where the dimension of the parameter to be estimated, say p, is larger than the size of the sample, usually denoted by n. This setting is motivated by modern applications such as genomics, where we often have n ≤ 100 the number of patients with a very rare desease, and p of the order of 10 5 or even 10 6 (CGH arrays), see for example [RBV08] and the references therein. Other examples appear in econometrics, we refer the reader to Belloni and Chernozhukov [BC11a,BC11b].
Probably the most famous example is high dimensional regression estimation: one observes pairs (x i , y i ) for 1 ≤ i ≤ n with y i ∈ R, x i ∈ R p and one wants to find a θ ∈ R p such that for a new pair (x, y), θ ′ x would be a good prediction for y. If p ≥ n, it is well known that a good estimation cannot be performed unless we make an additional assumption. Very often, it is quite natural to assume that there exists such a θ that is sparse: most of its coordinates are equal to 0. If we let θ 0 denote the number of non-zero coordinates in θ, this means that θ 0 ≪ p. In the genomics example, it means that only a few genes are relevant to explain the desease. Early examples of estimators introduced to deal with this kind of problems include the now famous AIC [Aka73] and BIC [Sch78]. Both can be written arg min where λ n > 0 differs in AIC and BIC. Despite AIC and BIC may give poor results when p ≥ n (see [BM01]), taking λ ≥ 2σ log(p) leads to estimators with very satisfying statistical properties (σ 2 being the variance of the noise). See for example [BM01,BTW07] for such results, and [BGH09] in the case of unknown variance.
The main problem with this so-called ℓ 0 penalization approach is that the effective computation of the estimators defined in (1.1) is very time consuming. In practice, these estimators cannot be used for p more than a few tens. This motivated the study of the LASSO introduced by Tibshirani [Tib96]. This estimator is defined by arg min The convexity of this minimization problem ensures that the estimator can be computed for very large p, see Efron et al. [EHJT04] for example. This motivated a lot of theoretical studies on the statistical performances of this estimator. The results with the weakest hypothesis can be found in the work of Bickel et al. [BRT09] or Koltchinksii [Kol]. See also very nice reviews in the paper by Van de Geer and Bühlmann [vdGB09] or in the PhD Thesis of Hebiri [Heb09]. Also note that a quantity of variants of the idea of ℓ 1 -penalization were studied simultaneously to the LASSO: among others the basis pursuit [Che95,CDS01], the Dantzig Selector [CT07], the Elastic Net [ZH05]... Another problem of estimation in high dimension is the so-called problem of sparse density estimation. In this setting, we observe n random variables with (unknown) density f and the purpose is to estimate f as a linear combination of some functions ϕ 1 , . . . , ϕ p . If p ≥ n and we can use the SPADES (for SPArse Density Estimator) by Bunea et al. [BWT07,BTWB10] or the iterative feature selection procedure in [Alq08].
One of the common features of all the theoretical studies of sparse estimators is that they focus only on the case where the observations are independent. For example, for the density estimation case, in [BTWB10] and [Alq08] the observations are assumed to be iid. The purpose of this paper is to propose a unified framework. Namely, we define a general stochastic optimization problem that contains as a special case regression and density estimation. We then define a general ℓ 1 -penalized estimator for this problem, in the special case of regression estimation this estimator is actually the LASSO and in the case of density estimation it is SPADES. Finally, we provide guarantees on the statistical performances of this estimator in the spirit of [BRT09], but we do not only consider independent observations: we want to study the case of dependent observations, and prove that we can still recover the target θ in this case, under various hypothesis.

General setting and ℓ 1 -penalized estimator
We now give the general setting and notations of our paper. Note that the cases of regression and density estimation will appear as particular cases.
We observe n random variables in Z : Z 1 , . . . , Z n . Let P be the distribution of (Z 1 , . . . , Z n ). We have a function Q : Z × R p → R such that for any z ∈ Z, θ ∈ R p → Q(z, θ) is a quadratic function. The objective is the estimation of a value θ that minimizes the following expression which only depends on n and θ: All the results that will follow are intended to be interesting in the case p > n on the condition that θ 0 := card{j : θ j = 0} is small. We use the following estimator: andθ λ denotes any solution of this minimization problem.
We now detail the notations in the two examples of interest: 1. in the regression example, where E(ε i ) = 0 (the ε i are not necessarily iid, they may be dependent and have different distribution). Here we take Q((x, y), θ) = (y − x ′ θ) 2 . In this example,θ λ is known as the LASSO estimator [Tib96]. 2. in the density estimation case, Z i ∈ R have the same density wrt Lebesgue measure (but they are not necessarily independent). We have a family of functions (ϕ i ) p i=1 and we want to estimate the density f of Z i by functions of the form In this case we take and note that this leads to Thenθ λ is the estimator known as SPADES [BTWB10].

Overview of the paper
In Section 2 we provide a sparsity inequality that extend the one of Bickel et al. [BRT09] to the case of non iid variables. This result involves two assumptions: the first one is about the function Q and is already needed in the iid case. It is usually refered as Restricted Eigenvalue Property. The other hypothesis is more involved, it is specific to the non iid case. It roughly says that we are able to control the deviations of empirical means of dependent variables around their expectations.
In Section 3, we provide several examples of classical assumptions on the observations that can ensure that we have such a control. These assumptions are expressed in terms of weak dependence coefficients, so in the beginning of this section we briefly introduce weak dependence. We also provide some references.
We apply the results of Sections 2 and 3 to regression estimation in Section 4 and to density estimation in Section 5.
Finally the proofs are given in Section 7.

Assumptions and result
First, we need an assumption on the quadratic form R(·).
Assumption A(κ) with κ > 0. As Q(z, ·) is a quadratic form, we have the matrix that does not depend on θ, and we assume that the matrix M has only 1 on its diagonal (actually, this just means that we renormalize the observations X i in the regression case, or the function ϕ j in the density estimation case), that it is non-random (here again, this is easily checked in the two examples) and that it satisfies Note that this condition, usually referred as restricted eigenvalue property (REP), is already required in the iid setting, see [BRT09,vdGB09] for example. In these paper it is also discussed why we cannot hope to get rid of this hypothesis.
We set for simplicity Recall that as Q(z, θ) is a quadratic function it may be written as Q(z, θ) = θ ′ A(z)θ + b(z) ′ θ + c(z) for a p × p-matrix valued function A on R p and a vector function b : Theorem 2.1. Let us assume that Assumption A(κ) is satisfied. Let us assume that the distribution P of (Z 1 , . . . , Z n ) is such that there is a constant α ∈ [0, 1 2 ] and a decreasing continuous function ψ(·) with The arguments of the proof of Theorem 2.1 are taken from [BRT09]. The proof is given in Section 7, page 768.
Note that the hypothesis in this theorem heavily depend on the distribution of the variables Z 1 , . . . , Z n , and particulary on their type of dependence. Section 3 will provide some examples of situations where this hypothesis is satisfied.
Also note that the upper bound in the inequality is minimized if we make the choice λ = λ * . Then It is important to remark that the choice λ = 4n α− 1 2 ψ −1 ε p may be impossible in practice, as the practitionner may not know α and ψ(·). Moreover, this choice is not necessarily the best one in practice: in the regression case with iid noise N (0, σ 2 ), we will see that this choice leads to λ = 4σ 2n log(p/ε). This choice requires the knowledge of σ. Moreover it is not usually the best choice in practice, see for example the simulations in [Heb09]. Even in the iid case, the choice of a good λ in practice is still an open problem. However, note that 1. the question is in some sense meaningless. For example the value of λ that minimizes the quadratic risk R(θ λ ) is not the same than the value of λ that may ensure, under some supplementary hypothesis, thatθ λ identifies correctly the non-zero coordinates in θ, see for example Leeb and Pötscher [LP05] on that topic. One has to be careful to what one means when one say a good choice for λ. 2. some popular methods like cross-validation seem to give good results for the quadratic risk, at least in the iid case. An interesting open question is to know if one can prove theoretical results for cross validation in this setting. See also the bootstrap method proposed in [BC11b]. 3. the LARS algorithm [EHJT04] computeθ λ for any λ > 0 in a very short time (coordinate descent algorithms [FHHT07] are valuable alternative to LARS).

Remarks on the density and regression estimation setting
First, note that in the regression setting (Equation 1.2), for any i ∈ {1, . . . , n} and j ∈ {1, . . . , p} we have Then, in the density estimation context, So, in both cases, the assumption given by Equation 2.1 is satisfied if we have a control of the deviation of empirical means to their expectation. In the next sections, we discuss some conditions to obtain such controls with dependent variables.

Models fitting conditions of Theorem 2.1
In this section, we give some results that allow to control the deviation of empirical means to their expectations for general (non iid) obsrevations. The idea will be, in the next sections, to apply these results to the processes For the sake of simplicity, in this section, we deal with a generic process V = (V i ) i∈Z and the applications are given in the next sections. Various examples of pairs (α, ψ) are given. We will use the classical notation We are going to introduce some coefficients in order to control the dependence of the V i . The first example of such coefficients are the α-mixing coefficients first introduced by Rosenblatt [Ros56], The idea is that the faster α V (r) decreases to 0, the less dependent are V i and V i+r for large r. Assumptions on the rate of decay allows to prove laws of large numbers and central limit theorems. Different mixing coefficients were then studied, we refer the reader to [Dou94,Rio00] for more details.
The main problem with mixing coefficients is that they exclude too many processes. It is easy to build a process V satisfying a central limit theorem with constant α V (r), see [DDL+07] Chapter 1 for an example. This motivated the introduction of weak dependence coefficients. The monograph [DDL+07] provides a comprehensive introduction to the various weak dependence coefficients. Our purpose here is not to define all these coefficients, but rather to introduce some examples that allow to satisfy condition (2.1) in Theorem 2.1.
Definition 3.1. We put, for any process (V i ) i∈Z , (3.1) We precise in §-3.1.1 and in §-3.1.2 that suitable decays of those coefficients yield (2.1). Those two sections will provide quite different forms of the function ψ.
We finally provide some basic properties, proved in [DDL+07]. The following result allows a comparison between different type of coefficients.
Finally the following property will be useful in this paper.
Proposition 3.2. If V is η-dependent and f is L-Lipschitz and bounded, then

Moment inequalities
In Doukhan and Louhichi [DL99] it is proved that if for an even integer 2q we have ∃C ≥ 1 such that: then Marcinkiewicz-Zygmund inequality follows: and thus α = 0 and ψ(t) is of the order of 1/t 2q in (2.1). However, explicit constants are needed in Theorem 2.1. We actually have the following result.
Various inequalities of this type where derived for alternative dependences (see Doukhan [Dou94], Rio [Rio00] and Dedecker et al. [DDL+07] for an extensive bibliography which also covers the case of non integer exponents).

Exponential inequalities
Using the previous inequality, Doukhan and Louhichi [DL99] proved exponential inequalities that would lead to ψ(t) in exp(− √ t). Doukhan and Neumann [DN07] use alternative cumulant techniques to get ψ(t) in exp(−t 2 ) for suitable bounds of the previous covariances (3.1).
We assume that there exist constants K, L 1 , L 2 < ∞, µ ≥ 0, and a nonincreasing sequence of real coefficients (ρ(n)) n≥0 such that, for all u-tuples (s 1 , . . . , s u ) and all v-tuples (t 1 , . . . , t v ) with 1 ≤ s 1 ≤ · · · ≤ s u ≤ t 1 ≤ · · · ≤ t v ≤ n the following inequality is fulfilled: where A n can be chosen as any number greater than or equal to σ 2 n := V ar(V 1 + · · · + V n ) and Remark 3.3. One can easily check that if V is η-dependent then (3.4) is satisfied with Ψ as in (b), K 2 = M and ρ(r) = η(r), see Remark 9 page 9 in [DN07]. So if V is η-dependent and η(r) decreases fast enough to 0 then we have an exponential inequality.
This result yields convienient bounds for the function ψ. A recent paper by Olivier Wintenberger [Win10] is also of interest: it directly yields alternative results from our main result. In this paper, we do not intend to provide the reader with encyclopedic references but mainly to precise some ideas and techniques so that this will be developed in further papers.

Gaussian case
In the special case of Gaussian processes (V i ) i , tails of S n are classically described because S n ∼ N (0, σ 2 n ) and here ψ(t) = exp(−t 2 ). We thus may obtain simultaneously subGaussian tails and α = (1 − β)/2 > 0.

Non subGaussian tails
Assume that that for each i, j, G i ∼ N (0, 1) and (G i ) i is a stationary Gaussian processes with, for some B, β, (3.5) Let V i = P (G i ) for a function with Hermite rank m ≥ 1, and since cov(H m (G 0 ), H m (G i )) = m! (r(i)) m their covariance series is non m-th summable in case β ∈] 1 m , 1[. The case P (x) = x 2 − 1 and β ∈] 1 2 , 1[ is investigated by using the following expansion in the seminal work by Rosenblatt [Ros61].
Set R n for the covariance matrix of the Gaussian random vector (G 1 , . . . , G n ): Quoting that

(B is given by Equation (3.5)), this is thus clear that for small enough |t|
Here the conditions in the main theorem hold with ψ(t) = e −t and α = 1 2 −β > 0 for any M > 1/τ .

Application to regression estimation
In this section we apply Theorem 2.1 and the various examples of Section 3 to obtain results for regression estimation. Note that the results in the iid setting are already known, they are only given here for the sake of completeness, in order to provide comparison with the other cases.
Let us remind that in the regression case, we want to apply the results of Section 3 to W For the sake of simplicity, in this whole session dedicated to regression, let us put max(X) := max

Regression in the iid case
Under the usual assumption that the ε i are iid and subGaussian, for some known σ 2 , then we have So we can apply Theorem 2.1 in order to obtain the following well known result:

Marcinkiewicz-Zygmund type inequalities
Let us remark that, for any 1 ≤ j ≤ p, Thus, we apply Theorem 2.1 and Proposition 3.3 to obtain the following result. ∃C ≥ 1 such that: Remark 4.1. This result aims at filling a gap for non subGaussian and non iid random variables. The result still allows to deal with the sparse case p > n in case q > 1. In this case we deal with the case p = n q/2 and we get a rate of convergence in probability O(1/ √ n). If q = 1 and p n → 0 the least squares methods apply which make such sparsity algorithms less relevant.
Moreover if q < 1 the present method is definitely not efficient. Hence in the case of heavy tails, such as considered in the paper by Bartkiewicz et al. [BJMW10], our results are useless. Anyway, using least squares for heavy tailed models (without second order moments) does not look to be a good idea!

Exponential inequalities
Using Theorem 2.1 and Theorem 3.1 we prove the following result.
Corollary 4.3. Let us assume that the (ε i ) satisfy the hypothesis of Theorem 3.1: let Ψ : N 2 → N be one of the functions of Theorem 3.1, we assume that there are constants K, L 1 , L 2 < ∞, µ ≥ 0, and a nonincreasing sequence of real coefficients (ρ(n)) n≥0 such that, for all u-tuples (s 1 , . . . , s u ) and all v-tuples (t 1 , . . . , t v ) with 1 ≤ s 1 ≤ · · · ≤ s u ≤ t 1 ≤ · · · ≤ t v ≤ n the following inequality is fulfilled: Let c be a positive constant and let us put Let us assume that ε > 0, p and n are such that So the rate is the same than in the iid case. The only difference is in the constant, and a restriction for very large values of p.
Remark that we cannot in general compute explicitely the inverse of this function but we can upper-bound the range for u: In this case, and so ψ −1 (y) = C log 2 y .
So we can take, following Theorem 2.1, as soon as ψ −1 (ε/p) < n 1/(2µ+4) . For example, for a fixed number of observations n and a fixed confidence level ε, we have the restriction: Under this condition we have, by Theorem 2.1, this ends the proof.

Simulations
In order to illustrate the results, we propose a very short simulation study. The purpose of this study is not to show the good performances of the estimator in practice or to give recipes for the choice of λ. The aim is more to show that the performances of the iid setting are likely to be obtained in the dependent setting if the dependence coefficients are small. We use the following model: where the X i 's will be treated as fixed design, but in practice will be iid vectors in R p with p = 50, with distribution N p (0, Σ) where Σ is given by Σ i,j = 0.5 |i−j| . The parameter is given by θ = (3, 1.5, 0, 0, 2, 0, 0, . . .) ′ ∈ R p . This is the toy example used by Tibshirani [Tib96]. Let ϑ ∈] − 1, 1[. The noise satisfies ε i = ϑε i−1 + η i , for i ≥ 2, where the η i are iid N (0, 1 − ϑ 2 ) and ε 1 ∼ N (0, 1). Note that this ensure that E(ε 2 i ) = 1 for any i, so the noise level does not depent on ϑ. In the experiments, ϑ ∈ {−0.95, −0.5, 0, 0.5, 0.95}.
We fixed a grid of values G ⊂]0, 1.5[ and we computed, for every experiment, the LASSO estimator with λ = g log(p)/n for all g ∈ G. We have repeated the experiment 25 times for every value of ϑ and report the results in Figure 1.
We can remark that all the curves are very similar. The minimum reconstruction error is obtained for g ≃ 0.2, that corresponds to λ ≃ 0.072. Note that in the iid case, it is smaller than the theoretical value given by Theorem 2.1, λ = 4σ 2 log(p/ε)/n ≃ 2.56 for ε = 1/10, that would correspond to g ≃ 7.10, a value that would not event stand in the figure!

Application to density estimation
Here we apply Theorem 2.1 and Section 3 to the context of density estimation. Let us remind that in this setting,

Density estimation in the iid case
If the Z i are iid with density f and if ϕ j ∞ < B for any j ∈ {1, . . . , p} then we can apply Hoeffding inequality [Hoe63] to upper bound We obtain So we can apply Theorem 2.1.
Corollary 5.1. In the context of density estimation, under Assumption A(κ), if the Z i are iid with density f and if ϕ j ∞ < B for any j ∈ {1, . . . , p}, the choice λ = 4B 2n log(2p/ε) leads to This result is essentially known, see [BWT07].

Density estimation in the dependent case
Note that if as previously we work with bounded ϕ j (·), we automatically have moments of any order. So we will only state a result based on exponential inequality. So, using Theorem 2.1 and Theorem 3.1 we obtain: Corollary 5.2. Let us assume that there are L > 0 and B ≥ 1 such that ϕ j (·) is L-Lipschitz and ϕ j ∞ < B for any j ∈ {1, . . . , p}. Let us assume that Z 1 , . . . , Z n satisfy for some L 1 , L 2 , µ > 0. Let us put a c > 0, define and assume that p, n and the confidence level ε are such that Remark 5.1. The assumption that the ϕ j are all L-Lipschitz for a constant L excludes a lot of interesting dictionaries. If we assume that the ϕ j are L(n)-Lipschitz (this would be the case if we used the first n functions in the Fourier basis for example), then we will suffer a loss in (5.1) when compared to the iid case. However, note that Equation (5.2) below is the starting point of our proof, so we cannot hope to find a simple way to remove this hypothesis when using η-weak dependence. This will be the object of a future work.
So we can apply Theorem 3.1 with Ψ(u, v) = u + v and we obtain with A n = 4nBLL 1 and B n = 2 3+µ BL 1 , in other words .
We then put u = t √ n to obtain .

Conclusion
In this paper, we showed how the LASSO and other ℓ 1 -penalized methods can be extended to the case of dependent random variables. An open and ambitious question to be adressed later is to find a good datadriven way to calibrate the regularization parameter λ when we don't know in advance the dependence coefficients of our observations. Anyway this first step with sparsity in the dependent setting is done for accurate applications and our brief simulations let us think that such techniques are reasonable for time series.
Here again extensions to random fields or to dependent point processes seem plausible.