PAC-Bayesian Estimation and Prediction in Sparse Additive Models

: The present paper is about estimation and prediction in high-dimensional additive models under a sparsity assumption ( p ≫ n paradigm). A PAC-Bayesian strategy is investigated, delivering oracle inequalities in probability. The implementation is performed through recent outcomes in high-dimensional MCMC algorithms, and the performance of our method is assessed on simulated data


Introduction
Substantial progress has been achieved over the last years in estimating highdimensional regression models.A thorough introduction to this dynamic field of contemporary statistics is provided by the recent monographs Hastie, Tibshirani and Friedman (2009); Bühlmann and van de Geer (2011).In the popular framework of linear and generalized linear models, the Lasso estimator introduced by Tibshirani (1996) immediately proved successful.Its theoretical properties have been extensively studied and its popularity has never wavered since then, see for example Bunea, Tsybakov and Wegkamp (2006); van de Geer (2008); Bickel, Ritov and Tsybakov (2009); Meinshausen and Yu (2009).However, even though numerous phenomena are well captured within this linear context, restraining high-dimensional statistics to this setting is unsatisfactory.To relax the strong assumptions required in the linear framework, one idea is to investigate a more general class of models, such as nonparametric regression models of the form Y = f (X) + W , where Y denotes the response, X the predictor and W a zero-mean noise.A good compromise between complexity and effectiveness is the additive model.It has been extensively studied and formalized for thirty years now.Amongst many other references, the reader is invited to refer to Stone (1985); Hastie andTibshirani (1986, 1990); Härdle (1990).The core of this model is that the regression function is written as a sum of univariate functions f = p i=1 f i , easing its interpretation.Indeed, each covariate's effect is assessed by a unique function.This class of nonparametric models is a popular setting in statistics, despite the fact that classical estimation procedures are known to perform poorly as soon as the number of covariates p exceeds the number of observations n in that setting.
In the present paper, our goal is to investigate a PAC-Bayesian-based prediction strategy in the high-dimensional additive framework (p ≫ n paradigm).In that context, estimation is essentially possible at the price of a sparsity assumption, i.e., most of the f i functions are zero.More precisely, our setting is non-asymptotic.As empirical evidences of sparse representations accumulate, high-dimensional statistics are more and more coupled with a sparsity assumption, namely that the intrinsic dimension p 0 of the data is much smaller than p and n, see e.g.Giraud, Huet and Verzelen (2012).Additive modelling under a sparsity constraint has been essentially studied under the scope of the Lasso in Meier, van de Geer and Bühlmann (2009), Suzuki and Sugiyama (2012) and Koltchinskii and Yuan (2010) or of a combination of functional grouped Lasso and backfitting algorithm in Ravikumar et al. (2009).Those papers inaugurated the study of this problem and contain essential theoretical results consisting in asymptotics (see Meier, van de Geer and Bühlmann (2009); Ravikumar et al. (2009)) and non-asymptotics (see Suzuki and Sugiyama (2012); Koltchinskii and Yuan (2010)) oracle inequalities.The present article should be seen as a constructive contribution towards a deeper understanding of prediction problems in the additive framework.It should also be stressed that our work is to be seen as an attempt to relax as much as possible assumptions made on the model, such as restrictive conditions on the regressors' matrix.We consider them too much of a non-realistic burden when it comes to prediction problems.
Our modus operandi will be based on PAC-Bayesian results, which is original in that context to our knowledge.The PAC-Bayesian theory originates in the two seminal papers Shawe-Taylor and Williamson (1997);McAllester (1999) and has been extensively formalized in the context of classification (see Catoni (2004Catoni ( , 2007))) and regression (see Audibert (2004a,b); Alquier (2006Alquier ( , 2008)); Audibert andCatoni (2010, 2011)).However, the methods presented in these references are not explicitly designed to cover the high-dimensional setting under the sparsity assumption.Thus, the PAC-Bayesian theory has been worked out in the sparsity perspective lately, by Dalalyan andTsybakov (2008, 2012); Alquier and Lounici (2011); Rigollet and Tsybakov (2012).The main message of these studies is that aggregation with a properly chosen prior is able to deal effectively with the sparsity issue.Interesting additional references addressing the aggregation outcomes would be Rigollet (2006); Audibert (2009).The former aggregation procedures rely on an exponential weights approach, achieving good statistical properties.Our method should be seen as an extension of these techniques, and is particularly focused on additive modelling specificities.Contrary to procedures such as the Lasso, the Dantzig selector and other penalized methods which are provably consistent under restrictive assumptions on the Gram matrix associated to the predictors, PAC-Bayesian aggregation requires only minimal assumptions on the model.Our method is supported by oracle inequalities in probability, that are valid in both asymptotic and non-asymptotic settings.We also show that our estimators achieve the optimal rate of convergence over traditional smoothing classes such as Sobolev ellipsoids.It should be stressed that our work is inspired by Alquier and Biau (2011), which addresses the celebrated single-index model with similar tools and philosophy.Let us also mention that although the use of PAC-Bayesian techniques are original in this context, parallel work has been conducted in the deterministic design case by Suzuki (2012).
A major difficulty when considering high-dimensional problems is to achieve a favorable compromise between statistical and computational performances.The recent and thorough monograph Bühlmann and van de Geer (2011) shall provide the reader with valuable insights that address this drawback.As a consequence, the explicit implementation of PAC-Bayesian techniques remains unsatisfactory as existing routines are only put to test with small values of p (typically p < 100), contradicting with the high-dimensional framework.In the meantime, as a solution of a convex problem the Lasso proves computable for large values of p in reasonable amounts of time.We therefore focused on improving the computational aspect of our PAC-Bayesian strategy.Monte Carlo Markov Chains (MCMC) techniques proved increasingly popular in the Bayesian community, for they probably are the best way of sampling from potentially complex probability distributions.The reader willing to find a thorough introduction to such techniques is invited to refer to the comprehensive monographs Marin and Robert (2007); Meyn and Tweedie (2009).While Alquier and Biau (2011); Alquier and Lounici (2011) explore versions of the reversible jump MCMC method (RJM-CMC) introduced by Green (1995), Dalalyan andTsybakov (2008, 2012) in-vestigate a Langevin-Monte Carlo-based method, however only a deterministic design is considered.We shall try to overcome those limitations by considering adaptations of a recent procedure whose comprehensive description is to be found in Petralias (2010); Petralias and Dellaportas (2012).This procedure called Subspace Carlin and Chib algorithm originates in the seminal paper by Carlin and Chib (1995), and has a close philosophy of Hans, Dobra and West (2007), as it favors local moves for the Markov chain.We provide numerical evidence that our method is computationally efficient, on simulated data.
The paper is organized as follows.Section 2 presents our PAC-Bayesian prediction strategy in additive models.In particular, it contains the main theoretical results of this paper which consist in oracle inequalities.Section 3 is devoted to the implementation of our procedure, along with numerical experiments on simulated data, presented in Section 4. Finally, and for the sake of clarity, proofs have been postponed to Section 5.

PAC-Bayesian prediction
Let (Ω, A, P) be a probability space on which we denote by {(X i , Y i )} n i=1 a sample of n independent and identically distributed (i.i.d.) random vectors in (−1, 1) p × R, with X i = (X i1 , . . ., X ip ), satisfying where ψ ⋆ 1 , . . ., ψ ⋆ p are p continuous functions (−1, 1) → R and {ξ i } n i=1 is a set of i.i.d.(conditionaly to {(X i , Y i )} n i=1 ) real-valued random variables.Let P denote the distribution of the sample {(X i , Y i )} n i=1 .Denote by E the expectation computed with respect to P and let • ∞ be the supremum norm.We make the two following assumptions.
= 0 and there exist two positive constants L and σ 2 such that for any integer k ≥ 2, Note that A1 implies that E ξ 1 = 0 and that the distribution of ξ 1 may depend on X 1 .In particular, A1 holds if ξ 1 is a zero-mean gaussian with variance γ 2 (X 1 ) where x → γ 2 (x) is bounded.Further, note that the boundedness assumption A2 plays a key role in our approach, as it allows to use a version of Bernstein's inequality which is one of the two main technical tools we use to state our results.This assumption is not only a technical prerequisite since it proved crucial for critical regimes: indeed, if the intrinsic dimension p 0 of the regression function ψ ⋆ is still large, the boundedness of the function class allows much faster estimation rates.This point is profusely discussed in Raskutti, Wainwright and Yu (2012).
We are mostly interested in sparse additive models, in which only a few {ψ ⋆ j } p j=1 are not identically zero.Let {ϕ k } ∞ k=1 be a known countable set of continuous functions R → (−1, 1) called the dictionary.In the sequel, |H| stands for the cardinality of a set H. For any p-th tuple m = (m 1 , . . ., m p ) ∈ N p , denote by S(m) ⊂ {1, . . ., p} the set of indices of nonzero elements of m, i.e., and define which is equipped with the σ-algebra T = σ m∈M B(Θ m ) , where M is the collection of models M = {m = (m 1 , . . ., m p ) ∈ N p }. Consider the span of the set {ϕ k } ∞ k=1 , i.e., the set of functions equipped with a countable generated σ-algebra denoted by F. The risk and empirical risk associated to any ψ θ ∈ F are defined respectively as where Consider the probability η α on the set M defined by for some α ∈ (0, 1/2).Let us stress the fact that the probability η α acts as a penalization term over a model m, on the number of its active regressors through the combinatorial term Finally, set δ > 0 (which may be interpreted as an inverse temperature parameter) and the posterior Gibbs transition density is We then consider two competing estimators.The first one is the randomized Gibbs estimator Ψ, constructed with parameters θ sampled from the posterior Gibbs density, i.e., for any A ∈ F, while the second one is the aggregated Gibbs estimator Ψa defined as the posterior mean (2.3) These estimators have been introduced in Catoni (2004Catoni ( , 2007) ) and investigated in further work by Audibert (2004a); Alquier (2006Alquier ( , 2008)); Dalalyan andTsybakov (2008, 2012).
For the sake of clarity, denote by D a generic numerical constant in the sequel.We are now in a position to write a PAC-Bayesian oracle inequality.
Under mild assumptions, Theorem 2.1 provides inequalities which admit the following interpretation.If there exists a "small" model in the collection M, i.e., a model m such that j∈S(m) m j and |S(m)| are small, such that ψ θ (with θ ∈ Θ m ) is close to ψ ⋆ , then ψ and ψa are also close to ψ ⋆ up to log(n)/n and log(p)/n terms.However, if no such model exists, at least one of the terms j∈S(m) m j /n and |S(m)|/n starts to emerge, thereby deteriorating the global quality of the bound.A satisfying estimation of ψ ⋆ is typically possible when ψ ⋆ admits a sparse representation.
To go further, we derive from Theorem 2.1 an inequality on Sobolev ellipsoids.We show that our procedure achieves the optimal rate of convergence in this setting.For the sake of shortness, we consider Sobolev spaces, however one can easily derive the following results in other functional spaces such as Besov spaces.See Tsybakov (2009) and the references therein.
The notation {ϕ k } ∞ k=1 now refers to the (non-normalized) trigonometric system, defined as with j ∈ N * and t ∈ (−1, 1).Let us denote by S ⋆ the set of indices of nonidentically zero regressors.That is, the regression function ψ ⋆ is Assume that for any j ∈ S ⋆ , ψ ⋆ j belongs to the Sobolev ellipsoid W(r j , d j ) defined as with d j chosen such that j∈S ⋆ d j ≤ C √ 6/π and for unknown regularity parameters r 1 , . . ., r |S ⋆ | ≥ 1.Let us stress the fact that this assumption casts our results onto the adaptive setting.It also implies that ψ ⋆ belongs to the Sobolev ellipsoid W(r, d), with r = min j∈S ⋆ r j and d = j∈S ⋆ d j , i.e., It is worth pointing out that in that setting, the Sobolev ellipsoid is better approximated by the ℓ 1 -ball B 1 m (0, C) as the dimension of m grows.Further, make the following assumption.
(A3) The distribution of the data P has a probability density with respect to the corresponding Lebesgue measure, bounded from above by a constant B > 0.
Theorem 2.2.Let ψ and ψa be realizations of the Gibbs estimators defined by (2.2)-(2.3),respectively.Let A1, A2 and A3 hold.Set w = 8C max(L, C) and δ = nℓ/[w +4(σ 2 +C 2 )], for ℓ ∈ (0, 1), and let ε ∈ (0, 1).Then with P-probability A salient fact about Theorem 2.2 is its links with existing work: assume that all the ψ ⋆ j belong to the same Sobolev ellipsoid W(r, d).The convergence rate is now log(n)n − 2r 2r+1 + log(p)/n.This rate (down to a log(n) term) is the same as the one exhibited by Koltchinskii and Yuan (2010) in the context of multiple kernel learning (n − 2r 2r+1 +log(p)/n).Suzuki and Sugiyama (2012) even obtain faster rates which correspond to smaller functional spaces.However, the results presented by both Koltchinskii and Yuan (2010) and Suzuki and Sugiyama (2012) are obtained under stringent conditions on the design, which are not necessary to prove Theorem 2.2.
A natural extension is to consider sparsity on both regressors and their expansion, instead of sparse regressors and nested expansion as before.That is, we no longer consider the first m j dictionary functions for the expansion of regressor j.To this aim, we slightly extend the previous notation.Let K ∈ N * be the length of the dictionary.A model is now denoted by m = (m 1 , . . ., m p ) and for any j ∈ {1, . . ., p}, m j = (m j1 , . . ., m jK ) is a K-sized vector whose entries are 1 whenever the corresponding dictionary function is present in the model and 0 otherwise.Introduce the notation The prior distribution on the models space M is now for any α ∈ (0, 1/2).

MCMC implementation
In this section, we describe an implementation of the method outlined in the previous section.Our goal is to sample from the Gibbs posterior distribution ρ δ .We use a version of the so-called Subspace .Define i : t → {+, −, =}, the three possible moves performed by the algorithm: an addition, a deletion or an adjustment of a regressor.Let {e 1 , . . ., e p } be the canonical base of R p .For any model m For ease of notation, let I denote the identity matrix.Finally, denote by φ(•; µ, Γ) the density of a Gaussian distribution N(µ, Γ) with mean µ and covariance matrix Γ.A description of the full algorithm is presented in Algorithm 1.
Otherwise, set m(t) = m(t − 1) and θ(t) = θ m(t−1) .12: end if 13: end for The transition kernel of the chain defined above is reversible with respect to ρ δ ⊗ η α , hence this procedure ensures that {θ(t)} T t=1 is a Markov Chain with stationary distribution ρ δ .

Numerical studies
In this section we validate the effectiveness of our method on simulated data.All our numerical studies have been performed with the software R (see R Core Team ( 2012)).The method is available on the CRAN website (http://www.cran.r-project.org/web/packages/pacbpred/index.html),under the name pacbpred (see Guedj (2012)).Some comments are in order here about how to calibrate the constants C, σ 2 , δ and α.Clearly, a too small value for C will stuck the algorithm, preventing the chain to escape from the initial model.Indeed, most proposed models will be discarded since the acceptance ratio will frequently take the ).As the parameter σ 2 is the variance of the proposal distribution φ, the practioner should tune it in accordance with the noise level of the data.The parameter requiring the finest calibration is δ: the convergence of the algorithm is sensitive to its choice.Dalalyan andTsybakov (2008, 2012) exhibit the theoretical value δ = n/4σ 2 .This value leads to very good numerical performances, as it has been also noticed by Dalalyan andTsybakov (2008, 2012); Alquier and Biau (2011).The choice for α is guided by a similar reasoning to the one for C. Its contribution to the acceptance ratio is limited to a log(1/α) term.The value α = 0.25 was used in the simulations for its apparent good properties.Although it would be computationally costly, a finer calibration through methods such as cross-validation is possible.
Finally and as a general rule, we strongly encourage practitioners to run several chains of inequal lengths and to adjust the number of iterations needed by observing if the empirical risk is stabilized.
The results of the simulations are summarized in Table 1 and illustrated by Figure 1 and Figure 2. The reconstruction of the true regression function ψ ⋆ is achieved even in very high-dimensional situations, pulling up our method at the level of the gold standard Lasso.

Proofs
To start the chain of proofs leading to Theorem 2.1, Theorem 2.2 and Theorem 2.3, we recall and prove some lemmas to establish Theorem 5.1 which consists in a general PAC-Bayesian inequality in the spirit of Catoni (2004, Theorem 5.5.1) for classification or Catoni (2004, Lemma 5.8.2) for regression.Note also that Dalalyan and Tsybakov (2012, Theorem 1) provides a similar inequality in the deterministic design case.A salient fact on Theorem 5.1 is that the validity of the oracle inequalities only involves the distribution of the noise variable ξ 1 , and that distribution is independent of the sample size n.
The proofs of the following two classical results are omitted.Lemma 5.1 is a version of Bernstein's inequality which originates in Massart (2007, Proposition 2.19), whereas Lemma 5.2 appears in Catoni (2004, Equation 5.2.1).
For x ∈ R, denote (x) + = max(x, 0).Let µ 1 , µ 2 be two probabilities.The Kullback-Leibler divergence of µ 1 with respect to µ 2 is denoted KL(µ 1 , µ 2 ) and is Finally, for any measurable space (A, A) and any probability π on (A, A), denote by M 1 +,π (A, A) the set of probabilities on (A, A) absolutely continuous with respect to π.
Lemma 5.1.Let (T i ) n i=1 be independent real-valued variables.Assume that there exist two positive constants v and w such that, for any integer k ≥ 2, Then for any γ ∈ 0, 1 w , Lemma 5.2.Let (A, A) be a measurable space.For any probability µ on (A, A) and any measurable function with the convention ∞ − ∞ = −∞.Moreover, as soon as h is upper-bounded on the support of µ, the supremum with respect to m on the right-hand side is reached for the Gibbs distribution g given by Theorem 5.1 is valid in the general regression framework.In the proofs of Lemma 5.3, Lemma 5.4, Lemma 5.5 and Theorem 5.1, we consider a general regression function ψ ⋆ .Denote by (Θ, T) a space of functions equipped with a countable generated σ-algebra, and let π be a probability on (Θ, T), referred to as the prior.Lemma 5.3, Lemma 5.4, Lemma 5.5 and Theorem 5.1 follow from the work of Catoni (2004); Dalalyan andTsybakov (2008, 2012); Alquier (2008); Alquier and Biau (2011).Let δ > 0 and consider the so-called posterior Gibbs transition density ρ δ with respect to π, defined as . (5.1) In the following three lemmas, denote by ρ a so-called posterior probability absolutely continuous with respect to π.Let ψ be a realization of a random variable Ψ sampled from ρ.
Proof.Recall that the randomized Gibbs estimator Ψ is sampled from ρ δ .Denote by ψ a realization of the variable Ψ.By Lemma 5.3, with P-probability at least By Lemma 5.2, with P-probability at least 1 − ε, Finally, by Lemma 5.4, with P-probability at least 1 − 2ε, Apply Lemma 5.5 with the Gibbs posterior probability defined by (5.1).With P-probability at least 1 − ε, By Lemma 5.2, with P ⊗n -probability at least 1 By Lemma 5.4, with P ⊗n -probability at least 1 − 2ε As R is a convex function, applying Jensen's inequality gives Finally, note that Next, using the elementary inequality log n k ≤ k log(ne/k) and that α We restrict the set of all probabilities absolutely continuous with respect to π m to uniform probabilities on the ball B 1 m (x, ζ), with x ∈ B 1 m (0, C) and 0 < ζ ≤ C − θ 1 .Such a probability is denoted by µ x,ζ .With P-probability at least 1 − 2ε, it yields that Note also that and For any m ∈ M, define To proceed, we need to check that the projection of θ ⋆ onto model m lies in B 1 m (0, C), i.e., on their expansion through α p j=1 mj .Our procedure relies on the following construction of the probability π, referred to as the prior, in order to promote the sparsity properties of the target regression function ψ ⋆ .For any m ∈ M, ζ > 0 and x ∈ Θ m , denote by B 1 m (x, ζ) the ℓ 1 -ball centered in x with radius ζ.For any m ∈ M, denote by π m the uniform distribution on B 1 m (0, C).Define the probability π on (Θ, T), e j , j ∈ S[m(t)]}, and V = [m(t)] = {k ∈ M : S(k) = S[m(t)]}.A move i(t) is chosen with probability q[i(t)].By convention, if S[m(t)] = p (respectively S[m(t)] = 1) the probability of performing an addition move (respectively a deletion move) is zero.Note ξ : {+, −} → {−, +} and let D m be the design matrix in model m ∈ M. Denote by LSE m the least square estimate LSE

Fig 2 .
Fig 2. plot of the responses Y 1 , . . ., Yn against their estimates.The more points on the first bisectrix (solid black line), the better the estimation.

Table 1
Each number is the mean (standard deviation) of the RSS over 10 independent runs