Lasso in infinite dimension: application to variable selection in functional multivariate linear regression

It is more and more frequently the case in applications that the data we observe come from one or more random variables taking values in an infinite dimensional space, e.g. curves. The need to have tools adapted to the nature of these data explains the growing interest in the field of functional data analysis. The model we study in this paper assumes a linear dependence between a quantity of interest and several covariates, at least one of which has an infinite dimension. To select the relevant covariates in this context, we investigate adaptations of the Lasso method. Two estimation methods are defined. The first one consists in the minimization of a Group-Lasso criterion on the multivariate functional space H. The second one minimizes the same criterion but on a finite dimensional subspaces of H whose dimension is chosen by a penalized least squares method. We prove oracle inequalities of sparsity in the case where the design is fixed or random. To compute the solutions of both criteria in practice, we propose a coordinate descent algorithm. A numerical study on simulated and real data illustrates the behavior of the estimators.


Introduction
More and more often, the data we observe come from one or more random variables taking their values in a space of infinite dimension.This is the case, for example, for data that can be represented as curves.The need to develop tools adapted to the nature of the data explains the growing interest in the field of functional data analysis (Ramsay and Silverman, 2005;Ferraty and Vieu, 2006;Ferraty and Romain, 2011).It has proven to be very fruitful in many applications, for example in spectrometry (see for example Pham et al., 2010), in the study of electroencephalograms (Di et al., 2009), in biomechanics (Sørensen et al., 2012) and in econometrics (Laurini, 2014).
In some contexts, and more and more often, the data are a finite number of curves.We call this case multidimensional functional data.This is the case in Aneiros-Pérez et al. (2004) where the objective is to predict the ozone concentration of the next day from the ozone concentration curve, the N O concentration curve, the N O 2 concentration curve, the wind speed curve and the wind direction of the current day.Another example comes from nuclear safety problems where the risk of failure of a nuclear reactor vessel in case of a loss of coolant accident is studied as a function of the evolution of the temperature, pressure and heat transfer parameter in the vessel (Roche, 2018).It can also happen, perhaps more often, that the observed quantities are of different natures (curves and vectors or scalars).This case has motivated the study of partial linear models (see for example Shin 2009;Wang et al. 2021;Xu et al. 2020) where a quantity of interest Y depends both on vectors and on functional covariates.
In the case where the number of covariates, especially functional or infinite dimensional covariates, is large, it may be necessary to select the most relevant covariates for prediction, either to solve the computational problems posed by the complexity of the data or to obtain an interpretable prediction procedure.
The objective of this paper is to study the link between a real response Y and a vector of covariates X = (X 1 , ..., X p ) which can be of different nature (curves or vectors or scalar quantities).We assume that, for all j = 1, ..., p, i = 1, ..., n, X j i ∈ H j where (H j , ∥•∥ j , ⟨•, •⟩ j ) is a separable Hilbert space.Our covariate {X i } 1≤i≤n is then in the product space H = H 1 ×...×H p , which is also a separable Hilbert space equipped with its natural scalar product ⟨f , g⟩ = p j=1 ⟨f j , g j ⟩ j for all f = (f 1 , ..., f p ), g = (g 1 , ..., g p ) ∈ H and usual norm ∥f ∥ = ⟨f , f ⟩.
Note that our model does not require the H j 's to be functional spaces, we can have H j = R or H j = R d , for some j ∈ {1, ..., p}.The case where H j = R, for all j = 1, . . ., p exactly corresponds to the classical multivariate regression model.
The functional linear model, which corresponds to the case p = 1 in the equation (1), has been widely studied.It has been defined by Cardot et al. (1999) who proposed an estimator based on principal component analysis.Splines estimators have also been proposed by Ramsay and Dalzell (1991); Cardot et al. (2003); Crambes et al. (2009) as well as estimators based on the decomposition of the slope function β in the Fourier domain (Ramsay and Silverman, 2005;Li and Hsing, 2007;Comte and Johannes, 2010) or in a general basis (Cardot and Johannes, 2010;Comte and Johannes, 2012).In a similar context, we also mention the work of Koltchinskii and Minsker (2014) on Lasso.In this paper, it is assumed that the function β is well represented as a sum of a small number of well separated spikes.In the case where p = 2, H 1 a functional space and H 2 = R d , the model ( 1) is called partial functional linear regression model and has been studied for example by Shin (2009); Shin and Lee (2012) who proposed principal component regression and ridge regression approaches for the estimation of the two coefficients of the model.
Few works have been devoted to the multivariate functional linear model which corresponds to the case where p ≥ 2 and the H j are function spaces for all j = 1, . . ., p.To the best of our knowledge, the model was first mentioned in the work of Cardot et al. (2007) under the name multiple functional linear model.An estimator of β is defined with an iterative backfitting algorithm and applied to the ozone prediction dataset initially studied by Aneiros-Pérez et al. (2004).Variable selection is performed by testing all possible models and selecting the one that minimizes the prediction error on a test sample.Let us also mention the work of Chiou et al. (2016) who consider a multivariate linear regression model with functional output.They define a consistent and asymptotically normal estimator based on the multivariate functional principal components initially proposed by Chiou et al. (2014).
In the case where the covariates are finite dimensional and p is large, the usual approach to select variables is to use a penalty of type ℓ 1 .This case has been widely studied, with many variations and improvements.One of the most common variable selection methods, the Lasso (Tibshirani, 1996;Chen et al., 1998), consists of the minimization of a least squares criterion with an ℓ 1 penalty.The statistical properties of the Lasso estimator are now well understood.Sparsity oracle inequalities have been obtained for predictive losses in particular in standard multivariate or nonparametric regression models (see for example Bunea et al., 2007;Bickel et al., 2009;Koltchinskii, 2009;Bertin et al., 2011).
The Group-Lasso (Yuan and Lin, 2006;Chesneau and Hebiri, 2008) addresses the case where the set of covariates can be partitioned into a number of groups.To take into account the group structure in the data, our model can be rewritten as H j = R d j , j = 1, . . ., p, where p is the number of groups and d j is the cardinal of the j-th group.Huang and Zhang (2010) show that, under certain conditions called strong group sparsity, the Group-Lasso penalty is more efficient than the Lasso penalty.Lounici et al. (2011) proved oracle inequalities for the prediction and ℓ 2 estimation error that are optimal in the minimax sense.Their theoretical results also demonstrate that Group-Lasso can improve Lasso in prediction and estimation.van de Geer (2014) proved sharp oracle inequalities for general weakly decomposable regularization penalties, including Group-Lasso penalties.This approach has proven fruitful in many settings such as time series (Chan et al., 2014), generalized linear models (Blazère et al., 2014) in particular Poisson regression (Ivanoff et al., 2016) or logistic regression (Meier et al., 2008;Kwemou, 2016), the study of panel data (Li et al., 2016), the prediction of breast or prostate cancer (Fan et al., 2016;Zhao et al., 2016).The theoretical results were extended to the case where the errors are heteroscedastic by Dalalyan et al. (2013).
Drawing inspiration from Lounici et al. (2011) we define two criteria and where λ = (λ 1 , ..., λ p ) are positive parameters and (H (m) ) m≥1 is a sequence of nested finitedimensional subspaces of H, to be specified later.
The case where the product space H is of finite dimension has been widely treated (see the references above).However, few papers deal with the infinite-dimensional case.Most of the literature in functional data analysis naturally focuses on dimension reduction methods (mainly projection onto a spline basis or onto the principal component basis in Ramsay and Silverman 2005;Ferraty and Romain 2011) to reduce data complexity.More recently, clustering approaches have been considered (see for example Devijver, 2017) as well as variable selection methods using ℓ 1 penalties.(Kong et al., 2016) have proposed a Lasso type penalty allowing to select the Karhunen-Loève coefficients of the functional variable simultaneously with the coefficients of the vector variable in the partial functional linear model (case p = 2, H 1 = L 2 (T ), H 2 = R d of the Model (1)).Group-Lasso and adaptive Group-Lasso procedures have been proposed by Aneiros andVieu (2014, 2016) to select the important observation points t 1 , ..., t n (impact points) in a regression model where the covariates are the discrete values (X(t 1 ), ..., X(t p )) of a random function X. Bayesian approaches have been proposed by Grollemund et al. (2019) in the case where the β * j are sparse step functions.The natural extension of the approaches developed in the field of functional data analysis in our context leads to the projected version of the criterion defined in equation ( 3) with H (m) generated by a multivariate splines basis or an fPCA basis.However, the projection step induces a bias that must be taken into account.
Some recent contributions (see for example Goia and Vieu, 2016;Sangalli, 2018) emphasize the need to work at the interface between high-dimensional statistics, functional data analysis, and machine learning to deal more effectively with specific problems of high-dimensional or infinite-dimensional data.Indeed, the problem of infinite-dimensional variable selection is also considered in the machine learning community, especially in the context of multiple-kernel learning.Bach (2008); Nardi and Rinaldo (2008) proved the consistency of model estimation and selection, as well as prediction and estimation bounds for the Group-Lasso estimator, when the data belong to Reproducing Kernel Hilbert Spaces.In these papers, the criterion is minimized on the whole product space H, leading to (2).However, imposing that the data be in a Reproducing Kernel Hilbert Space is too restrictive in the domain of functional data because it implies a constraint on the unknown regularity of the data.To the best of our knowledge, the theoretical study of (2) has not been done when the data are in a general Hilbert space.
Our approach also covers the case where Y i depends on a single functional variable Z i : T → R and we want to determine whether observing the entire curve {Z i (t), t ∈ T } is useful to predict Y i or whether it is sufficient to observe it on some subsets of T .For this purpose, we define T 1 , . . ., T p a partition of the set T into subintervals and consider the restrictions X j i : T j → R of Z i on T j .If the corresponding coefficient β * j is zero, we know that X j i is, a priori, irrelevant to predict Y i and, therefore, that the behavior of Z i on the interval T j has no significant influence on Y i .The idea of using a Lasso type criterion or a Danzig selector in this context, called the FLIRTI method (for Functional LInear Regression That is Interpretable) has been developed by James et al. (2009).

Contribution of the paper
The properties of the solution of the Group-Lasso problem (2) have been studied for example by Lounici et al. (2011) under restricted eigenvalue type assumptions in the finite-dimensional case.Bellec and Tsybakov (2017) have improved these results by obtaining sharp versions of the sparsity oracle inequalities.The aim of this paper is to study the case where dim(H) = +∞ and to answer the following questions: are we able to obtain sharp oracle inequalities when dim(H) = +∞?How to compute the solution of a Lasso problem in this infinite-dimensional context ?
To answer the first question, we must first define a restricted eigenvalue condition (or an equivalent).Unfortunately, the question of the restricted eigenvalue assumption in an infinite dimensional space turns out to be a complex issue.Indeed, we first prove in Section 2 that no such hypothesis can be verified on the entire space H in infinite dimension, or even when the data dimension is too large.We consider as an alternative, the minimal ratio κ(m) n (s) between the empirical norm and the norm of H on the cone This quantity, supposed to be constant in finite dimension in the works of Lounici et al. (2011); Bellec and Tsybakov (2017), is seen here as a sequence which decreases towards 0 when m = dim(H (m) ) increases, at a rate which will determine the convergence rate of the final estimator.This rate of convergence thus plays the role of a regularity parameter.This is, to our knowledge, a new approach to the problem.
We prove in section 3 a sharp oracle inequality for both criteria (2) and (3) without any assumption other than noise normality.The proofs and results are similar to those of Lounici et al. (2011); Bellec and Tsybakov (2017) except that we have to deal with the remaining term due to the violation of the restricted eigenvalue assumption for the solution of (2) and the bias due to the projection for the solution of (3).The results are true for both fixed and random designs.We find, as expected, that the properties of the projected estimator (3) depend strongly on the choice of the projection dimension m.A data-driven criterion for selecting the dimension m, inspired by the work of Barron et al. (1999) and their adaptation to the functional linear model by Brunel et al. (2016), is proposed.In Section 4, we obtain a sparsity oracle inequality for the theoretical prediction error under certain assumptions of sub-Gaussianity of the data distribution.The sections 5 and 6 are devoted to the numerical properties of the solution.If the solution of the criterion (3) can be computed directly from the coefficients of the data in a basis of the space H (m) with tools dedicated to multivariate data, it is not the same for the solution of the criterion (2) which requires solving an infinite dimensional optimization problem.We then define a computational algorithm allowing to minimize the criterion (2) directly in the space H, without projecting the data.This computational algorithm is also used to solve the criterion (3) to facilitate comparisons.The properties of the estimators are studied numerically in section 6 on simulated data sets.We then applied both estimation procedures to the prediction of energy consumption of household appliances.
We also denote by P X (•) = P(•|X 1 , . . ., X n ) the conditional probability with respect to the design if it is random or P X (•) = P if the design is fixed.
2 Discussion on the restricted eigenvalues assumption 2.1 The restricted eigenvalues assumption does not hold if dim(H) = +∞ Sparsity oracle inequalities are usually obtained under conditions on the design matrix.One of the most common is the restricted eigenvalues property (Bickel et al., 2009;Lounici et al., 2011).Translated to our context, this assumption may be written as follows.
(A RE(s) ): There exists a positive number κ = κ(s) such that min with ∥f ∥ n := 1 n n i=1 ⟨f, X i ⟩ 2 the empirical norm on H naturally associated with our problem.As explained in Bickel et al. (2009, Section 3), this assumption can be seen as a "positive definiteness" condition on the Gram matrix restricted to sparse vectors.In the finite dimensional context, van de Geer and Bühlmann ( 2009) have proven that this condition covers a large class of design matrices.
The next lemma, proven in Section A.1, shows that this assumption does not hold when dim(H J ) is too large for a subset J of {1, . . ., p}.

Finite-dimensional subspaces and restriction of the restricted eigenvalues assumption
The infinite-dimensional nature of the data is the main obstacle here.To circumvent the dimensionality problem, we restrict the assumption to finite-dimensional spaces.In the sequel, we focus on spaces spanned by the m-first elements of an orthonormal basis (φ (k) ) k≥1 i.e.H (m) := span φ (1) , . . ., φ (m) .To obtain sparsity oracle inequalities, we suppose fulfilled the following condition of support compatibility of the basis.We denote by the projection operator into the j-th coordinates and the projection operator into H (m) .
Condition (C supp ) appears necessary to obtain the sparsity oracle inequality.It is a condition on the basis (φ (k) ) k≥1 .It is the case for instance if, for all k ≥ 1, |J(φ (k) )| = 1.Indeed, this means that π j φ and we deduce that It is possible now to define a restricted eigenvalues property on the projection on the data on the finite-dimensional space H (m) .We would like to emphasize first that the viewpoint is different.In finite-dimensional contexts (see e.g.Bickel et al. 2009;Lounici et al. 2011), the restricted eigenvalue property is an assumption on the design matrix.In infinite-dimensional contexts, it seems more natural, since, there is no a priori dimension for the data, to define a sequence (κ (m) n ) m≥1 depending on the sparsity level s ∈ {1, . . ., p} as follows The quantities κ(m) n (s) are linked with the spectral radius of restrictions of the empirical covariance operator Γ by the following relationship min J⊆{1,...,p};|J|≤s where where φ j , j ∈ J) ∈ H J and ⟨f , g⟩ J = j∈J ⟨f j , g j ⟩ j is the usual scalar product of H J .
Since it has been proven by Cardot and Johannes (2010) that the rate of decrease of the eigenvalues of the covariance operator influences the minimax rates in functional linear regression, we may assume that the rate of decrease of κ(m) n (s) to 0 influences the rate, which is confirmed by our results.

some examples
In this section, we detail three examples of spaces H on which we will illustrate the theoretical results of the paper.The two first examples are illustrative ones and the third one is close to the electricity consumption case presented in Section 6.5.
Example 1: finite-dimensional space verifying the restricted eigenvalues assumption We first consider, as an illustrative example, the case where dim(H) = d < +∞.In that case, without loss of generality, we can consider that dim(H j ) = R d j with d 1 + . . .+ d p = d.Moreover, we suppose in this example, that the restricted eigenvalues assumption (A RE(s) ) written in Section 2.1 holds with c 0 = 3.This case match with the model described in Bellec and Tsybakov (2017); Lounici et al. (2011) (with, eventually, c 0 = 7 instead of c 0 = 3 in Lounici et al. 2011) and we can see easily that, for any nested sequence Example 2: simple semi-functional linear model We suppose that p = 2 and ) and H 2 = R.We consider a basis ( e k ) k≥1 that diagonalizes the empirical covariance operator Γ 1 of the functional data (X 1 1 , . . ., X 1 n ) and we denote by ( µ k ) k≥1 the associated nonincreasing eigenvalues sequence.Remark that the orthonormal system {( e (1) k , 0), k ≥ 1; (0, 1)} is a basis of H that diagonalizes the operator Γ.We construct for a rank r ∈ N\{0}, for m < r, For s = 1, we remark that, for m < r, m the m-largest eigenvalue of Γ 1 .For m ≥ r, we also take into account the interaction between the two variables and we can see that κ(m) n (1) is the smallest eigenvalue of the covariance matrix of the data matrix containing the coefficients of the projection of the data onto H (m) which is (⟨X 1 i , e (1) Example 3: fully multivariate functional linear model Now we consider the example of p an integer and H j = L 2 ([0, 1]).We define, for all j = 1, . . ., p, a basis ( e k ) k≥1 that diagonalizes the empirical covariance operator Γ j of the data (X j 1 , . . ., X j n ) and we denote by ( µ k ) k≥1 the associated eigenvalues sequence.To simplify the definitions, we set m = Lp, with L ∈ N\{0} and writes k , k = 1, . . ., L}, j = 1, . . ., p.In that case, the matrix Γ |m is a block matrix is the correlation matrix between the pro- We remark that the matrices Γ L j,j are diagonal matrices, with diagonal coefficients { µ with equality in the case where the extra-diagonal correlation matrices Γ L j,j ′ = 0 for all j ̸ = j ′ .

Sharp sparsity oracle-inequalities for the empirical prediction error
In this section, the design X 1 , . . ., X n is supposed to be either fixed or random.The results of the section are obtained under the unique assumption of Gaussianity of the noise and no assumption on the design.We prove the following sharp sparsity-oracle inequality for the solutions of both problems (2) and (3).
Proposition 1.Let q > 0 be fixed and choose With probability larger than 1 − p 1−q , for all m ≥ 1, and the orthogonal projection onto (H (m) ) ⊥ and using the convention 1/0 = +∞ in the case where κ(m) The proof of this result can be found in Section A.2.It is based on the ones of Bellec and Tsybakov (2017, Proposition 5) and Lounici et al. (2011, Theorem 3.1) with some adjustments linked with the infinite-dimensional nature of the data.In particular, we need a concentration inequality that remains true in Hilbert spaces (see Proposition 2).
In the case where dim(H) < +∞ (Example 1 in Section 2.3) , we remark that when m = d = dim(H) the problematic remaining term R m,n disappears and the result of Proposition 1 coïncides with • the result of Bellec and Tsybakov (2017, Proposition 5) in the case λ j = λ for all j = 1, . . ., p with the same constants, • the result of Lounici et al. (2011, Theorem 3.1) with better constants (9/4 instead of 96 in the term due to the penalty and 1 replaced by 2 in the bias term).
However, in the case where dim(H) = +∞ we have to deal either with the remaining term R n,m for β λ,∞ or with the choice of an optimal dimension m for β λ,m .Up to now, it seems difficult to know the exact convergence rate of R n,m .On the contrary, the choice of dimension m for the estimator β λ,m is linked with a classical bias-variance compromise.
• When m is small the distance ∥β * − β∥ n between β * and any β ∈ H (m) is generally large.
• When m is sufficiently large, we know the distance ∥β * − β∥ n is small but the term is close to 0 when m is close to rk( Γ).
To achieve the best trade-off between these two terms, a model selection procedure, in the spirit of Barron et al. (1999), is introduced.We select where κ > 0 is a constant which can be calibrated by a simulation study or selected from the data by methods stemmed from slope heuristics (see e.g.Baudry et al. 2012) and N n ≤ n.
We obtain the following sparsity oracle inequality for the selected estimator β λ, m .
The proof of Theorem 1 can be found in Section A.3.It is based on the control of an empirical process naturally associated with our problem given in Lemma 2. Both quantities κ min and C M S are universal constants.
Theorem 1 implies that, with probability larger than 1 − p where, for all m, β ( * ,⊥m) is the orthogonal projection of β * onto (H (m) ) ⊥ .The upper-bound in Equation ( 10) is then the best compromise between two terms: • an approximation term β ( * ,⊥m) 2 n which decreases to 0 when m → +∞; • a second term due to the penalization and the projection which increases to +∞ when m → +∞.

Oracle-inequality for prediction error
The aim of this section is to prove sparsity-oracle inequalities for the theoretical counterpart of the empirical prediction error and to derive convergence rates under appropriate regularity assumptions.We suppose in this section that the design X 1 , . . ., X n is a sequence of i.i.d centered random variables in H.The aim is to control the estimator in terms of the norm associated to the prediction error of an estimator β defined by where (X, Y ) follows the same distribution as (X 1 , Y 1 ) and is independent of the sample.

Moment assumptions and definitions
First denote by we also denote by (µ k ) k≥1 the eigenvalues of Γ sorted in decreasing order. (H (1) M om ) There exists a constant b > 0 such that, for all ℓ ≥ 1, (H (2) M om ) There exist two constants v M om > 0 and c M om > 0, such that, for all ℓ ≥ 2, Both assumptions (H M om ) and (H M om ) are necessary to apply exponential inequalities and are verified e.g. by Gaussian or bounded processes.
Then, there exist C M S , c max > 0 (depending only on tr(Γ) and ρ(Γ)) and C M om > 0 (depending only on v M om and c M om ) such that the following inequality holds with probability larger than where C > 0 is a universal constant.
The proof is based on concentration inequalities of ratio of norms that can be found in Proposition 4 and that relies mainly on Bernstein's inequality (for real and functional random variables).It can be found in Section A.4

Convergence rates
From Theorem 2, we derive an upper-bound on the convergence rates of the estimator β λ, m .For this we need some regularity assumptions on β * and Γ.
For a sequence v = (v j ) j≥1 of positive real numbers, we define a weighted norm as follows We introduce two sequences b = (b k ) k≥1 and v = (v k ) k≥1 of positive real numbers and R, c > 0 and note and for the regularity classes of β * and Γ.
Let us explain the regularity assumptions on β and Γ in the three examples of section 2.3.In order to simplify the presentation, we replace in examples 2 and 3, for all j, the basis ( e (j) k ) k≥1 by its theoretical counterpart (e k ) k≥1 which is the basis that diagonalizes Γ j and write it example 2' (resp.example 3') instead of example 2 (resp.example 3).
In example 1, since dim(H) < +∞, we can remark that, for any sequence b Similarly, it is easily seen that for all sequence v ∈ (R * + ) N\{0} , there exists c = ρ(Γ 1/2 )/ min j {v In example 2', remark that, for all Then, for all k ⟩ 2 ≤ R ′ (and conversely).The assumption is therefore an ellipsoidal regularity assumption on the functional element of the vector β.This ellispoïdal regularity assumption is very classical in non-parametric minimax estimation (Tsybakov, 2009) and in particular in a functional data framework (Cardot and Johannes, 2010;Comte and Johannes, 2012;Brunel et al., 2016).Concerning the hypothesis on Γ, we have the following characterization : there exists c > 0 such that Γ ∈ N v (c) if and only if µ k ) k≥1 is the sequence of eigenvalues of the covariance operator Γ 1 , sorted in non increasing order.Concerning the more complex example 3', there is a link between the regularity of beta β and the regularity of its coordinates.Remark that, defining φ (jp+k) = (e Then, if each coordinate β j is in an ellipsoïd of L 2 ([0, 1]) i.e. there exist b 1 , . . ., b p > 0 and R > 0 such that, then, denoting, b = (k max j=1,...,p {b j } ) k≥1 we have Then, the index b j accounting for the regularity of the function β j , the vector of functions β has the worst regularity of all its coordinates.Regarding the regularity class N v (c) a similar result may be obtained.We can see that, for all f = (f 1 , . . ., f p ) ∈ H, and finally, if there exists c > 0 such that µ Corollary 1 (Rates of convergence).We suppose that all assumptions of Theorem 2 are verified and we choose, for all j = 1, . . ., p, with A > 0 a numerical constant.We also suppose that there exist γ ≥ 1/2 and b > 0, such that and that there exists γ(s) ≥ 1/2 such that Then, there exist two quantities C, C ′ > 0, such that, if |J(β * )| ≤ s, with probability larger than The proof relies on the results of Theorem 2. The polynomial decrease of the eigenvalues (µ k ) k≥1 of the operator Γ is also a usual assumption.The Brownian bridge and the Brownian motion on Remark that the rate of convergence of the selected estimator β λ, m is the same as the one of β λ,m * where has the order of the optimal value of m in the upper-bound of Equation ( 7).
We do not know however the exact order of the minimax rate when the solution β * is sparse and if it can be achieved by either β λ,m or β λ,∞ .

Computing the Lasso estimator
The purpose of this section is to explain how the estimators β λ,∞ and β λ, m are computed.We first describe an algorithm which allows to obtain an approximation of β λ,∞ by adapting to infinite dimension an existing finite dimensional algorithm.We then explain, in subsection 5.2, how we choose the parameter λ (both for β λ,∞ and β λ, m ) and in subsection 5.3 how a projection space H (m) can be chosen to construct the estimator β λ, m .Finally, we define a method to reduce the usual bias of Lasso type estimators in subsection 5.4.

Computational algorithm
We propose the following algorithm to compute an approximation of β λ,∞ .It can also be adapted to obtain an approximation of β λ,m , even if, for this estimator, the usual algorithms of vanilla group-Lasso can be used directly.
The idea is to update sequentially each coordinate β 1 , ..., β p in the spirit of the glmnet algorithm (Friedman et al., 2010) by solving However, in the Group-Lasso context, this algorithm is based on the so-called group-wise orthonormality condition, which, translated to our context, amounts to suppose that the operators Γ j (or their restrictions Γ j|m ) are all equal to the identity.This assumption is not possible if dim(H j ) = +∞ since Γ j is a finite-rank operator.Without this condition, Equation ( 12) does not admit a closed-form solution and, hence, is not calculable.We then propose a variant of the GPD (Groupwise-Majorization-Descent) algorithm, initially defined by Yang and Zou (2015) for Group-Lasso type optimization problems, without imposing the group-wise orthonormality condition.The GPD algorithm is also based on the principle of coordinate descent but the minimisation problem ( 12) is modified in order to relax the group-wise orthonormality condition.We denote by β the value of the parameter at the end of iteration k.During iteration k + 1, we update sequentially each coordinate.Suppose that we have changed the j − 1 first coordinates, the current value of our estimator is ( β p ).We want to update the j-th coefficient and, ideally, we would like to minimise the following criterion We have ℓ , X ℓ i ⟩ ℓ , and Hence ) with where, ℓ , X ℓ i ⟩ ℓ is the current prediction of Y i .If Γ j is not the identity, we can see that the minimisation of γ n (β j ) has no explicit solution.To circumvent the problem the idea is to upper-bound the quantity where is an upper-bound on the spectral radius ρ( Γ j ) of Γ j .Instead of minimising γ n we minimise its upper-bound The minimisation problem of γ n has an explicit solution After an initialisation step (β p ), the updates on the estimated coefficients are then given by Equation ( 13).
Remark that, for the case of Equation ( 2), the optimisation is done directly in the space H and does not require the data to be projected.Consequently, it avoids the loss of information and the computational cost due to the projection of the data in a finite dimensional space, as well as, for data-driven basis such as PCA or PLS, the computational cost of the calculation of the basis itself.
Drawing inspiration from Friedman et al. (2010), we consider a pathwise coordinate descent scheme starting from the following value of r, It can be proven that, taking r = r max , the solution of the minimisation problem (2) is β λ(rmax) = (0, ..., 0).Starting from this value of r max , we choose a grid decreasing from r max to r min = δr max of n r values equally spaced in the log scale i.e.
As pointed out by Friedman et al. (2010), this scheme leads to a more stable and faster algorithm.
In practice, we chose δ = 0.001 and n r = 100.However, when r is too small, the algorithm does not always converge, in particular when the dimension is large or infinite.We believe that it is linked with the fact that the optimisation problem (2) has no solution as soon as dim(H) ≥ rk( Γ) and λ = 0.
In the case where the noise variance is known, Theorem 1 suggests the value We recall that Equation ( 8) is obtained with probability 1 − p 1−q .Hence, if we want a precision better than 1 − α, we take q = 1 − ln(α)/ ln(p).However, in practice, the parameter σ 2 is usually unknown.We propose three methods to choose the parameter r among the grid R and compare them in the simulation study.

V -fold cross-validation
We split the sample { be the prediction made with the estimator of β * minimising criterion (2) (or (3)) using only the data (X We choose the value of r n minimising the mean of the cross-validated error:

Estimation of σ 2
We propose the following estimator of σ 2 : where r min is an element of r ∈ R.

BIC criterion
We also consider the BIC criterion, as proposed by Wang et al. (2007); Wang and Leng (2007), The corresponding values of λ will be denoted respectively by λ ).The practical properties of the three methods are compared in Section 6.

Construction of the projected estimator
The projected estimator relies mainly on the choice of the basis (φ (k) ) k≥1 .To verify the support stability condition C supp , a possibility is to proceed as follows.
• Choose, for all j = 1, . . ., p an orthonormal basis of H j , denoted by (e • Define There are many ways to choose the basis (e k ) 1≤k≤dim(H j ) , j = 1, . . ., p as well as the bijection σ, depending on the nature of the spaces H 1 , . . ., H p .We give here some examples.

If σ
Proceed in a similar way for k = 2, 3, ... respecting the constraint σ(k Example 3: PCA basis with data-driven choice of the bijection σ Let, for j = 1, . . ., p, ( k ) 1≤k≤dim(H j ) the PCA basis of {X j i , i = 1, . . ., n}, that is to say a basis of eigenfunctions (if H j is a function space) or eigenvectors (if dim(H j ) < +∞) of the covariance operator Γ j .We denote by ( µ (j) k ) 1≤k≤dim(H j ) the corresponding eigenvalues.This naturally provides a data-driven choice of the bijection σ the can be defined such that ( µ σ 2 (k) ) k≥1 is sorted in decreasing order.Since the elements of the PCA basis are data-dependent, but depend only on the X i 's, the results of Section 3 hold but not the results of Section 4. Similar results for the PCA basis could be derived from the theory developed in Mas and Ruymgaart (2015); Brunel et al. (2016) at the price of further theoretical considerations which are out of the scope of the paper.We follow in Section 6 an approach based on the principal components basis (PCA basis).Other data-driven basis such as the Partial Leasts Squares (PLS, Preda and Saporta 2005;Wold 1975) can also be considered in practice.

Tikhonov regularization step
It is well known that the classical Lasso estimator is biased (see e.g.Giraud, 2015, Section 4.2.5) because the ℓ 1 penalization favors too strongly solutions with small ℓ 1 norm.To remove it, one of the current method, called Gauss-Lasso, consists in fitting a least-squares estimator on the sparse regression model constructed by keeping only the coefficients which are on the support of the Lasso estimate.
This method is not directly applicable here because least-squares estimators are not welldefined in infinite-dimensional contexts.Indeed, to compute a least-squares estimator of the coefficients in the support J of the Lasso estimator, we need to invert the covariance operator Γ J which is generally not invertible.
To circumvent the problem, we propose a ridge regression approach (also named Tikhonov regularization below) on the support of the Lasso estimate.A similar approach has been investigated by Liu and Yu (2013) in high-dimensional regression.They have shown the unbiasedness of the combination of Lasso and ridge regression.More precisely, we consider the following minimisation problem with ρ > 0 a parameter which can be selected e.g. by V -fold cross-validation.We can see that , is an exact solution of problem ( 14) but need the inversion of the operator Γ J + ρI to be calculated in practice.In order to compute the solution of ( 14), we define below a stochastic gradient descent algorithm.The algorithm is initialised at the solution (where m = ∞ or m = m) of the Lasso and, at each iteration, we do where , is the gradient of the criterion to minimise.
In practice we choose α k = α 1 k −1 with α 1 tuned in order to get convergence at reasonable speed.

Numerical study
In this section, we study practical properties of both estimators β λ,∞ and β λ, m .We first consider a context where the data are simulated and then an application to the prediction of electricity consumption.

Simulation study
We test the algorithm on two examples : ) with σ = 0.01.The size of the sample is fixed to n = 1000.The definitions of β * ,1 , β * ,2 and X are given in Table 1.

Support recovery properties and parameter selection
In Figure 1, we plot the norm of β λ,∞ j as a function of the parameter r.We see that, for all values of r, we have J ⊆ J * , and, if r is sufficiently small J = J * .We compare in Table 2 the Table 1:  percentage of time where the true model has been recovered when the parameter r is selected with the three methods described in Section 5.2.We see that the method based on the estimation of σ 2 has very good support recovery performances, but both BIC and CV criterion do not perform well.Since the CV criterion minimises an empirical version of the prediction error, it tends to select a parameter for which the method has good predictive performances.However, this is not necessarily associated with good support recovery properties which could explain the bad performances of the CV criterion in terms of support recovery.As a consequence, the method based on the estimation of σ 2 is the only one which is considered for the projected estimator β λ, m and in the sequel we will denote simply λ = λ ( σ 2 ) .

Lasso estimators
In Figure 2, we plot the first coordinate β λ,∞ 1 of Lasso estimator β λ,∞ (right) and compare it with the true function β * 1 .We can see that the shape of both functions are similar, but their norms are completely different.Hence, the Lasso estimator recovers the true support but gives biased estimators of the coefficients β j , j ∈ J * .
For the projected estimator β λ, m , as recommended in Brunel et al. (2016), we set the value of the constant κ of criterion ( 9) to κ = 2.The selected dimensions are plotted in Figure 3.We can see that the dimension selected is quite large in general and that it is larger for model 2 than for model 1, which indicates that the dimension selection criterion adapts to the complexity of the model.The resulting estimators are plotted in Figure 4.A similar conclusion as for the projection-free estimator can be drawn concerning the bias problem.

Final estimator
On Figure 5 we see that the Tikhonov regularization step reduces the bias in both examples.
We can compare it with the effect of Tikhonov regularization step on the whole sample (i.e.without variable selection).It turns out that, in the case where all the covariates are kept, the algorithm (15) converges very slowly leading to poor estimates.The computation time of the estimators on an iMac 3,06 GHz Intel Core 2 Duo -with a non optimal code -are given in Table 3 for illustrative purposes.

Application to the prediction of energy use of appliances
The aim is to study appliances energy consumption -which is the main source of energy consumption -in a low energy house situated in Stambruges (Belgium).The data are energy consumption measurements of electrical appliances (Appliances), light (light), temperature and humidity in the kitchen (T1 and RH1), in the living room (T2 and RH2), in the laundry room (T3 and RH3), in the office (T4 and RH4), in the bathroom (T5 and RH5), outside the building on the north side (T6 and RH6), in the ironing room (T7 and RH7), in the teenagers' room (T8 and RH8) and in the parents' room (T9 and RH9) as well as the temperature (T out), the pressure (Press mm hg), the humidity (RH out), wind speed (Windspeed), visibility (Visibility) and dewpoint temperature (Tdewpoint) from the weather station of Chievres, which is the weather station of the nearest airport.Each variable is measured every 10 minutes from 11th january, 2016, 5pm to 27th may, 2016, 6pm.
The data is freely available on UCI Machine Learning Repository (archive.ics.uci.edu/ml/datasets/Appliances+energy+prediction) and has been studied by Candanedo et al. (2017).We refer to this article for a precise description of the experiment and a method to predict appliances energy consumption at a given time from the measurement of the other variables.
Here, we focus on the prediction of the mean appliances energy consumption of one day from the measure of each variable the day before (from midnight to midnight).We then dispose of a dataset of size n = 136 with p = 24 functional covariates.Our variable of interest is the logarithm of the mean appliances consumption.In order to obtain better results, we divide the covariates by their range.More precisely, the j-th curve of the i-th observation X j i is transformed as follows .
Recall that usual standardisation techniques are not possible for infinite-dimensional data since the covariance operator of each covariate is non invertible.The choice of the above transformation allows us to obtain covariates of the same order.All the variables are then centered.
We first plot the evolution of the norm of the coefficients as a function of r.The results are shown in Figure 6.
The variables selected by the Lasso criterion are the appliances energy consumption (Appliances), temperature of the laundry room (T3) and temperature of the teenage room (T8) curves.The corresponding slopes are represented in Figure 7.We observe that all the curves take larger values at the end of the day (after 8 pm).This indicates that the values of the three parameters that influence the most the mean appliances energy consumption of the day after are the one measured at the end of the day.

Concluding remarks
The objective of the paper was to study how the theoretical results obtained for Lasso and Group-Lasso penalties can be adapted when the dimension of the covariates is infinite, which is, in particular, the case of functional data.

Discussion on the theoretical results and open-questions
As in finite dimension, the main issue is to control the relationship between the empirical norm naturally associated with the least squares criterion, related to the covariance matrix of the covariates, and the norm inducing the sparsity appearing in the penalty.The main problem here is that, as we prove in Lemma 1, these two norms cannot be equivalent in infinite dimension.The unprojected estimator seems to have very good performance in practice, and the solution can be computed easily.However, the rates of convergence of this estimator remains an open question.On the other hand, we prove sharp oracle inequalities for the projected estimator and we are able to define a data-based dimension selection criterion that achieves the best trade-off between the bias and the variance term.However, the rates of convergence of this estimator has not been proven to be optimal.Intuitively, it is not, and it seems likely that an adaptive Lasso procedure is needed to obtain an optimal rate in the minimax sense.These questions, which seem complex questions to solve, are left for future work.We could also consider an alternative restricted eigenvalues assumption as it appears in Jiang et al. (2019) and suppose that there exist two positive numbers κ 1 and κ 2 such that where we denote This assumption does not suffer from the curse of dimensionality as the assumption A RE(s) does.However, contrary to the finite-dimensional case, the control of the probability that the assumption holds in the random design case is, to our knowledge, still an open question.
Discussion on the numerical results From the simulation results, both methods seem to estimate the support of the slope coefficient β * well.However, the projected method, which gives us the most accurate theoretical results, is quite difficult to implement in practice, due to the cost of constructing the spaces H (m) .On the contrary, the unprojected estimator seems to give interesting results, both in the simulation study and in the application on real data.However, the theoretical results (see for instance the remark after Corollary 1) argue for the choice of a finite value of m.
Discussion on the linearity assumption The linearity assumption may be too restrictive in some contexts.A natural way to consider a nonlinear regression model is to assume that Y = m(X) + ε where m : H → R is an unknown regression function.However, it has been shown by Mas (2012) that, without additional structural assumptions on m, this model suffers from the curse of dimensionality which manifests itself here by a very low minimax rate of convergence, typically logarithmic (see also the recent review by Ling and Vieu 2018 and the discussion in Geenens 2011; Chagny and Roche 2016).This is also the case for additive models Y = m 1 (X 1 ) + . . .+ m p (X p ) + ε, with m j unknown functions m j : H j → R, which could be natural models to consider the sparsity problem.This is the reason why semi-parametric models have been introduced and widely studied.In this category, we can mention for example the partially linear models (Kong et al., 2016;Wong et al., 2019), where we recall that X p∞+1 , . . ., X p are scalar or vector covariates and X 1 , . . ., X p∞ are functional covariates.The approach developed in this paper could be directly extended to this model by considering estimators by projection of m j , as in Bunea et al. (2007).However, this introduces an additional bias that needs to be handled in the theoretical results and requires careful selection of the projection spaces and their dimensions.
This model has been generalized, for example, to the case of single-index models (see Novo et al. 2021 and references cited).
where the g j 's are unknown real functions.This type of model, poses theoretical questions more difficult to solve than the previous one, because the coefficients β j do not depend linearly on the observations.Acknowledgements I would like to thank Vincent Rivoirard and Gaëlle Chagny for their helpful advices and careful reading of the manuscript.The research is partly supported by the french Agence Nationale de la Recherche (ANR-18-CE40-0014 projet SMILES).

A.2 Proof of Proposition 1
Proof.We prove only (8), Inequality (7) follows the same lines.The proof below is largely inspired by the proof of Lounici et al. (2011), using the improvement of Bellec and Tsybakov (2017) to obtain a sharp oracle inequality.First remark that some algebra gives us the following result, true for all β ∈ H (m) , We can easily verify that the function is a proper convex function.Hence, β λ,∞ is a minimum of γ over H if and only if 0 is a subgradient ∂γ( β λ,∞ ) of γ at the point β λ,∞ .
Since, for all j = 1, ..., p, the subdifferential of ∥•∥ j at the point 0 is the closed unit ball of H j , the subdifferential of γ 2 : β → 2 p j=1 λ j ∥β j ∥ j at the point β ∈ D c , is the set Hence, the subdifferential of γ at the point β = (β 1 , ..., β p ) ∈ H is the set Then, since 0 ∈ ∂γ( β λ,∞ ), we know that there exists δ = ( δ 1 , . . ., δ p ) ∈ ∂γ 2 ( β λ,∞ ) such that Now remark that, denoting [ β λ,∞ ] j the j-th coordinate of β λ,∞ we have, by definition of δ, Then inserting ( 18) and ( 19) in ( 16), we get Then the key result proven by Bellec et al. (2018, Lemma A.2) in a finite-dimensional context also holds in our infinite-dimensional context.We now deal with the term involving the ε i 's.Remark that, writing β λ,∞ = β Let A = p j=1 A j , with On the set A, and gathering equations ( 20) and ( 21), We consider now two cases : First remark that in case 2., the result is obvious.Now, in case 1., we have, by definition of κ Then, using twice the fact that, for all x, y ∈ R, 3xy ≤ x 2 + (9/4)y 2 , Equation ( 22) becomes, , where we recall that which implies the expected result.
We turn now to the upper-bound on the probability of the complement of the event A. Conditionally to X 1 , . . ., X n , since {ε i } 1≤i≤n ∼ i.i.d N (0, σ 2 ), the variable 1 n n i=1 ε i X j i is a Gaussian random variable taking values in the Hilbert (hence Banach) space H j .Therefore, from Proposition 2, we know that, denoting P X This implies that as soon as r n ≥ 4 √ 2σ q ln(p)/n.
since C(η) ≤ 2. We also now that P(A c ) ≤ p 1−q and P(B c ) ≤ C M S n .
We define now the set C = C 1 ∩ C 2 where and prove that where r Γ > 0 depends only on Γ and is defined below.We apply Proposition 4 to bound We remark that and there exists a constant r Γ > 0 such that, for all m, Then from Equation ( 28), and the fact that N n ≤ n, we get that We turn now to the upper-bound on the probability of C c 2 and apply again Proposition 4 On the set A ∩ B ∩ C, we have then, for all m = 1, . . ., N n , We upper-bound the quantities above using Bernstein's inequality (Proposition 3, p. 36).
Proof of Lemma 2. We follow the ideas of Baraud (2000).Let m be fixed, and m) , ∀i, x i = ⟨f , X i ⟩ .
We known that S m is a linear subspace of R n and that sup f ∈H (m) ,∥f ∥n=1 where ε = (ε 1 , . . ., ε n ) t and P m is the matrix of the orthogonal projection onto S m .This gives us We apply now Bellec (2019, Theorem 3), with A = P m and obtain the expected results, since
Let X be a Gaussian random variable in a Banach space (B, ∥ • ∥).For every t > 0, .
Let Z 1 , . . ., Z n be independent random variables satisfying the moments conditions for some positive constants v and c.Then, for any positive ε, Proposition 4. Norm equivalence in finite subspaces.

Figure 3 :Figure 4 :Figure 5 :
Figure 3: Bar charts of dimension selected m over the 50 Monte Carlo replications for the projected estimator β λ, m .

Table 3 :
Computation time of the estimators.