Bayesian functional linear regression with sparse step functions

The functional linear regression model is a common tool to determine the relationship between a scalar outcome and a functional predictor seen as a function of time. This paper focuses on the Bayesian estimation of the support of the coefficient function. To this aim we propose a parsimonious and adaptive decomposition of the coefficient function as a step function, and a model including a prior distribution that we name Bayesian functional Linear regression with Sparse Step functions (Bliss). The aim of the method is to recover areas of time which influences the most the outcome. A Bayes estimator of the support is built with a specific loss function, as well as two Bayes estimators of the coefficient function, a first one which is smooth and a second one which is a step function. The performance of the proposed methodology is analysed on various synthetic datasets and is illustrated on a black P\'erigord truffle dataset to study the influence of rainfall on the production.


Introduction
Consider that one wants to explain the final outcome y of a process along time (for instance the amount of some agricultural production) thanks to what happened during the whole history (for instance, the rainfall history, or temperature history). Among the statistical learning methods, functional linear models (Ramsay and Silverman, 2005) aim at predicting a scalar y based on covariates x 1 (t), x 2 (t), . . . , x q (t) lying in a functional space, L 2 (T ) say, where T is an interval of R. If x q+1 , . . . , x p are additional scalar covariates, the outcome y is predicted linearly with y = µ + T β 1 (t)x 1 (t)dt + · · · + T β q (t)x q (t)dt + β q+1 x q+1 + · · · + β p x p , where µ is the intercept, β 1 (t), . . . , β q (t) the coefficient functions, and β q+1 , . . . , β p the other (scalar) coefficients. In this framework the functional covariates x j (t) and the unknown coefficient functions β j (t) lie in the L 2 (T ) functional space, thus we face a nonparametric problem. Standard methods (Ramsay and Silverman, 2005) for estimating the β j (t)'s, 1 ≤ j ≤ q, are based on the expansion onto a given basis of L 2 (T ). See Reiss et al. (2016) for a comprehensive scan of the methodology. A question which arises naturally in many applied contexts is the detection of periods of time which influence the most the final outcome y. Note that each integral in (1) is a weighted average of the whole trajectory of x j (t), and does not identify any specific impact of local period of the process. These time periods might vary from one covariates to another. For instance, in agricultural science, the final outcome may depend on the amount of rainfall during a given period (e.g., to prevent rotting), and the temperature during another (e.g., to prevent freezing). Standard methods do not answer the above question, namely to recover the support of the coefficient functions β j (t) with the noticeable exception of Picheny et al. (2016).
Unlike the scalar-on-image models, we focus here on one-dimensional functional covariates. When T is not a one dimensional space, the problem becomes much more complex. The functional covariates and the coefficient functions are all discretized, e.g. via the pixels of the images, see Goldsmith et al. (2014); Li et al. (2015); Kang et al. (2016). In these two-or three-dimensional problems, because of the curse of dimensionality, the points which are included in the support of the coefficient functions follow a parametric distribution, namely an Ising model. One important issue solved by these authors is the sensitivity of the parameter estimate of the Ising model in the neighborhood of the phase transition.
When T is a one dimensional space, we can build nonparametric estimates. In this vein, using the L 1 -penalty to achieve parsimony, the Flirti method of James et al. (2009) obtains an estimate of the β j (t)'s assuming they are sparse functions with sparse derivatives. Nevertheless Flirti is difficult to calibrate: its numerical results depend heavily on tuning parameters. From our experience, Flirti's estimate is so sensitive to the values of the tuning parameters that we can miss the range of good values with cross-validation. The authors propose to rely on cross-validation to set these tuning parameters. But, by definition, cross-validation assesses the predictive performance of a model, see Arlot and Celisse (2010) and the many references therein. And, of course, optimizing the performance regarding the prediction of y does not provide any guaranty regarding the support estimate. Zhou et al. (2013) propose a two-stage method to estimate the coefficient function. Preliminarily, β(t) is expanded onto a B-spline basis to reduce the dimension of the model. The first stage estimates the coefficients of the truncated expansion onto the basis using a lasso method to find the null intervals. Then, the second stage refines the estimation of the null intervals and estimates the magnitude of β(t) for the rest of the support. Another approach to obtain parsimony is to rely on Fused lasso (Tibshirani et al., 2005): if we discretize the covariate functions and the coefficient function as described in James et al. (2009), the penalization of Fused lasso induces parsimony in the coefficients. But, once again the calibration of the penalization is performed using cross-validation which targets predictive perfomance rather than the accuracy of the support estimate.
In this paper, we propose Bayesian estimates of both the supports and the coefficient functions β i (t)'s. To keep the dimension of the parameter as low as possible, we stay with the simplest and the most parsimonious shape of the coefficient function over its support. Hence, conditionally on the support, the coefficient functions β j (t)'s are supposed to be step function (piecewise constant function can be described with a minimal number of parameters). We can decompose any step function β(t) as where I 1 , . . . , I K are intervals of T , |I k | is the length of the interval I k and β * k are the coefficients of the expansion. The support is the union of all I k 's if the coefficients β * k are non null. Period of times which does not influence the outcome are outside the support. The above model has another advantage: such step functions change values abruptly from 0 to a non null value. Hence their supports are relatively clear. On the contrary, if we have at our disposal a smooth estimate of a coefficient function β j (t) in the model given by (1), the support of the estimate is the whole T and we have to find regions where the estimate is not significantly different from 0. Moreover, with a full Bayesian procedure, we can evaluate the uncertainty on the estimates of the support and the values of the coefficient functions.
The paper is organized as follows. Section 2 presents the Bayesian modelling, including the prior distribution in 2.2, the support estimate in 2.3 and the coefficient function estimate in 2.4. Section 3 is devoted to the study of numerical results on synthetic data, with comparison to other methods and sensibility to the tuning of the hyperparameters of the prior. Section 4 details the results of Bliss on a dataset concerning the influence of rainfall on the growth of the black Périgord truffle.

The Bliss method
We present the hierarchical Bayesian model in Section 2.2, the Bayes estimate of the support in Section 2.3 and two Bayes estimates of the coefficient function in Section 2.4. The implementation and visualization details are given at the end of this second part.

Reducing the model
Assume we have observed n independent replicates y i (1 ≤ i ≤ n) of the outcome, explained with the functional covariates x ij (t) (1 ≤ i ≤ n, 1 ≤ j ≤ q) and the scalar covariates x ij (1 ≤ i ≤ n, q + 1 ≤ j ≤ p). The whole dataset will be denoted D in what follows. Let us denote by x i = {x i1 (t), . . . , x iq (t), x i,q+1 , x ip } the set of all covariates, and by θ the set of all parameters, namely {β 1 (t), . . . , β q (t), β q+1 , . . . , β p , µ, σ 2 }, where σ 2 is a variance parameter. We resort to the Gaussian likelihood defined as If we set a prior on the parameter θ which includes all β j (t)'s, β j , µ and σ 2 , we can recover the full posterior from the following conditional distributions (both theoretically and practically with a Gibbs sampler) : where β −j represents the set of β-parameters except β j or β j (t). Hence we can reduce the problem to a single functional covariate and no scalar covariate. The model we have to study becomes with a single functional covariate x i (t).

Model on a single functional covariate
For parsimony we seek the coefficient function β(t) in the following set of sparse step functions where K is a hyperparameter that counts the number of intervals required to define the function. Note that we do not make any assumptions regarding the intervals I 1 , . . . , I K . First they do not form a partition of T . As a consequence, a function β(t) in E K is piecewise constant and null outside the union of the intervals I k , k = 1, . . . , K. This union is the support of β(t), hence the model includes an explicit description of the support. Second the intervals I 1 , . . . , I K can even overlap to ease the parametrization of the intervals: we do not have to add constraints on the parametrization to remove possible overlaps.
the integral of the covariate functions x i (t) against β(t) becomes a linear combination of partial integrals of the covariate function over the intervals I k and we predict y i with Thus, given the intervals I 1 , . . . , I K , we face a multivariate linear model with the usual Gaussian likelihood.
It remains to set a parametrization on E K and a prior distribution. Each interval I k is parametrized with its center m k and its half length k : As a result, when K is fixed, the parameter of the model is θ = (m 1 , . . . , m K , 1 , . . . , K , β * 1 , . . . , β * K , µ, σ 2 ).
We first define the prior on the support, that is to say on the intervals I k . The prior on the center of each interval is uniformly distributed on the whole range of time T . This for k = 1, . . . , K Figure 1: The full Bayesian model. The coefficient function β(t) = K k=1 β * k 1{t ∈ I k }/|I k | defines both a projection the covariate functions x i (t) onto R K by averaging the function over each interval I k and a prediction y i which depends on the vector β * = (β * 1 , . . . , β * K ) and the intercept µ.
uniform prior does not promote any particular region of T . Furthermore, the prior on the half-length of the interval I k is the Gamma distribution Γ(a, b). To understand this prior and set hyperparameters a and b, we introduce the prior probability that a given t ∈ T is in the support, namely where π K is the prior distribution on the range of parameters Θ K of dimension 3K + 2, and where S θ = Supp(β θ ) is the support of β θ (t) that is to say the union of the I k 's. The value of α(t) depends on hyperparameters a and b. These parameters should be fixed with the help of prior knowledge on α(t).
Given the intervals, or equivalently, given the m k 's and k 's, the functional linear model becomes a multivariate linear model with x i (I k ) as scalar covariates. We could have set a standard and well-understood prior on β * |(I k ) 1≤k≤K , namely the g−Zellner prior, with g = n in order to define a vaguely informative prior. More specifically, the design matrix given the intervals is And the g-Zellner prior, with g = n is given by where β * = (β * 1 , . . . , β * K ) and G = x · (I · ) T x · (I · ) is the Gram matrix. But, depending on the intervals I k , the covariates x i (I k ) can be highly correlated. (We recall here that the functional covariate can have autocorrelation and that the intervals can overlap.) That is why, in this setting, the Gram matrix G = x · (I · ) T x · (I · ) can be ill-conditioned, that is to say not numerically invertible and we cannot resort to the g−Zellner prior. To solve this issue we have to decrease the condition number of G, and apply a Tikhonov regularization. The resulting prior is a ridge-Zellner prior (Baragatti and Pommeret, 2012) replaces G by G + ηI in (8), where η is some scalar tuning the amount of regularization and I is the identity matrix. Adding the ηI matrix shifts all eigenvalues of the Gram matrix by η. In order to obtain a well-conditioned matrix, we decided to fix η with the help of the largest eigenvalue of the Gram matrix, λ max (G) and to set η = vλ max (G) where v is an hyperparameter of the model.
To sum up the above, the prior distribution on Θ K is The resulting Bayesian modelling is given in Figure 1 and depends on hyperparameters which are v 0 , v, a and K. We denote by π K (θ) and π K (θ|D) the prior and the posterior distributions. We propose below default values for the hyperparameters v 0 , v, a, see Section 3.4 for numerical results that supports this proposal.
• The parameter v 0 drives the prior information we put on the intercept µ. This is clearly not the most important hyperparameter since we expect important information regarding µ in the likelihood. We recommend using v 0 = 100 ×ȳ 2 , whereȳ is the average of the outcome on the dataset. Even if it may look like we set the prior with the current data, the resulting prior is vaguely non-informative.
• The parameter v is more difficult to set: it tunes the amount of regularization in the g-Zellner prior. Our set of numerical studies indicates, see Section 3 below, that v = 5 is a good value.
• The parameter a sets the prior length of an interval of the support. It should depend on the number K of intervals. We recommend the value a = (5K) −1 so that the average length of an interval from the prior distribution is proportional to 1/K. Our numerical studies that constant 5 in the above recommandation does not drastically influence the results.
Finally, the hyperparameter K drives the number of intervals, thus the dimension of Θ K . We can put an extra prior distribution on K and perform Bayesian model choice either to infer K or to aggregate posteriors coming from various values of K. There is a ban on the use of improper prior together with Bayesian model choice (or Bayes factor) because of the Jeffrey-Lindley paradox (see, e.g. Robert, 2007, Section 5.2). And a careful reader would notice here the improper prior on σ 2 . But it does not prohibit the use of Bayesian choice because it is a parameter common to all models (i.e., to all values of K here).

Estimation of the support
Regarding the inference of the support, an interesting quantity is the posterior probability that a given t ∈ T is in the support. It can be defined as the prior probability in (7), that is to say Both functions α(t) and α(t|D) can be easily computed with a sample from the prior and the posterior respectively. They are also relatively easy to interpret in term of marginal distribution of the support: fix t ∈ T , • α(t) is the prior probability that t is in the support of the coefficient function and • α(t|D) is the posterior probability of the same event.
Now let L γ (S, S θ ) be the loss function given by where S θ = Supp(β θ ) is the support of β θ (t), the coefficient function as parametrized in (5) and where γ is a tuning parameter in [0; 1]. Actually, there is two type of errors when estimating the support: • error of type I: a point t ∈ T which is really in the support S θ has not been included in the estimate, • error of type II: a point t ∈ T has been included in the support estimate but does not lie into the real support S θ and the tuning parameter γ allows to set different weights on both types of error. Note that, when γ = 1/2, the loss function is one half of the Lebesgue measure of the symmetric difference S∆S θ .
Bayes estimates are obtained by minimizing a loss function integrated with respect to the posterior distribution, see Robert (2007). Hence, in this situation, Bayes estimates of the support are given by The following theorem shows the existence of the Bayes estimate and how to compute it from α(t|D).
Theorem 1. The level set of α(t|D) defined by is a Bayes estimate associated to the above loss L γ (S, S θ ). Moreover, up to a set of null Lebesgue measure, any Bayes estimate S γ (D) that solves the optimisation problem given in (12) satisfies The proof of the above theorem is given in Appendix A.1. Although simple-looking, the proof requires some caution because sets should be Borelian sets. Note that, when we try to avoid completely errors of type I (resp. type II) by setting γ = 0 (resp. γ = 1), the support estimate is T (resp. ∅). Additionally Theorem 1 shows how we should interpret the posterior probability α(t|D) and that its plot may be one important output of the Bayesian analysis proposed in this paper: it measures the evidence that a given point is in the support of the coefficient function. Finally, note that the number of intervals in the support estimate S γ (D) can, and is often different from the value of K (because intervals can overlap).

Estimation of the coefficient function
The Bayesian modelling given in Section 2.2 was mainly designed to estimate the support of the coefficient function. We can nevertheless provide Bayes estimates of the coefficient function. We propose here two Bayes estimates of the coefficient function. The first one, given in Equation (13) is a smooth estimate, whereas the second estimate, given in Proposition 3, is a stepwise estimate which is parsimonious and may be more easily interpreted.
With the default quadratic loss, a Bayes estimate is defined as where β θ (t) is the coefficient function as parametrized in (5). At least heuristically β L 2 (·) is the average of β θ (·) over the posterior distribution π K (θ|D), though the average of functions taking values in L 2 (T ) under some probability distribution is hard to define (using either Bochner or Pettis integrals). In this simple setting we can claim the following (see Appendix A.2 for the proof).
is in L 2 (T ) and solves the optimization problem (13).
Averages such as (14) belong to the closure of the convex hull of the support E K of the posterior distribution. We can prove (see Proposition 5 in Appendix A.4) that the convex hull of E K is the set E = ∪ ∞ K=1 E K of step functions on T , and the closure of E is L 2 (T ). Hence the only guarantee we have on β L 2 as defined in (14) is that β L 2 lies in L 2 (T ), a much larger space than the set of step functions. Though not shown here, integrating the β θ (t)'s over θ with respect to the posterior distribution has regularizing properties, and the Bayes estimate β L 2 (t) is smooth.
To obtain an estimate lying in the set of step functions, namely E, we can consider the projection of β L 2 onto the set E K 0 for a suitable value of K 0 possibly different to K. However, due to the topological properties of L 2 (T ) and E K 0 , the projection of β L 2 onto the set E K 0 does not always exist (see Appendix A.4). To address this problem, we introduce a subset E ε K 0 of E K 0 , where ε > 0 is a tuning parameter. Let F ε denote the set of step functions β(t) ∈ L 2 (T ) which can be written as where the intervals J k 's are mutually disjoint and each of the lengths are greater than ε.
The set E ε K 0 is now defined as F ε ∩ E K 0 . By considering this set, we remove from E K the step functions which have intervals of very small length, and we can prove the following.
always exists.
(ii) The estimate β ε K 0 (·) is a true Bayes estimate with loss function That is to say Finally one should note that the support of the Bliss estimate given in Proposition 3 provides another estimate of the support, which differs from the Bayes estimate introduced in Section 2.3. Obviously, real Bayes estimates, which optimizes the loss integrated over the posterior distribution, are by construction better estimates. Another possible alternative would be the definition of an estimate of the coefficient function whose support is given by one of the Bayes estimates defined in Theorem 1. But such estimates do not account for the inferential error regarding the support. Hence we believed that, when it comes to estimating the coefficient function, the Bayes estimates proposed in this Section are better than other candidates and achieve a trade off between inferential errors on its support and prediction accuracy on new data.

Implementation
The full posterior distribution can be written explicitly from the Bayesian model given in Equations (9). As usual with hierarchical models, sampling from the posterior distribution π K (θ|D) can be done with a Gibbs algorithm (see, e.g., Robert and Casella, 2013, Chapter 7). The details of the MCMC algorithm are given in Appendix B.1. Now let θ(s), s = 1, . . . , N , denote the output of the MCMC sampler after the burnin period. The computation of the Bayes estimate S γ (D) of the support as defined in Theorem 1 depends on the probabilities α(t|D). With the Monte Carlo sample from the MCMC, we can easily approximate these posterior probabilities by the frequencies What remains to be computed are the approximations of β L 2 (·) and β ε K 0 (·) based on the MCMC sample. First, the Monte Carlo approximation of (14) is given by And the more interesting Bayes estimate β ε K 0 (·) can be computed by minimizing over the set E ε K 0 . To this end we run a Simulated annealing algorithm (Kirkpatrick et al., 1983), described in Appendix B.2.
We also provide a striking graphical display of the posterior distribution on the set E K with a heat map. More precisely, the aim is to sketch all marginal posterior distributions π t K (·|D) of β θ (t) for any value of t ∈ T in one single figure. To this end we introduce the probability measure Q on T × R defined as follows. Its marginal distribution over T is uniform, and given the value t of the first coordinate, the second coordinate is distributed according to the posterior distribution of β(t). In other words, We can easily derive an empirical approximation of Q from the MCMC sample {θ(s)} of the posterior. Indeed, the first marginal distribution of Q, namely Unif(T ) can be approximated by a regular grid t i , i = 1, . . . , M . And, for each value of i, set b is = β θ(s) (t i ), s = 1, . . . , N . The resulting empirical measure is where δ (t,b) is the Dirac measure at (t, b). The graphical display we propose is representing Q with a heat map on T × R. Each small area of T × R is thus colored according to its Qprobability. This should be done cautiously as the marginal posterior distribution π t K (·|D) has a point mass at zero: π t K (b = 0|D) > 0 by construction of the prior distribution. Finally the color scale can be any monotone function of the probabilities, in particular non linear functions to handle the atom at 0. Examples are provided in Section 3 in Figures 4 and 5.

Simulation study
In this section, the performance of Bliss is evaluated and compared to three competitors: FDA (Ramsay and Silverman, 2005), Fused lasso (Tibshirani et al., 2005) and Flirti (James et al., 2009), using simulated datasets.

Simulation scheme
First of all, we describe how we generate different datasets on which we applied and compared the methods. The support of the covariate curves x i is T = [0, 1], observed on a regular grid (t j ) j=1,...,p on T , for p = 100. We simulate p-multivariate Gaussian vectors x i , i = 1, . . . , 100, corresponding to the values of curves x i for the observation times (t j ) j . The covariance matrix Σ of these Gaussian vectors is given by for i and j from 1 to p, where the coefficient ζ tunes the autocorrelation of the x i (t). Three different shapes are considered for the functional coefficient β, given in Figure 2.
The first one is a step function, the second one is smooth and is null on small intervals of T (Smooth), the third one is nonnull only on small intervals of T (Spiky).
The outcomes y i are calculated according to (3) with an additional noise following a centred Gaussian distribution with variance σ 2 . The value of σ 2 is fixed such that the signal to noise ratio is equal to a chosen value r. Datasets are simulated for µ = 1 and for the following different values of ζ and r: • ζ = 1, 1/3, 1/5, • r = 1, 3, 5.
Hence, we simulate 27 datasets with different characteristics, that we use in Section 3.3 to compare the methods. Section 3.1 describes the simulation scheme of the datasets. Section 3.3 describes the criteria: Support Error.

Performances regarding support estimates
We begin by assessing the performances of our proposal in term of support recovery. We focus here on the datasets simulated with the step function as the true coefficient function. It is the only function among the three functions we have chosen where the real definition of the support matches with the answer a statistician would expect, see Figure 1. The numerical results are given in Table 1, where we evaluated the error with the Lebesgue measure of the symmetric difference between the true support S 0 and the estimated one S, that is to say 2L 1/2 ( S, S 0 ) with the notation of Section 2.3.
As we claim at the end of Section 2.4, the Bayes estimates we have defined in Theorem 1 performs much better than relying on the support of a stepwise estimate of the coefficient function. As also expected the accuracy of the Bayes support estimate worsens when the autocorrelation within the functional covariate x i (t) increases. The signal to noise ratio is the second most influent factor that explains the accuracy of the estimate.
The third interval of the true support, namely [0.8, 0.95], is the most difficult to recover because the true value of the coefficient function over this interval is relatively low (−1) compared to the other values (4 and 3) of the coefficient function. Figure 3 gives two examples of the posterior probability function α(t|D) defined in Eq. (10) where we have highlighted (in red) the Bayes support estimate with γ = 1/2. Among these two examples, the Figure shows that the third interval is recovered only when there is low autocorrelation in x i (t) (i.e. Dataset 1). Figure 3 exhibits that the support estimate of Dataset 1 (low autocorrelation within the covariate) is more trustworthy than the support estimate of Dataset 3 (high autocorrelation within the covariate).
For more complex coefficient functions, see Figure 2, we cannot compare directly the Bayes support estimate with the true support of the coefficient function that generated the data. Nevertheless, in the next section, we will compare the coefficient estimate with the true value of the coefficient function.

Performances regarding the coefficient function
In order to compare the methods for the estimation of the coefficient function, we use the L 2 -loss, namely 1 0 ( β(t) − β 0 (t)) 2 dt where β(t) is an estimate we compare to the true coefficient function β 0 (t). Table 2 shows the results of Bliss and its competitors on these simulated datasets. It appears that the numerical results of the three methods have the same order of magnitude. Although the three methods may have different accuracy, depending on the shape of the coefficient function that generated the dataset.
Regarding Fused Lasso, we can see in Table 2 that its accuracy worsens when the problem is not sparse, that is to say when the true function is the "smooth" function (the red curve of Figure 2). Next, we observe that Flirti is very sensitive. Its numerical results can be sometimes rather accurate, but sometimes the L 2 -error can blow up (to exceed 100) because the method did not manage to tune its parameters. The L 2 -Bliss estimate defined in Proposition 2 frequently overperforms the other methods. This first conclusion is not surprising because the L 2 -Bliss estimate has been defined to optimize the L 2 -loss integrated over the posterior distribution.
Even in situations where the true function is stepwise, the stepwise Bliss estimate of Proposition 3 is less accurate than the L 2 -Bliss estimate, except for two examples (datasets 6 and 9). Nevertheless we do argue that the stepwise Bliss estimate was built to provide a trade off between accuracy regarding the support estimate and accuracy regarding the coefficient function estimate. Thus the stepwise estimate is a balance between support estimate and coefficient function estimate that can help the statistician who can then get an interpretation of the underlying phenomena that generated the data. In other words, the stepwise-Bliss estimate is not the best neither at estimating the support nor at approximating the coefficient function, but provides a tradeoff.
To show more detailed results we have presented the estimate of the coefficient function in two cases.
• Figure 4 displays the numerical results on Dataset 4 (medium level of signal, low level of autocorrelation with the covariates). As can be expected when the true coefficient is a stepwise function, the stepwise Bliss estimate behaves nicely. The representation of the marginals of the posterior distribution with a heat map shows the confidence we can have in the Bayes estimate of the coefficient function. The smooth estimate nicely follows the regions of high posterior density. Here, the stepwise estimate clearly highlights two time periods (the first two intervals of the true support) and the sign of the coefficient function on these intervals. We can compare our proposal with its competitors. Flirti did not manage to tune its own parameters, and the Flirti estimate is completely irrelevant. Fused Lasso on a discretized version of the functional covariate provides a relatively nice estimate of the coefficient function.
And FDA is not that bad, although the estimate is clearly too smooth to match the true coefficient function.
• Figure 5 displays the numerical results on Dataset 25 (low level of signal, and low level of autocorrelation within the covariates). In this example, the true coefficient is not stepwise, but smooth, and is around zero on large time periods. The L 2 -Bliss estimate of Proposition 2 matches approximately the true coefficient function. The stepwise-Bliss estimate is a little bit poorer (maybe because of the difficult calibration of the simulated annealing algorithm). When comparing these results with other estimates on this dataset, we see that Flirti and Fused Lasso performed also decently, even if they both highlight a third time period (around t = 0.85) where they infer a negative coefficient function instead of 0. Flirti is at its best here, and has obviously managed to tune its own parameters in a relevant way. The confidence bands of Flirti are then reliable, but we stress here that they are relatively wide around periods where the Flirti estimate is null and does not reflect high confidence in any support estimate based on Flirti. Finally, the comments on FDA are the same as Dataset 4, the FDA estimate is clearly too smoothed to match the true coefficient function.

Tuning the hyperparameters
We can now discuss our recommandation on the hyperparameters of the model, given at the end of Section 2.2. For this study, we applied our methodology on Dataset 1 and fixed the hyperparameters v 0 , v, a around the recommended values.We recall that Dataset 1 is a synthetic dataset simulated with a coefficient function that is a step function (the black curve of Figure 2), with a high level of signal over noise (r = 5) and with a low level of autocorrelation within the covariates(ζ = 1). The following values are considered for each hyperparameter: • for a: 0.5/K, 0.2/K, 0.1/K, 0.07/K and 0.05/K; Section 3.1 describes the simulation scheme of the datasets. The stepwise Bliss estimate is the estimate defined in Proposition 3, while the L 2 -estimate is the smooth estimate defined in Proposition 2.
The numerical results are given in Table 3. The default values we recommend are not the best values here, but we have done numerous other trials on many synthetic datasets and these choices are relatively robust. We do not highlight any particular value for K since this value can (and should) be chosen with the Bayesian model choice machinery.

Application to the black Périgord truffle dataset
We apply the Bliss method on a dataset to predict the amount of production of black truffles given the rainfall curves. The black Périgord truffle (Tuber Melanosporum Vitt.) is one of the most famous and valuable edible mushrooms, because of its excellent aromatic and gustatory qualities. It is the fruiting body of a hypogeous Ascomycete fungus, which grows in ectomycorrhizal symbiosis with oaks species or hazelnut trees in Mediterranean conditions. Modern truffle cultivation involves the plantation of orchards with tree seedlings inoculated with Tuber Melanosporum. The planted orchards could then be viewed as ecosystems that should be managed in order to favour the formation and the growth of truffles. The formation begins in late winter with the germination of haploid spores released by mature ascocarps. Tree roots are then colonised by haploid mycelium to form ectomycorrhizal symbiotic associations. Induction of the fructification (sexual reproduction) occurs in May or June (the smallest truffles have been observed in mid-June). Then the young truffles grow during summer months and are mature between the middle of November and the middle of March (harvest season). The production of truffles should then be sensitive to climatic conditions throughout the entire year (Le Tacon et al., 2014). However, to our knowledge few studies focus on the influence of rainfall or irrigation during the entire year (Demerson and Demerson, 2014;Le Tacon et al., 2014). Our aim is then to investigate the influence of rainfall throughout the entire year on the production of black truffles. Knowing this influence could lead to a better management of the orchards, to a better understanding of the sexual reproduction, and to a better understanding of the effects of climate change. Indeed, concerning sexual reproduction, Le Tacon et al. (2014,2016) made the assumption that climatic conditions could be critical for the initiation of sexual reproduction throughout the development of the mitospores expected to occur in late winter or spring. And concerning climate change, its consequences on the geographic distribution of truffles is of interest (see Splivallo et al., 2012or Büntgen et al., 2011. The analyzed data were provided by J. Demerson. They consist of the rainfall records on an orchard near Uzès (France) between 1985 and 1999, and of the production of black truffles on this orchard between 1985 and 1999. In practice, to explain the production of the year n, we take into account the rainfall between the 1st of January of the year n − 1 and the 31st of March of the year n. Indeed, we want to take into account the whole life cycle, from the formation of new ectomycorrhizas following acospore germination during the winter preceding the harvest (year n − 1) to the harvest of the year n. The cumulative rainfall is measured every 10 days, hence between the 1st of January of the year n − 1 and the 31st of March of the year n we have the rainfalls associated with 45 ten-day periods, see Figure 6. This dataset can be considered as reliable, as the rainfall records have been made exactly on the orchard, and the orchard was not irrigated.
Biological assumptions at stake From the literature we can spotlight the following periods of time which might influence the growth of truffles.
Period #1: Late spring and summer of year n−1. This is the (only) period for which all experts are unanimous to say it has a particular effect. Büntgen et al. (2012), Demerson and Demerson (2014) or Le Tacon et al. (2014) all confirm the importance of the negative effect of summer hydric deficit on truffle production: they found it to be the most important factor influencing the production. Indeed, in summer the truffles need water to survive the high temperatures and to grow. Otherwise they can dry out and die.
Period #2: Late winter of year n − 1, as shown by Demerson and Demerson (2014) and Le Tacon et al. (2014). Indeed, as explained in Le Tacon et al. (2014), consistent water availability in late winter could support the formation of new mycorrhizae, thus allowing a new cycle. Moreover, from results obtained by Healy et al. (2013) they made the assumption that rainfall is critical for the initiation of sexual reproduction throughout development of mitospores, which is expected to occur in late winter or spring of the year n − 1. This is an assumption as the occurrence and the initiation of sexual reproduction is largely unknown, see Murat et al. (2013) or Le Tacon et al. (2016. Period #3: November and December of year n−1, as claimed by Demerson and Demerson (2014) and Le Tacon et al. (2014). Le Tacon et al. explained that rainfall in autumn allows the growth of young truffles which have survived the summer.
Period #4: September of year n − 1, as claimed by Demerson and Demerson (2014). Excess water in this period should be harmuful to truffles. The assumption made was that in September the soil temperature is still high, so micro-organisms responsible for rot are quite active, while a wet truffle has its respiratory system disturbed and can not defend itself against these micro-organisms.
The challenge is to confirm some of these periods with Bliss, despite the small size of the dataset. In particular, each rainfall curve is discretized with only 45 points (cumulative rainfall every 10 days) and we have at our disposal only 13 observations. Running Bliss As explained above (in Section 3.2), part of the difficulty of the inference problem comes from autocorrelation within the covariate. Figure 6 shows that the autocorrelation can be considered as null when the lag is 3 or more in number of ten-day periods. In other words the rainfall background presents autocorrelation within a period of time of about a month (keeping in mind that the whole history we consider lasts 15 months).
The first and maybe most important hyperparameter is K, the number of intervals in the coefficient functions from the prior. Because of the discretization of the rainfall, and the number of observations, the value of K should stay small to remain parsimonious. Because of the size of the dataset, we have set the hyperparameter a to obtain a prior probability of being in the support of about 0.5. The results are given in Figure 7. As can be seen on the left of this Figure, the error variance σ 2 decreases when K increases, because models of higher dimension can more easily fit the data. The main question is when do they overfit the data? Looking at the right panel of Figure 7, we can consider how the posterior probability α(t|D) depends on the value of K and choose a reasonable value. First, for K = 1 or 2, the posterior probability is high during a first long period time until August of year n − 1 and falls to much lower values after that. Thus, these small values of K provide a rough picture of dependency. Secondly, for K = 4, 5 or 6, the posterior probability α(t|D) varies between 0.2 and 0.7 and shows doubtful variations after November of year n − 1 and other strong variations during the summer of year n − 1 that are also doubtful. Hence we decided to rely on K = 3 although this choice is rather subjective.
Conclusions on the truffle dataset We begin by noting that about half of the variance of the output (the amount of production of truffle) is explained by the rainfall given the posterior distribution of σ 2 in the left panel of Figure 7. The support estimate S 0.5 (D) with K = 3 is composed of two disjoint intervals: a first one from May of year n − 1 to the second ten-day period of August with the highest posterior probability, and a second one from the third ten-day period of February of year n − 1 to the end of March of year n − 1 with a smaller posterior probability. Thus, as far as we can tell from this analysis, Periods #1 and #2 are validated by the data. Period #3 cannot be validated although the posterior probability α(t|D) presents small bumps around theses periods of time for highest values of K. For K = 3, the value of α(t|D) stays around 0.3 on Period #3. Finally, regarding Period #4, we can see a small bump on the curve α(t|D) around this period of time even for K = 3, but the highest value of the posterior probability on this period is about 0.4. Hence we chose to remain undecided on Period #4.

Conclusion
In this paper, we have provided a full Bayesian methodology to analyse linear models with time-dependent functional covariates. The main purpose of our study was to estimate of the support of the coefficient function to search the periods of time which influences the most the outcome. We rely on piecewise constant coefficient functions to set the prior, which has four benefits. The first benefit is parsimony of the Bliss model, which turns two thirds of the parameter's dimension to the estimation of the support. The second benefit with our Bayesian setting that begins by defining the support is that we can rely on the ridge-Zellner prior to handle the autocorrelation within the functional covariate. This fact sets Bliss apart from Bayesian methods relying on spike-and-slab prior to handle sparsity. The third benefit is avoiding cross-validation to tune internal parameters of the method. Indeed, cross-validation methods optimize the performance regarding the model's predictive power, and not the accuracy of the support estimate. And, last but not least, the fourth benefit is the ability to compute numerically the posterior probability that a given date is in the support, α(t|D), whose value gives a clear hint on the reliability of the support estimate. Nevertheless a serious limitation of our Bayesian model is that it can handle only covariate functions of one variable (we call time in the paper). Indeed the shape of the support of a function of more than one variable is much more complex than an union of intervals and cannot be easily modelled in a nonparametric, but parsimonious manner.
We have provided numerical results regarding the power of Bliss on a bunch of synthetic datasets as well as a dataset studying the black Périgord truffle. We have shown by presenting some of these examples in details how we can interpret the results of Bliss, in particular how we can rely on the posterior probabilities α(t|D) or the heatmap of posterior distribution of the coefficient function to assess the reliability of our estimates. Bliss provides two main outputs: first an estimate of the support of the coefficient function without targeting the coefficient function, and second a trade-off between support estimate and coefficient function estimate through the stepwise estimate of Proposition 3. Moreover our prior can straightforwardly be encompassed into a linear model with other functional or scalar covariates.

A.3 Proof of Proposition 3
First, the norm d(·) − β L 2 (·) is non negative, hence the set admits an infimum. Let m denote this infimum. We have to prove that m is actually a minimum of the above set, namely that there exists a function d(·) ∈ E ε K 0 such that m = d(·) − β L 2 (·) .
To this end, we introduce a minimizing sequence {d n (·)} and we will show that one of its subsequence admits a limit within E ε K 0 . Let d n (·) be such that The step function d n (·) can be written as d n (t) = L k=1 α k,n 1{t ∈ (a k,n , b k,n )} where the (a k,n , b k,n ), k = 1, . . . , L are non overlapping intervals. Note that their number L does not depend on n because all d n (·) lie in E K 0 for some fixed value of K 0 , and we can always choose L = 2K 0 − 1. Moreover, because d n (t) is in F ε , we can assume that b k,n − a k,n ≥ ε, for all k, n.
Now the sequence {a 1,n } n has its elements in the compact interval T hence we extract a subsequence (still denoted {a 1,n } n ) which converges an element a 1,∞ of T . Likewise, by extracting subsequences 2L times, we can assume that all sequences {a 1,n } n ,. . . , {a L,n } n , {b 1,n } n , . . . , {b L,n } n are convergent, and that a k,∞ = lim n→∞ a k,n , b k,∞ = lim n→∞ b k,n , and b k,∞ − a k,∞ ≥ ε, k = 1, . . . , L where the last inequalities come from (19).
A.4 Topological properties of E K Proposition 5. Let K ≥ 1.
(i) The convex hull of E K is E.
(ii) Under the L 2 (T )-topology, the closure of E is L 2 (T ).
Proof. The result of (ii) is rather classical, see, e.g., Rudin (1986). The convex hull of E K includes any step function. Indeed, any step function can be written as a convex combination of simple a1{t ∈ I}'s which all belongs to E K . Moreover, E is convex because it is a linear space. Hence claim (i) is proven.
Knowing thatβ L 2 (·) and d n (·) belong to L 2 for all n, we have d n (.) ∈ E K ∩ B L 2 (R + m + 1), for all n, where B L 2 (r) is the L 2 -ball of radius r around the origin. Note that E K ∩ B L 2 (R + m + 1) is not a compact set, for example consider d n (t) = √ n 1{t ∈ [0, 1 n ]}. Hence it is not possible to extract a subsequence of {d n (·)} which converges to a d ∞ (·) ∈ E K such that d(·) −β L 2 (·) = m.