Functional regression via variational Bayes

We introduce variational Bayes methods for fast approximate inference in functional regression analysis. Both the standard cross-sectional and the increasingly common longitudinal settings are treated. The methodology allows Bayesian functional regression analyses to be conducted without the computational overhead of Monte Carlo methods. Confidence intervals of the model parameters are obtained both using the approximate variational approach and nonparametric resampling of clusters. The latter approach is possible because our variational Bayes functional regression approach is computationally efficient. A simulation study indicates that variational Bayes is highly accurate in estimating the parameters of interest and in approximating the Markov chain Monte Carlo-sampled joint posterior distribution of the model parameters. The methods apply generally, but are motivated by a longitudinal neuroimaging study of multiple sclerosis patients. Code used in simulations is made available as a web-supplement. The work of Goldsmith and Crainiceanu was supported by Award Number R01NS060910 from the National Institute Of Neurological Disorders And Stroke. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute Of Neurological DIsorders And Stroke or the National Institutes of Health. The first author would also like to acknowledge partial support from Training Grant 2T32ES012871 from the US, National Institutes of Health, National Institute of Environmental Health Sciences. The work of Wand was partially supported by Australian Research Council Discovery Project DP110100061.


Introduction
Due to ever-expanding methods for the acquisition and storage of information, functional data is often encountered in scientific applications.A common problem in the field of functional data analysis is determining the relationship between a scalar outcome Y and a densely observed functional predictor X(t) [18,22,9].Increasingly, this problem is longitudinal in that both the functional predictors and scalar outcomes are observed at several visits for each subject.Bayesian approaches to cross-sectional and longitudinal functional regression possess a number of advantages, including the ability to jointly model the observed functions and scalar outcomes and easily constructed credible intervals [5,7].However, these approaches require computationally expensive Markov chain Monte Carlo (MCMC) simulations of joint posterior distributions.The goal of this paper is to introduce a fast and scalable alternative to accommodate new types of data sets.
Variational approximations, now regularly used in computer science, are a collection of techniques for deriving approximate solutions to inference problems [10,11,25].They have a growing visibility in the statistics literature [ 16,19,24].In the Bayesian context, these methods are useful in approximating intractable posterior density functions.While this approximation sacrifices some of MCMC's accuracy, it provides large gains in terms of computational feasibility, especially in large-data settings.
In this article, we derive an iterative algorithm for approximate Bayesian inference in functional regression.Using this algorithm, inference on model parameters can be obtained several orders of magnitude faster than MCMC sampling methods.Importantly, the construction of credible intervals for the functional coefficient is straightforward.Moreover, this procedure retains the ability to jointly model the predictor process and the scalar outcome.The computational advantage conveyed by the variational methods also allows resampling techniques, such as the nonparametric bootstrap of subjects, to be used.Unlike MCMC sampling, the variational approach cannot be made arbitrarily accurate.However, simulations indicate that the quality of the approximation is high in our setting.Our variational method is not designed to replace MCMC sampling, but it is a useful additional inferential tool in that it provides near-instant and highly accurate approximate posterior distributions.This will become increasingly relevant as functional datasets become larger and more complex.
In particular, we develop variational Bayes methods for two functional regression models: the classic cross-sectional case, in which a single scalar outcome and functional predictor are observed for each subject; and the more recent longitudinal case, in which scalar outcomes and functional predictors are observed repeatedly for each subject.This methodology is based on a penalized approach to functional regression that is flexible and widely applicable [6].Although variational techniques typically incur initial algebraic and implementation costs, the present article alleviates these considerations.
We apply the methods developed to a longitudinal neuroimaging study, in which multiple sclerosis patients undergo both tests of cognitive ability and a diffusion tensor imaging scan at each of several visits.From the diffusion tensor imaging scans, we construct functional predictors that provide detailed quantitative information about major white matter fiber bundles (see Figure 1).Because multiple sclerosis results in the degradation of cerebral white matter, researchers hope to use the functional predictors and cognitive disability mea-sures to understand the progression of the disease.This study was previously analyzed in [7].
In Section 2 we introduce variational Bayes and a penalized approach to functional regression.Section 3 combines these ideas and develops a scalable iterative algorithm for approximate Bayesian inference in functional regression.The results of a simulation study are described in Section 4 and a real-data analysis is performed in Section 5. We conclude the main text with a discussion in Section 6. Appendices A and B contain algebraic derivations and expressions used in the construction of the iterative algorithm.All code used in the simulation study is available as a web-supplement to this article.

Background
In the following subsections we introduce variational approximations for Bayesian inference and an approach to functional regression which uses penalized Bsplines to estimate the coefficient function.

Variational Bayes
Here we give an overview of variational Bayes; for a more complete treatment see [19] and [3], Chapter 10.
Bayesian inference is based on the posterior density function where θ ∈ Θ is the parameter vector, y is the observed data, p(y) is the marginal likelihood of the observed data, and p(y, θ) is the joint likelihood of the data and model parameters.The goal of the density transform approach is to approximate the posterior density p(θ|y) by a function q(θ) for which the q-specific lower bound on the marginal likelihood (defined below) is more tractable than the marginal likelihood itself.The first step is to restrict q to a more manageable class of densities and choose the element of that class with minimum Kullback-Leibler distance from p(θ|y).More concretely, let q be an arbitrary density function over Θ.Then log p(y) ≥ q(θ) log p(y, θ) q(θ) dθ. (2.1) with equality if and only if q(θ) = p(θ|y) almost everywhere [12].It follows that p(y) ≥ exp q(θ) log p(y,θ) q(θ) dθ; we define the q-specific lower bound on the marginal likelihood as p(y; q) ≡ exp q(θ) log p(y, θ) q(θ) dθ. (2. 2) It can be shown that minimizing the Kullback-Leibler distance between q(θ) and p(θ|y) is equivalent to maximizing p(y; q).Stated generally, the following result holds.
Result 2.1.Let u and v be continuous random vectors with joint density p(u, v).Then sup q q(u) log p(u, v) q(u) du is achieved by q * (u) = p(u|v).
Next, we restrict q to a class of functions for which p(y; q) is more tractable than p(y).While several restrictions are possible, here we focus on the product density transform: we assume that for some partition {θ 1 , . . ., θ L } of θ it is possible to write q(θ) = L l=1 q l (θ l ).In the functional regression setting, the posterior dependence of some subsets of the model parameters is weak and the assumption that q factorizes provides accurate approximate inference.In other settings where the posterior dependence of parameters is stronger, this assumption may lead to poor approximations and inference due to the failure to account for correlations between model parameters.There are three simple strategies to gain insight into what sets of parameters are a-posteriori weakly correlated: 1) theoretical work on asymptotic posterior correlations; 2) Bayesian inference on smaller or simpler data sets; and 3) prior experience.If posterior correlation is potentially problematic, a more flexible component density q l that allows for this correlation could be used; however, this must be balanced against the simplification desired in the approximating class of functions.While none of the approaches above is infallible, when combined with powerful variational approximations they can provide a valuable alternative to Bayesian inference.The methods provided in this paper are intended as a reasonable and tractable complement of and not replacement for Bayesian computations.
Combining the assumption that q factorizes over a partition of θ with Result 2.1, we can derive explicit solutions for each factor q l (θ l ), 1 ≤ l ≤ L, in terms of the remaining factors.Solving for each factor in terms of the others leads to an iterative algorithm for obtaining a solution for q.The explicit solution for each q l (θ l ) is derived as follows.Assuming that q is subject to the factorization restriction, it follows that log p(y; q) = L l=1 Define the joint density function p(y, θ 1 ) to be so that log p(y; q) = q 1 (θ 1 ) log p(y, θ 1 ) q 1 (θ 1 ) + terms not involving q 1 .
Then, using Result 2.1, the optimal q 1 is where E θ −1 log p(y, θ) is the expectation with respect to q 2 (θ 2 ) . . .q L (θ L ).The same argument for l in 1, . . ., L yields optimal densities satisfying where rest ≡ {y, θ 1 , . . ., θ l−1 , θ l+1 , . . ., θ L } is the collection of all remaining parameters and the observed data.Solving for each factor in terms of the others leads to an iterative algorithm for obtaining a solution for q.We update each factor in turn until the change in p(y; q) is negligible.

Penalized functional regression
Next we introduce penalized approaches to cross-sectional and longitudinal functional regression [5,6,7].
In the cross-sectional case, we observe data of the form [Y i , X i (t), z i ] for subjects 1 ≤ i ≤ I, where Y i is a continuous outcome, X i (t) is a functional covariate, and z i is a 1 × p vector of non-functional covariates.The linear functional regression model is given by [4,21] (2.4) We call the parameter γ(t) the coefficient function.In practice, the predictor functions X i (t) are observed over a discrete grid, and often with error.That is, we observe {W i (t ij ) : The sampling scheme on which the functional predictors are observed may take a variety of forms: points may be equally or unequally spaced, sparse or dense at the subject level, identical or different across subjects.For simplicity, we will assume that all subjects are observed over the same grid {t 1 , . . ., t N } and are observed at an equal number of visits J. Extensions to different grids and different number of visits is straightforward, but with considerable increase in notational complexity.
To estimate the parameters in model (2.4), we use the following two-stage procedure.First, the predictor functions X i (t) are expressed using a principal components (PC) decomposition.Second, the coefficient function γ(t) is estimated using penalized B-splines.Smoothness of γ(t) is explicitly induced via a mixed effects model.Specifically, let ΣX (s, t) be an estimator of the covariance operator Cov (X i (s), X i (t)) based on the available functional observations.Further, let ∞ k=1 λ k ψ k (s)ψ k (t) be the spectral decomposition of ΣX (s, t), where λ 1 ≥ λ 2 ≥ • • • are the non-increasing eigenvalues and ψ(t) = {ψ k (t) : k ∈ Z + } are the corresponding orthonormal eigenfunctions.An approximation for X i (t), based on a truncated Karhunen-Loéve expansion, is given by , where K x is the truncation lag, the PC loadings dt are uncorrelated random variables with variance λ k , and µ(t) is the mean function over all subjects and visits.
Next, we use a large cubic B-spline basis to smoothly estimate the coefficient function γ(t) using a mixed effects model.Let φ(t) = {φ 1 (t), . . ., φ Kg (t)} be a cubic B-spline basis of dimension K g .Then the integral in model (2.4) can be written as is the row vector of subject i's PC loadings, and M is a K x × K g matrix with (k, l) th entry 1 0 ψ k (t)φ l (t) dt.Smoothness of γ(t) is enforced by assuming a modified first order random walk prior on the vector g [13].That is, we assume g l ∼ N g l−1 , σ 2 g for 2 ≤ l ≤ K g and let g 1 ∼ N 0, 0.01σ 2 g .These are standard assumptions in Bayesian P-splines modeling [23,13].Taken together, we jointly model the scalar outcome Y i and the functional exposure X i (t) using the following model: where β are treated as fixed parameters with diffuse priors, D is the covariance matrix induced by the first order random walk prior, and Λ = diag[λ 1 , . . ., λ Kx ].
Inference for the functional regression model is based on the posterior density where C is the matrix of PC loadings constructed by row-stacking the c i , Y = {Y i } I i=1 , and W is the matrix of observed predictor functions constructed by row-stacking the W i (t).Because the functional predictors X i (t) are observed with error, this model extends Bayesian inference for measurement error regression problems to the functional setting.A directed acyclic graph depicting model (2.5) is presented in Figure 2. In the longitudinal case, we observe data of the form [Y ij , X ij (t), z ij ] for 1 ≤ i ≤ I and 1 ≤ j ≤ J i .Thus we observe a distinct functional predictor and scalar outcome for each subject over several visits, and again note that in place of the true functional predictors X ij (t) we often observe a measured-with-error function W ij (t).The longitudinal functional regression model is given by [7] (2.7) this differs from model (2.4) in the use of subject-specific random effects Z i to account for correlation in the repeated outcomes at the subject level.Moreover, longitudinal data sets tend to be much larger than cross-sectional data sets because of the number of visits.
Given the advent of multiple observational studies collecting dense functional data at multiple visits, the importance of longitudinal functional regression cannot be understated.Unfortunately, with the exception of the work in [7], no other approach can currently deal with the combination of subject-specific random effects and functional predictors necessary to capture the structure of the data.While a wide array of functional regression methods exist, we contend that the specific modeling choices described here made the extension not only possible, but seamless.Estimation of the parameters in the longitudinal setting extends naturally from the procedure outlined for the cross-sectional setting.Again, we express the functional predictors using a PC basis and use a penalized B-spline expansion for the coefficient function.The joint model for the outcome, Y ij , and exposure, X ij (t), becomes Again, inference is based on the posterior density A directed acyclic graph of the longitudinal functional regression model appears in Figure 2.

Variational approximations for penalized functional regression
We now combine the ideas introduced above to develop a scalable iterative algorithm for approximate Bayesian inference in functional regression.We will focus on the longitudinal functional regression model (2.8); the cross-sectional case can be obtained as a special case by omitting the vector of subject-specific random effects b.We pause briefly to introduce the following useful notation: for a scalar random variables θ, let be the mean and variance with respect to the q distribution.For a vector parameter θ, we use the analogously defined µ q(θ) and Σ q(θ) .
As noted, inference in the longitudinal functional regression model is based on the posterior density (2.9).Using variational Bayes, we approximate this posterior density using and by solving for each factor q(•) in terms of the remaining factors.The additional factorization follows as a consequence of (2.3) and the structure of the current model as shown in Figure 2 [3, Sec.10.2.5].We take advantage of this induced factorization in deriving optimal densities for the variance components in the penalized functional regression model.
To provide an example of how optimal densities are constructed, we derive the optimal densities q * (g) and q * (σ 2 g ); derivations of these and for the other parameters are provided in Appendix A. Recall that g ∼ N 0, σ 2 g D and σ 2 g ∼ IG (A g , B g ).According to (2.3), the optimal densities are given by where rest includes both the observed data and all parameters not currently under consideration.
Using the full conditional distribution p(g|rest g ), the optimal density q * (g) is where (3.2) Thus the optimal density q * (g) is N(µ q(g) , Σ q(g) ).Similarly, the optimal density q * (σ 2 g ) is where Note that, when q(σ 2 g ) = q * (σ 2 g ), the term µ q(1/σ 2 g ) appearing in (3.2) is equal to . Thus, the optimal densities q * (g) and q * (σ 2 g ) belong to parametric families with the parameters explicitly determined by the distributions of the remaining model parameters and the observed data.Similar derivations for the parameters of the remaining optimal densities are derived in Appendix A. Taken together, these solutions lead to Algorithm 1 for approximate Bayesian inference in the functional linear regression setting.
Further, as shown in Appendix B, the q-specific lower bound on the marginal log-likelihood has the form where const. is an additive constant that remains unchanged in the iterations of Algorithm 1.All parameters denoted A and B and indexed by a subscript are hyperparameters of the inverse gamma prior distributions of the variance components.The quantity log p(Y , W ; q) is typically monitored for convergence in place of p(Y , W ; q).Note that, because several substitutions are made to simplify the expression, this form for log p(Y , W ; q) is only valid at the end of each iteration of Algorithm 1, and only if the parameters are updated in the order given.Finally, posterior credible intervals are readily obtained for all model parameters.However, variational approximations in effect fit a parametric distribution to a mode of the posterior density, which may have consequences when the posterior is multi-modal or more diffuse than the approximating parametric distribution; in such cases one could expect that credible intervals from variational Bayes and MCMC sampling may not agree.This was not a problem in our simulations, where the agreement between the approximate and MCMC-sampled posterior distribution is high, although in our application the variational credible intervals are slightly narrower than those from MCMC sampling.
Algorithm 1 Iterative scheme for obtaining the parameters in the optimal densities in the longitudinal functional regression model (2.8).
until the increase in p(Y , W ; q) is negligible.

Simulations
In this section we undertake simulation exercises with two goals.First, we evaluate our approach's overall ability to accurately estimate all coefficients in a functional regression model.Second, we compare the individual approximate posterior distributions q * (θ l ) ≈ p(θ l | rest) to those given by Markov-chain Monte Carlo (MCMC) sampling in order to examine the quality of the variational approximation in the functional regression setting.We conduct separate simulations for the cross-sectional and longitudinal situations.The MCMC sampling was executed in WinBUGS and the variational Bayes approach was implemented in R.

Cross-sectional functional regression
We generate samples from the model Here we assume I = {100, 500} subjects and generate z i ∼ Unif [−5, 5].We take , and γ(t) = cos(2πt).To generate our simulated functional predictors X S i (t), we use the functional predictors X A i (t) from our scientific application in the following way.First, we compute a functional principal components decomposition of the X A i (t) with eigenfunctions ψ 1 (t), ψ 2 (t), . . .and corresponding eigenvalues λ 1 , λ 2 , . ... Recall that the application predictors can be approximated using where µ(t) is a population mean function, K x is the truncation lag and the c ik are uncorrelated random variables with variance λ k .Using this, we construct simulated regressors This parametric construction of the simulated functional predictors is related to the application predictors through the mean function µ(t), the eigenfunctions ψ k (t), and the variance components λ k .As in our application, the simulated predictors are observed on a grid of length 93.
We generate 100 such datasets for I = 100 and I = 500 and fit model (4.1) using both MCMC simulation and the variational approximation approach.For the MCMC simulation, we use chains of length 2500 with the first 1000 as burn-in.Representative examples of the MCMC model fits were inspected using trace and autocorrelation plots to ensure that the posterior samples were reasonable and that the comparison with variational Bayes was fair.To evaluate the ability of the proposed approach to estimate the functional coefficient γ(t) we use the mean squared error (MSE) A comparison of MSEs for the variational approach with the more computationally intensive MCMC sampling is given in table 1.To provide context for this table, in the left panel of Figure 3 we plot the estimated coefficient function resulting in the median MSE for I = 100.
Interestingly, when I = 500 the MSEs for MCMC sampling contain several large outliers, which raises the average MSE for the coefficient function in Table 1.Upon inspection, it was found that these large values corresponded to model fits in which the chains for g were bimodal.A large primary mode surrounded the true parameter value but a smaller, more diffuse mode corresponded to a model overfit.We refit these models using as initial parameter values the estimates provided by the variational approach, which caused the bimodal behavior to disappear and brought the MSEs (and average MSE) down to levels similar to the remaining model fits.
We also quantify the quality of the variational approximation to the MCMCsampled posterior by computing the accuracy for each parameter in the model ]; scores near 1 indicate a high level of agreement between the two densities.Due to the large number of parameters in the model, we present only a subset of the average accuracies in Table 2.As with the MSE, context for this table in given in Figure 3.The accuracy of g is affected by the presence of outliers, attributable to the same bimodal MCMC samples that caused the very large MSEs appearing in Table 1.
Due to the substantial decrease in computation time using variational Bayes over MCMC methods, we are able to construct 95% bootstrap confidence intervals by sampling subjects with replacement and refitting model (4.1) using the variational approach.While credible intervals provided by MCMC or by a single variational fit are overly conservative, the bootstrap intervals are on average .46times narrower and, averaged over the domain, have coverage probability 93.4% for I = 100 and 93.6% for I = 500.The far right panel of Figure 3 displays the coverage probabilities of the various credible and confidence intervals for I = 500.
As demonstrated in Table 2 and Figure 3, the variational approximation performs well in this functional regression setting, both in terms of low MSEs and of agreement the with MCMC-sample posterior density.This stems from the low posterior dependence between the parameters, which is assumed in the use of the density transform approach.Additionally, the use of the bootstrap allows the construction of confidence intervals that are not overly conservative.Importantly, even in this simulation the computational burden is greatly reduced through the use of variational approximations.For I = 100, the MCMC sampling took on average 315 seconds, while the approximation was computed in on average 0.04 seconds (Dual Core 3.06GHz Processor; 4 GB RAM; OS X 10.6.4).For I = 500, the respective times were 1614 and 0.2 seconds.Constructing the bootstrap confidence intervals, based on 400 bootstrap samples, took on average an additional 20 and 76 seconds for I = 100 and I = 500, respectively.
We fit model (2.8) for 100 simulated datasets.As in the cross-sectional case, we use chain lengths of 2500, with 1000 as burn-in, for the MCMC sampling.In Table 3 we display the average MSE of the estimated functional and scalar parameters and the subject-specific random effects.Again, the variational approximation performs as well as the MCMC sampling with a substantial difference in computation time: the MCMC sampling took on average 973 seconds, while the approximation was calculated in on average .2seconds.

Application
In our scientific application, we analyze the association between measures of intracranial white matter and cognitive decline in multiple sclerosis patients.White matter is made up of myelinated axons, the long fibers used to transmit electrical signals in the brain, and is organized into bundles, or tracts.Major examples of white matter tracts are the corpus callosum, the corticospinal tracts, and the optic radiations.Here we focus on the corpus callosum, a collection of white-matter fibers which connects the two hemispheres of the brain.
Myelin, the fatty insulation surrounding white matter fibers, allows electrical signals to be propagated at high speeds along white matter tracts.Multiple sclerosis is a demyelinating autoimmune disease that causes in lesions in the white matter.These lesions disrupt electrical signals, and, over time, result in severe disability in affected patients.To measure cognitive disability, we use the Paced Auditory Serial Addition Test (PASAT), which assesses auditory processing speed and calculation ability.In this test, a proctor reads aloud a sequence of 60 numbers at three-second intervals, while the subject provides the sum of the previous two numbers spoken.This test has scores between 0 and 60 indicating the number of correct sums provided by the subject; the score 60 indicates the highest level of cognitive ability [8].
To quantify white matter, we use diffusion tensor imaging, a magnetic resonance imaging that measures the diffusivity of water in the brain.Because white matter is organized in bundles, water tends to diffuse anisotropically along the tract, which makes their reconstruction from MRI possible.By measuring diffusivity along several gradients, diffusion tensor imaging is able to produce detailed images of intracranial white matter [1,2,14,17].Moreover, continuous summaries of individual white matter tracts, parameterized by distance along the tract and called tract profiles, can be constructed from diffusion tensor images.
Here we study the fractional anisotropy tract profile of the right corticospinal tract; this gives a measure of how anisotropic diffusion is along the tract.
Our study consists of 100 multiple sclerosis patients with between two and eight visits each; a total of 334 visits were observed.Study participants had ages between 21 and 71 years, and 63% were women.We fit model (2.8), using age and gender as non-functional covariates and the mean diffusivity tract profile of the corpus callosum as a functional predictor.We include subject-specific random intercepts to account for the repeated observations at the subject level.The model was fit using both the variational approximation and MCMC sampling; the results are shown in Figure 4.
Previous studies have linked damage in the corpus callosum to cognitive decline as measured by PASAT and other tests [15,20].However, these studies lacked the spatial information present in the functional treatment here, which proves to be important.From the estimated coefficient function and bootstrapped confidence intervals, we see that the region from roughly 0 to .2 is negatively associated with the PASAT outcome -that is, subjects with aboveaverage mean diffusivity in these regions tend to have lower PASAT scores.A second region, from .65 to .8, is positively associated with the outcome.We base inference on the bootstrapped intervals due to the overly conservative coverage of the MCMC and variational Bayes credible intervals; however, there is broad agreement between all intervals regarding the location of regions of interest.Note the interpretation of the coefficient function is marginal, rather than conditional on a subject's random intercept.The random intercepts are an important component of the model: a model including only random intercepts explains roughly 80% of the outcome variability, while adding functional and nonfunctional covariates raises this to 89%.Finally, age and gender were not found to be statistically significant, but were retained as scientifically important covariates.Their inclusion did not meaningfully affect the shape or significance of the functional predictor.
There is broad agreement between the variational Bayes and MCMC model fits: the point estimates of the coefficient and the random intercepts are very similar, and the credible intervals indicate the same regions of significance.On the other hand, the credible interval using MCMC is wider than that using variational Bayes.As noted above, the variational method can result in narrower confidence intervals if the approximating density is less diffuse than the MCMCsampled posterior which appears to be the case here.In this application, we posit that the lesser importance of the functional predictors in comparison to the random intercepts leads to increased posterior variability in the estimated functional coefficient.Indeed, when we fit a model without the random subjectspecific intercept the confidence intervals for Bayesian and variational Bayesian approximations became indistinguishable.
Also shown in Figure 4 as a grey band is the 95% bootstrap confidence interval, constructed by nonparametrically resampling subjects and fitting the longitudinal functional regression model using variational Bayes.Inference for the coefficient function is largely unchanged based on the bootstrap interval except in the region from .2 to .4,which does not appear to be significantly associated with the outcome.Although the credible intervals using variational Bayes are likely too narrow, the computational gain and accurate point estimates provided by this method allow for the construction of bootstrap confidence intervals, which performed much better in our simulations.

Discussion
The variational Bayes approach to functional regression was motivated by a pressing need for computationally feasible Bayesian inference in a large-data setting.We have developed iterative algorithms for approximate inference in both the cross-sectional and longitudinal regression settings, and analyzed a longitudinal neuroimaging study.The methods developed: 1) flexibly estimate all parameters in the cross-sectional and longitudinal functional regression models; 2) accurately approximate the posterior distributions of all model parameters; 3) retain the advantages of Bayesian inference, including the ability to jointly model the functional predictors and scalar outcomes and easily constructed credible bands; 4) require orders of magnitude less computational effort than MCMC techniques; and 5) allow the construction of nonparametric bootstrap confidence intervals, which seem to have good coverage probabilities.
A few limitations of the variational Bayes method are apparent.While our simulations indicate that variational techniques can be used with confidence in the functional regression setting, the approximation cannot be made more accurate by increasing computation time.Additionally, the iterative algorithms are based on involved algebraic derivations; those needed for functional regression have been carried out here, but additional work may be needed to adapt these algorithms to specific scientific settings.Lastly, the performance of credible intervals approximated using variational Bayes may not be satisfactory if the posterior distribution is multimodal or more diffuse than the approximating distribution, although the use of the nonparametric bootstrap can alleviate this issue.
Future work may proceed in several directions.The adaptation of the approach to non-Gaussian outcomes will expand the class of applications in which variational Bayes may be used for functional regression.Very large gains in computation time may be found in functional magnetic resonance imaging or other studies where the predictors are sampled at thousands or tens of thousands of points.More generally, variational Bayes has potential applications in several functional data analysis topics, including function-on-function regression and the decomposition of populations of functions.
The full conditional distribution p(g|rest) appearing above is given by Therefore the optimal density q * (g) is q * (g) where 2) Thus the optimal density q * (g) is N(µ q(g) , Σ q(g) ).
Further, the full conditional p(σ 2 g |rest) is given by so that the optimal density q * (σ 2 g ) is where Note that, when q(σ 2 g ) = q * (σ 2 g ), the term µ q(1/σ 2 g ) appearing in (A. Therefore, by (A.1), the optimal density q * (b) is After taking the expectation above, the optimal density q * (b) is N(µ q(b) , Σ q(b) ) where Therefore, by (A.1), the optimal density q * (C) is (A.6) Thus the optimal density q * (C) is a product of Normally distributed random vectors sharing a common covariance matrix and with means the rows of µ q(C) .
In the derivation of the optimal density q * (λ k ), 1 ≤ k ≤ K x , we let C k denote the k th column of C and µ k q(C) denote the k th column of µ q(C) .Further, we let (Σ q(C) ) kk denote the (k, k) th element of Σ q(C) .The full conditional p(λ k |rest) is given by so that the optimal density q * (λ k ) is Note that, when q(λ k ) = q * (λ k ), the term (k, k) th entry of Λ −1 q appearing in (A.6) is equal to A λ +(nJ)/2 B q(λ k ) .

X
Recall that the functional predictors are observed over a grid of length N .The full conditional p(σ 2 X |rest) is given by p Next, we see that where .
Appendix B: Expression for p(Y, W ; q) In this appendix we derive an expression for the lower bound of the log likelihood.This quantity is use to monitor convergence in Algorithm 1, and its derivation takes advantage of the order of updates in the algorithm to simplify the expression.We have that log p(Y , W ; q * ) = q(θ) log p(Y ,W ,θ) q * (θ) The first term appearing in (B.1) is .
Next, we have The fourth term is given by µ T q(g) D −1 µ q(g) + tr D −1 Σ q(g) + K g 2 .

Fig 1 .
Fig 1.The functional predictor used in our diffusion tensor imaging application.The left and middle panels show the functional predictors observed for individual subjects; the right panel shows the collection of all observed functions.

Fig 2 .
Fig 2. Directed acyclic graph corresponding to the functional regression model (2.5).Shaded nodes correspond to observed data, and unshaded nodes to model parameters.Arrows indicate conditional dependence.The nodes for σ 2 b and b, shown as dashed lines, appear in the longitudinal functional regression model (2.8) but not in the cross-sectional model (2.5).

Fig 3 .
Fig 3.The top left panel shows the estimated coefficient function corresponding to the median MSE = 0.049, as well as variational and MCMC 95% credible intervals (dashed lines) and the 95% bootstrap interval (shaded region).The top-right panel displays the coverage probabilities of the credible intervals over the domain of the predictor (note there is perfect overlap of the VB and MCMC coverage probabilities).The bottom panels show posterior densities estimated by variational approximations and by MCMC sampling from the same simulated dataset (shown in dashed and solid lines, respectively), and provide the accuracy of the approximation expressed as a percent.

Fig 4 .
Fig 4. Results of fitting model (2.8) to the diffusion tensor imaging dataset.The left panel shows the estimated coefficient function; credible intervals for both methods are shown in dashed lines, and the nonparametrically bootstrapped interval shown in grey; the right panel shows the random intercepts predicted by both variational Bayes and MCMC sampling.

2 . 2 b
Optimal densities for b and σ Recall that b ∼ N 0, σ 2 b I and σ 2 b ∼ IG (A b , B b ).The full conditional distribution p(b|rest) is given by

Table 1
Average integrated MSE for γ(t) and average MSE for the non-functional covariates β 1 , β 2 estimated using the variational approximation, taken over 100 simulated datasets.For I = 500, large outlier for the MCMC MSE were removed in the calculation of the average

Table 2
The accuracy of the variational approximation to the MCMC-sampled posterior, expressed as a percentage, for a subset of parameters in the cross-sectional functional regression model(2.5)

Table 3
Average integrated MSE for γ(t) and average MSE for the non-functional covariates β 1 , β 2