Bayesian Functional Principal Components Analysis via Variational Message Passing

Functional principal components analysis is a popular tool for inference on functional data. Standard approaches rely on an eigendecomposition of a smoothed covariance surface in order to extract the orthonormal functions representing the major modes of variation. This approach can be a computationally intensive procedure, especially in the presence of large datasets with irregular observations. In this article, we develop a Bayesian approach, which aims to determine the Karhunen-Lo\`eve decomposition directly without the need to smooth and estimate a covariance surface. More specifically, we develop a variational Bayesian algorithm via message passing over a factor graph, which is more commonly referred to as variational message passing. Message passing algorithms are a powerful tool for compartmentalizing the algebra and coding required for inference in hierarchical statistical models. Recently, there has been much focus on formulating variational inference algorithms in the message passing framework because it removes the need for rederiving approximate posterior density functions if there is a change to the model. Instead, model changes are handled by changing specific computational units, known as fragments, within the factor graph. We extend the notion of variational message passing to functional principal components analysis. Indeed, this is the first article to address a functional data model via variational message passing. Our approach introduces two new fragments that are necessary for Bayesian functional principal components analysis. We present the computational details, a set of simulations for assessing accuracy and speed and an application to United States temperature data.


Introduction
Functional principal components analysis (FPCA) is the methodological extension of classical principal components analysis (PCA) to functional data.Within the overarching framework of functional data analysis, FPCA is a central technique.The advantages of using FPCA for functional data are derived from analogous advantages that PCA affords for multivariate data analysis.For instance, PCA in the multivariate data setting is used to reduce dimensionality and identify the major modes of variation of the data set.The modes of variation are determined by the eigenvectors of the sample covariance matrix of the data set, while dimension reduction is achieved by identifying the eigenvectors that maximize variation in the data.In the functional setting, response curves are interpreted as independent realisations of an underlying stochastic process.A covariance operator and its eigenfunctions play the analogous role that the covariance matrix and its eigenvectors play in the multivariate data setting.By identifying the eigenfunctions with the largest eigenvalues, one can reduce the dimensionality of the entire data set by approximating each curve as a linear combination of the reduced set of eigenfunctions.
There are technical issues that arise in the functional setting that are not present for multivariate data.The domain of the functional curves is typically a compact interval [0, T ] of the real line.Despite having a continuous domain, the curves are only observed at discrete points over this interval.Furthermore, the points of observation, as well as the total number of observations, need not be the same for each curve.Therefore, approaches that are used in PCA require modifications to extend to the functional framework.In FPCA, we often rely on nonparametric regression to smooth the eigenfunctions and employ an appropriate step to ensure that they are orthonormal from the perspective of integration, rather than inner products of vectors.
There have been numerous developments in FPCA methodology throughout the statistical literature.A thorough introduction to the statistical framework and applications can be found in Ramsay and Silverman (2005, Chapter 8) and Wang, Chiou, and Müller (2016, Section 2).Much of this work mirrors the eigendecomposition approach to PCA, in that an eigenbasis is obtained from a covariance surface.Yao, Müller, and Wang (2005) focused on the case of sparsely observed functional data, and estimate principal component scores through conditional expectations.Xiao, Zipunnikov, Ruppert, and Crainiceanu (2016) developed a fast covariance estimation method for densely observed functional data.Di, Crainiceanu, Caffo, and Punjabi (2009) extended FPCA to multilevel functional data, extracting within and between subject sources of variability, and Greven, Crainiceanu, Caffo, and Reich (2011) developed methods for longitudinal functional data.However, Goldsmith, Greven, and Crainiceanu (2013) noted that these approaches implicitly condition on an estimated eigenbasis to estimate scores, meaning that inference on individual curve estimates can be inaccurate.
Meanwhile, other approaches have built on or are similar to the probabilistic PCA framework that was introduced by Tipping and Bishop (1999) and Bishop (1999).Rather than first obtaining eigenfunctions from a covariance and then estimating scores, all quantities are considered unknown and are estimated jointly.James, Hastie, and Sugar (2000) used an EM algorithm for estimation and inference in the context of sparsely observed curves.Variational Bayes for FPCA was introduced by van der Linde ( 2008) via a generative model with a factorized approximation of the full posterior density function.Goldsmith, Zippunnikov, and Schrack (2015) introduced a fully Bayes framework for multilevel function-on-scalar regression models, and also considered observed values that arise from exponential family distributions.
In frequentist versions of FPCA, the covariance function is determined through bivariate smoothing of the raw covariances.Eigenfunctions and eigenvalues are then determined from the smoothed covariance function.The key advantage in the Bayesian approach is that the covariance function is not estimated, meaning that complex bivariate smoothing is not required.Indeed, the eigenfunctions and eigenvalues are computed directly as part of a Bayesian hierarchical model.Furthermore, it is unnecessary to compute or store large covariance matrices for dense functional data, and for sparse, irregular functional data -where estimating the raw covariance is difficult or impossible -direct estimation of eigenfunctions in a Bayesian model is straightforward.For these reasons, we pursue a Bayesian approach to FPCA.
Although there have been numerous contributions to Bayesian implementations of FPCA, we argue that there are additional considerations that should be addressed.First, MCMC modeling of FPCA is a computationally expensive procedure and, in some biostatistical applications (Goldsmith et al., 2015), the computational time can reach several hours.Second, current versions of variational Bayes for FPCA, despite being a much faster computational alternative, are difficult to extend to more complex likelihood specifications, such as multilevel data models and binary response outcomes.Minka (2005) presents a unifying view of approximate Bayesian inference under a message passing framework that relies on the notion of messages passed between nodes of a factor graph.Mean field variational Bayes (MFVB) can be incorporated into this framework through an alternate scheme known as variational message passing (VMP) (Winn & Bishop, 2005).Wand (2017) introduced computational units, known as fragments, that compartmentalize the algebraic derivations that are necessary for approximate Bayesian inference in VMP.The notion of fragments within a factor graph is essential for efficient extensions of variational Bayes-based FPCA to arbitrarily large statistical models.
In this article, we propose an FPCA extension of the VMP framework for variational Bayesian inference set out in Wand (2017).Our novel methodology includes the introduction of two fragments that are necessary for computing approximate posterior density functions under an MFVB scheme, as well as a sequence of post-processing steps for estimating the orthonormal eigenfunctions.We provide an introduction to variational Bayesian inference in Section 2, with an overview of VMP in Section 2.2.Section 3 introduces the Bayesian hierarchical model for FPCA and its extensions under a VMP formulation.In Section 4, we outline the post-VMP steps that are required for producing orthonormal eigenfunctions.Simulations, including speed and accuracy comparisons with MCMC algorithms, are presented in Section 5, and an application to United States temperature data is provided in Section 6.

Variational Bayesian Inference
The overarching aim of this article is the identification and derivation of fragments that are necessary for VMP implementations of FPCA.VMP represents a class of methodologies, derived from MFVB approaches, for approximate Bayesian inference over a factor graph.In this section, we provide a brief introduction to the MFVB and VMP frameworks.For an in-depth explanation of MFVB, we refer the reader to Ormerod and Wand (2010) and Blei, Kucukelbir, and McAuliffe (2017); for a comprehensive review of VMP, we refer the reader to Minka (2005) and Wand (2017).
Variational Bayesian inference is based on the notion of minimal Kullback-Leibler divergence to approximate a posterior density function.For arbitrary density functions p 1 and p 2 on R d , the Kullback-Leibler divergence of p 1 from p 2 is Note that Consider a generic Bayesian model with observed data vector y and parameter vector θ ∈ Θ, where Θ is a parameter space.We make the assumption that y and θ are continuous random variables with density functions p(y) and p(θ).For the case where some components are discrete, a similar treatment applies with summations replacing integrals.Next, let q(θ) represent an arbitrary density function over the parameter space Θ.The essence of variational Bayesian inference is to restrict q to a class of density functions Q and use the optimal q-density function, defined by as an approximation to the true posterior density function p(θ|y).
From the non-negativity condition of (1), we have p(y; q) ≤ p(y) showing that p(y; q) is a lower-bound on the marginal likelihood.This leads to an equivalent form for the optimisation problem in (2): q * (θ) ≡ argmax q∈Q {log p(y; q)}. (3) As stated in Rohde and Wand (2016), this alternate expression has the advantage of representing the optimal q-density function as maximising the lower-bound on the marginal log-likelihood.For the remainder of this article, we will address variational Bayesian inference with (3), rather than (2).

Mean Field Variational Bayes
MFVB is a class of variational Bayesian inference methods that uses a product density (or mean field) restriction in the optimal q-density function.The mean field approximation, which has its roots in statistical physics (Parisi, 1988), imposes the factorization for all q ∈ Q , where {θ 1 , . . ., θ N } is some partition of θ.The optimal q-density functions that satisfy (3) are given by (e.g.Minka, 2005;Ormerod & Wand, 2010) where E q(θ\θ i ) denotes expectation with respect to the optimal posterior density functions of all elements in the partition of θ, defined by (4), except for the optimal posterior density function of θ i .The parameter vectors that define each of the optimal q-density functions in (5) are interrelated and are updated by a coordinate ascent algorithm (Ormerod & Wand, 2010, Algorithm 1).However, the resulting parameter vector updates are problem-specific and must be rederived if there is a change to the model.For instance, the updates for the optimal posterior density functions of the coefficients in a linear regression model will differ from those in a linear logistic regression model.

Variational Message Passing
VMP is an alternate computational framework for variational Bayesian inference with a mean field product restriction.The VMP infrastructure is a factor graph representation of the Bayesian model.Wand (2017) advocates for the use of fragments, a sub-graph of a factor graph, as a means of compartmentalizing the algebra and computer coding required for variational Bayesian inference.Posterior density estimation is achieved by messages passed within and between factor graph fragments.Here, we give a brief description of the foundations of VMP.For a thorough exposition of the VMP framework, we refer the reader to Wand (2017).
A factor graph representation of the Bayesian model expressed in ( 6) is presented in Figure 1.The square nodes are the factors, which represent the distributional specifications of the model, and the circular nodes are called stochastic nodes, which represent the parameter vectors of the model.Furthermore, the graph is bipartite, meaning that stochastic nodes can only share an edge with a factor and vice versa.Additionally, notice that the edges of the factor graph respect the distributional dependencies of (6).For instance, the factor for p(θ 2 |θ 4 , θ 5 ) shares edges with the stochastic nodes for θ 2 , θ 4 and θ 5 only.
The VMP construction of variational Bayesian inference relies on messages passed between factors and stochastic nodes.Consider the factor for p(θ 2 |θ 4 , θ 5 ) and the messages that it will pass to its neighboring stochastic nodes θ 2 , θ 4 and θ 5 .The messages passed from this factor are updated according to Before continuing, we will make a brief note on our notation for computational updates.In (7), we use a left arrow to indicate a computational update, rather than an equality symbol.This reflects the fact that at each iteration of the VMP algorithm, the algebraic form of the message remains the same, whereas the parameter values of the message will change between iterations as the Kullback-Leibler divergence is minimized.In addition, the subscript of the message in (7) indicates the direction of the message.Finally, each message is a function of the parameter vector that it is sent to or sent from.Next, consider the stochastic node θ 4 , which receives messages from the factors p(θ 2 |θ 4 , θ 5 ), p(θ 3 |θ 4 ) and p(θ 4 ).The messages that θ 4 returns to these factors are In general, the message that a stochastic node passes to a neighbouring factor is simply the product of the messages that it has received from its other neighbouring factors.Note that the form of the message updates in (7) are such that the range of each message is a subset of R. Therefore, the computations in (8) simply involve multiplication of scalars.Now, consider a general Bayesian model with M factors p 1 , . . ., p M and N parameter vectors θ 1 , . . ., θ N .We define the set of stochastic nodes that are connected to the factor p j as neighbors( j) ≡ {i = 1, . . ., N : θ i shares an edge with p j }.
Then, the message from factor p j to the stochastic node θ i is The message from θ i to p j is Upon convergence of the messages, the optimal q-density functions, which satisfy (3) take the form The VMP iterative loop has the following generic steps (Minka, 2005;Wand, 2017): 1. Choose a factor.
2. Update the messages passed from the factor's neighbouring stochastic nodes to the factor.
3. Update the messages passed from the factor to its neighbouring stochastic nodes.

Exponential Family Form
A key step in deriving and implementing VMP algorithms is the representation of probability density functions in exponential family form.In particular, the messages in ( 9) and ( 10) are typically in the exponential family of density functions with vector of sufficient statistics T (θ i ).We have where η p j →θ i and η θ i →p j are the message natural parameter vectors.Wand (2017) explains how natural parameter vectors play a central role in the messages that are passed within and between factor graph fragments.Furthermore, the natural parameter vectors for the optimal q-density functions in (11) take the form In addition, we adopt the notation η p j ↔θ i ≡ η p j →θ i + η θ i →p j . (13) Before introducing the exponential family forms for key distributions in the VMP setting, we outline some matrix and vector operators.We define the vec and vech operators, which are wellestablished (e.g.Gentle, 2007).For a d 1 × d 2 matrix, the vec operator concatenates the columns of the matrix from left to right.For a d 1 × d 1 matrix, the vech operator concatenates the columns of the matrix after removing the above diagonal elements.For example, suppose that Then vec(A) = (2, −3, −1, 1) and vech(A) = (2, −3, 1) .For a d 2 × 1 vector a, vec −1 (a) is the d × d matrix such that vec{vec −1 (a)} = a.Additionally, the matrix D d is the duplication matrix of order d, and it is such that The normal distribution is one of the most important distributions within the exponential family, and it plays a major role in VMP versions of variational Bayesian inference.Consider the d × 1 multivariate normal random vector x ∼ N(µ, Σ).The probability density function of x can be shown to satisfy where − 1 2 vec(Σ −1 ) are, respectively, the vector of sufficient statistics and the natural parameter vector.The function is the log-partition function.The inverse mapping of the natural parameter vector is (Wand, 2017, equation S.4) We will refer to the representation of the multivariate normal probability density function in (14) as the vec-based representation.Alternatively, a more storage-economical representation of the multivariate normal probability density function is the vech-based representation: where the vector of sufficient statistics, the natural parameter vector and the log-partition function are, respectively, The inverse mapping of the natural parameter vector under the vech-based representation is The other major distribution within the exponential family that is pivotal for this article is the inverse-χ 2 distribution.A random variable x has an inverse-χ 2 distribution with shape parameter ξ > 0 and scale parameter λ > 0 if the probability density function of x is where Γ(•) is the gamma function defined by Γ(z) ≡ ∞ 0 u z−1 e u du.The exponential family representation of the inverse-χ 2 density function is where the vector of sufficient statistics, the natural parameter vector and the log-partition function are, respectively, The inverse mapping of the natural parameter vector is (Maestrini & Wand, 2020, equation 8) The generalization of the inverse-χ 2 distribution is the inverse G-Wishart distribution, written X ∼ Inverse G-Wishart(G, ξ, Λ), for a symmetric and positive definite d × d matrix X.The parameter G is an undirected graph with d nodes and edge set E with pairs of nodes connected by an edge.Furthermore, ξ > 0 and Λ is a symmetric and positive definite d × d matrix.We say that, the symmetric matrix A respects G if A i j = 0 for all (i, j) / ∈ E. Now, impose the additional constraint that X −1 respects G. Furthermore, let G full represent a complete graph where each node is connected to every other node by an edge, and let G diag represent a sparse graph where the edge set is empty.The justification for this notation is that a matrix with no zero constraints respects G full and a diagonal matrix respects G diag .The two graph constructions generate two definitions for the inverse G-Wishart distribution (Maestrini & Wand, 2020, Definition 3): (a) If G = G full and ξ is restricted such that ξ > 2d − 2 then we say that X has an inverse G-Wishart distribution with graph G, shape parameter ξ and scale matrix Λ if and only if the non-zero values of the density function of X satisfy The exponential family form of the inverse G-Wishart distribution is not important for this article; we refer the reader to Maestrini and Wand (2020, Section 2.2).Instead, our interest lies in the role of the graphical parameter, which must be incorporated as an argument for variational message updates involving inverse G-Wishart random matrices, including inverse-χ 2 random variables.We follow the advice in Maestrini and Wand (2020, Sections 5 & 6) for message passing updates involving graphical parameters.

Factor Graph Fragments
A factor graph fragment (or fragment, for short) is the computational unit of VMP.A fragment, as defined by Wand (2017), is a subgraph of a factor graph consisting of a single factor and each of its neighboring stochastic nodes.An example of a factor graph fragment is the fragment for p(θ 2 |θ 4 , θ 5 ) in Figure 1, which is highlighted in blue.The factor representing p(θ 2 |θ 4 , θ 5 ), the neighboring stochastic nodes θ 2 , θ 4 and θ 5 and the connecting edges comprise the fragment.
The identification and derivation of the algebraic computations in fragments is the key step for extending VMP to arbitrarily large Bayesian statistical models.As explained at the end of Section 2.1, the form of each of the optimal q-density functions is problem-specific, requiring rederivations for any changes in the model.However, the fragment based approach in VMP means that updates are localized within the fragment.This represents enormous savings in algebraic derivations and coding because once a fragment has been coded and stored, it can simply be called as a function without the need to rederive the updates.Any change in the model is simply handled by removing the fragment that does not fit the new model and replacing it with an appropriate alternative.
The following is a list of some of the major contributions to fragment updates for VMP: • Wand (2017) identified five fundamental fragments for Gaussian response semiparametric regression.This includes the Gaussian prior fragment, inverse G-Wishart prior fragment and the iterated inverse G-Wishart fragment, which are necessary for a VMP construction of Bayesian FPCA.The author also identified several other fragments for direct implementation in various extensions for semiparametric regression.
• Nolan and Wand (2017) developed fast, stable and accurate numerical integration techniques for the logistic likelihood fragment.This had been previously introduced in Wand (2017) using the variational lower bound of Jaakkola and Jordan (2000), however the performance of this approximation can be poor (Knowles & Minka, 2011).Instead, Nolan and Wand (2017) incorporated the normal scale mixture uniform approximation of the logistic function (Monahan & Stefanski, 1992) into the logistic likelihood fragment for highly accurate inference.
• Maestrini and Wand (2018) derived algorithmic updates for the skew t likelihood fragment with all skew t parameters inferred, rather than being held fixed.
• McLean and Wand (2019) built on previous VMP constructions by focusing on regression models where the response variable is modeled to have an elaborate distribution, such as Negative Binomial and t likelihoods.
• Nolan, Menictas, and Wand (2020) used a set of solutions to sparse multilevel matrix problems (Nolan & Wand, 2020) to streamline the computations of VMP for Gaussian response linear mixed models.This involved the introduction of four new fragments.
• Maestrini and Wand (2020) provide corrections for the matrix versions of the inverse G-Wishart prior fragment and the iterated inverse G-Wishart fragment from Wand (2017).In particular, Section 4.1.3 of Wand (2017), which introduces the iterated inverse G-Wishart fragment, only addresses the case where the graph parameter is G = G full .However, it does not account for the case where G = G diag .The necessary corrections are outlined in Section 6.1 of Maestrini and Wand (2020).Although the scalar versions of these fragments from Wand (2017) are sufficient for the current article, we will use the updated fragments from Maestrini and Wand (2020) since they are the new standard for approximate Bayesian inference on variance and covariance matrix parameters.

Functional Principal Components Analysis
Consider a random sample of i.i.d.smooth random functions y 1 , . . ., y n ∈ L 2 [0, 1].We will assume the existence of a continuous mean function µ = E y i and continuous covariance surface Then, the covariance operator Σ of y i is defined as Mercer's Theorem implies that the spectral decomposition of , where the γ l are the eigenvalues of Σ in descending order and ψ * l are the corresponding orthonormal eigenfunctions.The Karhunen-Loève decomposition is the basis for the FPCA expansion (Yao et al., 2005): where The asterisk is used as a reminder that the eigenfunctions in ( 18) are orthonormal and that the scores are independent.
Expansion (18) facilitates dimension reduction by providing a best approximation for each curve y 1 , . . ., y n in terms of the truncated sums involving the first where || • || denotes the L 2 norm and •, • denotes the L 2 inner product.For this reason, we define the best estimate of y i as Next, we make some observations involving the scores and the orthonormal eigenfunctions in ( 18) and ( 19): 1.If γ l = γ k , l = k, then the corresponding orthonormal eigenfunctions ψ * l and ψ * k are not unique.We will simply assume that the eigenvalues are unique, which is a reasonable assumption for most applications (e.g.climate data and biostatistical problems).
2. If all the eigenvalues are unique, the corresponding orthonormal eigenfunctions are only unique up to a change of sign.Issues of identifiability are always present when one attempts to infer eigenfunctions or eigenvectors.However, choosing one eigenfunction over its opposite sign has no effect on the resulting fits, although one choice of sign may provide more natural interpretation of the eigenfunction.Here, we simply assume that the inner product of the Bayesian estimate of an eigenfunction, with the desired sign, and the eigenfunction itself is positive.
We state these assumptions formally.
Assumption 1.The eigenvalues of the covariance operator in (17) are unique.
Expansions similar to ( 19) are also possible, where where ζ il are correlated across l, but remain independent across i, and the ψ l are not orthonormal.Theorem 3.1 shows that an orthogonal decomposition of the resulting basis functions and scores is sufficient for establishing the appropriate estimator ( 19).Its proof is provided in Appendix A.
Theorem 3.1.Consider Assumptions 1 and 2 and the approximations of the response curves y 1 , . . ., y n in (20).Then, there exists a unique set of orthonormal eigenfunctions ψ * 1 , . . ., ψ * L and an uncorrelated set of scores ζ * i1 , . . ., ζ * iL , i = 1, . . ., n, such that Theorem 3.1 motivates estimation of the Karhunen-Loève decomposition directly to infer the eigenfunctions and scores.In this approach, all components of the Karhunen-Loève decomposition are viewed as unknown so that scores and eigenfunctions are estimated jointly.The other class of methods use covariance decompositions to obtain the eigenfunctions and subsequently estimates the scores given the eigenfunctions using the Karhunen-Loève decomposition (e.g.Yao et al., 2005;Di et al., 2009;Xiao et al., 2016).There are several advantages in the former method in that it does not require estimation or smoothing of a large covariance and can more directly handle sparse or irregular functional data.The Bayesian model is described in Section 3.1, while new fragments that are relevant for FPCA via VMP are introduced in Sections 3.2 and 3.3.

Bayesian Model Construction
In practice, the curves y 1 , . . ., y n are indirectly observed as noisy observations at discrete points in time.Furthermore, the observation times are not necessarily the same for each curve.Let the set of design points for the ith curve be summarized by the vector where T i is the number of observations on the ith curve.In addition, we represent the observations for the ith curve, y i (t), by the vector where ε i j are i.i.d.noise terms with E(ε i j ) = 0 and Var(ε i j ) = σ 2 ε .The finite decomposition in (19) takes the form: where µ i ≡ {µ(t i1 ), . . ., µ(t iT i )} , ψ il ≡ {ψ l (t i1 ), . . ., ψ l (t iT i )} , for l = 1, . . ., L, and i ≡ (ε i1 , . . ., ε iT i ) is a vector of measurement errors for the observations on curve y i (t).
We model continuous curves from discrete observations via nonparametric regression (Ruppert, Wand, & Carroll, 2003, 2009), using the mixed model-based penalized spline basis function represen-tation, as in Durbán, Harezlak, Wand, and Carroll (2005).The representation for the mean function and the FPCA eigenfunctions are: where {z k (•)} 1≤k≤K is a suitable set of basis functions.Splines and wavelet families are the most common choices for the z k .In our simulations, we use O'Sullivan penalized splines, which are described in Section 4 of Wand and Ormerod (2008).
In order to avoid notational clutter, we incorporate the following definitions: Then simple derivations that stem from (23) show that the vector of observations on each of the response curves satisfies the representation: where In addition, we define: Next, we present the Bayesian FPCA Gaussian response model: where and Note that the iterated inverse-χ 2 distributional specification on σ 2 ε , which involves an inverse-χ 2 prior specification on the auxiliary variable a ε , is equivalent to σ 2 ε ∼ Half-Cauchy(A ε ).This auxiliary variable-based hierarchical construction facilitates arbitrarily non-informative priors on standard deviation parameters (Gelman, 2006).Similar comments also apply to the iterated inverse-χ 2 distributional specifications for σ 2 µ and σ 2 ψ 1 , . . ., σ 2 ψ L .Full Bayesian inference for the parameter set ν, ζ 1 , . . ., ψ L and a ψ 1 , . . ., a ψ L requires the determination of the posterior density function but it is typically analytically intractable.The standard approach for overcoming this deficiency is to employ MCMC approaches.However, we propose two major arguments against this approach.First, MCMC simulations are very slow for model ( 26), even for moderate dimensions of ν, which depends on the number of eigenfunctions (L) and O'Sullivan penalized spline basis functions (K).Second, the mean function µ(t) and the eigenfunctions ψ 1 (t), . . ., ψ L (t) are typically highly correlated, which is expected to lead to poor mixing.A possible remedy for this is to use an inverse G-Wishart prior structure that permits correlations amongst the spline coefficients (Goldsmith & Kitago, 2016).However, this is beyond the scope of this article, which is not concerned with improving MCMC methods for FPCA.Alternatively, variational approximate inference for model (26) involves the mean field restriction: The approximation in (28) represents the minimal mean-field restriction that is required for approximate variational inference.Here, we have assumed posterior independence between global parameters and response curve-specific parameters, as well as incorporating the notion of asymptotic independence between regression coefficients and variance parameters (Menictas & Wand, 2013, Section 3.1).
From here, we work with the factorization in (29) to minimize the Kullback-Leibler divergence of the right-hand side of (28) from its left-hand side.The factor graph for model ( 26) that represents the factorization in ( 29) is presented in Figure 2.
From Figure 2, we identify the following factor graph fragments that are involved in the Bayesian FPCA model ( 26): • The fragments for p(ζ 1 ), . . ., p(ζ n ) are Gaussian prior fragments.The updates for this fragment are presented in Section 4.1.1 of Wand (2017), where a vec-based representation of the multivariate normal density function is used.
• The fragments for p(a ε ), p(a µ ) and p(a ψ 1 ), . . ., p(a ψ L ) are scalar versions of the inverse G-Wishart prior fragment, which is presented as Algorithm 1 of Maestrini and Wand (2020).

Functional Principal Component Gaussian Likelihood Fragment
The Functional Principal Component Gaussian likelihood fragment, shown in blue in Figure 2, is defined by the factor where The purpose of this fragment is to provide message updates for the variational posterior density functions of ν, ζ i , . . ., ζ n and σ 2 ε at every iteration of the VMP algorithm.Here, we outline the construction of the messages that are passed from the factor representing the likelihood p(y|ν, ζ 1 , . . ., ζ n , σ 2 ε ) to each of its neighbouring stochastic nodes.On the other hand, the derivations of these messages and the derivations of expected values of random variables, random vectors and random matrices that these messages depend on are deferred to Appendix D.
The message from p(y|ν, ζ 1 , . . ., ζ n , σ 2 ε ) to ν can be shown to be proportional to a multivariate normal density function, with vec-based exponential density function representation: The update for the natural parameter vector in (32) is where, Before proceeding to the other messages in this fragment, we make a brief comment on using the vec-based representation of the message in (32), as opposed to the storage-economical vech-based representation.In preliminary simulations, we found that computations using the vech-based representation were enormously hindered by the need to use a huge Moore-Penrose inverse matrix.For instance, consider the case where there are two basis functions (L = 2) and 25 O'Sullivan penalized spline basis functions (K = 25) for nonparametric regression.In this instance, the vector ν is 81 × 1 (d = 81) and the Moore-Penrose inverse matrix D + 81 has dimension 3321 × 6561, inhibiting the com-putational speed.For this reason, we have decided to use the vec-based representation of the message in (32), which does not require the use of a Moore-Penrose inverse matrix.
For each i = 1, . . ., n, the message from p(y|ν, which is proportional to a multivariate normal density function.The update for the natural parameter vector in ( 35) is where The message from p(y|ν, and it is proportional to an inverse-χ 2 density function.The update for the natural parameter vector in (38) is where We must remember that the inverse-χ 2 density function message that is passed to σ 2 ε is part of the inverse G-Wishart class of density functions for VMP.Within this class of messages, a graph message is also required to specify whether the density function respects a full or a diagonal matrix.This graphical message does not affect inverse-χ 2 density function messages, however we will include a graph message with the aim of providing fragments that are compatible with previously constructed inverse G-Wishart fragments (Maestrini & Wand, 2020, Algorithms 1 & 2).According to Section 7.4 of Maestrini and Wand (2020), the auxiliary-based hierarchical prior specification of σ 2 ε in (26) requires a graphical message of the form That is, the univariate random variable σ 2 ε is chosen to respect a "full" 1 × 1 matrix.Pseudocode for the functional principal component Gaussian Likelihood Fragment is presented in Algorithm 1.A derivation of all the relevant expectations and natural parameter vector updates is provided in Appendix D.
Algorithm 1 Pseudocode for the functional principal component Gaussian likelihood fragment.
1: Update all expectations with respect to the optimal posterior distribution.see Appendix D

Functional Principal Component Gaussian Penalization Fragment
The functional principal component Gaussian penalization fragment, shown in red in Figure 2, is defined by the factor p(ν|σ 2 µ , σ 2 ψ 1 , . . ., σ 2 ψ L ), where and all sub-vectors and sub-matrices are defined in (27).The purpose of this fragment is to provide message updates for the variational posterior density functions of ν, σ 2 µ and σ 2 ψ 1 , . . ., σ 2 ψ L at each iteration of the VMP algorithm.Here, as in Section 3.2, we outline the messages that are passed from this factor to its neighbouring stochastic nodes.For detailed derivations of these messages and all relevant expectations, we defer the reader to Appendix E.
First, let us introduce the vector and matrix Then, the message from p(ν|σ 2 µ , σ 2 ψ 1 , . . ., σ 2 ψ L ) to ν can be shown to be proportional to a multivariate normal density function, with vec-based exponential density function representation where Once again, we have used a vec-based representation of the message to ν as opposed to a storageeconomical vech-based representation.The major reason for this is outlined in the discussion following (34).The message from p(ν|σ which is an inverse-χ 2 density function after normalization.The update for the natural parameter vector in ( 46) is For l = 1, . . ., L, the message from p(ν|σ 2 µ , σ 2 ψ 1 , . . ., σ 2 ψ L ) to σ 2 ψ l is similar to the message to σ 2 µ .The message is which is an inverse-χ 2 density function after normalization.The update for the natural parameter vector in ( 48) is Finally, recall the discussion following (40).Each of the messages to the variance parameters σ 2 µ , σ 2 ψ 1 , . . ., σ 2 ψ L must be paired with a graph message.For the same reasons that were used to justify the graphical message in (41), the graph messages received by σ 2 µ , σ 2 ψ 1 , . . ., σ 2 ψ L are, respectively, Pseudocode for the functional principal component Gaussian penalization fragment is presented in Algorithm 2. A derivation of all the relevant expectations and natural parameter vector updates is provided in Appendix E.

Post-VMP Steps
The FPCA model for curve estimation (19), which has its genesis in the Karhunen-Loève decomposition (18), relies on orthogonal functional principal component eigenfunctions and independent vectors of scores with uncorrelated entries.However, the variational Bayesian FPCA resulting from a VMP treatment does not enforce any orthogonality restrictions on the resulting eigenfunctions.Although curve estimation is still valid without these constraints, interpretation of the analysis is more straight-Algorithm 2 Pseudocode for the functional principal component Gaussian penalization fragment.

. , L} Updates:
1: Update all expectations with respect to the optimal posterior distribution.

Establishing the Optimal Posterior Density Functions
We are primarily concerned with the optimal posterior density functions for the vector of spline coefficients for the mean function and eigenfunctions ν and the vectors of principal component scores ζ 1 , . . ., ζ n .Upon convergence of the algorithm, the natural parameter vectors for these optimal posterior density functions are, according to (12), The optimal posterior density for each of these parameters is a Gaussian density function, where the mean vector E q (ν) and covariance matrix Cov q (ν) for q * (ν) can be computed from (15), and the corresponding parameters E q (ζ i ) and Cov q (ζ i ) for q * (ζ i ), i = 1, . . ., n, can be computed from ( 16).Note that we partition E q (ν) as E q (ν) = {E q (ν µ ) , E q (ν ψ 1 ) , . . ., E q (ν ψ L ) } in a similar fashion to (25).

Establishing a Vector Version of the Karhunen-Loève Decomposition
In this section, we outline a sequence of steps to establish orthogonal functional principal component eigenfunctions and uncorrelated scores.Note that we will treat the estimated functional principal component eigenfunctions as fixed curves that are estimated from the posterior mean of the spline coefficients E q (ν).As a consequence, the pointwise posterior variance in the response curve estimates result from the variance in the principal component scores alone.This treatment is in line with standard approaches to FPCA, where the randomness in the model is generated by the functional principal component scores (e.g.Yao et al., 2005;Benko, Härdle, & Kneip, 2009).Now, we outline the steps to construct orthogonal functional principal component eigenfunctions and uncorrelated scores.The existence and uniqueness of the eigenfunctions are justified by Theorem 3.1.First, set up an equidistant grid of design points t g = (t g1 , . . .,t gn g ) , where t g1 = 0, t gn g = 1 and n g is the length of the grid.Then define C g in an analogous fashion to (24): Establish the posterior estimates of the mean function E q {µ(t g )} = C g E q (ν µ ) and the functional principal components eigenfunctions E q {ψ l (t g )} = C g E q (ν ψ l ), l = 1, . . ., L. Then define the matrix Ψ such that Establish the singular value decomposition of Ψ such that Ψ = U ψ D ψ V ψ , where U ψ is an n g × L matrix consisting of the first L left singular vectors of Ψ, V ψ is an L × L matrix consisting of the right singular vectors of Ψ, and D ψ is an L × L diagonal matrix consisting of the singular values of Ψ.
Next, define Set m ζ to be the L × 1 sample mean vector of the columns of D ψ V ψ Ξ , and set Then set C ζ to be the L × L sample covariance matrix of the columns of D ψ V ψ Ξ − m ζ 1 n and establish its spectral decomposition C ζ = QΛQ , where Λ is a diagonal matrix consisting of the eigenvalues of C ζ in descending order along its main diagonal and Q is the orthogonal matrix consisting of the corresponding eigenvectors of C ζ along its columns.Finally, define the matrices Notice that Ψ is an n g × L matrix and Ξ is an n × L matrix.Next, partition these matrices such that The columns of Ψ are orthonormal vectors, but we require continuous curves that are orthonormal in L 2 [0, 1].We can adjust this by finding an approximation of || ψ l ||, l = 1, . . ., L, through numerical integration.This allows us to establish estimates of the orthonormal functions ψ * 1 , . . ., ψ * L in (19) over the vector t g with as well as estimates of the scores with Lemma 4.1 outlines the construction of posterior curve estimation for the response vectors y 1 (t g ), . . ., y n (t g ).Proposition 4.2 shows that the form of the predicted response vectors in Lemma 4.1 is a vector version of the Karhunen-Loève decomposition.Here, we define Lemma 4.1.The posterior estimate for the response vector y i (t g ) is given by Remark.The posterior estimates y 1 (t g ), . . ., y n (t g ) in ( 54) are the same as those prior to the postprocessing steps outlined above.That is, where E q (ν µ ) is the posterior estimate of ν µ from the VMP algorithm and similarly for E q (ζ il ) and E q (ν ψ l ).In summary, the post processing steps simply realign the mean function, orthogonalize and normalize the eigenfunctions and uncorrelate the scores, but do not affect the fits to the observed data.
Remark.Proposition 4.2 shows that the sample properties of the posterior estimates for the scores obey the assumptions of the scores in the Karhunen-Loève decomposition in (18).Furthermore, the vectors ψ 1 (t g ), . . ., ψ L (t g ) respect the orthogonality conditions in 2 .Therefore, (54) may be interpreted as a vector version of the truncated Karhunen-Loève decomposition.As a consequence, the numerical estimates of || ψ l || 2 , l = 1, . . ., L are the posterior estimates of the eigenvalues of the covariance operator Σ (see the first paragraph of Section 3).
The proof of Lemma 4.1 is presented in Appendix B, and the proof of Proposition 4.2 is presented in Appendix C.

Simulations
We illustrate the use of Algorithms 1 and 2 through a series of simulations of model ( 26).Pseudocode for the VMP algorithm is provided in Algorithm 3.

Accuracy Assessment
For model ( 26), we simulated 36 response curves with the number of observations T i for the ith curve sampled uniformly over {20, 21, . . ., 30}.Furthermore, the time observations within the ith curve {t i1 , . . .,t iT i } were sampled uniformly over the interval (0, 1), while the residual variance σ 2 ε was set to 1.The mean function was and the eigenfunctions were Each vector of principal component scores were simulated according to Nonparameteric regression with O'Sullivan penalized splines for the nonlinear curves was performed with K = 10.Finally, the simulations were conducted by setting L = 3, rather than 2 (the number of eigenfunctions), to assess the flexibility of the VMP algorithm under slight model misspecification.
The results from the simulation are presented in Figure 3, where a random sample of four of the functional responses are selected for visual clarity.In addition, we have included the results from an MCMC treatment of model ( 26) in blue for comparison with the VMP-based variational Bayes fits in red.MCMC simulations were conducted through Rstan, the R (R Core Team, 2020) interface to the probabilistic programming language Stan (Stan Development Team, 2020).The variational Bayes fits have good agreement with their MCMC counterparts, as well as the simulated data.In particular, the post-VMP procedures that are outlined in Section 4 neatly complement the standard VMP algorithm.
We then incorporated five settings for the number of response curves: n ∈ {10, 50, 100, 250, 500}.For each of these settings, we conducted 100 simulations of model ( 26) with the aim of analysing the error of the posterior mean estimates of the mean curve in (55) and the functional principal component eigenfunctions in ( 56).The error of each simulation was determined via the integrated squared error: where, in our simulations, f (•) represents the true function that generated the data, while f (•) represents the VMP-based variational Bayes posterior mean curve.
The box plots for the logarithm of the integrated squared error values in Figure 4 (a) reflect the excellent results for the settings where n = 50, 100, 250 and 500.Overall, the results for the setting where n = 10 are good, however, there are a few simulations where the posterior estimates of the second functional principal component eigenfunction ψ 2 (•) are poor.This is to be expected because the scores associated with this were generated from a N(0, 0.25) distribution reflecting its weaker contribution to the data generation process.Also, as expected, the error scores for all curves tend to decline with increasing n.In Figure 4 (b), we present all of the simulated posterior mean curves of the mean function and the functional principal component eigenfunctions, for the case where n = 100 and n = 500, which are overlaid with the true functions in blue.These plots demonstrate the strength of the VMP algorithm for estimating the underlying curves that generate an observed set of functional data.Furthermore, the variability in the curve estimates is drastically reduced with increasing n.In addition to each of the VMP posterior estimates, we have included the pointwise mean curve and the pointwise 95% confidence interval for the MCMC simulations for each of the generated data sets.Evidently, there is strong agreement between the VMP simulations and the MCMC simulations.

Computational Speed Comparisons
In the previous section, we saw that the mean field product restriction in (29) does not compromise the accuracy of variational Bayesian inference for FPCA.However, the major advantage offered by variational Bayesian inference via VMP is fast approximate inference in comparison to MCMC simulations.Several published articles have addressed the computational speed gains from using variational Bayesian inference.Faes, Ormerod, and Wand (2011) presented speed gains for parametric and nonparametric regression with missing data, Luts and Wand (2015) presented timing comparisons for semiparametric regression models with count responses, and Lee and Wand (2016) and Nolan et al. (2020) established speed gains for multilevel data models with streamlined matrix algebraic results.In all cases, the variational Bayesian inference algorithms had strong accuracy in comparison  58) for 100 simulations of each of the settings n ∈ {10, 50, 100, 250, 500}.In (b), we present the results for the mean function and the eigenfunctions when n = 100 (top row) and n = 500 (bottom row).The true functions are shown in blue in each panel, and the VMP-based posterior mean curve for each of the simulations is presented in red.In addition, we have included the pointwise mean curve and the pointwise 95% confidence intervals resulting from the MCMC posterior estimates for each of the generated datasets in black.Note that, in each panel, the pointwise MCMC mean curve overlaps very tightly with the true curve, making it difficult to see. to MCMC simulations and were far superior in computational speed.
In Table 1, we present a similar set of results for the computational speed of VMP and MCMC for model ( 26).The simulations were identical to those that were used to generate the results in Figure 4, where there were 100 simulations over five settings for the number of response curves n ∈ {10, 50, 100, 250, 500}.In addition, the simulations were performed on a laptop computer with 8 GB of random access memory and a 1.6 GHz processor.In Table 1, we present the median elapsed computing time (in seconds), with the first quartile and the third quartile shown in brackets.Notice that most of the VMP simulations are completed within 1 minute, whereas the elapsed computing time for the MCMC simulations tends to vary from approximately 1 minute, for n = 10, to over an hour, for n = 500.The most impressive results are in the fourth column, where the median VMP simulation is 19.6 times faster than the median MCMC simulation for n = 10, 34.3 times faster for n = 50, 37.9 times faster for n = 100, 49.0 times faster for n = 250 and 59.8 times faster for n = 500.

Application: United States Temperature Data
We now provide an illustration of our methodology with an application to temperature data collected from various United States weather stations, which is available from the rnoaa package (Chamberlain et al., 2021) in R. The rnoaa package is an interface to the National Oceanic and Atmospheric Administration's climate data.The function ghcnd_stations() provides access to all available global historical climatology network daily weather data for each weather site from 1960 to 1994.The information includes the longitude and latitude for each site, and this was used to determine the state or the federal district of the site.Our analysis focused on maximum daily temperature that was averaged over the 25 years of available data.
From this package, we collected full data sets (data available for every day of the year) from 2837 weather stations, where 49 states and federal districts were represented.For each state or federal district, we took a random sample of 3 of the available sites.In cases where there were less than 3 sites available (Rhode Island and District of Columbia), we used all available sites.This resulted in 145 sites used in our application, with 365 observations for each site.
Chapter 8 of Ramsay and Silverman (2005) conducts a similar analysis of Canadian temperature data from various weather stations.In their application, they uncovered four functional principal component eigenfunctions.Similarly, we conducted VMP simulations with L = 4.The results are presented in Figure 5.In Figure 5 (a), we display the results of four randomly selected weather stations, from four different states.There is relatively small residual variability in the observed dataset because we are using long-term averages.As a consequence, the pointwise 95% credible sets would not be visible in the plots, so we have only included the pointwise variational Bayesian posterior means.In Figure 5 (b), we present the pointwise posterior estimates for each eigenfunction (top panel) and the the effect of perturbing the estimated mean function with each eigenfunction (bottom panel): µ(t) ± δ l ψ l (t), l = 1, . . ., 4. The plus (minus) signs indicate the shift that each eigenfunction makes to the mean function with a positive (negative) perturbation.In addition, the value of δ l was simply selected such that the effect of the perturbation would be visibly apparent.Note that the top and bottom panels in Figure 5 (b) should be analysed concurrently when determining the effect of each eigenfunction.
The bottom panel for the first eigenfunction (which accounts for 91.5% of the total variation) shows that it is a mean shift that perturbs the mean function in the positive (negative) direction when it is added (subtracted).The top panel shows that this effect is stronger in the Winter months than the Summer months, indicating that US temperature is most variable in Winter.Similar analysis of the second eigenfunction (which accounts for 6.5% of the total variation) shows that it represents uniformity in the measured temperatures.It perturbs the mean function in the negative (positive) direction in the Summer (Winter) months when it is added.As a consequence, weather stations at locations with larger discrepancies between Winter and Summer temperatures will have a strong and negative score for this eigenfunction.The third and fourth eigenfunctions are harder to interpret given their weak contributions to the total variation.The scores associated with the first two eigenfunctions for the displayed weather stations are (0.21, -1.26) for the weather station in Louisiana, (0.32, -0.30) for the weather station in Michigan, (-5.24, -0.04) for the weather station in Montana and (-0.66, 0.85) for the weather station in North Dakota.The scores for the first eigenfunction indicate that slightly higher than average temperatures are to be expected in Louisiana and Michigan, slightly lower than average temperatures in North Dakota and well below average temperatures in Montana.Furthermore, the scores for the second eigenfunction show that the greatest variability between Summer and Winter months can be found in Louisiana, whereas Montana may appear to be more uniform in comparison to Louisiana.In addition, Michigan experiences more than average differences in temperature between Summer and Winter, while the difference in Winter and Summer temperatures in Montana are close to the average differences in the United States.

Closing Remarks
We have provided a comprehensive overview of Bayesian FPCA with a VMP-based mean field variational Bayes approach.Our coverage has focused on the Gaussian likelihood specification for the observed data, and it includes the introduction of two new fragments for VMP: 1. the functional principal component likelihood fragment (Algorithm 1); 2. and the functional principal component Gaussian penalization fragment (Algorithm 2).
These are directly compatible with the fragment-based computational constructions of VMP outlined in Wand (2017).This is, to our knowledge, the first VMP construction of a Bayesian FPCA model.In addition, we have outlined a sequence of post-VMP steps that are necessary for producing orthonormal functional principal component eigenfunctions and uncorrelated scores.
Simulations were conducted to assess the speed and accuracy of the VMP simulations against MCMC counterparts.The approximate variational posterior density functions were in good agreement with the MCMC estimations, and the VMP algorithm was approximately 20 to 50 times faster than the MCMC algorithm depending on the number of response curves.An application to a large US temperature dataset showed that the VMP-based FPCA algorithm can be used for strong inference in big data applications.
This study could be extended to other functional data models, such as function on scalar or vector regression models, that are yet to be treated under a VMP-based mean field variational Bayes approach.In addition, extending the likelihood specification to generalized outcomes would also satisfy a popular area of research in functional data analysis.
Next, establish the eigendecomposition of Σ ι , such that Σ ι = Q ι Λ ι Q ι , where Λ ι is a diagonal matrix consisting of the eigenvalues of Σ ι in descending order, and the columns of Q ι are the corresponding eigenvectors.Then, it can be easily seen that ∼ N(0, Λ ι ), i = 1, . . ., n.

B Proof of Lemma 4.1
To prove (54), we first note that posterior curve estimates from the VMP algorithm satisfy where Then, (60) implies Now, let c be the L × 1 vector, with || ψ l || as the lth entry, l = 1, . . ., L. Furthermore, let 1/c be the L × 1 vector, with 1/|| ψ l || as the lth entry, l = 1, . . ., L. Recall that we can approximate these values through numerical integration.Then, It is easy to see that this implies (54).

C Proof of Proposition 4.2
The independence of ζ 1 , . . ., ζ n is a consequence of the independence assumption in (28).Let c and 1/c retain their definitions from Appendix B. Then, note that Recall that m ζ is the mean vector of the columns of D ψ V ψ Ξ .Then, it is easy to see that which proves the results for the estimated scores.
From Lemma 4.1, we have . Therefore, the sample covariance matrix of y 1 (t g ), . . ., y n (t g ) is such that Simple rearrangement confirms that this is the eigenvalue decomposition of the sample covariance matrix of y 1 (t g ), . . ., y n (t g ), proving the result for the vectors ψ 1 (t g ), . . ., ψ L (t g ).

D Derivation of the Functional Principal Component Gaussian Likelihood Fragment
From (31), we have, for i = 1, . . ., n, First, we establish the natural parameter vector for each of the optimal posterior density functions.These natural parameter vectors are essential for determining expectations with respect to the optimal posterior distribution.According to (13), the natural parameter vector for q(ν) is , and the natural parameter vector for q(σ 2 ε ) is . Next, we consider the updates for standard expectations that occur for each of the random variables and random vectors in (61).For ν, we need to determine the mean vector E q (ν) and the covariance matrix Cov q (ν).The expectations are taken with respect to the normalization of , which is a multivariate normal density function with natural parameter vector η p(y|ν,ζ 1 ,...,ζ n ,σ 2 ε )↔ν .From (15), we have Furthermore, the mean vector has the form and the covariance matrix has the form Similarly, for each i = 1, . . ., n, we need to determine the optimal mean vector and covariance matrix for ζ i , which are E q (ζ i ) and Cov q (ζ i ), respectively.The expectations are taken with respect to the normalization of , which is a multivariate normal density function with natural parameter vector η p(y|ν,ζ 1 ,...,ζ n ,σ 2 ε )↔ζ i .According to ( 16), Finally, for σ 2 ε , we need to determine E q (1/σ 2 ε ), with the expectation taken with respect to the normalization of This is an inverse-χ 2 density function, with natural parameter vector η p(y|ν,ζ 1 ,...,ζ n ,σ 2 ε )↔σ 2 ε .According to Result 6 of Maestrini and Wand (2020), . Now, we turn our attention to the derivation of the message passed from p(y|ν, Therefore, as a function of ν, (61) can be re-written as log p(y i |ν, where, for each i = 1, . . ., n, ζ i is defined in (34).From ( 9) and ( 30), the message from the factor p(y|ν, ζ 1 , . . ., ζ n , σ 2 ε ) to ν is as given in (32), which is proportional to a multivariate normal density function.The update for the message's natural parameter vector, in (33), is dependent upon the mean vector and covariance matrix of ζ i , which are where E q (ζ i ) and Cov q (ζ i ) are defined in (65).Note that a standard statistical result allows us to write Next, notice that where V ψ is defined in (37).Then, for each i = 1, . . ., n, the log-density function in (61) can be represented as a function of ζ i by log p(y i |ν, where h µψ,i and H ψ,i are also defined in (37).From ( 9) and ( 30), the message from the factor p(y|ν, ζ 1 , . . ., ζ n , σ 2 ε ) to ζ i is as given in ( 35), which is proportional to a multivariate normal density function.The message's natural parameter vector update, in (36), is dependant on the following expectations that are yet to be determined: where, for l = 1, . . ., L, E q (ν ψ l ) is defined by ( 62) and (63).Next, E q (h µψ,i ) is an L × 1 vector, with lth component being which depends on sub-vectors of E q (ν) and sub-blocks of Cov q (ν) that are defined in ( 63) and (64), respectively.Finally, E q (H ψ,i ) is an L × L matrix, with (l, l ) component being The final message to consider is the message from p(y|ν, where V is defined in (40) and, for each i = 1, . . ., n, ζ i is defined in (34).From ( 9) and ( 30), the message from p(y|ν, ζ 1 , . . ., ζ n , σ 2 ε ) to σ 2 ε is as given in (38), which is proportional to a inverse-χ 2 density function.The message's natural parameter vector, in (39), depends on the mean of the square norm ||y i − C i V ζ i || 2 , for i = 1, . . ., n.This expectation takes the form where we introduce the matrices and vectors h µ,i ≡ ν µ C i C i ν µ , for i = 1, . . ., n.
For each i = 1, . . ., n, the mean vector E q ( ζ i ) and Cov q ( ζ i ) are defined in (67).However, E q (V ) and E q (H i ), i = 1, . . ., n, are yet to be determined.From ( 40), E q (V ) = E q (ν µ ) E q (ν ψ 1 ) . . .E q (ν ψ L ) , where the component mean vectors are defined by (63).For each i = 1, . . ., n, the expectation of H i , defined in (73), with respect to the optimal posterior distribution is E q (H i ) ≡ E q (h µ,i ) E q (h µψ,i ) E q (h µψ,i ) E q (H ψ,i ) , where h µ,i is defined in (74) with expected value Furthermore, E q (h µψ,i ) and E q (H ψ,i ) are defined in ( 71) and ( 72), respectively.The FPCA Gaussian likelihood fragment, summarized in Algorithm 1, is a proceduralization of these results.
The mean and FPC Gaussian penalization fragment, summarized in Algorithm 2, is a proceduralization of these results.
dt are the principal components scores.The ζ * il are independent across i and uncorrelated across l, with E(ζ * il ) = 0 and Var(ζ

Figure 2 :
Figure 2: The factor graph for the Bayesian FPCA model in (26).

Figure 3 :
Figure 3: The results from one simulation of the Gaussian response FPCA model in (26).The simulation parameters are outlined in Section 5.In (a), the simulated data are shown in black, while the VMP-based variational Bayes posterior estimates are presented in red and the corresponding MCMC estimates are shown in blue.In each panel, the solid lines represent the posterior mean, while the dashed line represents the 95% pointwise credible sets for the mean.In (b), we present the vector of scores for each of the randomly selected response curves, shown in black, as well as the VMP-based variational Bayes posterior estimates, shown in red, and the MCMC-based posterior estimates, shown in blue.The red and blue dots represent the VMP-based variational Bayes posterior means and the MCMC-based posterior means, respectively.The ellipses represent the 95% credible contours.

Figure 4 :
Figure 4: The results from a simulation study of the Gaussian response FPCA model in (26).The simulation parameters are outlined in Section 5.1.The box plots in (a) are a summary of the logarithm of the integrated squared error values in (58) for 100 simulations of each of the settings n ∈ {10, 50, 100, 250, 500}.In (b), we present the results for the mean function and the eigenfunctions when n = 100 (top row) and n = 500 (bottom row).The true functions are shown in blue in each panel, and the VMP-based posterior mean curve for each of the simulations is presented in red.In addition, we have included the pointwise mean curve and the pointwise 95% confidence intervals resulting from the MCMC posterior estimates for each of the generated datasets in black.Note that, in each panel, the pointwise MCMC mean curve overlaps very tightly with the true curve, making it difficult to see.

Figure 5 :
Figure 5: Application of the VMP algorithm for FPCA to the United States temperature data.The fits in (a) are for four randomly selected weather stations in the dataset.The plots in (b) present the pointwise posterior mean estimates of the eigenfunctions (top panel) and show the estimated mean function with perturbations from each eigenfunction (bottom panel): µ(t) ± δ l ψ l (t), l = 1, . . ., 4.
then we say the X has an inverse G-Wishart distribution with graph G, shape parameter ξ > 0 and scale matrix Λ if and only if the non-zero values of the density function of . have not been addressed in previous literature on VMP, but are crucial for FPCA modelling.We name the fragment for p(y|ν, ζ 1 , . . ., ζ n , σ 2 ε ) the functional principal component Gaussian like-lihood fragment and the fragment for p(ν|σ 2 µ , σ 2 ψ 1 , . . ., σ 2 ψ L ) the functional principal component Gaussian penalization fragment.

Table 1 :
Median (first quartile, third quartile) elapsed computing time in seconds for VMP and MCMC with n ∈ (10, 50, 100, 250, 500).The fourth column presents the ratio of the median elapsed time for MCMC to the median elapsed time for VMP.