High-dimensional Functional Graphical Model Structure Learning via Neighborhood Selection Approach

Undirected graphical models are widely used to model the conditional independence structure of vector-valued data. However, in many modern applications, for example those involving EEG and fMRI data, observations are more appropriately modeled as multivariate random functions rather than vectors. Functional graphical models have been proposed to model the conditional independence structure of such functional data. We propose a neighborhood selection approach to estimate the structure of Gaussian functional graphical models, where we first estimate the neighborhood of each node via a function-on-function regression and subsequently recover the entire graph structure by combining the estimated neighborhoods. Our approach only requires assumptions on the conditional distributions of random functions, and we estimate the conditional independence structure directly. We thus circumvent the need for a well-defined precision operator that may not exist when the functions are infinite dimensional. Additionally, the neighborhood selection approach is computationally efficient and can be easily parallelized. The statistical consistency of the proposed method in the high-dimensional setting is supported by both theory and experimental results. In addition, we study the effect of the choice of the function basis used for dimensionality reduction in an intermediate step. We give a heuristic criterion for choosing a function basis and motivate two practically useful choices, which we justify by both theory and experiments.


Introduction
Multivariate functional data are collected in applications such as neuroscience, medical science, traffic monitoring, and finance. Each observation in a data set is a multivariate random function, rather than a random vector. Typically, each observation is a multivariate time series, but can be interpreted as a realization of a multivariate stochastic process in continuous time. Such interpretation can provide a unifying approach to the analysis of classical functional data and longitudinal data, where functional data may be used to deal with sparsely observed measurements with noise (Chiou and Müller, 2016).
The focus of our work is characterizing the conditional independence structure of multivariate random functions. As a motivation, we consider data gathered from fMRI scans obtained over 116 brain regions with a time-series signal recorded for each region (Milham et al., 2012). The sample consists of two groups: one group is comprised of patients diagnosed with Attention Deficit Hyperactivity Disorder (ADHD), while the other is a control group. Our goal is to understand the functional connectivity patterns between brain regions for both the ADHD and control groups. Functional connectivity can be uncovered by learning the conditional independence structure across the 116 random functions.
Graphical models are widely used to represent the conditional independence structure of multivariate random variables (Lauritzen, 1996). Let G = {V, E} denote an undirected graph where V is the set of vertices and E ⊂ V 2 is the set of edges. When the data consist of random vectors X = (X 1 , . . . , X p ) , we say that X satisfies the pairwise Markov property with respect to G if X v ⊥ ⊥ X w | {X u } u∈V \{v,w} holds if and only if {v, w} ∈ E. This notion has been extended to functional graphical models-where each node represents a random function rather than a random scalar-in order to characterize the conditional independence relationship of multivariate random functions.
We propose a procedure to estimate the functional graphical model when the random functions follow a multivariate Gaussian processes (MGP). This setting was considered in Qiao et al. (2019), who proposed the functional graphical lasso to estimate the graph structure. Their procedure first obtains a finite dimensional representation for the observed multivariate functions using functional principal component analysis (FPCA). Subsequently, a precision matrix is computed from the projection scores of the finite dimensional representation using a graphical lasso objective with a group penalty. The graph structure is finally obtained from the non-zero blocks of the estimated precision matrix. When the underlying random functions are infinite dimensional, the corresponding covariance operator is a compact operator, and its inverse, the precision operator, is ill-defined (Hsing and Eubank, 2015). As a result, Qiao et al. (2019) ensure their estimand is well defined by assuming that the random functions lie in a finite dimensional space. However, that assumption excludes infinite dimensional functional data and is restrictive.
In contrast to the functional graphical lasso proposed by Qiao et al. (2019), we propose a neighborhood selection approach to estimate Gaussian functional graphical models. For vector-valued Gaussian graphical models, Meinshausen and Bühlmann (2006) proposed a neighborhood selection procedure that estimates the neighborhood-the set of adjacent nodes in a conditional independence graph-for each node separately by sparse regression. The entire graph structure is then estimated by combining estimates of node-specific neighborhoods. We extend their approach to the functional data setting. This allows us to avoid defining the precision operator, and, as a result, our theory extends to truly infinite dimensional functional data.
We cast the neighborhood selection procedure as a function-on-function regression problem. Due to the infinite dimensional nature of the functional data, we first project all observed random functions onto a finite dimensional basis. Thus, we approximate the function-on-function regression with a vector-on-vector regression problem that is solved by minimizing a squared error loss with a group lasso penalty. We do not require a specific choice of function basis for our methodology and the corresponding theory, and we provide a theoretically guided intuition on the choice of function basis under different conditions. Specifically, when estimating the neighborhood of a target node, we project all functions onto a single subspace instead of projecting each function onto its own subspace. In Section 2.4 we discuss the intuition for why this is preferable to projecting each function onto their own subspace. This observation is supported by both the theory in Section 4.3 and the simulations in Section 5.
Besides the methodological contributions, we also provide nontrivial theoretical contributions. Most importantly, by directly estimating the conditional independence structure without reference to a population "precision operator," we do not require that the functional data is finite dimensional. However, in the infinite-dimensional setting there will be a residual term due to using a finite dimensional approximation, and deriving error bounds requires a careful analysis of this residual term. Finally, our theory is non-asymptotic in nature, and we derive finite-sample guarantees for graph recovery.
In summary, the neighborhood selection approach yields at least three benefits. First, it allows us to define functional graphical models directly from the conditional covariance and does not require the notion of a precision operator. As a result, we can estimate the graph structure even from infinite dimensional data, rather than restricting the data to finite dimensional functions. Second, by estimating the neighborhood of each node separately, we have increased flexibility in choosing the function basis used to represent the random functions, and tailoring the function basis for the specific task at hand results in empirically better estimation results. Finally, when estimating the neighborhood of a node, we only need to handle p individual M × M matrices. These neighborhood estimation procedures can be done in parallel, leading to a highly efficient estimation procedure. In comparison, fglasso (Qiao et al., 2019) needs to estimate a pM ×pM matrix, which is computationally much more expensive. We demonstrate the practical value of our neighborhood selection method on the motivating fMRI data set.

Related Work
Our paper contributes to the growing literature on modeling multivariate functional data. We study the estimation of conditional independence structure from multivariate functional data in the setting of MGPs (Qiao et al., 2019). For each component of multivariate functional data, Qiao et al. (2019) projected observed functions on the corresponding function basis estimated by FPCA. Subsequently, the graph structure is estimated from the projection scores using the multitask extension of the graphical lasso (Kolar et al., 2013(Kolar et al., , 2014, which estimates the precision matrix with a block structure. However, the precision operator is ill-defined when the functional data is infinite dimensional, and their method is computationally expensive when the number of nodes is large. In the same setting, Qiao et al. (2017) proposed a dynamic functional graphical model that allows the graph structure to change over time. Zhu et al. (2016) proposed a Bayesian approach to functional graphical models. Zapata et al. (2021) studied estimation of the graph structure under the assumption of partial separability. Roughly speaking, partial separability assumes that the time-varying covariance of the MGP can be decomposed node-wise into a time-varying component and a constant. Lynch and Chen (2018) proposed a test to verify the partial separability assumption. However, the assumption can be restrictive and may not hold in many settings. Zhao et al. (2019a) and Zhao et al. (2020) discussed direct estimation of the difference between two Gaussian functional graphical models without the need to estimate each individual structure. Solea and Li (2020) extended the Gaussian functional graphical model to a copula version by allowing monotonic transformations of the FPCA scores.
Our paper is also related to the literature on function-on-function regression that studies regression problems where both the response and predictor variables are functions. Luo and Qi (2017) and Qi and Luo (2018) studied function-on-function linear regression, where each predictor function is transformed by a corresponding integral operator defined by the bivariate coefficient function, and the addition of all transformed predictor functions is defined as the signal function. The response function is then assumed to be a simple addition of the signal, intercept, and noise functions. To estimate the coefficient functions, they used the FPCA basis of the signal function to expand the observed functions and transform the original function-on-function regression problem to a functionon-scalar regression problem with uncorrelated predictors. Ultimately, this regression is solved by a penalized least-squares method. In contrast, we focus on variable selection, rather than prediction, and develop non-asymptotic theory. Luo et al. (2016) also converted the function-on-function regression problem to function-on-scalar regression by projecting predictor functions with basis functions, but they restricted their choice of function basis to wavelet transformation, and thus can be considered as a special case of our approach. Ivanescu et al. (2015) discussed a method similar to what we propose, but they did not give any guidance on how to choose a function basis, nor did they offer any theoretical guarantees. These approaches can be treated as special cases within our framework with a specific choice of function basis. In addition to linear function-on-function regression, Qi and Luo (2019) and Scheipl et al. (2015) studied functional additive models, but did not give theoretical results about variable selection in the high-dimensional setting.
While finishing this paper, we were made aware of concurrent work by Solea and Dette (2021) that also uses a neighborhood selection approach to estimate a graphical model from functional data. There are several differences between our approaches. First, when estimating the neighborhood of a particular node, Solea and Dette (2021) project each observed function to its own eigenfunction basis. We consider several alternative choices for the functional basis expansion and suggest a different approach-when estimating the neighborhood of a node j, we project all observed functions onto the same basis-typically the eigenbasis of the j-th node. However, we may use different bases when estimating different neighborhoods. Second, Solea and Dette (2021) assume a slightly more general setting where the FPCA scores arise from an additive regression model, whereas the scores arise from a linear model in our work. However, their theoretical guarantees require a functional irrepresentablity condition (Assumption 4.8 therein)-which we do not require.
We use bold lower case letters (e.g., a and b) to denote vectors and bold upper case letters (e.g., A and B) to denote matrices. For a vector a ∈ R n , let a q denote its l q -norm, q ∈ [1, ∞), with the usual extension to q = 0 and q = ∞. For a set of indices I ⊆ [n], we use a I to denote the vector in R n with a I,i = v i for all i ∈ I and a I,i = 0 for all i / ∈ I. Let G = {G 1 , G 2 , . . . , G N G } be a partitioning of the set [n] into a set of N G disjoint groups. The mixed norm · 1,q is defined as For a symmetric matrix B, we use ρ max (B) to denote its largest eigenvalue, ρ min (B) to denote its smallest eigenvalue, and tr(B) to denote its trace. For a matrix A ∈ R n1×n2 , we use vec(A) to denote the vector in R n1n2 formed by stacking the columns of A. For two matrices A ∈ R n×m and B ∈ R r×s , A ⊗ B ∈ R nr×ms denotes their Kronecker product, with We use |||A||| ∞ to denote the elementwise maximum absolute value of A, that is, For a real-valued differentiable function f : R n → R, we use ∇f (x) ∈ R n to denote its gradient at a point x ∈ R n . For real-valued functions f (·), g(·) defined on T ⊂ R, we use f = ( T f 2 (t)dt) 1/2 to denote the L 2 -norm of f and f, g = T f (t)g(t)dt. For a bivariate function B(t , t) defined on T × T , we use B HS = B(t , t) HS = ( T ×T B 2 (t , t)dt dt) 1/2 to denote its Hilbert-Schmidt norm. We use f (·) = (f 1 (·), f 2 (·), . . . , f n (·)) to denote a vector with function entries.
For any two sequences {a n } n≥1 and {b n } n≥1 , we use a n b n or a n = O(b n ) (a n b n or a n = Ω(b n )) to denote that there exists a constant c ≥ 0 such that a n ≤ c · b n (a n ≥ c · b n ) for n large enough. Similarly, we use a n =Õ(b n ) to ignore any log terms asymptotically, that is, a n =Õ(b n ) if a n = O(b n log k b n ) for some k ≥ 0. In this paper, we useÕ(·) to ignore log terms of sample size, but we keep log terms of other quantities like the number of vertices and the dimension of truncated function.

Outline of the Paper
The rest of the paper is organized as follows. In Section 2, we introduce the functional graphical model and our methodology for estimating the graph structure. In Section 3, we discuss the optimization algorithm used to compute the estimator. We develop theoretical guarantees for our approach in Section 4. Results on simulated and real data are reported in Section 5 and Section 6, respectively. We conclude the paper with a discussion in Section 7. Code to replicate the results in this paper is available at https://github.com/PercyZhai/FGM_Neighborhood.

Methodology
In this section, we briefly review the functional graphical model in Section 2.1. We introduce a neighborhood selection procedure for estimating the functional graphical in Section 2.2 and discuss a practical implementation in subsequent subsections.

Functional Graphical Model
Let g(·) = (g 1 (·), g 2 (·), . . . , g p (·)) be a p-dimensional multivariate Gaussian process (MGP) with mean zero and a common domain T , where T is a closed subset of the real line. We also assume that the function g j (·), j ∈ [p], is a random element of a Hilbert space H and E[ g j 2 ] < ∞. Following Qiao et al. (2019), for two random functions g j and g l , the conditional cross-covariance function is defined as Under the multivariate Gaussian process assumption, if C jl (t , t) = 0 for all t , t ∈ T , then the random functions g j (·) and g l (·) are conditionally independent given the other random functions; i.e., is the set of vertices and E ⊂ V 2 is the set of edges. The edge set E encodes the pairwise Markov property of g(·) (Lauritzen, 1996) Let g i (·) = (g i1 (·), g i2 (·), . . . , g ip (·)) denote a random copy of g(·). The goal of this work is to estimate the edge set E when given n i.i.d. random copies {g i (·)} n i=1 . Qiao et al. (2019) proposed to estimate E by a functional graphical lasso procedure. In contrast, we propose a neighborhood selection approach detailed in the subsequent section.

Functional Neighborhood Selection
We develop a neighborhood selection procedure for estimating the functional graphical model. The neighborhood selection approach can be traced back to Besag (1975) and was further developed for Gaussian graphical models in a high-dimensional setting by Meinshausen and Bühlmann (2006). Specifically, Meinshausen and Bühlmann (2006) estimated the conditional independence graph when the vector-valued data X = (X k ) k∈ [p] are drawn from a multivariate Gaussian. Properties of the multivariate Gaussian ensure that for each j ∈ [p], there exist {β jk } k =j such that where ε j is normally distributed and independent of all X k , k = j. It can be shown that the set of nodes adjacent to the node j in the conditional independence graph-i.e., the neighborhood of j which we denote N j = {k : (j, k) ∈ E}-is equivalent to the set {k ∈ [p]\{j} : β jk = 0}. Thus, Meinshausen and Bühlmann (2006) use the variables selected from a penalized regression of X j onto all other variables to estimate N j ; specifically,N j = {k ∈ [p]\{j} :β jk = 0}. After estimating each neighborhood, they combine the estimates into a single estimate of the entire graph G. We consider random functions and assume that each random function has the following representation, which is analogous to the multivariate Gaussian setting.
where e ij (·) ⊥ ⊥ g ik (·), k = j, and β jk (t, t ) HS < ∞. for j ∈ V do Estimate the projection basis φ j , if it is not known in advance Use (5) to calculate projection scores for all observed functions on φ j Given projection scores, solve (7) EstimateN j using (8) end for Combine all neighborhoods into the estimated edge set using AND/OR rule Output: ReturnÊ When the index is clear from the context, we will drop the subscript j from β jk (t, t ). In addition, β jk (t, t ) = 0 for all t, t is equivalent to C jl (t, t ) = 0 for all t, t .
Given the representation in (3), we adapt the neighborhood selection approach to functional data and seek to construct an estimate of the graph by first estimating each neighborhood We denote the size of the neighborhood as s j = |N j |. To estimate the neighborhood for j ∈ V , we regress g ij on {g ik : k ∈ [p]\{j}} using a penalized functional regression approach. Despite the conceptual simplicity and high level similarity to Meinshausen and Bühlmann (2006), there are numerous technical challenges that need to be addressed in the functional data setting, which we discuss in Section 4.

Vector-on-Vector Regression
The observed functions {g i (·)} n i=1 may be infinite dimensional objects, thus, the regression problem suggested by (3) cannot be solved directly. As a practical estimation procedure, we first approximate the function-on-function regression problem with a tractable finite dimensional vector-on-vector regression problem.
For a fixed target node j ∈ [p], suppose we seek to estimate N j . As a first step, we represent the potentially infinite dimensional functions using a finite M -dimensional basis. Let φ j = {φ jm } ∞ m=1 be an orthonormal basis of H; for now, we assume that it is given, and details on selecting an appropriate basis will be discussed in Section 2.4. Using the first M basis functions, we compute the projection scores for each k ∈ [p] and m ∈ [M ]: and form the projection score vectors a i,k,M = (a ik1 , . . . , a ikM ) . For each observed function, the scores encode the L 2 projection onto the first M elements of φ j and g ik (·) ≈ M m=1 a ikm φ jm (·). The target node j, will typically be fixed, so without loss of generality, we assume j = p. In addition, we follow the commonly used regression notation and denote the random function of the target node, g ij (·), as g Y i (·) and denote the other p − 1 random functions as (g X1 i (·), . . . , g Xp−1 i (·)) . We let a Y im = g Y i , φ jm and a X k im = g X k i , φ jm denote the scores for observed functions and let a Y i,M and a Xk i,M denote the vectors of scores. At times, we will also use the notation As shown in Appendix A.1, a Y i,M can be represented as where B * k,M ∈ R M ×M is a regression matrix parameter corresponding to β jk (·, * ), w i,M is the noise vector corresponding to e ij (·), and r i,M is a bias term, which arises due to only using the first M basis functions. More details are provided in Section 4.
Given n i.i.d. samples {g i (·)} n i=1 , we estimate B * k,M -and subsequently N j -using a penalized least squares approach. Let a Y im , a X k im , a Y i,M , and a Xk i,M denote the quantities arising from the ith observed sample. We select B * k,M by minimizing the following objective: where λ n is a tuning parameter. In Section 3, we propose an efficient optimization algorithm to solve (7), and give statistical guarantees forB 1,M , . . . ,B p−1,M in Section 4. GivenB 1,M , . . . ,B p−1,M , the estimated neighborhood set is then where the threshold n is a tuning parameter. Finally, the estimated edge setÊ is obtained by combining the estimated neighborhoods of each node. Following Meinshausen and Bühlmann (2006), the edge setÊ can be computed by one of the following schemes: • AND: if both j ∈N l and l ∈N j hold, then (j, l) ∈Ê; • OR: if either j ∈N l or l ∈N j holds, then (j, l) ∈Ê.
To operationalize the procedure, we discuss the choice of basis functions and the choice of tuning parameters in the following two sections.

Choice of Basis
A key element in the procedure above is the choice of the basis φ j . Throughout the paper, we assume that the basis is orthonormal; if the user specifies a non-orthonormal basis, it can first be orthonormalized with a procedure such as the Gram-Schmidt algorithm (Theorem 2.4.10 of Hsing and Eubank (2015)).
At a high level, there are two different approaches that can be used: the basis can be fixed in advance or the basis can depend on the data. In the first approach, one uses a known basis, which could be selected via prior knowledge, or simply a commonly used basis for which projection scores can be efficiently computed (e.g., the Fourier, B-spline, and wavelet bases). The second approach uses a basis that is determined by unobserved population quantities and needs to be estimated before computing the projection scores. For instance, functional PCA (FPCA) can be used to estimate a basis (Ramsay and Silverman, 2005, Chapter 8). In the previous section, we discussed vector-onvector regression assuming that the basis φ j was known a priori, and here we discuss the case where the basis must be estimated.
For a chosen node j ∈ [p] and any i ∈ [n], suppose that we have an estimate {φ jm } m≥1 of the "true" basis , and a Xk i,M = (â X k i1 , . . . ,â X k iM ) . Similar to (6), we havê where the additional term v i,M is defined in (36) in appendix, which arises due to usingφ j instead of φ j . Whenφ j is close to φ j , the error term v i,M should be small. See the derivation of (9) in Appendix A.1. Based on the relationship in (9), we estimate the graph structure as in the previous section, whereB 1 , · · · ,B p−1 are estimated using (7) with a Y i,M and a Xk i,M replaced byâ Y i,M andâ Xk i,M . The subsequently estimated neighborhood sets are given by (8).
The most popular data-driven basis is the FPCA basis. Consider the functional covariance of g ij (·), K jj (t , t) = Cov(g ij (t ), g ij (t)) , and the linear Hilbert-Schmidt covariance operator K j : L 2 (T ) → L 2 (T ), Then there exist eigenpairs {σ jm , φ jm (·)} m∈N of K j (Theorem 7.2.6 of Hsing and Eubank (2015)), where {σ jm } m∈N are the eigenvalues and {φ jm (·)} m∈N are orthonormal eigenfunctions. Since the covariance operator, K j , is symmetric and positive semidefinite, we assume that σ j1 ≥ σ j2 ≥ · · · ≥ 0 without loss of generality. By the Karhunen-Loève theorem, g j can be represented as are the FPCA scores and a ijm is independent from a ijm for m = m (Bosq, 2000, Theorem 1.5). We will refer to {φ jm (·)} ∞ m=1 as the FPCA basis. Since the basis is orthonormal, the function g M ij (·) = M k=1 a ijk φ jk (·) is the L 2projection of g ij (·) onto the basis spanned by the first M FPCA functions. The main advantage of this basis is that it provides the best approximation in the L 2 sense when projecting a function onto a fixed number of basis functions.
Unfortunately, the FPCA basis is typically unknown since K jj (t , t) is unknown. Therefore, we first estimate the functional covariance with the empirical version: A subsequent eigen-decomposition ofK jj (t , t) yields the estimated eigenpairs {σ jm ,φ jm (·)} M m=1 , which in turn can be used to estimate the FPCA scoresâ ijm = T g ij (t)φ jm (t)dt. Qiao et al. (2019) and Solea and Dette (2021) also use projection scores from a dimension reduction procedure. However, there are several key differences between our approach and their approaches. First, although it is the most commonly used basis, we do not restrict ourselves to the FPCA basis, and instead consider a generic basis. This provides additional flexibility and allows us to explore the effect of the chosen basis on empirical performance. See Section 4.3 for details. Our methodology also differs in a second, more substantial way. Both Qiao et al. (2019) and Solea and Dette (2021) project each random function onto its own FPCA basis and consider the resulting projection scores for all subsequent tasks. In contrast, when estimating the neighborhood of a specific node-rather than projecting each random function onto its own subspace-we project all random functions onto the same subspace. Said concretely, the subspace for estimating N j , φ j , may differ from φ k , the subspace used for estimating N k . However, when estimating N j we use the projection scores for all functions projected onto a single basis φ j .
Intuitively, the advantage of this approach is that we can tailor the finite dimensional representation to maximize the information relevant to selecting the neighborhood of a specific node, N j . The FPCA basis for each random function maximizes the "retained information" for that random function. While there might be significant features of g ik (·) that are captured by its FPCA basis, these features may not be relevant for estimating the neighborhood of a specific node N j . Ultimately, we ought to care more about how g ik (·) behaves in the subspace spanned by g ij 's FPCA basis, which captures g i,Y 's variability, rather than the subspace spanned by its own FPCA basis. We examine a theoretical justification in Section 4.3 and also illustrate the advantages in simulations.
More concretely, using a single basis for selecting N j also avoids issues of colinearity which may arise artificially. For instance, suppose g ik (·) = ∞ m=1 a ikm φ km (·) and g il (·) = ∞ m=1 a ilm φ lm (·) have eigenfunctions {φ km (·)} m≥1 and {φ lm (·)} m≥1 that differ drastically, but a ikm and a ilm are highly correlated. When estimating N j using the projection scores from the FPCA basis of k and l, this would result in a poorly conditioned problem that may violate the irrepresentability condition (e.g., Assumption 4.8 in Solea and Dette (2021) or Condition 5 in Qiao et al. (2019)), despite the fact that the actual random functions g ik (·) and g il (·) are not difficult to distinguish. Projecting g ik (·) and g il (·) onto the same basis-φ j -would avoid this concern, and the resulting projection scores would only be colinear if the actual random functions are similar and the problem is intrinsically hard.
While our methodology and theory allow for any orthonormal basis, we show both theoretically and in simulations that a well-chosen basis can improve performance. When choosing a basis, there are at least two objectives to consider. First, we want to minimize the covariance between r iM and {a Xk i,M } k∈[p−1] in (6). Second, we want to maximize the signal strength min k∈Nj B * k,M F . In general, simultaneously achieving these two objectives is practically infeasible. Thus, in practice, we focus on achieving at least one of the two. Achieving the first objective is generally infeasible without further restrictive assumptions (See Section 4.3). Thus, in practice, we generally focus on the second objective, which will lead us to use the FPCA basis of g Y i which we recommend as a default choice.

Selection of Tuning Parameters
There are three tuning parameters that need to be chosen to implement our algorithm: the group lasso penalty parameter, λ n , in (7); the number of basis functions used for dimension reduction, M ; and the thresholding parameter from (8), n . While a nonzero thresholding parameter n is needed for theoretical guarantees in Theorem 1 and Theorem 2, in practice it can be set to zero. The number of basis functions M can be selected using 5-fold cross-validation, similar to the procedure in Qiao et al. (2019). In Section 4, the theoretical guarantees require a choice for the penalty parameter λ n ; however, this value depends on population parameters that are unknown in practice. Here, we give a practical procedure for choosing λ n .
For a large enough value of λ n , all the estimated coefficientsB k will be set to zero. Specifically, by Proposition 1, there exists a threshold λ max,n > 0 that can be calculated from the data, such that for any λ n > λ max,n , the resultingB k = 0 for all k ∈ [p − 1]. Thus, we only need to consider λ n ∈ (0, λ max,n ]. We found empirically that traditional K-fold cross-validation performs poorly in our setting. Therefore, for each j ∈ [p], we select λ n using selective cross-validation (SCV) (She, 2012), which generally yields better empirical performance. The procedure is as follows: 1. For each λ n value: (a) Use the entire data set to estimateB λn = (B 1 , . . . ,B p−1 ) by solving (7).
i. For k ∈N (B λn ), re-estimateB k by minimizing the unpenalized least squares objective using the l-th-fold training set, which we denote I l . SetB k = 0 for all k / ∈N (B λn ). Specifically, we obtainB 1 , . . . ,B p−1 by solving the optimization problem below: ii. Evaluate the estimate on the test data using the error criterion defined by (10) that contains a cross-validation deviance term and a degree-of-freedom shrinkage term. (c) Calculate the mean of the criterion across all K folds.
2. Pick the λ n value that minimizes the mean criterion.
We propose an error criterion named SCV-RSS, where RSS stands for the residual sum of squares. The criterion performs well in practice and adds the BIC penalty term to the squared Froebenius norm of the empirical estimation error:

Optimization Algorithm
We solve the optimization problem in (7) using the alternating direction method of multipliers (ADMM) (Gabay and Mercier, 1976;Boyd et al., 2011). The pseudo-code is given in Algorithm 2 and we provide additional details below. Let (7) can be reformulated as: which can be minimized by solving a series of optimization problems. At the h-th iteration, for all k ∈ [p − 1]: where ρ is the penalty parameter that we discuss below. For efficiency, the ADMM subproblems can be simplified as follows (see Appendix A.2 for details): where (11) can be rewritten as: From Proposition 1 of Yuan and Lin (2006), P h+1 where Ψ k is an orthonormal matrix and Λ k is a diagonal matrix, both obtained from the eigen- The value of ν satisfies ν P h+1 k F = λ n /ρ, and is obtained using a bisection line search. Specifically, given the initial search interval [ν L , ν R ], we set ν = ν C := (ν L + ν R )/2 and calculate ν P h+1 Iterating this procedure, the value ν C converges to a point that satisfies ν C P h+1 k F = λ n /ρ. Note that we perform one eigen-decomposition of (A Xk ) A Xk and cache the result. As a consequence, updating P h+1 k only requires inverting a diagonal matrix. Finally, the solution of (12) isQ Iteratively using the updates (14), (15), and (13), the matrix P h k will eventually converge to Boyd et al., 2011). The solution of (7) is given byB k = P * k , k ∈ [p − 1]. The stopping criterion for the iteration process depends on the primal residual, which indicates how well the constraints are satisfied, and the dual residual, which indicates stability of the updates between two consecutive iterations (Boyd et al., 2011) The algorithm terminates when both residuals are below their respective thresholds: , with abs = 10 −4 and rel = 10 −3 , as suggested in Boyd et al. (2011). Finally, we discuss the penalty parameter ρ. To prevent the primal and dual residuals from significantly varying across iterations, the parameter ρ can be adjusted at each step. This ensures stability regardless of the initial P 0 k and U 0 . We use Strategy S3 in Table 1 of He et al. (2000) with ϕ = 10, τ incr = τ decr = 2: with ρ 0 = 1. This ensures that the primal and dual residuals are relatively close.
Algorithm 2: ADMM for functional neighborhood selection (15) and U h+1 by (13); Check if primal and dual residuals meet stopping criteria; Update ρ h+1 for next round; end for Output:B k for k ∈ [p − 1].

Theoretical Properties
We now discuss the statistical properties of the estimator proposed in Section 2. In particular, we give conditions under which the neighborhood of a single variable can be consistently recovered. Using a union bound extends the guarantees to recovery of the entire graph. First, we discuss a procedure that uses a fixed function basis and, subsequently, we discuss a procedure that uses an estimated function basis.
Since we are first considering a single node j, we assume without loss of generality that j = p. To simplify the notation, we also drop the subscript j from β jk (t , t), φ j , φ j,m , . . ., in this section.
Assumption 2 is quite weak since Proposition 2 in Appendix A.7 implies that (16) holds almost everywhere when β k (t, t ) HS < ∞. Thus, Assumption 2 simply assumes that set of measure zero is actually empty.
. The scores of the "error" projected onto the function basis are denoted as When n and M are large enough, then the term β k,>M (t , t) HS is small,B M k is close to B * k,M , andN j will be a good estimator of N j .
Both w i,M and r i,M are Gaussian vectors with mean zero, and we denote their covariance matrices as Σ w and Σ r respectively; in addition, we define Σ r,w = Cov(r i,M , w i,M ) and Σ w,r = (Σ r,w ) . To simplify the notation, we drop the explicit dependence on M . Let The following quantities will be used to state the results: The quantity K 0 is used to provide an upper bound on the estimation error for the covariance matrix of a X i,M . This is subsequently used to prove a lower bound on the restricted eigenvalues, which is a crucial step for proving Theorem 1 and Theorem 2.

Prior Fixed Function Basis
, where e ij is defined in (3). Note that, σ jr ≤ σ j0 . We introduce several assumptions before stating the main results.
Assumption 3. There exists a constant C > 0 that does not depend on p such that σ max,0 ≤ C and max j∈[p] max k∈Nj β jk (s, t) HS < ∞ for all p ≥ 1.
Assumption 3 requires that the norm of the random functions have a finite second moment and is a basic requirement for functional graphical models to be well defined. Note that Ξ k (M ) ≤ max j∈[p] E g ij 2 for all k = 1, 2, 3, 4 and any M . Thus, Ξ k (M ) ≤ C for all M ≥ 1 and p ≥ 1.
Assumption 4. Let Σ X Nj ,Nj = (Σ Xk,X k ) k,k ∈Nj ∈ R |Nj |M ×|Nj |M be the submatrix with blocks indexed by the elements of the neighborhood set N j and define For any M , we assume that κ > 0. When N j is empty, we let κ = ∞ for all M .
Assumption 4 requires that the projection scores of all functions in the neighborhood of the node j are linearly independent. As discussed in Section 2.4, because we project all functions onto the same basis, the projection scores would only be colinear if the functions are truly difficult to distinguish.
Let τ (M ) be the relevant signal strength: where β k,M (t , t) is defined in (18). For any orthonormal basis, the signal strength τ (M ) is an increasing function of M . When N j is empty, we define τ (M ) = ∞ for all M . Recall that we use s = s j to denote the size of the neighborhood, s = |N j |.
Assumption 5 requires that the function ω(M )-a measure of bias due to truncation-must decay quickly compared to κ(M ), which roughly measures the conditioning of the design matrix, after dividing by τ (M ), which measures the signal strength. Let Let wherẽ and The exact form ofλ(n, p, M, δ) can be found in (45) in appendix. The functionλ(n, p, M, δ) provides a theoretical guidance on how to select the regularization parameter λ n and the function χ(n, p, M, δ) is used to provide theoretical guidance on the choice of thresholding parameter n .
We are now ready to state our main result on the consistency of the neighborhood selection procedure.
Theorem 1 (Neighborhood Recovery with Prior Fixed Basis). Suppose that Assumptions 1-5 hold. Furthermore, suppose M ≥ M * , λ n =λ(n, p, M, δ), and n = χ(n, p, M, δ) + 12 s/κ(M )ω(M ). Fix δ ∈ (0, 1]. If the sample size n satisfies Proof. See Appendix A.5. Note that the quantities κ(M ), ω(M ), ν(M ), s, and M all implicitly depend on j. One key difference between Theorem 1 and a typical group lasso result (Chapter 9.6 of Wainwright, 2019) is that the error term is correlated with the covariates-recall that the projection scores a X i,M are correlated with the residual r i,M due to the finite-dimensional approximation. The effect of the correlation is captured by the function ω(M ). When there is no truncation bias-i.e., the random functions are finite-dimensional and M is large enough-then ω(M ) will be zero.
Using a union bound, the following corollary directly follows from Theorem 1 and provides guarantees for recovery of the entire graph.
Corollary 1 (Graph Recovery with Prior Fixed Basis). Suppose the conditions of Theorem 1 hold for all nodes j ∈ [p]. We use κ j (M j ), ω j (M j ), ν j (M j ), s j , and M j to take the place of κ(M ), ω(M ), ν(M ), s, and M in Theorem 1 to show their dependency on j explicitly. LetÊ be the estimated edge set obtained by applying either the AND or OR rule to the estimated neighborhood of each node. If the sample size n satisfies Compared with Qiao et al. (2019), we do not assume that the functional data are finite dimensional. Instead, we study the truly infinite dimensional functional data and discuss the trade-off between bias, signal strength, and conditioning of the design matrix explicitly. When Condition 2 of Qiao et al. (2019) (which is required for the correct graph recovery therein) holds, that is, when g ij is M (n)-dimensional for all j ∈ [p] and some positive integer M (n), then ω(M (n)) = 0 and ν(M (n)) > 0. However, when Assumption 5 holds, we do not necessarily need g ij to be finite dimensional. Thus, Condition 2 of Qiao et al. (2019) is strictly stronger than Assumption 5.

Data-Dependent Function Basis
We now consider the setting where the function basis used for dimension reduction is not known in advance, and instead the basis used is an estimate of some population basis. We will assume access the estimated basis satisfies the following property.
Assumption 6. There exist constants c 1 , c 2 > 0 such that for all 0 < δ ≤ 1, we have where {d jm }, j ∈ [p], m ≥ 1, are constants that depend j, m and satisfy d 2 Assumption 6 holds when φ jm (t) are the FPCA eigenfunctions of K jj (t , t) andφ jm (t) are the estimated FPCA eigenfunctions ofK jj (t , t) -see Lemma 6 and Lemma 8 in the Supplementary Material of Qiao et al. (2019). When Assumption 6 holds, let d j,max = max m≥1 d jm , and whereλ(n, p, M, δ) is given in (60) in appendix.
Comparing Theorem 2 with Theorem 1, one key difference is that the regularization parameter is increased by a term that corresponds to the estimation error of the basis functions. As a result, the sample complexity also increases due to this additional error source.
Similar to before, the following corollary provides guarantees for recovery of the whole graph and directly follows from Theorem 2 when applying a union bound.
Corollary 2. Suppose the conditions of Theorem 2 hold for all nodes j ∈ [p]. We use κ j (M j ), ω j (M j ), ν j (M j ), s j , and M j to take the place of κ(M ), ω(M ), ν(M ), s, and M in Theorem 1 to show their dependency on j explicitly. LetÊ be the estimated edge set obtained by applying either the AND or OR rule to the estimated neighborhood of each node. If the sample size n satisfies , then P Ê = E ≥ 1 − δ.

Theoretical Guidance on the Choice of Function Basis
Suppose we use φ = {φ m } ∞ m=1 as the function basis to represent the data. Let where ω(M ) measures the covariance between the scores of the basis elements we include and the basis elements we truncate, κ(M ) is the minimum eigenvalue of the covariance of the scores, and τ (M ) measures the signal strength of the basis elements we include. The function Λ(M, φ) appears in Assumption 5, and based on the previous section, a good choice of φ should minimize Λ(M, φ) for all M ≥ 1. Unfortunately, minimizing Λ(M, φ) is typically infeasible as it involves unknown quantities. In this section, we motivate two approaches for selecting the function basis. First, we show that when additional assumptions hold, a basis which minimizes ω(M ) can be used. Second, we consider a more general case and suggest approximately minimizing an upper bound on Λ(M, φ).

Minimize ω(M )
One approach to choose the function basis is to minimize ω(M ). To that end, the function basis φ should minimize the covariance between While minimizing this covariance is intractable in general, under the assumption of partial separability (Zapata et al., 2021), we can solve the minimization problem exactly.
are mutually uncorrelated.
When the PSKL basis exists, then {{ g X k i , φ m } p−1 k=1 } M m=1 and {{ g X k i , φ m } p−1 k=1 } m>M are uncorrelated for any M ≥ 1, and ω(M ) = 0 for all M ≥ 1. Note that Λ(M, φ) is nonnegative, thus, the PSKL basis minimizes (32) when the data generating process is partially separable. Lynch and Chen (2018) proposed a test to verify the partial separability assumption. However, the partial separability assumption is strong and may not hold in general settings.

Minimize an Approximate Upper Bound
When the PSKL basis does not exist, we suggest choosing a function basis by approximately minimizing the following upper bound on (32). By the calculation in Appendix A.3, we have Therefore, by maximizing the right hand side of (33), we are approximately minimizing Λ(M, φ).
Unfortunately, most of the terms depend on N j , which is unknown. As a consequence, we choose to maximize log tr Σ Y , which does not depend on N j . This term is maximized when the function basis φ is the FPCA basis of g Y i . More intuitively, the FPCA basis of g Y i maximizes the signal strength of the response variable. In Section 5, we confirm by extensive simulations that the FPCA basis of g Y i indeed performs well.

Simulations
We illustrate the finite sample properties of our neighborhood selection procedure through a simulation study. We generate the simulated data with the following procedure. Let where a ij ∈ R M * and ((a i1 ) , . . . , (a ip ) ) ∈ R pM * follows a mean zero Gaussian distribution with covariance matrix Σ = Θ −1 , and f (t) is a vector that contains the first M * Fourier basis functions. We consider the following four settings for the precision matrix Θ.
• Model B. We generate a partially block-banded precision matrix Θ ∈ R pM * ×pM * with M * = 15. In this setting, every alternating block of 10 nodes are connected by Model A, and the remaining nodes are fully isolated. Precisely, Θ is a block diagonal matrix, with each of its 10M * × 10M * blocks denoted by Θ kk , k = 1, · · · , p 10 . For even k, we set Θ kk = I 10M * . For odd k, we set Θ kk ∈ R 10M * ×10M * to be the block-banded precision matrix generated by Model A.
• Model C. We generate a hub-connected precision matrix Θ ∈ R pM * ×pM * with M * = 5.
We generate the edge set E from a power law distribution as follows. For each node, the number of neighbors m follows a power law distribution p m = m −α with α = 2, and the exact neighbors are sampled uniformly. A disjoint sequence of edge sets E 1 , . . . , E 5 is generated from E, yielding 5 adjacency matrices G l , l = 1, . . . , 5. The detailed algorithm of this step is given in Section 8 of Zapata et al. (2021). Then we generate p × p precision submatrices Ω l , l = 1, . . . , 5, whose supports are exactly G l . First, p × p matricesΩ l are generated: ]. Next, we rescale the rows ofΩ l so that the 2 -norm of each row is 1. We then obtain Ω l by symmetrizingΩ l ; we averageΩ l with its transpose, and set the diagonals to one. Let Σ ps = diag(Σ 1 , . . . , Σ 5 ) with Σ l = l Ω −1 l , where l = 3l −1.8 . To break the partial separability condition, we define a block precision matrixΩ ∈ R pM * ×pM * , whose diagonal blocks of p × p areΩ l,l = Ω l and off-diagonal blocks areΩ l,l+1 =Ω l+1,l = (Ω * l + Ω * l+1 )/2 where Ω * l = Ω l − diag(Ω l ). We then calculate Finally, we obtain the covariance matrix as (Σ jl ) st = (Σ nps,st ) jl , 1 ≤ j, l ≤ p and 1 ≤ s, t ≤ 5. Finally, the precision matrix Θ = (Σ) −1 .
• Model D. This model is similar to the setting introduced in Rothman et al. (2008), but modified to fit functional data. We generate a random block sparse precision matrix Θ ∈ R pM * ×pM * with M * = 15. Each off-diagonal block Θ jl , 1 ≤ j = l ≤ p is set to 0.5I M * with probability 0.1, and 0 M * otherwise. The diagonal blocks are set as Θ jj = δ I M * , where δ is chosen to guarantee the positive definiteness of the precision matrix, i.e., Θ 0. It is sufficient to choose δ such that it exceeds the maximum row sum of the absolute values of the off-diagonal entries of Θ, thus the diagonal dominance ensures positive definiteness. Notice that the partial separability condition is satisfied under this model. Models A and B are similar to Models 1 and 2 in Section 5.1 of Qiao et al. (2019), but modified so that partial separability is violated. Model C-where partial separability is also violated-is used as a counter-example in Zapata et al. (2021). In these three models, the partial separability condition is violated, that is, it is impossible to separate the multivariate and functional aspects of the data. However, Model D satisfies the partial separability by construction.
For each setting, we fix n = 100 and let p = 50, 100, 150. Each random function is observed on a grid with T = 100 equally spaced points on [0, 1]. For T observed time points, (t 1 , . . . , t T ), uniformly spread on [0, 1], the observed values are generated by In models A, B, and D, we set σ = 0.5, while in model C, σ 2 = 0.05 × M * l=1 tr(Σ l )/p. We report the results averaged over 30 independent runs.
In all experiments, we set the number of selected principal components to be M = 5 for all nodes. This is a typical value selected by the cross-validation process described in Section 2.5. We set the threshold parameter n = 0. For the penalty parameter λ n , we plot the ROC curve as the paramter λ n changes. More specifically, let λ n = λ j,n explicitly denote the parameter choice for the node j. According to Proposition 1, there exists λ max such thatN j is empty. Let λ j,n,t = t · λ j,max , where t ∈ [0, 1] is the same for all nodes. We plot the ROC curve as t changes from 1 to 0. We also use the SCV-RSS criterion introduced in Section 2.5 to select λ j,n for all nodes, and then evaluate the performance by reporting the precision and recall on the specific graph that is selected.
We compare our method with the functional Graphical lasso (FGLasso) procedure (Qiao et al., 2019) and the PSKL procedure (Zapata et al., 2021). For FGLasso, we select the parameters as proposed in Qiao et al. (2019). For PSKL, we use the package "fgm" with the default setting Zapata et al. (2021). As Model D is partially separable, we also implemented our method using the PSKL function basis that we assume is known a priori, which we call PSKL Basis-in this case, it is a B-spline basis. To demonstrate the advantages of using a single function basis when estimating N j -as explained in Section 2.4-we implemented the following two methods to estimate the FPCA scores and compared their performances. The first method, which we call FPCA-g X , projects each function onto its own FPCA basis and uses those projection scores for all subsequent tasks. The second method, which we call FPCA-g Y , projects all other functions onto the FPCA basis of g j when selecting N j .
To compare the methods, we plot their respective ROC curves for each model and different values of p. For each value of the tuning parameter λ n , we compare the estimated edge set to the true edge set. Specifically, we calculate the true positive rate TPR(λ n ) = TP(λ n )/(TP(λ n ) + FN(λ n )) and the false positive rate FPR(λ n ) = TP(λ n )/(TP(λ n ) + TN(λ n )), where TP(λ n ), FP(λ n ), TN(λ n ), FN(λ n ) stand for the number of true positive, false positive, true negative, and false negative number of edges, respectively. The ROC curves are plotted by varying the penalty parameter λ n , with TPR(λ n ) on the vertical axis and FPR(λ n ) on the horizontal axis. We also calculate the area under the ROC curve (AUC). The ROC curves are shown in Figure 1 and the average AUC is given in Table 1. The "AND" scheme drastically outperforms the "OR" scheme in the simulations. For brevity, we only include the AUC of the former in Table 1.
Although FPCA-g Y and FPCA-g X perform similarly, FPCA-g Y slightly outperforms FPCA-g X across all settings. Moreover, FPCA-g Y drastically outperforms FGLasso in Models A and D, and slightly outperforms FGLasso in most settings under Models B and C. In Models A, B, and C, where partial separability does not hold, the PSKL procedure generally underperforms the other   procedures. Even in Model D, where partial separability holds, the PSKL has a performance that is comparable to ours. We also note that in Model D, the neighborhood selection procedure that uses the "optimal" PSKL basis, which we assume is known a priori, has a very similar performance to the procedure that uses the FPCA basis and does not require prior knowledge. This suggests that using the FPCA basis is a good idea across a variety of settings. Practitioners may want a single graph rather than a series of graphs corresponding to different penalty parameters. Thus, we also evaluate the precision and recall of the final graph selected using the penalty parameter obtained through selective cross-validation. The precision is defined as Precision(λ * n ) = TP(λ * n )/(TP(λ * n ) + FP(λ * n )), while the recall is defined as Recall(λ * n ) = TP(λ * n )/(TP(λ * n ) + FN(λ * n )). A larger value of precision and recall indicates better performance. We applied SCV-RSS criterion, and the results under Models A,B,C using FPCA basis and under Model D using both FPCA and PSKL bases are shown in Table 2. From Table 2 we see that our method obtains satisfactory performance under some of the settings, even in the high-dimensional setting.

Real-world Data Analysis
We apply our procedure to data from ADHD-200 Consortium (Milham et al., 2012). The samples contain whole-brain fMRI scans for n ADHD = 70 patients that are diagnosed with Attention Deficit Hyperactivity Disorder (ADHD), as well as n control = 95 controls. We use the preprocessed data from Bellec et al. (2017), where the raw fMRI scans of the whole brain are divided into p = 116 regions of interest (ROI) (Tzourio-Mazoyer et al., 2002). Our goal is to recover the functional conditional independence graph for both groups.
The conditional independence graphs for each group are estimated with a penalty parameter λ n selected by the SCV procedure. The estimated edge sets are depicted in Figure 2. Both the ADHD and control groups have several hub nodes, certain ROIs that are highly connected to other ROIs. In our analysis, we define a node as a hub if its number of neighbors is greater than p/2. The ADHD group has 17 hubs, while the control group only has 6. In addition, the adjacency matrix of ADHD group has a density of 30.4%, which is larger than the control group's 27.7%. This agrees with previous findings that ADHD patients tend to exhibit more abnormal spontaneous connectivity among ROIs (Konrad and Eickhoff, 2010).

Conclusion
We propose a neighborhood selection method for estimating the structure of a functional graphical model and show that it can consistently recover the conditional independence graph in the highdimensional setting. Specifically, we pose the problem of graph selection as a series of functionon-function regressions, and we approximate the function-on-function regressions with a vector-onvector regression approach that is achieved by functional dimension reduction. Through extensive simulations, we demonstrate that the proposed method outperforms existing approaches in a variety of settings. Finally, we apply our method to a real-world data set to show its practical utility. We analyze a fMRI data set that includes patients with ADHD and a control group. We estimate the connectivity pattern between brain regions and find results that are are consistent with previous research.
A key step in our method is the choice of the basis for dimension reduction. While we suggest using the FPCA basis for most settings, our methodology allows an arbitrary orthonormal basis. We also provide a theoretically motivated procedure for choosing a particular basis. Nonetheless, developing a more rigorous data-driven approach is still an open problem that we hope to study in the future. Another fruitful avenue for future work is the development of methods that allow for inference and hypothesis testing in functional graphs. For example, Kim et al. (2019) has developed inferential tools for high-dimensional Markov networks, and future work may extend their results to the functional graph setting.

A Technical Proofs
We give proofs of the technical results that appear in the main text.
A.1 Derivation of (6) and (9) Recall that β jk (t, t ) is defined in (3) andφ jm is an estimate of the true basis function φ jm . We focus on a given node j ∈ [p], and we drop the index j from the notation to simplify the discussion. We first prove (6). By (3) and (16), we have Then (6) follows directly from (34). To show (9), we only need to redefine relevant concepts. We definẽ Thusb k,mm = 0 for all m, m ≥ 1 when k / ∈ N j . Similarly, let By Proposition 2, we have Then by a similar argument to (34), we havê which implies (9) combined with (36).

A.2 Simplification of ADMM Optimization Problems
We explain how to obtain the problem in (12). Let g( can be rewritten as Let φ : R n×M → R be a function that satisfies φ(0) = 0 and ∇φ(0) = 0. The Lagrangian function is then The matrix Q k that minimizes the Lagrangian satisfies This is equivalent to ρ(Q k − W h+1 k ) = µ p−1 ∇φ(0). As a result, we see that each entry in Q k − W h+1 k does not vary with k. LetR h+1 = 1 Therefore, we have obtained the problem in (12).

A.3 Derivation of (33)
We drop the subscript M . By (6) and the definition of N j , we have is a multivariate Gaussian vector. Then and Since the correlation between a Y i and a Xk i is larger than correlation between r i and a Xk i , k ∈ N j , when M is large enough, we will have Combining the last two displays, we have By Lemma 15, we have and, therefore, Combining (39), (40), and the definition of Λ(M, φ), we arrive at (33).

A.4 Proposition 1 and its Proof
Proposition 1. If the tuning parameter λ n satisfies where A Xk and A Y are defined in Section 3, then the estimated support setN j is empty.
Proof. This threshold of λ n can be derived using the KKT condition. We use the notation introduced in Section 3. The subgradient of the objective in (7) with respect to B k can be written as We assume thatN j is non-empty. That is, there exists some k such that B k = 0. By (42), we have is positive semi-definite and we have assumed that (41) holds, the left hand side of (44) then satisfies that where the first inequality follows from Lemma 16. On the other hand, the right hand side of (44) satisfies that Combine the above two equations with (44), we have a contradiction. Thus, we conclude thatN j must be empty.

A.5 Proof of Theorem 1
In this section, we prove Theorem 1. We first introduce some useful notation. Letλ(n, p, M, δ) be defined as where c is some universal constant that does not depend on n, p or M .
To simplify the notation, we omit the basis dimension, M , and let . Then by (6), for all i ∈ [n], we have where u i = w i + r i , and w i , r i are defined in Appendix A.1. With this notation, we give a proof of Theorem 1.
Proof. The equation (46) can be rewritten as Then we can further formulate (47) as Recall that then we have Z = A X ⊗ I M . We divide columns of A X into p − 1 groups with equal group size M , that is, Similarly, we divide the columns of Z into p − 1 groups with equal group size M 2 , that is, Z = (Z 1 , Z 2 , . . . , Z p−1 ), where Z k ∈ R nM ×M 2 for all k ∈ [p − 1]. Then, we have (Z u) k = Z k u. Besides, by definition of Z, it is easy to see that Z k = A Xk ⊗ I M .
Besides, we can rewrite (7) asβ where Thus, the support set defined in (4) and its estimator defined in (8) can be expressed as We define the model space M(N j ) with which the penalty term R(·) is decomposable. Let we then have its orthogonal complement as It is then easy to verify that R(·) defined in (52) is decomposable with respect to (M(N j ), M(N j ) ⊥ ) (Example 2, Section 2.2 of Negahban et al. (2012)), that is R(θ + γ) = R(θ) + R(γ) for all θ ∈ M(N j ) and γ ∈ M(N j ) ⊥ .
When λ n =λ(n, p, M, δ), whereλ(n, p, M, δ) is defined in (45), then by Lemma 3, we have hold with probability at least 1 − 2δ. This way, by Lemma 1 of Negahban et al. (2012), we have thatβ − β * lies in a constrained space C(N j ) defined by with probability at least 1 − 2δ. Note that C(N j ) depends on support N j through M.
The error term of a first-order Taylor series expansion is where ∆β =β − β * . Then by Lemma 7 and ∆β ∈ C(N j ) with probability at least 1 − 2δ, we have Thus, by Lemma 4, we then have where κ(M ) is defined in (21) and χ(n, p, M, δ) is defined in (25). Given the inequality in the left hand side parenthesis of (56) holds, we then have Next, we prove that P{N j = N j } ≥ 1 − 3δ. To show that, we only need to prove that the above inequality implies thatN j = N j . Under the assumption that the above inequality holds, note that for any k / ∈ N j , we have B * k = 0 for all M , thus, we have On the other hand, for k ∈ N j , we have Because M ≥ M * , and by the definition of M * in (24) and the definition of ν(M ) in (23), when Combine the above results and note that decreasing 3δ to δ doesn't affect the asymptotic order of n, we the have the final result.
As in the proof of Theorem 1, we omit the basis dimension, M , and letâ Y i =â Y i,M ,â Xk i =â Xk i,M , and B * k = B * k,M for all k ∈ [p − 1]. Then by (9), for all i ∈ [n], we havê We now give the proof of Theorem 2.
Proof. LetÂ We divide columns ofÂ X into p − 1 groups with equal group size M , that is, Similarly, we divide the columns ofẐ into p − 1 groups with equal group size M 2 , that is,Ẑ X = [Ẑ X1 ,Ẑ X2 , . . . ,Ẑ X p−1 ], wherê Z Xk ∈ R nM ×M 2 for all k ∈ [p − 1]. Then, we have (Ẑ u) k =Ẑ k u. By definition ofẐ, it is easy to see thatẐ k =Â Xk ⊗ I M . In addition, letâ Y = ((â Y 1 ) , · · · , (â Y n ) ) ∈ R nM . Furthermore, let Σ X n = 1 n (Â X ) Â X . Thus, we can rewrite (7) aŝ We can then follow the similar proof of Theorem 1 to prove Theorem 2. The only two modifications needed are new upper bounds for 2 n max k∈[p−1] (Ẑ ũ) k 2 and Σ X n − Σ X ∞ . In fact, when we have hold, whereχ(n, p, M, δ) is defined in (31), then by the similar argument in the proof of Theorem 1, we can show that which further implies thatN j = N j . Thus, to show thatN j = N j holds with high probability, we only need to prove that (63) holds with high probability. We define the following events.
We now prove the above claim. We first prove that ∩ 7 i=1 A i implies that 2 n max k∈[p−1] (Ẑ ũ) k 2 ≤ λ(n, p, M, δ). Note that for all k ∈ [p − 1], we have (36). The above inequality implies that By A 5 , we have the second term to be bounded byλ(n, p, M, δ). To bound the third term, note that and by Lemma 6, we further have By Lemma 14 with Γ g (j) = σ max,0 Q 2 n,p,δ for all j ∈ [p] Γ e = σ jr Q 1 n,δ Γ φ (m) = d jm (1/c 1 ) log (c 2 /δ) n because of A 1 , A 2 and Assumption 6, we have where Γ v is defined by (59). And combine with A 4 , we have Finally, to bound the first term in (65), by similar arguments as (66) and (67), we have Since we have by A 1 and Assumption 6, and (68), then by (69), we have Thus, combine the bounds of three terms in (65) and by the definition ofλ(n, p, M, δ) in (60) Z k v 2 ≤λ(n, p, M, δ).
We then prove that and by A 6 , we thus only need to prove that Σ X −Σ X n ∞ ≤Γ(n, p, M, δ) − Γ(n, p, M, δ). Note that and By Lemma 12 with Γ g (j) = σ max,0 Q 2 n,p,δ for all j ∈ [p],  (70) and (71), we have (Z u) k 2 ≤λ(n, p, M, δ) and Σ X n − Σ X ∞ ≤Γ(n, p, M, δ) Finally, we only need n large enough such thatΓ(n, p, M, δ) ≤ κ(M )/(32M 2 s) andχ(n, p, M, δ) ≤ ν(M )/3 whereχ(n, p, M, δ) is defined in (31). After dropping log(n) term, to satisfy the first condition, we need and to satisfy the second condition, we need Then combine the above results and note that decreasing 8δ to δ doesn't affect the asymptotic order of n, we the have the final result.

A.7 Proposition 2 and its Proof
Let H be a separable Hilbert space of functions defined on a compact domain T ⊆ R. For f, g ∈ H, let f, g = T f (t)g(t)dt. Assume that β(s, t) is the kernel function of an integral operator K : H → H, where (s, t) ∈ T 2 , that is, we have T β(·, s)f (s)ds ∈ H. Recall that we define the Hilbert Schmidt norm of β(s, t) as β(s, t) 2 HS = T T β 2 (s, t)dsdt. We assume that β(s, t) HS < ∞. Let {φ m (·)} ∞ m=1 be an orthonormal basis of H.
Proposition 2. Let {φ m (·)} ∞ m=1 be any orthonormal basis of H and β(s, t) be a bivariate function on T × T such that β(s, t) HS < ∞, we then have We only need to prove that β(s, t) =β(s, t) a.e.. Or equivalently, we only need to show that β(s, t) −β(s, t) HS = 0.
First, we show that is an orthonormal basis of H, by Parseval's relation (Theorem 2.4.13 of Hsing and Eubank (2015)), we have where in the second equality we can change the order of integration and summation using Fubini's Theorem (Theorem 1.7.2 of Durrett (2019)) and the assumption that β(s, t) HS < ∞, while the last equality is by Parseval's relation.
Proof. The result follows directly from Theorem 6.1 of Wainwright (2019) and a union bound.

Proof. Note that
√ n ≤ C n,δ1 . Recall that u = w + r, where u is defined in (48), w = (w 1 , w 2 , . . . , w n ) ∈ R nM and r = (r 1 , r 2 , . . . , r n ) ∈ R nM . Both w i and r i are Gaussian vectors with mean zero, and covariance matrices Σ w and Σ r respectively for all i ∈ [n] , where we dropped the superscript M to simplify the notation, and Since both r i and a Xk i are Gaussian vectors and Cov(a Xk i , r 1 i ) = 0, we have a Xk i ⊥ ⊥ r 1 i . Then We upper bound the first term on the right hand side of (73). Let ξ i = w i +r 1 i . Then ξ i is a Gaussian vector with mean zero and covariance matrix Σ ξ k , with Σ ξ k = Σ w + Σ r − Σ r,Xk Σ X kk −1 Σ Xk,r .
The above inequality holds for any Z k such that Z k / √ n 2 ≤ C n,δ1 and, therefore, is valid unconditionally as well. Since Ξ 1 (M ) = max k∈[p−1] ρ max (Σ ξ k ), if follows from the union bound that Next, we upper bound the second term in (73). For any k ∈ [p − 1], we have we only need to bound 1 . We use Theorem 4.1 in Kuchibhotla and Chakrabortty (2018) to bound this term. We first check the conditions therein. Note that for b X k im is a standard normal random variable for all i ∈ [n] and m ∈ [M ], we have (b X k im ) 2 to be chi-squared distributed with degree of freedom 1. By the moment generating function of chi-square distribution, we have for all i ∈ [n]. In addition, note that Therefore, by Theorem 4.1 in Kuchibhotla and Chakrabortty (2018), we have for all ∆ > 0, where c is a constant. Thus, combining the above equation with (76) and (77), we have The result follows by combining (73), (75), and (78).
Proof. The result follows directly from Lemma 1, Lemma 2 and Bonferroni inequality.  Proof. The result follows Theorem 4.1 in Kuchibhotla and Chakrabortty (2018). In order to apply the theorem, we need to check the conditions therein. First, we bound max k∈[p−1],m∈[M ] a X k im ψ2 . Note that for ζ ∼ N (0, σ 2 ), we have (ζ/σ) 2 is chi-square distributed with degree of freedom 1. By the moment generating function of chi-square distribution, we have Based on the above result, we have for any i ∈ [n], where K 0 is defined by (20). We then bound max k,k ∈[p],m,m ∈[M ] Var a X k im a X k im . This followed by Thus, the final result is derived by applying Theorem 4.1 in Kuchibhotla and Chakrabortty (2018) with Proof. For any m, m ∈ [M ], we have Therefore, and the result immediately follows. Z∆β 2 2 ≥ κ 4 ∆β 2 2 for all ∆β ∈ C(N j ) ≥ 1 − δ, where C(N j ) is defined in (54).
Proof. Let {φ m } m≥1 be othornormal eigenfunctions of covariance function of g. Let a m = g, φ m . We have a m , m ≥ 1, are independent mean zero Gaussian random variables with variance σ m and σ 0 = m≥1 σ m . We further have g = Using the Jensen's inequality for t → t 2k , we have Thus, which completes the proof.
Lemma 12. Recall that g i (·) = (g i1 (·), g i2 (·), . . . , g ip (·)) is our i-th observation defined in Section 2.1. Besides, recall that in Section 2.4, we have φ m = φ jm andφ m =φ jm be the m-th basis function and and its corresponding estimate respectively used to do projection for j-th node, and a Xk i,M = (â X k i1 , . . . ,â X k iM ) be the projection score vector of g i by using {φ m } M m=1 . Under the assumption that 1 n Combine (83)-(84), and note that b * k,mm =b k,mm = 0 for all m, m ≥ 1 when k / ∈ N j , thus we have Lemma 14. Recall that g i (·) = (g i1 (·), g i2 (·), . . . , g ip (·)) is our i-th observation defined in Section 2.1 and e ij (·) is the error term defined in Assumption 1. Besides, recall that φ m = φ jm and φ m =φ jm are the m-th basis function and its corresponding estimate respectively used to do projection for the j-th node, defined in Section 2.4. Recall from ( where By Lemma 13 and our assumption, we have By Lemma 12, we have Thus, In addition, we have III m ≤ 3Γ e Γ 2 φ (m). Combining the above results, we complete the proof of (85).
To show (85), note that {φ m } ∞ m=1 is an orthonormal function basis, and we treat T β k (t, t ) φ m (t ) − φ m (t ) dt as a function of t, then we have Γ 2 φ (m ) .

Thus, we have
(86) then follows the combination of the above inequality and (85).
Lemma 15. Let A ∈ R m×n , B ∈ R n×m . We have Proof. Let A = [a 1 , a 2 , . . . , a n ]. Since The final result follows from taking the square root on both sides.
Lemma 16. Let A ∈ R n×n be symmetric and positive semi-definite, B ∈ R n×m , and C ∈ R n×n be a diagonal matrix with positive diagonal elements. Then we have Since A is symmetric and positive semi-definite and C is a diagonal matrix with positive diagonal elements, we have C A = A C = CA to be symmetric and positive semi-definite. And note that A A is symmetric and positive semi-definite, thus we have which implies the final result.

C Table of Notations
Notation Meaning Page G = (V, E) undirected graph, V is set of vertices, E is set of edges 2 X p-dimensional random variables 2 M number of basis functions we used to do dimension reduction 3 g(·) p-dimensional multivariate Gaussian process 5 T domain of multivariate Gaussian process 5 C jl (t, t ) conditional cross-covariance function 5 {g i (·)} n i=1 g i (·) = (g i1 (·), · · · , g ip (·)) , random copies of g(·) 5 β jk (t, t ) coefficient on g ij from g ik 5 N j ,N j (estimated) neighborhood set of node j 5 e ij (·) error of g ij (·) 5 φ j = {φ jm (·)} ∞ m=1 , orthonormal functional basis on H 6 a i,k,M = (a ik1 , · · · , a ikM ) , vector of projection scores 6 g Y i (·), g X k i (·) random functions of the target node and the other random functions vectors of scores projected on estimated basesφ j 7-8 K jj (t , t) functional covariance of g ij (·) 8 K j (f )(t) Hilbert-Schmidt covariance operator 8 {σ jm } m∈N eigenvalues of K j 8 g M ij (t) L 2 projection of g ij (t) onto the basis spanned by the first M FPCA functions 8 K jj (t , t) empirical functional covariance of g ij (·) 8 {σ jm ,φ jm (t)} M m=1 eigenpairs ofK jj (t , t) 8 a ijm = T g ij (t)φ jm (t)dt, estimated FPCA scores 8 B λn = (B 1 , · · · ,B p−1 ), group Lasso estimates under a fixed λ n 9 B k estimates of B k in selective cross-validation process 10 i = a Y i,M − p−1 k=1B k a Xk i,M , residuals ofB k 10 A Y , A Xk , A X matrices of FPCA scores 10 ρ, ρ h penalty parameter for ADMM subproblem (on iteration h) 10 b * k,mm