Semi-parametric inference for large-scale data with temporally dependent noise

: Temporal dependence is frequently encountered in large-scale structured noisy data, arising from scientiﬁc studies in neuroscience and meteorology, among others. This challenging characteristic may not align with existing theoretical frameworks or data analysis tools. Motivated by multi-session fMRI time series data, this paper introduces a novel semi-parametric inference procedure suitable for a broad class of “non-stationary, non-Gaussian, temporally dependent” noise processes in time-course data. It develops a new test statistic based on a tapering-type estimator of the large-dimensional noise auto-covariance matrix and establishes its asymptotic chi-squared distribution. Our method not only relaxes the consistency requirement for the noise covariance matrix estimator but also avoids direct matrix inversion without sacriﬁcing detection power. It adapts well to both stationary and a wider range of temporal noise processes, making it particularly eﬀective for handling challenging scenarios involving very large scales of data and large dimensions of noise covariance matrices. We demonstrate the eﬃcacy of the proposed procedure through simulation evaluations and real fMRI data analysis.


Introduction
The advancement of high-throughput technology has transformed research in many fields.Large-scale structured data often exhibit complex forms of dependence, non-stationarity, and non-Gaussianity.Examples include datasets from functional magnetic resonance imaging (fMRI) in neuroinformatics, interferometric synthetic aperture radar images from Earth-orbiting satellites in geophysics [1,8], and air quality measurements from thousands of weather stations in meteorology [25,37,11], among others.These challenging features may not align with existing theoretical frameworks or data analysis tools.Ignoring such dependencies and non-Gaussian non-stationarity can lead to inaccurate statistical estimations and inference procedures for large-scale imaging data.This paper aims to develop some useful statistical methodologies and inference tools to effectively address temporal dependence.

Motivation of this work
In recent years, the study of human cognition has greatly benefited from innovations in fMRI, an exciting technique aimed at delineating task-activated regions in the human brains.A typical fMRI dataset for a single subject and session can be viewed as a huge collection of cuboid elements, typically ranging from 10 4 to 10 5 voxels v, each associated with a time series of responses denoted as {Y (v) (t i ) : i = 1, . . ., n; v ∈ V}, where the series length, n, can extend to the order of 10 3 .Analyzing fMRI data commonly involves two stages: Stage-I (voxelwise analysis): modeling and making inferences for fMRI signals at each voxel.Stage-II (whole-brain inference): detecting regions of brain activity across the entire brain.
A significant challenge in voxelwise inference lies in the fact that the observed signals are contaminated with highly correlated noise in the time domain.Moreover, several factors may induce non-Gaussian non-stationarity in the noise process.For instance, background memory processes or extraneous auditory and visual stimuli can lead to changes in the noise's variance [17,26].Additionally, abrupt movements by the subject represent another source of nonstationarity [9].Thus, it is essential to adequately account for non-stationary, non-Gaussian temporal dependence to accurately detect activation in fMRI.
In the literature, several statistical modeling and analysis methods have been developed for single-voxel fMRI, including the general linear model (GLM) [19,20,39] and the semi-parametric model approach [46], each with its own constraints.
(i) GLM approaches usually assume that the noise process is a stationary time series, such as AR(1) [4], AR(m) [39], and ARMA(1, 1) [31], among others.Some variations of non-stationarity, such as independent heterogeneous processes [26,29], AR combined with independent heterogeneous noise [9], and long-range dependent 1/f -like processes [26], have been considered within the GLM framework.However, GLM approaches often require restrictive assumptions on model parameters, such as specific parametric forms for the hemodynamic response function (HRF) and the temporal drift function [39].These assumptions can lead to biased estimators of true responses and test statistics with limited power for detection.
(ii) Semi-parametric models of the form: (v) + d (v) + ε (v) , v ∈ V (1.1) were developed in [46] for time-course data Y (v) = (Y (v) (t 1 ), . . ., Y (v) (t n )) T at voxel v, with external stimuli in the matrix S common to all voxels.These models provide more flexibility in modeling assumptions compared to the GLM approach but are restricted to a stationary g-dependent noise process ε (v) = (ε (v) (t 1 ), . . ., ε (v) (t n )) T .Detailed descriptions can be found in Section 2.1.[15] considered two types of stationary noise processes: a g n -dependent sequence and a linear process with a dependence structure related to the number n of time points.[27] modeled the noise process as a fractional Brownian motion process.These approaches require the noise covariance matrix estimator Σ ∈ R n×n to be consistent to the true Σ n and involve taking the inverse of Σ, posing significant challenges for statistical inference with large-scale time-course data.
In the literature, semi-parametric inference for stationary time series includes [10], [16], [22], and [23], among others.Specifically, [10] studied the partially linear single-index model; [16] proposed the partial-linear single-index model with ARMA noise; and [22] examined the partially time-varying coefficient model.However, these mentioned semi-parametric models differ from the partially linear model presented in our paper.[23] considered the partially linear model and investigated the simultaneous inference of the nonparametric part with stationary noise, but their results do not apply to non-stationary noise.Semiparametric models for non-stationary time series have also been explored in the literature; see, e.g., [35], [38], and [13].[35] introduced the semi-parametric factor model, which is distinct from the partially linear model in our work.[38] conducted semi-parametric inference for panel data with non-stationary noise.However, they assumed that the noise terms are independent, so their results are not applicable to non-stationary temporally dependent noise.[13] examined the semi-parametric model, which encompasses the partially linear model.Although the response in their model is a non-stationary time series, they assumed that the noise term is stationary.Consequently, their results cannot be used to analyze our model with non-stationary noise.Moreover, if the model in [13] simplifies to the partially linear model, as it does in our paper, their estimation method corresponds to ours, except that we employ the local-linear method while they opt for the local-constant method.Consequently, there is no need to compare our method with the approaches mentioned above.
In stark contrast, very few works have explored the semi-parametric model for fMRI data in the context of non-stationary, non-Gaussian, temporally dependent noise.Given the inherently time-varying nature of fMRI noises, the assumption of stationarity may not be realistically appropriate.Therefore, it is desirable to investigate how the identification of activated brain voxels can benefit from semi-parametric modeling, including parameter learning and inference under a broader range of temporally dependent noise processes.

Outline of this work
To detect activated brain regions, we first conduct voxelwise inference in Stage-I using the data {Y (v) : v ∈ V}, and subsequently, in Stage-II, we conduct multiple testing using the sequence of voxelwise p-values {p (v) : v ∈ V}.This paper mainly focuses on Stage-I for theoretical analysis, while both stages are involved in the numerical studies.Regarding Stage-II, exploring alternative approaches, such as the recent multivariate tensor method, could be an interesting topic for future research.
In this paper, we will conduct a novel semi-parametric inference procedure that accommodates serially correlated noise, which can be non-stationary and non-Gaussian.This has applications in fMRI and other large-scale imaging data with time series noise.We will investigate three key issues.
Concerning the structure of temporally dependent noise, we will initially consider short-range dependence, building on the framework presented in [47] and [45].This framework encompasses various time series models, including stationary models (e.g., ARMA model [4,39,31], stationary g-dependent process [46]) and non-stationary models (e.g., independent heterogeneous process [26,29] and AR+independent heterogeneous noise model [9]) as special cases.See Section 2 for details.Extensions to other types of dependence will be explored in Section 8.
Inspired by the detection of brain activity within a voxel, we will explore the significance test of the parameter vector in the semi-parametric model (2.1) with temporally dependent noise.Due to non-stationarity and the unique structure of the Toeplitz design matrix, some existing forms of semi-parametric test statistics, particularly well-suited for stationary noise, are no longer applicable in non-stationary scenarios.To accommodate non-stationarity, we will develop a new semi-parametric test statistic by utilizing an appropriate estimator Σ of Σ n .Unlike existing results, our method circumvents the consistency requirement and avoids direct inversion of Σ in high dimensions while maintaining the size without compromising detection power.This approach strikes a balance between statistical guarantees and computational efficiency, especially in practical large-scale imaging data scenarios.See Section 3 for details.
Regarding the estimation of Σ n for temporally dependent noise, several regularization methods have been studied in the context of estimating large covariance matrices.These methods have been explored in areas of high-dimensional inference [3,5,7,18,44] and stationary time series analysis [28,40,41,43].They include techniques like banding, tapering, and thresholding.We will adopt the tapering-based method [3,43] for two reasons: (1) Most penalization methods require a diverging (approaching infinity) number of i.i.d.replicates, whereas a typical fMRI experiment involves only a finite number of replicates (e.g., sessions, scans, or runs), making penalization methods unsuitable.(2) The entries of Σ n decay as they move away from the diagonal, and the tapering-based estimator is positive semidefinite, further ensuring positive-definiteness by adding κ n I n with a shrinking κ n > 0. The tapering method is more suitable than the thresholding method due to the "tapering" structure of Σ n , where elements far from the diagonal exhibit decay.In contrast, the banding method does not guarantee positive-semidefiniteness. Therefore, the tapering method is the preferred choice for estimating Σ n .See Section 4 for further details.
Section 5 explores the asymptotic properties of the new inference procedure and develops practical computational strategies.Simulation studies in Section 6, which includes 1D, 2D, and 3D datasets, support these theoretical properties and demonstrate the better performance of the new procedure compared to existing ones.Section 7 analyzes a real fMRI dataset, and Section 8 provides a brief conclusion.
Details of technical derivations and proofs are relegated to the Appendix.We introduce some notations below.Denote 0 q = (0, . . ., 0) T ∈ R q , and 1 q = (1, . . ., 1) T ∈ R q for q ∈ Z + .Let I q ∈ R q×q be the identity matrix, and e i;q be the ith column of I q for i = 1, . . ., q.For a square matrix A ∈ R q×q , A 0 and A 0 mean that A is positive definite and positive semidefinite, respectively.Denote by λ max (A) the maximum eigenvalue of A. For a generic matrix B ∈ R q×q , let B(i, j) be the element of B in the ith row and jth column for i = 1, . . ., q and j = 1, . . ., q .The L 1 , L 2 , and L ∞ norms of B are denoted as , respectively.For any scalar random variable X and constant a ∈ (0, ∞), define |||X||| a = {E(|X| a )} 1/a .In the following, C and c represent generic finite constants that are independent of the sample size, denoted as n.

Semi-parametric model
We begin with a brief introduction to the semi-parametric partially-linear model (PLM): Here, the response is denoted as Y i and the covariates are represented as (t i , S i ), where t i ∈ [0, 1].In this model, we aim to estimate the unknown parameter vector h and the non-parametric function d(•), while ε(t i ) represents the noise terms with E{ε(t i )} = 0. We impose the assumption t i ∈ [0, 1] to ensure that d(•) is compactly supported, which is necessary for consistent non-parametric estimation of d(t).It's worth noting that, in the case of real fMRI data, the duration between consecutive observations remains unchanged with n.For temporal data measured at time points {t i } n i=1 (e.g., equally spaced t i = i/n relevant to fMRI and [8]), denoting S = (S 1 , . . ., S n ) T , the PLM (2.1) can be re-expressed as: with m) for the design matrix.E.g., is a Toeplitz design matrix associated with s j (t i ), representing the input of the jth type of observed external stimulus at time t i .Please refer to [46] for the derivation of S from the convolution model.
for the vector of unknown regression parameters (relevant to brain HRF) of primary research interest.For instance, h j = (h j (t 1 ), . . ., h j (t m )) T ∈ R m×1 is the parameter vector associated with the jth type of observed external stimulus.
for the zero-mean noise process, independent of S.
For fMRI data, is the number of external stimuli, and m indicates the length of each HRF parameter vector h j , j = 1, . . ., .In real fMRI data, the HRF is typically sparse, with m usually specified as a small integer much less than n, for instance, a common choice is 18 as a rule of thumb.The design matrix S in fMRI data depends on how external stimuli are presented to the subjects [39].For the sake of presentation, we assume that s j (•) starts at the same time for all j = 1, . . ., .However, our method can also accommodate design matrices that allow different starting times for different types of stimuli, e.g., s j (t i ) = 0 for some j = 1, . . ., and all i = 1, . . ., n 0 , with a fixed integer n 0 > 1.While our method is developed for fMRI data with the Toeplitz design matrix S j , our theoretical arguments do not rely on the Toeplitz structure and hold for general design matrices that satisfy Conditions A4 and A10.It's worth noting that model (2.2) also has the flexibility to accommodate multiple sessions (i.e., runs or segments).However, for brevity, the discussions in Sections 2-5 will concentrate on a single run.
The smooth function d(t) describes the slowly drifting baseline in the fMRI signal, which can vary between different voxels.The HRF characterizes the stimulus-related variation in the signal for a specific voxel.For each voxel, the HRF is commonly assumed to be identical when the same type of stimulus is presented at different times, as discussed in [39].
This work aims to detect the brain activities by testing the significance of the HRF parameters.While inference for the nonparametric part d(•) is feasible (using methods such as [42]), it falls outside the scope of this paper because d(•) is not directly relevant to event-related brain activities.

Assumptions on the noise process
The noise terms {ε(t i )} n i=1 in either (2.1) or (2.2) exhibit serial correlation, with the covariance structure characterized by a positive-definite auto-covariance matrix, Most of the current literature relies on the assumption of stationarity or Gaussianity.Motivated from the framework in [45], we incorporate non-Gaussian nonstationarity into the sequence of noise terms {ε(t i )}, as follows: where As discussed in the fMRI literature, the noise is time-varying non-stationary; see, e.g., [24] and [26], while our assumption includes common non-stationary linear processes, e.g., time-varying AR and time-varying MA, as special cases.For detailed descriptions and examples of the noise mechanism, please refer to Section 2 of [45] and Section 4 of [47].
To further accommodate non-stationary short-range dependence, we introduce e i as an i.i.d.replicate of e i , and define The dependence of G(t; F i ) on e 0 across t ∈ [0, 1] can be quantified by where ω > 0 is a constant.The cumulative dependence of {G(t; F i )} ∞ i=q on e 0 is represented by with a non-negative integer q.As in Condition A3 in Section 5.1, we consider the short-range dependence condition, as expressed by the quantity Θ q,ω in (2.4), i.e., Θ 0,ω < ∞ for a constant ω ∈ [4, ∞). (2.5) As a remark, the assumption ω ≥ 4 is necessary to derive the error bound for the quadratic form of the noise in Lemma 4. This same condition was considered for both stationary and non-stationary time series analysis, as seen in, e.g., [41] and [45], respectively.

Implications of the noise assumptions
Utilizing assumption (2.5), we demonstrate in Lemma 1 some properties of the noise process {ε(t i )}.One noteworthy implication is that the absolute autocovariance of the noise diminishes as the time-lag q increases: These Under the short-range dependence condition (2.5), the noise mechanism (2.3) encompasses a wide range of stationary and non-stationary time series noise models as special cases.These include stationary ARMA models [4,39,31] and g-dependent processes [46], as well as non-stationary i.ni.d.processes [26,29] and AR + i.ni.d.processes [9].
Lemma 1 only necessitates moment conditions (i.e., Condition A3) for the noise terms {ε(t i )}, without imposing specific distributional assumptions.Therefore, our method is generally applicable to non-Gaussian scenarios.Additionally, the numerical results in Section 6 demonstrate the robust performance of our method under non-Gaussian conditions.

Semi-parametric inference with temporally dependent noise
The detection of brain activity can be achieved by testing the significance of the parametric part h (associated with HRF) in model (2.2).If a specific voxel is activated by a certain type of stimulus, then the corresponding HRF is nonidentically zero; otherwise, the HRF is zero.Moreover, checking the equivalency between HRFs activated by different types of stimuli is also of interest to neuroscientists.This inspires us to consider the linear form of hypotheses on h: Here, A ∈ R k×( m) is a full row rank matrix of coefficients, and k is a fixed positive integer.In particular, if A = I m , then (3.1) can be employed to detect the activity at a single voxel.If A = (I m , −I m , 0 m×m , . . ., 0 m×m ), we can test the equivalence of two HRFs, h 1 and h 2 .Other choices of A are also applicable, depending on the specific structures of the HRFs being tested.
To fit the semi-parametric model (2.2) and carry out the semi-parametric inference (3.1) for h, we first develop a semi-parametric estimation for the nonparametric part d in (2.2) using the local-linear estimation method [12].Denote by S b ∈ R n×n the local-linear smoothing matrix associated with the design points {t 1 , . . ., t n }, with entries given by: for i, j = 1, . . ., n, where K(•) is a kernel function, b > 0 is a bandwidth parameter, and Applying the local-linear estimation method to model (2.2), we obtain S b Y = S b (S h + d + ε), which leads to: The asymptotic normality of h is demonstrated in Lemma 5 in the Appendix: under certain regularity conditions.This result allows us to propose a new Wald-type test statistic for conducting the significance test of (3.1): which incorporates the noise auto-covariance matrix n (the inverse of the noise auto-correlation matrix) is used in their test statistics because they estimate h using the weighted leastsquares estimator, whose asymptotic covariance matrix includes R −1 n .Moreover, we could directly use Σ n in (3.6) instead of cov( , ) because the auto-correlation among S b is negligible, as shown in Lemma 2.
We wish to emphasize that, in comparison to certain existing methods like (5.1) (combined with the estimator h in (5.2)), which assume stationarity and necessitate the computation of the inverse covariance matrix Σ −1 n for parameter estimation or significance testing, the new test statistic (3.6) (associated with the estimator h in (3.5)) leverages Σ n directly, without requiring the computation of its inverse Σ −1 n .In Section 5, we provide statistical guarantees for the broad applicability of W(Σ n ; b).Moreover, the advantages in numerical efficiency become especially evident when analyzing large-scale imaging data.In cases with very large n, such as n = 2800 in the real fMRI data discussed in Section 7, developing a valid empirical estimator for Σ −1 n presents more challenges than estimating Σ n itself.However, our proposed test statistic (4.2) only necessitates the computation of the inverse of fixed-dimensional matrices, making it computationally more efficient.

Auto-covariance matrix estimation for temporally dependent noise
In practical applications, the true noise auto-covariance matrix Σ n is unknown, leading to the need for estimating Σ n ∈ R n×n before applying the Wald-type test statistic in (3.6).We propose estimating Σ n using the tapering-based estimator: Here, the tapering matrix T ∈ R n×n is positive-semidefinite and Toeplitz.The operator "•" represents the Schur matrix product [33] (or entrywise product), and κ n > 0 is a shrinking parameter.The inclusion of κ n I n in (4.1) ensures the positivity of Σ. Theoretically, κ n approaches zero, and it's worth noting that our theoretical results remain valid even if κ n = 0.In practice, the choice of κ n has a negligible impact on numerical results, provided κ n > 0 is sufficiently small.We can set κ n as a user-defined small constant, e.g., κ n = 10 −5 , or adopt a data-driven choice like κ n = Y − S h 2 2 /n ν with a user-defined constant ν ≥ 2, as motivated by [28] (see Section 3 therein).By substituting the estimator Σ into (3.6),we obtain a practical form of the semi-parametric test statistic: (4.2)

Theoretical and methodological issues
The tapering-based method is commonly employed for large covariance matrix estimation in high-dimensional inference [3,7] and stationary time series analysis [28,43].This approach effectively ensures the positive-semidefiniteness of the resulting estimator.However, in the presence of non-stationary noise, directly verifying the consistency of the tapering-based estimator Σ in (4.1) appears challenging.Therefore, conducting the asymptotic analysis of the proposed semi-parametric test statistic W( Σ; b) with non-stationary dependent noise is a nontrivial undertaking.Nonetheless, in Section 5.1, we demonstrate that the proposed W( Σ; b) follows an asymptotic χ 2 k distribution under the null hypothesis.Furthermore, we derive the asymptotic power functions of the proposed test statistic under both fixed and contiguous alternatives.Section 5.2 addresses the selection of the tapering matrix T in the estimator Σ, and in Section 5.3, we develop data-driven methods for tuning parameter selection.

Asymptotic distributions of the proposed semi-parametric test statistics
This section rigorously examines the asymptotic properties of W(Σ n ; b) and W( Σ; b) in the presence of non-Gaussian temporally dependent noise.The technical assumptions for the theoretical analysis are provided below:   Lemma 6 in the Appendix).In [46], for stationary noise processes, a similar con- ) therein) is demonstrated to derive the asymptotic distributions of their proposed test statistics K and its bias-corrected version K bc .This is done by assuming that E , where R n is an estimator of the noise auto-correlation matrix R n .There, with r = Y − S h and 2) [15] demonstrated that a banded estimator of R n satisfies Condition A8 in [46] under a stationary g n -dependent assumption for ε.In our work, it appears that a straightforward approach to prove Lemma 6 is to show 1).However, under the non-stationary situation, demonstrating the consistency of Σ is extremely challenging due to the large number of parameters estimated.Here, we establish Lemma 6 by leveraging the special structure of the taperingbased estimator Σ and the assumption of short-range dependence for the noise process {ε(t i )}.Our assumption regarding the bandwidth, b → 0 and nb → ∞ in Condition A5 is milder than those required for analyzing the partially linear model in [32] and [21].This is because we considered fixed design points t 1 , . . ., t n for the nonparametric component, while they used random design.Fixed design allows us to disregard the estimation error of the conditional expectation of S given the design points of the nonparametric component.Nevertheless, controlling this estimation error necessitates stringent conditions on b under random design; refer to the first paragraph of Section 3 in [32] and the fourth paragraph on page 278 in [21] for further details.
Remark 1.The primary technical tools employed in the proof of Theorem 1 to address non-stationarity involve a n -dependent and martingale approximations for linear and quadratic forms.These tools only necessitate the short-range dependence outlined in Condition A3.While Lipschitz continuity is often required, for instance, to approximate Σ n (i, j) using a specific covariance function for a time-varying model as demonstrated in [45], it is not needed in our paper because we directly employ Σ n in the test statistic (3.6) without any approximation.
Next, we delve into the asymptotic powers of W(Σ n ; b) and W( Σ; b) against the fixed alternative in (3.1).

Theorem 2. Assume Condition
Theorem 2 suggests that, under any fixed alternative H 1 , W(Σ n ; b) and W( Σ; b) diverge to infinity with a rate of n.Hence, the power functions of the proposed test statistics approach one under In addition to the fixed alternative H 1 in (3.1), we also examine the shrinking contiguous alternatives within an n −1/2 neighborhood of H 0 in (3.1), defined as: , where the constant Theorem 3 asserts that under local alternatives, our proposed test statistics follow an asymptotic noncentral χ 2 k distribution, where the noncentrality parameter τ 2 depends on M 1 and M 2 .

Choice of the tapering matrix T in (4.1)
The asymptotic properties of the proposed W( Σ; b) in (4.2) depend on certain technical assumptions (Condition A9 in Section 5.1) concerning the tapering matrix T used in (4.1) to construct Σ.A wide class of tapering matrices satisfies these conditions.In our numerical experiments, we employ T gn , defined as: where the banding parameter g n ∈ {1, . . ., n} depends on n.This triangular tapering matrix T gn , as used in [43] for estimating the autocovariance matrix in stationary time series, has been shown to be rate-optimal.Therefore, we adopt (5.4) for our method.Various other tapering methods are available in the literature.For instance, [3] considered (see Assumption A on page 207 therein), which satisfies our Condition A9 under proper assumptions on τ n .Another tapering matrix used in both [6] (see equation ( 5) on page 2121 therein) and [28] (see Equation ( 4) therein) is defined as: for i, j = 1, . . ., n.
While [6] demonstrated that this tapering estimator is rate optimal for estimating high-dimensional covariance matrices with i.i.d.observations, it is important to note that T kn does not guarantee positive semidefiniteness.In Proposition 1 below, we discuss the conditions on g n under which T gn fulfills Condition A9 as outlined in Section 5.1.
), then the tapering matrix T gn defined in (5.4) satisfies Condition A9 in Section 5.1.
For brevity, let's denote the noise auto-covariance matrix estimator in (4.1) as: i.e., the semi-parametric test statistic calculated with the true Σ n substituted by the estimator Σ b;gn .
In [46], semi-parametric test statistics K and K bc were developed based on a g-banded estimator of the noise auto-correlation matrix R n , with g as a fixed integer.[15] studied a g n -banded estimator R n of R n for K and K bc , demonstrating the consistency of R n by assuming g n = o(n 1/14 ).Despite the different structures and assumptions, it is important to clarify the distinction between the divergence rate of the banding parameter g n for R n in [15] and that for Σ b;gn in Proposition 1.On the one hand, R n is a banded estimator whereas Σ b;gn is a tapering-based estimator.On the other hand, [15] aims to prove the consistency of R n , which requires stronger conditions on g n , whereas the consistency of Σ b;gn is not required in our paper.

Selection of tuning parameters b and g n
This section develops the data-driven methods for selecting the bandwidth parameter b for the local-linear smoothing matrix S b in (3.2) and the banding parameter g n for the tapering matrix T gn in (5.4).In practical implementation, we proceed by first calculating the data-driven bandwidth b of b, as explained in Section 5.3.2below.We then compute the data-driven choice g( b) for g n , as discussed in Section 5.3.1.Additional simulation evaluations (omitted in the paper) demonstrate that the proposed method for selecting the tuning parameters b and g n is robust against a small percentage of outlier contamination.

Data-driven choice of g n using a given b
Recall that two tuning parameters, b and g n , serve different roles in the estimator Σ b;gn in (5.5).We first develop a data-driven method to choose g n for any given bandwidth parameter b.
In the literature on estimating high-dimensional auto-covariance or autocorrelation matrices of stationary time series [40,15], the risk-minimization method is used to select the banding parameter.Estimating the risk functions in these works involves using the sub-sampling technique [30], which entails "overlapped" data splitting.
In cases involving non-stationary noise processes, the method of selecting g n by directly minimizing E( Σ b;gn − Σ n 2 ) may not perform well, as the estimator Σ b;gn may not be consistent to Σ n for non-stationary noise processes.However, the result n −1 S T Σ b;gn S − n −1 S T Σ n S = o P (1) (see Lemma 6) enables us to establish a more appropriate approach of selecting g n by minimizing the proposed risk function, To enhance computational efficiency (supported by Proposition 1), g n is chosen from {1, . . ., C 0 } rather than the entire set {1, . . ., n}, where C 0 is a predetermined integer much smaller than n.
We will apply the concept of sub-sampling to estimate the proposed risk function in (5.6).Despite the non-stationary nature of the noise process, where the noise covariance structure may vary among sub-samples, a common decaying pattern persists across all sub-samples due to the short-range dependence.Therefore, the sub-sampling method can be employed to estimate the shared banding parameter for all sub-samples.Let res = Y − S h represent the vector of residuals from model (3.4).We use positive integers V and B to denote the total number of blocks (each comprising sub-samples) and the number of subsamples within each block, respectively.Define G ν = {(ν −1) (n−B)/(V −1) + 1, . . ., (ν − 1) (n − B)/(V − 1) + B} to collect indices i ∈ {1, . . ., n} associated with sub-samples in the νth block, for ν = 1, . . ., V , where • represents the floor operator.Extract rows corresponding to indices i ∈ G μ from res and S, denoted as res (μ) and S (μ) , respectively.To estimate (5.6) using sub-samples, replace the unknown Σ n with the matrix Σ For any given bandwidth b, this leads to the data-driven choice of the banding parameter as follows: r(g n ; b). (5.7) Remark 2. The implementation of the data-driven method for selecting g n involves choices of C 0 , V , and B. In our numerical evaluations, we have found that reasonable choices are C 0 = 3 log(10n) , V = 20, and B = 8n 1/3 for the proposed test statistic W( Σ b; g( b) ; b) across various settings.

Data-driven choice of b
Recall that the estimator h in Step 1: Obtain an initial estimate of h by some preliminary procedure.E.g., taking the first-order difference of (2.2) yields where D 1 is an (n − 1) × n matrix defined as Since d(•) is smooth under Condition A1, D 1 d is ignorable.The ordinary least squares method supplies the initial difference-based estimate of h, Step Our simulation studies indicate that the choice of b 0 in Step 2 doesn't appreciably affect the performance of the proposed semi-parametric test statistic.In practice, other proper choices of b 0 can also be used.
∼ N (0, 1).In ∼ c 3 t 5 , with the constant c 3 adjusted to match the variance of the counterpart in noise model 2.
• Noise models 3 and 6 (non-stationary noise process): Time-varying AR(1), where , following the distribution N (0, 0.5 2 ) in noise model 3, and c 4 t 5 in noise model 6, with the constant c 4 adjusted to comply with the variances.

Inference for 1D temporally correlated data
The For comparison, we also consider test statistics K and K bc from [46], which assume stationarity with the auto-correlation matrix estimator R n by [15].(ii) In contrast, in the presence of non-stationary correlated noise, the finite sampling distributions of K and K bc in [46] do not closely resemble the χ 2 m distribution, as observed in the right panels of Figures 3 and 5.The agreement with the chi-squared distribution is confirmed only in the stationary case, as displayed in the right panels of Figure 1.
To assess the test statistics' power, data (S, Y ) were generated from model (2.2) with a non-zero true parameter h = δ1 m , where δ ranges from −0.3 to 0.3 in increments of 0.05.Empirical rejection rates of H 0 in (6.1) for different δ values are plotted in Figures 2, 4, and 6.
(iii) As expected, when δ = 0, the powers of W(Σ n ; b oracle ) and W( Σ b; g( b) ; b) are nearly equal to the nominal significance level of 0.05.As δ deviates from 0, the powers increase towards one.Hence, the proposed test statistics exhibit strong powers against the non-null h, while maintaining control over the type-I error rate.Results for test statistics K and K bc in [46], displayed in the right panels of Figure 2, also perform welly and show similar results to our proposed method due to the stationary nature of the noise process.(iv) However, unexpectedly, the right panels of Figures 4 and 6 reveal that the empirical powers of K and K bc in [46] exceed the nominal level when   δ = 0, a consequence of the non-stationary temporally correlated noise process.

Inference for 2D temporally and spatially correlated data
This section illustrates the semi-parametric inference procedure for detecting significant regions in 2D datasets.We simulate 101 replicates of fMRI data sets on a 2D-slice formed by 64 × 64 coordinates.The number of scanned voxels V 0 ∪ V 1 is 578, and the number of truly active scanned voxels V 1 is 104, where V 0 and V 1 refer to the regions of background and activated voxels, respectively.Within each simulation, at each voxel v ∈ V 0 ∪ V 1 , the data are simulated via Y (v) = S h (v) + d (v) + ε (v) , according to the set-up in Section 6.1, containing n = 300 time points within each of 2 runs, with = 1 and m = 16.Here h (v) = 0 at v ∈ V 0 , and For the nonzero h (v) , we adopt [14] to specify the HRF parameters as , for i = 1, . . ., m and j = 1, . . ., , where g 1 (t) = t a1 e −t/b1 and g 2 (t) = t a2 e −t/b2 , with a 1 = 5, b 1 = 0.9, a 2 = 12, b 2 = 0.7, c = 0.4, and τ s = 5.5.The region V 1 corresponds to two disks (as shown in red in the top left panel of Figure 7), centered at (47,44) and (20,27), with radii 5 and 6, respectively.Recall from (1.1) that stimuli in S are common to all voxels v ∈ V 0 ∪ V 1 but vary with different simulations.Hence, the 2D data are spatially correlated.The signal to noise ratios (SNR) of simulated fMRI data at the activated voxels v ∈ V 1 are 1.31, 0.59, 1.36, 1.31, 0.59, and 1.35 under noise models 1-6, respectively, where SNR = n i=1 var(S T i h (v) )/ n i=1 var{ε (v) (t i )}.To compare with the semi-parametric inference methods, we also include the parametric F -test combined with "parametric OLS" estimation, which assumes that the non-parametric component in (2.1) is fitted by a parametric quadratic model, i with unknown parameters (r 0 , r 1 , r 2 ), and noise terms {ε(t i )} are i.i.d. as N (0, σ 2 ).
Two false discovery rate (FDR)-controlling procedures, the BH procedure [2] and the Storey et al. procedure, [36], are employed to simultaneously test the significance of h (v) in (6.1) at all 578 voxels, with the FDR control level set at 0.05.Table 1 compares the calculated false discovery proportions (FDP), averaged across 101 simulations.The proposed W(Σ n ; b oracle ) and W( Σ b; g( b) ; b) both effectively control the FDR, whereas the "parametric OLS" method inflates the FDR due to misspecification of the non-parametric component.For graphical illustrations of the active regions detected by the parametric and semi-parametric methods, Figure 7 presents plots representing each of the three estimation methods (combined with the Storey et al. procedure).In this context, "representative" signifies the median performance, ranked according to the FDP, across 101 simulations.In Figure 7, the top left panel shows the true active regions (in red), while the top right, bottom left, and bottom right panels display the active regions (in purple) detected by the "parametric OLS" method, the proposed W(Σ n ; b oracle ), and W( Σ b; g( b) ; b), respectively.The comparison reveals that the "semi-parametric data-driven" method outperforms the "parametric OLS" method in the presence of non-stationary non-Gaussian time series noise.

Inference for 3D temporally and spatially correlated data
We simulate 101 sets of fMRI data on a 3D cube, with the number of 3D coordinates equal to 64 × 64 × 8; the number of scanned voxels is 4949, and the number of truly active scanned voxels is 773.The simulation setup at each voxel is similar to that for the 2D inference in Section 6.3.In Figure 8, the top left 8 slices show the true active regions (in red), while the top right, bottom left, and the bottom right 8 slices depict the active regions (in purple), which were identified by the "parametric OLS" method, the proposed W(Σ n ; b oracle ), and W( Σ b; g( b) ; b).The graphical assessment of detected regions in Figure 8, along with the numerical comparison of the FDPs summarized in Table 2, provides compelling evidence of the superior performance of the "semi-parametric datadriven" inference method over the parametric counterpart for data contaminated with non-stationary, non-Gaussian, temporally correlated noise.

Real brain fMRI analysis
We analyze the StarPlus fMRI data originally collected at Carnegie Mellon University's CCBI.Detailed information about the experiment design, data de- scription, and subjects can be found at http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-81/www/.Images were captured every 0.5 seconds for 6 subjects, with only a portion of each subject's brain imaged.The data are marked with 25-30 anatomically defined regions referred to as "Regions of Interest" (ROIs).There are n = 2800 images, featuring 3 types of stimuli, and m = 18.Please refer to Figure 9 for fMRI series plots of Subject 1 at 6 different voxels, where the non-stationarity pattern is visually evident.Figure 10 compares the active regions (in purple) identified by the "parametric OLS" method and the proposed "semi-parametric data-driven" method W( Σ b; g( b) ; b).As clearly observed, the "semi-parametric data-driven" method outperforms the "parametric OLS" method by effectively capturing significant regions, including the ROIs.In contrast, the latter fails to detect a majority of the ROIs.Results from other subjects are similar but omitted due to space limitations.

Discussion
As high-throughput technology advances, statistical inference methods naturally emerge from the analysis of large-scale structured datasets.These datasets are characterized not only by high dimensionality but also by non-stationarity and temporal dependence in various forms.The development of new tools for stochastic modeling, computational algorithms, and statistical inference proce-dures applied to these large-scale data is highly desirable.These tools enable scientists to efficiently analyze data with significantly increased flexibility and accuracy.This paper proposed a new method for performing semi-parametric inference on large-scale imaging data, inspired by the analysis of fMRI data, where the underlying noise process exhibits non-stationary, non-Gaussian behavior, and temporal dependence.Our method departs from existing approaches that rely on the assumption of noise process stationarity.We propose a semi-parametric test statistic using a tapering-based estimator for the noise auto-covariance matrix and illustrate its advantages over current methods.On the theoretical side, our method relaxes the requirement for the noise covariance matrix estimator to be consistent.On the numerical side, our method avoids the direct inversion of the n × n covariance matrix, preserving efficiency.It can be applied to both stationary and a broader range of non-stationary, non-Gaussian noise processes, making it particularly effective in addressing challenges posed by high-dimensional data.
In conclusion, we discuss potential directions for future research.First, the class of noise processes in Section 2 did not cover all possible cases, such as the stationary g n -dependent process [15] and the long-range dependent 1/f -like non-stationary noise process [26], which were not part of our assumptions.Investigating the feasibility of extensions on a case-by-case basis and addressing associated technical details present interesting avenues for further study.Second, our method for fitting the semi-parametric model estimates the parameter vector without imposing any prior structural knowledge.It would be valuable to explore how incorporating structural information, such as sparsity, grouping, and constraints, impacts parameter learning and related statistical inference.Third, conducting a comprehensive comparison between our proposed method and conventional approaches, including parametric and semi-parametric models from statistical, econometrics, and neuroimaging literature, would be beneficial for assessing its effectiveness on fMRI data.

Appendix A: Proofs of main results
For Lemma 1 and Lemmas 2-6, which will be needed in the proofs of the main results, we present their proofs below.
The first inequality above is because of the triangular inequality and Cauchy-Schwartz inequality.The last step is due to ( Under Condition A4, the term expressed as E[{s j1 (t i−u1 ) − p j1 }{s j2 (t k−u2 ) − p j2 }{s j1 (t i −u1 ) − p j1 }{s j2 (t k −u2 ) − p j2 }] is nonzero in the following four cases: where  For part (ii), from (A.3), Similar arguments imply that E I4 = o(1).Hence, var(I) = o(1).Following (A.4) and (A.5), we can show that E(II 2 ), E(III 2 ) and E(IV 2 ) are o (1).We finish the proof from (A.3).This completes the proof.where the first inequality comes from Fatou's lemma, the second inequality is due to Lemma 7 of [43] and the third inequality is from (A.7).This completes the proof.

Lemma 4. Let
) where Y = (I n − S b )Y , S = (I n − S b )S, d = (I n − S b )d and ε = (I n − S b )ε, with I n representing an identity matrix.Due to the smoothness of the function d(•) (see Condition A1 in Section 5.1), the part d in (3.3) can be negligible.Consequently, model (3.3) takes the approximate form, Y ≈ S h + ε. (3.4)The ordinary least squares method provides the estimator for the parametric part h as h = ( S T S) −1 S T Y .(3.5) Similarly, the non-parametric part d is estimated as d = S b (Y − S h).

P → M * 1 0P → M * 2 0Corollary 1 .
, provided the limits exist.Given that the limits of W(Σ n ; b) and W( Σ; b) are infinity and do not depend on the explicit forms of M 1 and M 2 , this prompts us to explore the possibility of relaxing the con-ditions n −1 S T S P → M 1 0 and n −1 S T Σ n S P → M 2 0 inTheorem 2. As in the proof of Theorem 1, we can demonstrate that under Condition A, for any subsequence {n l } ∞ l=1 , there exists a further subsequence {n lj } ∞ j=1 such that n −1 lj S T S and n −1 lj S T Σ n l j S as j → ∞.This result further enables us to establish the divergence of W(Σ n ; b) and W( Σ; b) under H 1 by assuming relaxed conditions.Assume Condition A in Section 5.1.Under the fixed alternative H 1 in (3.1), we have that W(Σ n ; b) P → ∞ and W( Σ; b) P → ∞.

( 3 . 5 ) 1 S T d 2 2 +
of the parameter vector h relies on the bandwidth b of the local-linear smoothing matrix S b .It is thus natural to choose b to minimize the mean squared error (MSE) of h.In the presence of nonstationary noise, this is expressed as MSE( h | S) = ( S T S) −tr{( S T S) −1 S T (I n − S b )Σ n (I n − S b ) T S( S T S) −1 }. (5.8) Ideally, if d and Σ n are known, then we can calculate the oracle-constant bandwidth b oracle , which minimizes E{MSE( h | S)}.For realistic applications with unknown d and Σ n , we select the data-driven choice of b by minimizing the plug-in estimate MSE( h | S), which replaces d and Σ n in MSE( h | S) by their estimates.Detailed procedures are described in Steps 1-3 below.

2 : 3 :
Select an initial bandwidth parameter b init by minimizing the covariancepenalty (I n − S b )(Y − S h DBE ) 2 2 + 2tr{S b Σ b0; g(b0) }, where h DBE is obtained in Step 1, and g(b 0 ) is a banding parameter selected by (5.7) using a predetermined bandwidth b 0 = 0.5.Step The data-driven choice of the bandwidth b is obtained by minimizing the plug-in estimate MSE( h | S), with d and Σ n in (5.8) replaced by their estimates S binit (Y −S h DBE ) and Σ binit; g( binit) , respectively, where g( b init ) is the choice of the banding parameter selected by (5.7) using the bandwidth parameter b init in Step 2.
1D temporally correlated data (S, Y ) are simulated from model (2.2) to mimic voxelwise fMRI time series.The plots depict empirical percentiles (from the 1st to the 99th) of the proposed semi-parametric test statistics, W(Σ n ; b oracle ) and W( Σ b; g( b) ; b), under H 0 in (6.1) versus the theoretical percentiles of the χ 2 m distribution.These plots are displayed in Figures 1, 3, and 5, corresponding to temporal noise models 4, 5 and 6, respectively, based on Monte Carlo simulation replications 500 times for each setting.Here, the oracle-constant bandwidth, b oracle , minimizes E{MSE( h | S)}, where the expectation is approximated by an empirical average across 100 simulated samples.The data-driven choices, b and g( b), are as described in Section 5.3.
(i) Clearly, in both stationary and non-stationary cases, the finite-sample distributions of W(Σ n ; b oracle ) and W( Σ b; g( b) ; b) agree reasonably well with the χ 2 m distribution.This provides support for the notion that the proposed test statistics W( Σ b; g( b) ; b), using the tapering-based estimator of Σ n and data-driven choices of tuning parameters, are comparable to their oracle counterparts, W(Σ n ; b oracle ), which use the true Σ n and the oracleconstant bandwidth parameter.Thus, both W( Σ b; g( b) ; b) and W(Σ n ; b oracle ) adapt conceivably well to the temporal dependence from either stationary or non-stationary noise.
properties further motivate us to develop a tapering-based estimator for the noise auto-covariance matrix Σ n in Section 4.

Table 1
Compare the calculated FDP averaged over 101 simulations.

Table 2
Compare the calculated FDP averaged over 101 simulations.