Estimation of Conditional Mean Operator under the Bandable Covariance Structure

We consider high-dimensional multivariate linear regression models, where the joint distribution of covariates and response variables is a multivariate normal distribution with a bandable covariance matrix. The main goal of this paper is to estimate the regression coefficient matrix, which is a function of the bandable covariance matrix. Although the tapering estimator of covariance has the minimax optimal convergence rate for the class of bandable covariances, we show that it has a sub-optimal convergence rate for the regression coefficient; that is, a minimax estimator for the class of bandable covariances may not be a minimax estimator for its functionals. We propose the blockwise tapering estimator of the regression coefficient, which has the minimax optimal convergence rate for the regression coefficient under the bandable covariance assumption. We also propose a Bayesian procedure called the blockwise tapering post-processed posterior of the regression coefficient and show that the proposed Bayesian procedure has the minimax optimal convergence rate for the regression coefficient under the bandable covariance assumption. We show that the proposed methods outperform the existing methods via numerical studies.


Introduction
Consider the multivariate linear regression model where Y i ∈ R q is a response vector, X i ∈ R p0 is a covariate vector, C ∈ R q×p0 is a regression coefficient matrix, and i ∈ R q , i = 1, 2, . . . , n, are independent and identically distributed error vectors from a q-dimensional distribution with mean zero. The multivariate linear regression model has been used for various fields of applications. For example, [36] analyzed atmospheric data using the model to forecast PM2.5 concentration, and [29] used the model to analyze the genomics data.
For the estimation of the multivariate linear regression coefficient C ∈ R q×p0 , one of the most commonly used approaches is a penalized least square method, which finds the minimizer of the following objective function, where P (C, Ω) is a penalty term, and the X ∈ R n×p0 and Y ∈ R n×q are defined such that the ith row vector of X ∈ R n×p0 (Y ∈ R n×q ) is X i (Y i ), and Ω is q × q positive-definite matrix representing the precision matrix of the error vector in the multivariate regression model. The penalized least square method penalizes the objective function when the estimate C deviates from the low dimensional structure which the true coefficient matrix C is assumed to have, and it is beneficial under high-dimensional settings, where p 0 and q can grow to infinity as n → ∞.
For example, as the penalty term P (C, Ω), the weighted sums of absolute value of elements in C and Ω have been suggested in [30], [35] and [24]. These penalty terms provide sparse estimations on the regression coefficient C and Ω.
There are also researches on the penalized estimation with the additional rank condition rank(C) = r for r < q. For example, in [10], the coefficient C is decomposed as C T = BA T where B ∈ R p0×r and A ∈ R r×q with A T A = I r . Then, they set the group lasso penalty on the row vectors of B to drop irrelevant predictor variables in the regression model. [9] suggested the penalty term as the sum of singular values of C and [33] used the penalty term on the singular vectors of C as well as the sum of singular values of C.
Employing a covariance estimation method is another approach for the estimation of the regression coefficient. The coefficient matrix C can be considered as a function of the joint covariance matrix of the covariate vector X ∈ R p0 and the response vector Y ∈ R q . Assume that Z = (X T , Y T ) T ∈ R p0+q follows a joint distribution with a mean vector μ and a covariance matrix Σ such that where μ X ∈ R p0 , μ Y ∈ R q , Σ XX ∈ R p0×p0 and Σ Y Y ∈ R q×q . Then, we have and μ 0 + Ψ 0 x is the conditional mean of Y given X = x if Z follows the multivariate Gaussian distribution. Note that μ 0 is the zero vector if we assume that μ X and μ Y are zero vectors. In this case, the coefficient matrix C in the multivariate regression model corresponds to Ψ 0 = Σ Y X Σ −1 XX , which is a function of the covariance matrix Σ and is called the conditional mean operator . Thus, estimators for covariance matrices can be used for the estimation of the conditional mean operator, which can be seen as a functional of a covariance matrix. Estimation methods for various functionals of covariance matrices have also been investigated in the literature. [13] suggested an optimal estimator for quadratic and l r functionals of sparse covariance matrices. [14] considered optimal estimation of μ T Σ −1 μ, where μ is mean vector and Σ is covariance matrix, under the sparsity assumption of Σ −1 μ.
We need to consider a high-dimensional covariance estimation method when we use a covariance estimator for the multivariate regression model under highdimensional settings. Suppose Z 1 , Z 2 , . . . , Z n are independent and identically generated from a p-dimensional distribution with mean zero and covariance matrix Σ. We refer to the estimation of covariance Σ as high-dimensional covariance estimation when p is assumed to go to infinity as n −→ ∞. Since traditional covariance estimation methods, such as the sample covariance matrix and the Bayesian method by the inverse-Wishart prior, are not consistent when p is larger than n [18,22], various structural assumptions on covariance matrices have been used to reduce the number of effective parameters. For example, the banded covariances [23], the bandable covariances [2], sparse covariances [5] and sparse spiked covariances [4] have been considered. These structural assumptions can be used in the joint covariance matrix of covariates and response variables when we employ covariance estimation for the multivariate regression under the high-dimensional settings.
In this paper, we consider the multivariate linear regression model, where the joint covariance of the covariate vector and the response vector is a bandable covariance matrix. Under the bandable covariance assumption, the farther apart two variables are, the smaller their covariance is. On the frequentist side, [6,7] proved that the tapering estimator of covariance has the minimax optimal convergence rates for the class of bandable covariances under the spectral norm, Frobenius norm, and matrix l 1 norm. Therefore, a naive approach would be estimating the conditional mean operator based on the tapering estimator of covariance (or other minimax covariance estimators).
Unfortunately, even if a covariance estimatorΣ has the minimax optimal convergence rate for the covariance Σ, it does not imply that f (Σ) has also the minimax optimal convergence rate for f (Σ) where f is a function on the space of covariances. Thus, the estimator for Σ Y X Σ −1 XX based on the tapering estimator of covariance may not have the minimax optimal convergence rate. Furthermore, there is no Bayesian method achieving the minimax posterior convergence rate for the class of bandable covariances. Note that [31], [20] and [23] proposed Bayesian procedures for banded covariances, but the class of bandable covariances considered in this paper is larger than the class of banded covariances.
We investigate the decision-theoretic property of the tapering estimator when the parameter of interest is the conditional mean operator, Σ Y X Σ −1 XX , instead of the covariance itself under the normality assumption. As a naive plug-in estimator for the conditional mean, we first define the tapering estimator of regression coefficient, a tapering estimator of covariance plugged into the conditional mean operator. However, we show that the tapering estimator of the regression coefficient is sub-optimal under the bandable covariance assumption. This implies that a naive plug-in estimator is not enough to achieve the minimax rate. To resolve this issue, we propose a modified plug-in estimator, the blockwise tapering estimator of regression coefficient, and show that it obtains the minimax rate.
As a Bayesian procedure for the conditional mean operator under the bandable covariance assumption, we adopt the post-processed posterior method [23]. A post-processed posterior is a posterior constructed by transforming posterior samples from the initial posterior, which is typically a computationally convenient posterior. This idea is especially useful when it is difficult to impose a prior distribution on a restricted parameter space due to an unknown normalizing constant. For a given parameter space Θ * , suppose that we are interested in restricted parameter space, Θ ⊂ Θ * . A post-processed posterior can be ob-tained by generating samples from an initial posterior on Θ * and post-processing the posterior samples so that the transformed post-processed samples belong to Θ. When the post-processing function is a projection map from Θ * to Θ, the method is called the posterior projection method. The posterior projection method has been suggested for various settings including [12], [17], [26] and [8], and was investigated in general aspects by [27]. The idea of transforming posterior samples was also used for the inference on covariance or precision matrices in [23] and [1].
We suggest two post-processed posteriors for the conditional mean operator. Both methods use the inverse-Wishart distribution as the initial prior distribution on the unconstrained covariance matrix space and use the tapering function and the blockwise tapering function as the post-processing functions for the conditional mean operator Σ Y X Σ −1 XX . We present the asymptotic analysis to justify the proposed post-processed posteriors, and show that the post-processed posterior by the blockwise tapering function has the minimax optimal convergence rate.
The rest of the paper is organized as follows. In Section 2, we introduce the blockwise tapering estimator for the inference of the conditional mean operator under the bandable covariance assumption and show that this estimator has the minimax convergence rate. In Section 3, we introduce the post-processed posteriors for the conditional mean operator, and present the posterior convergence rates. Simulation studies and real data analysis are given in Section 4. Proofs of theorems in Sections 2 and 3 are given in Sections 5 and A, respectively. We conclude this paper with a discussion section.

Notation
Let q, k and l be positive integers with l∨k ≤ q. For a q×q-matrix Σ and positive real numbers σ ij , 1 ≤ i, j ≤ q, let Σ = (σ ij ) 1≤i,j≤q = (σ ij ) when σ ij is equal to the (i, j) element of Σ. We define sub-matrix operators M For a q ×q-matrix Σ = (σ ij ) 1≤i,j≤q and a positive integer k with k ≤ q, define the tapering function T k (Σ), which was first defined in [6], as For any sequences a n and b n of positive real numbers, we denote a n = o(b n ) if lim n−→∞ a n /b n = 0, and a n = O(b n ) if lim sup n−→∞ a n /b n = C for a positive constant C. We denote a n b n if a n ≤ Cb n for all sufficiently large n and a positive constant C.

Blockwise tapering estimator
Let n, p and p 0 be positive integers with p 0 < p. Suppose Z 1 , Z 2 , . . . , Z n are independent and identically distributed from a p-dimensional distribution with mean zero and covariance matrix Σ 0 , where When only the first p 0 elements of Z i , i.e. X i , are given, the conditional mean vector for the other p − p 0 variables is The conditional mean operator Σ 0,Y X (Σ 0,XX ) −1 in (1) is the estimand we focus on in this paper. We define the transformation ψ from a covariance to the conditional mean operator as for Σ ∈ C p , where C p is the set of all p × p-dimensional positive definite matrices. We assume Σ 0 belongs to a class of bandable covariances, F α , which is defined as for some positive constants α, M > 0 and 0 < M 1 < M 0 , where λ min (Σ) is the minimum eigenvalue of Σ. [2] and [6] also considered the same class of bandable covariances except the minimum eigenvalue condition. A natural estimator for ψ(Σ 0 ) is the plug-in estimator, the tapering estimator of covariance plugged into ψ, because the tapering estimator achieves the minimax optimal convergence rate for the class of bandable covariances under the spectral norm loss [6]. For the positive-definiteness is necessary for the covariance estimator, we modify the tapering estimator of covariance so that it is positive definite and call it adjusted tapering estimator of covariance: where n > 0 is the positive-definite adjustment parameter, S n is the sample covariance matrix n i=1 Z i Z T i /n, and I p is the p × p identity matrix. We call the plug-in estimator with adjusted tapering estimator of covariance the tapering estimator of regression coefficient, in short the tapering estimator .
Since every column of the tapering estimator is not the zero vector with probability one, the tapering estimator uses all variables in a given covariate vector when the estimator is used as the regression coefficient. In other words, in the variable selection perspective, all variables are selected when the tapering estimator is used. Note that selecting out negligible covariates can increase the accuracy of a regression estimator, and partial correlations between covariates and responses have been used as a criterion for the variable selection [25,3]. Theorem 2.1 shows that if a covariance matrix is bandable, then so is its inverse matrix. It essentially means that, under the bandable covariance assumption, the variables far away from each other tend to have weak partial correlations. The proof of Theorem 2.1 is given in Section 5.1.
for all a > 0 and all sufficiently large integer k with p > k ∨ (ak log k).
Note [21] showed that the partial correlation between the ith and the jth variables is for all i ∈ [p], by Theorem 2.1, each element in response vector (Z i ) (p0+1):p has negligibly weak partial correlations with remote covariates (Z i ) j for j ≤ p 0 and large |j − p 0 |. Thus, selecting out these negligible covariates could yield a more accurate estimator for the conditional mean operator.
Based on the above argument, we propose the blockwise tapering estimator of regression coefficient, in short the blockwise tapering estimator . Let Z be the set of all integers and x = max{z ∈ Z : z ≤ x}. For positive real numbers a and n , and a positive integer k with 2 ak log k ≤ p 0 , define the blockwise tapering estimator as φ(S n ; 2 ak log k , n ) := φ(S n ; p 0 , 2 ak log k , n ) where O c×d is the c × d zero matrix for positive integers c and d. The definition of the blockwise tapering estimator gives where T k (S n ) Y X (2) = {T k (S n )} (p0+1):p,(p0−2 ak log k +1):p0 and S n,X (2) X (2) = (S n ) (p0−2 ak log k +1):p0 , and we examine the blockwise tapering estimator using the formula (3). The first p 0 − 2 ak log k columns of the estimator (3) are zero column vector, thus, given a covariate vector x ∈ R p0 , the blockwise tapering estimator drops the first p 0 − 2 ak log k covariate variables x 1:p0−2 ak log k . The remaining part of the estimator operates as the regression coefficient for the remaining covariate variables x p0−2 ak log k +1:p0 , and (4) is almost the same as ψ(T ( n ) k (S n,(X (2) ,Y )(X (2) ,Y ) ); 2 ak log k ) except for the positive-definiteness adjustment part, where S n,(X (2) ,Y )(X (2) ,Y ) = (S n ) p0−2 ak log k +1:p .
Note that ψ(T ( n ) k (S n,(X (2) ,Y )(X (2) ,Y ) ); 2 ak log k ) is the tapering estimator of regression coefficient with data {((X i ) T p0−2 ak log k +1:p0 , Y T i ) T ; i = 1, . . . , n}. These observations show that the blockwise tapering estimator is constructed based on the variable selection and the idea of the tapering estimator of the regression coefficient.

Minimax analysis of blockwise tapering estimator
In this section, we give the convergence rates of the tapering and blockwise tapering estimators under the normality assumption, i.e., Z 1 , Z 2 , . . . , Z n are independent and identically distributed from a p-dimensional Gaussian distribution with mean zero and covariance matrix Σ 0 , which is denoted by N p (0, Σ 0 ). We also show that the blockwise tapering estimator has the minimax convergence rate under the normality assumption. We use the loss function on R (p−p0)×p0 for a pair of parameter ψ(Σ 0 ) and estimatorĈ. The loss function gives the upper bound of the estimation error of E(Y | X = x) given x ∈ R p0 , because the definition of the operator norm gives We show that the tapering estimator has a sub-optimal convergence rate under the loss function (5), while the blockwise tapering estimator has the minimax optimal convergence rate. Theorem 2.2 gives the convergence rate of the tapering estimator. If we set n such that p 1/2 5 k/2 n exp(−λn) 2 n (k + log p)/n, then the convergence rate is (k + log p)/n + k −2α , which is the same rate as the convergence rate of the tapering estimator of covariance [6]. The proof of this theorem is given in Section 5.2.
for all sufficiently large n.
Next, we show the convergence rate of the blockwise tapering estimator. The blockwise tapering estimator is designed to estimate φ(Σ 0 ; 2 ak log k , 0) which approximates ψ(Σ 0 ). Lemma 2.3 gives the approximation error, which is negligible when k is large enough. Based on the approximation error, the convergence rate of the blockwise tapering estimator is given in Theorem 2.4. If we set n such that (ak log k) 1/2 5 k/2 n exp(−λn) 2 n k/n, the convergence rate of the blockwise tapering estimator is k/n + k −2{α∧(aτ −1)} . The proofs of the lemma and theorem are given in Section 5.3.
There exist some positive constants C and τ depending only on M , M 0 and M 1 such that for all a > 0 and all sufficiently large integers k and p 0 with ak log k /2 ≥ k and 2 ak log k < p 0 . (1), then there exist some positive constants C, λ and τ depending only on M , M 0 , M 1 and α such that for all a > 0 and all sufficiently large n, k and p 0 with ak log k /2 ≥ k and p 0 > 2 ak log k .
Next, we give the lower bound of the minimax risk for the conditional mean operator under the bandable covariance assumption to show that the blockwise tapering estimator is a minimax optimal estimator. LetĈ =Ĉ(X 1 , X 2 , . . . , X n ) be an estimator on R (p−p0)×p0 . The minimax risk is defined as Theorem 2.5 gives a lower bound of the minimax risk as n −2α/(2α+1) . If we set k, a and n of the blockwise tapering estimator such that k = n 1/(2α+1) , a > (α + 1)/τ and (ak log k) 1/2 5 k/2 n exp(−λn) 2 n k/n, then the convergence rate is the same as the lower bound asymptotically. Thus, the minimax convergence rate is n −2α/(2α+1) , and the blockwise tapering estimator attains the convergence rate. The proof of Theorem 2.5 is given in Section 5.4.

Blockwise tapering post-processed posterior
We propose the Bayesian counterparts of the tapering estimator and the blockwise tapering estimator using the post-processed posterior method [23]. The algorithm for the post-processed posteriors consists of the following two steps.
(a) (Initial posterior sampling step) First, we obtain the initial conjugate posterior distribution on the unconstrained parameter space. We take the inverse-Wishart distribution IW p (B 0 , ν 0 ) as the initial prior distribution of which density function is where B 0 ∈ C p and ν 0 > 2p. Then, using the likelihood of Gaussian distribution, we obtain the initial posterior distribution π i (Σ|Z n ) as , . . . , Σ (N ) from the initial posterior distribution. (b) (Post-processing step) Second, we post-process the samples from the initial posterior distribution with ψ{T , which are called the tapering function and the blockwise tapering function.
We call the post-processed posteriors obtained from the post-processing functions the tapering post-processed posterior (tapering PPP) and the blockwise tapering post-processed posterior (blockwise tapering PPP).
We use the decision-theoretic framework [22,23] to prove the minimax optimality of the blockwise tapering post-processed posterior. We define P-loss L(·, ·) and P-risk R(·, ·) for the conditional mean operator as where π pp (· | Z n ; f ) is the post-processed posterior distribution derived from initial prior π i and post-processing function f , and (π i , f) is a pair of initial prior π i and post-processing function f . Theorems 3.1 and 3.2 give the P-risk convergence rates of the tapering and the blockwise tapering post-processed posteriors, respectively. The convergence rates are the same as their frequentist counterparts. The proofs of these theorems are given in Section A.
. Let k be a positive integer, and let the prior π i of Σ be IW p (A n , ν n ) for A n ∈ C p and ν n > 2p.
for all sufficiently large n and k.
then for all a > 0 and all sufficiently large n, k and p 0 with ak log k /2 > k and p 0 > 2 ak log k .
Note that the P-risk minimax lower bound is n −2α/(2α+1) since the P-risk convergence rate is slower than or equal to the frequentist minimax rate [22]. Thus, if we set k, a and n of the blockwise tapering post-processed posterior such that k = n 1/(2α+1) , (ak log k) 1/2 5 k/2 n exp(−λn) 2 n k/n and a > (α + 1)/τ , then the P-risk convergence rate is the same as the lower bound asymptotically. Thus, the P-risk minimax convergence rate is n −2α/(2α+1) , and the blockwise tapering post-processed posterior attains the convergence rate.

Simulation
We compare the blockwise tapering estimator with the tapering estimator using simulation data. We define the true covariance matrix Σ 0 ∈ R p×p as below. Let and let Σ 0 = Σ * 0 +{0.5−λ min (Σ * 0 )}I p , which guarantees the minimum eigenvalue of Σ 0 is bounded away from zero. We set ρ = 0.6 and α = 0.1 for Σ 0 and generate data Z 1 , . . . , Z n from N p (0, Σ 0 ) independently, where p ∈ {500, 1000} and n = p/2. Let p 0 = 0.8p and fix the positive-definite adjustment parameter n as 0.5. We define the error reduction value by choosing the blockwise tapering estimator over the tapering estimator as We repeat generating the simulation data T times, and let Z (i) n and S (i) n denote the data and the sample covariance matrix, respectively, in the ith repetition for i ∈ {1, 2, . . . , T }. We summarize the error reduction values from the repetitions as t-value , which is the performance measure for the comparison between the tapering and blockwise tapering estimators. We also compare the blockwise tapering PPP with the tapering PPP for the same simulation data. We define the error reduction value by choosing the blockwise tapering PPP as whereĈ (T P P P ) andĈ (bT P P P ) are the posterior means of the tapering PPP and the blockwise tapering PPP, respectively. We define the t-value for T repetitions as We evaluate t f (k, a; 100) and t b (k, a; 100) for k ∈ {2, 3, . . . , 10} and a ∈ {5, 10, 20}. For the post-processed posteriors we generate 1000 posterior samples in each setting. We represent the result of the evaluations in Figure 1. When p is large and k is small, the effects of error reductions by the blockwise tapering estimator and the blockwise tapering post-processed posterior increase. Note that the convergence rates of the tapering estimator and the tapering PPP contain the additional log p/n term. The effect of the additional term is increased when another term in the convergence rate, k/n, is relatively small. Thus, the error reduction is effective when p is large compared to k. The figure also shows that the tapering estimator is slightly better otherwise. If log p is not relatively large, one does not need to abandon the covariates X 1:(p0−2 ak log k ) by using the blockwise tapering estimator or the blockwise tapering PPP.
Next, we compare the tapering estimator, blockwise tapering estimator, and their Bayesian versions with two other methods: covariance estimation method and multivariate regression method. A covariance estimator can be used for the estimation of the conditional mean operator by applying the transformation (2). We use the banding estimator [2], dual maximum likelihood estimator [19], and the banding post-processed posterior [23] as covariance estimators for comparison. The multivariate regression method is also used for comparison, since the multivariate linear regression coefficient is the conditional mean operator. We adopt the reduced-rank regression [9], the sparse reduced-rank regression [10] and the method of sparse orthogonal factor regression (SOFAR) [33].
We need to select tuning parameters for the conditional mean operator estimators. Based on the tuning parameter selection process, we divide the estimation methods into three categories: frequentist covariance-based method, post-processed posterior method, and multivariate regression method.
The tapering and blockwise tapering estimators belong to the frequentist covariance-based method, and the process of the tuning parameter selection is as follows. When a covariance estimator is given, the conditional mean operator and the conditional variance are derived, which yield the conditional distribution under the normality assumption. The log-likelihood function of the conditional distribution is used for the leave-one-out cross-validation. LetΣ(Z n,−i , τ) be a frequentist covariance estimator based on Z n,−i = (Z 1 , . . . , Z i−1 , Z i+1 , . . . , Z n ) given a tuning parameter vector τ . The tuning parameter τ is (k, ) when the tapering estimator is considered and τ is (k, , a) when the blockwise tapering estimator is considered. The derived conditional mean operator is ψ{Σ(Z n,−i , τ)}, and the conditional variance is We select τ as the minimizer of is the density function of the multivariate normal distribution with mean μ and covariance Σ. Since the conditional variance can not be derived from the blockwise tapering estimator, we use the conditional variance from the tapering estimator in this case.
For the tuning parameter selection of the post-processed posterior methods, we use the Bayesian leave-one-out cross-validation method [15] to the log-likelihood function of the conditional distribution. Let Σ be leave-one-out initial posterior samples which are generated from the initial posterior by Z n,−i for i ∈ {1, 2, . . . , n}. We select the tuning parameter vector τ as the minimizer of where ψ * and ν * are post-processing functions for the conditional mean operator and conditional variance given the tuning parameter τ , respectively. For the post-processing function of the conditional variance ν * , the banding PPP uses ν{B is the positive-definite adjusted banding operator defined as and the tapering and blockwise tapering PPPs use ν{T For the multivariate regression method, we use 10-fold cross-validation method as [10], [9] and [33] suggested. Note that all the methods contain the rank parameter in the tuning parameters. While we select the rank from {0, 1, . . . , 10} for the reduced-rank regression, {1, 2, . . . , 10} is considered for the others. Because if the rank is zero, all the three methods coincide, we only consider the [9]'s method for the zero rank case.
We set p = 200, ρ = 0.6 and α = 0.1, 0.3 for Σ 0 and generate Z 1 , Z 2 , . . . , Z n from N p (0, Σ 0 ) independently for n ∈ {100, 200}. We repeat generating the simulation data 100 times for each simulation setting. The performance of each method is measured as whereĈ s is the point estimator for the conditional mean operator in the sth repetition. For the post-processed posterior methods, we use the posterior mean as the point estimator. Table 1 gives the simulation error. The tapering estimator and the blockwise tapering estimator, and their Bayesian counterparts are the best in all settings. The multivariate regression methods, i.e. the reduced-rank regression, sparse reduced-rank regression and sparse orthogonal factor regression, are the worst in all settings. Unlike the other covariance-based methods, the bandable or banded covariance structure is not considered in the multivariate regression methods. It appears that the multivariate regression framework does not perform well under the high-dimensional bandable covariance assumption.

Application to forecasting traffic speed
We apply the proposed methods to multivariate regression analysis for small area spatio-temporal data, and use this application to forecast traffic speed in Yeoui-daero, a road in Seoul. Suppose spatio-temporal data are observed in S spatial regions and T times, where S and T are positive integers. Let X s,t be a random variable at sth spatial index and tth time index, s = 1, . . . , S and t = 1, . . . , T . We assume where r is a real-valued function from the non-negative integer space, and is assumed to be a decreasing function. Rearranging (X s,t ) s=1,...,S,t=1,...,T , we define Z ∈ R T S as Z = (X 1,1 , X 2,1 , . . . , X S,1 , X 1,2 , X 2,2 , . . . , X S,T ), and let E(ZZ T ) = Σ 0 = (σ 0,ij ). We show that Σ 0 is a bandable covariance, if the decreasing rate of r(x) is x −α−1 . Note that if mS ≤ |i − j| < (m + 1)S for m ∈ {0, 1, . . . , T − 1} and i, j ∈ {1, 2, . . . , T S}, then the time index difference between Z i and Z j is at least m. This observation and assumption (6) give |σ 0,ij | ≤ r( |i − j|/S ) and for some positive constant C. Thus, Σ 0 is a bandable covariance, and the proposed methods for the conditional mean operator under the bandable covariance assumption can be used to predict X 1:S,t0+1:T given X 1:S,1:t0 . Based on the rearrangement (7) and the proposed methods for the conditional mean operator under the bandable covariance assumption, we forecast traffic speed in Yeoui-daero using data from TOPIS [32]. In the traffic speed data in Yeoui-daero, a daily data set consists of observations in 8 spatial indexes and 24time indexes. Let a daily traffic speed be (X s,t ) 1≤s≤8,1≤t≤24 , where time index t indicates the time interval from (t − 1) o'clock to t o'clock, and the allocation of the spatial index s is given in Figure 2. We rearrange (X s,t ) 1≤s≤8,1≤t≤24 as (7), and apply the proposed estimators to forecast X 1:8,18:24 given X 1:8,1:17 .
In the data from TOPIS, we use data from January to October in 2020, excluding weekend data sets and missing data sets. We have 172 days observations which are denoted by Z 1 , Z 2 , . . . , Z 172 ∈ R 192 . To apply the proposed methods, we use mean-centered observationsZ 1 ,Z 2 , . . . ,Z 172 ∈ R 192 . For the performance measure, let the training data be Z train = (Z 1 ,Z 2 , . . . ,Z 86 ), and the test data be Z test = (Z 87 ,Z 88 , . . . ,Z 172 ). The forecast errors are summarized as where p 0 = 17 × 8 andĈ(Z train ) is a point estimator for the conditional mean operator based on Z train . The summarized forecast errors are represented in Table 2, which shows the tapering and blockwise tapering estimators and their Bayesian versions are the best among all methods.

Proofs of theorems and lemma
In this section, the proofs of Theorems 2.1-2.5 and Lemma 2.3 are given. The proofs of the other theorems, Theorems 3.1 and 3.2, are given in Section A. We give notations for the proofs. Let ||Σ|| F = tr(ΣΣ T ) and ||Σ|| r be the Frobenius norm and the matrix r-norm for a covariance matrix Σ, respectively. We also let W p (B 0 , ν 0 ) be Wishart distribution of which density function is where B 0 ∈ C p and ν 0 > p − 1.

Proof of Theorem 2.1
In this subsection, we give the proof of Theorem 2.1. First, we present Lemmas 5.1 and 5.2 which are necessary for the proof of Theorem 2.1. Proofs of these lemmas are given in Section A.
for all sufficiently large k.
The proofs of these lemmas are given in Section A. Now we prove Theorem 2.1.
for all sufficiently large k.

Proof of Theorem 2.2
In this subsection, we prove Theorem 2.2 which gives the convergence rate of the tapering estimator. First, we present Lemmas 5.3-5.6 necessary for the proof of Theorem 2.2. Proofs of these lemmas are given in Section A.
for all sufficiently large n.
for all sufficiently large n.
The proofs of these lemmas are given in Section A. Now, we give the proof of Theorem 2.2.

Proof of Theorem 2.2.
We have where the last inequality holds since λ min {T We have that there exists some positive constant C 1 depending only on M , M 0 , M 1 and α such that for all sufficiently large n. The second and last inequalities hold by Lemma 5.3 and Theorem 2 of [6], respectively. By Lemmas 5.4 and 5.5, there exist some positive constants C 2 and λ 1 depending only on M 0 and M 1 such that for all sufficiently large n. Finally, by Lemma 5.6, we get the upper bound of (12) as for some positive constants C 3 and λ 2 depending only on M , M 0 , M 1 and α.

Proofs of Lemma 2.3 and Theorem 2.4
In this subsection, we show the convergence rate of the blockwise tapering estimator by proving Lemma 2.3 and Theorem 2.4. First, we present Lemma 5.7.
Lemma 5.7. Let q and k be positive integers with k < q, and A ∈ C q be a kband matrix. For positive real numbers a and b with q > ak log k + bk log k , we express A and A −1 as where A 22 , Ω 22 ∈ R ( ak log k + bk log k )×( ak log k + bk log k ) . There exist some positive constants λ and C depending only on ||A|| and ||A −1 || such that for all sufficiently large k and q with ak log k ≥ k and q > ak log k + bk log k .
The proof of this lemma is given in Section A. Next, we provide the proofs of Lemma 2.3 and Theorem 2.4.
Proof of Lemma 2.3. There exist some positive constants C 1 and C 2 depending only on M , M 0 and M 1 such that for all sufficiently large k. The third inequality holds by Lemma 5.2 since Σ 0,XX ∈ F p0,α (M, M 0 , M 1 ). Then, we have For this upper bound, we have First, we show the upper bound of (13). Since T k (Σ 0 ) is a k-band matrix, Since ak log k ≥ k, the definition of M * ( ak log k ) since ak log k /2 ≥ k. Thus, by Lemma 5.1, there exist some positive constants C 3 and λ 1 depending only on M 0 and M 1 such that for all sufficiently large k with p 0 > k ∨ ak log k /2. Next, we show the upper bound of (14). Since T k (Σ 0 ) Y X is expressed as (15), we have Note where D * 1 , D * 2 and D * 3 are k × ak log k -matrices with We have First, we show the upper bound of ||D * 2 − D * 3 ||. Let By Lemma 5.7, there exist some positive constants C 4 and λ 2 depending only on M 0 and M 1 such that for all sufficiently large k with ak log k ≥ k and p 0 > 2 ak log k .
Next, we show the upper bound of ||D * 1 ||. Let M since ak log k /2 ≥ k. Thus, Lemma 5.1 gives that there exist some positive constants C 5 and λ 3 depending only on M 0 and M 1 such that for all sufficiently large k with p 0 > k ∨ (ak log k). By combining this inequality with (17), we have By collecting the inequalities (16) and (18), the proof is completed.
Proof of Theorem 2.4. We have By Lemma 2.3, there exist some positive constants C 1 and λ 1 depending only on M , M 0 , M 1 and α such that for all sufficientlay large k with ak log k /2 ≥ k and p 0 > 2 ak log k . Next, we show the upper bound of (20). LetM := M (2 ak log k ) p0−2 ak log k +1 in this proof. By the definition of Λ ( n ) , we have for some positive constants C 2 and λ 2 depending only on M , M 0 , M 1 and α. The first inequality holds since The second inequality holds since T k (A) Y X for a matrix A is determined only by M (2k+1) p0−k (A). Note that T k (A) is a k-banded matrix and T k (A) Y X is the off-diagonal block matrix. The third inequality holds since The last inequality holds by Lemmas 5.4, 5.5 and Theorem 2 of [6]. For the upper bound of (21), we have that there exists some positive constant C 3 depending only on M , M 0 and M 1 such that for all sufficiently large k. The first equality is satisfied by the definition of Λ (0) and Λ ( n ) . The last inequality holds by Lemma 5.2 sinceM (Σ 0,XX ) ∈ F 2 ak log k ,α (M, M 0 , M 1 ). We apply Lemma 5.6 and obtain that there exist some positive constants C 4 and λ 3 depending only on M , M 0 , M 1 and α such that for all sufficiently large n. Thus, there exists some positive constant C 5 depending only on M , M 0 , M 1 and α such that for all sufficiently large n and k. Collecting the upper bounds of (19), (20) and (21), we complete the proof.

Proof of Theorem 2.5
In this subsection, we prove Theorem 2.5 which gives the lower bound of the minimax risk for the conditional mean operator under the bandable covariance assumption. For probability measures P θ and P θ , we define where p θ and p θ are probability density functions of P θ and P θ , respectively, with respect to the reference measure ν. First, we provide Lemma 5.8 which is a reformulation of the proof of Lemma 6 in [6].
Lemma 5.8. Suppose Σ and Σ are p × p-positive definite matrices. Let P Σ be the joint distribution of X 1 , X 2 , . . . , X n , which are independent and identically generated from The proof of this lemma is given in Section A.
Proof of Theorem 2.5. Let C p,p0 = {A ∈ R p×p : A 1:p0,1:p0 ∈ C p0 }, and (C p,p0 ) χ be the space of estimators on C p,p0 , where χ is the sample space. Since for an which is a semimetric on C p,p0 . For a positive integer k with k < p 0 /2, let Θ = {0, 1} k , and let H(θ, θ ) be the Hamming distance on Θ. We define ||P θ ∧ P θ ||, (24) where P θ is the joint distribution of n independent observations from the multivariate normal distribution with mean zero and covariance Σ(θ). First, we show the lower bound of .

Discussion
We have considered the estimation of the conditional mean operator under the bandable covariance assumption, which is useful for the multivariate linear regression when there is a natural order in the variables. We showed that the plug-in estimator by the tapering estimator of covariance, which is the minimax optimal estimator for the class of bandable covariance, is sub-optimal for the conditional mean operator. This observation implies that when a function of the covariance matrix is to be estimated, the plug-in estimator by a minimax optimal covariance estimator may not be optimal. We have proposed the blockwise tapering estimator and the blockwise tapering post-processed posterior as minimax-optimal estimators for the conditional mean operator under the bandable covariance assumption. We constructed the estimators by modifying the tapering estimator and the tapering post-processed posterior to exclude the covariates which have small partial correlations with the response variables. Using the numerical studies, we also showed that the blockwise tapering estimator and the blockwise tapering post-processed posterior have smaller errors when p is large enough.
The tapering estimator and blockwise tapering estimator are constructed from the tapered sample covariance, and we can make similar estimators for the regression coefficient by replacing the tapered sample covariance with the banded sample covariance. Futhermore, the banding versions of these estimators also have the same convergence rates as their tapering versions. When we obtain the convergence rates of the tapering estimator and blockwise tapering estimator (Theorems 2.2 and 2.4), we rely on the concentration inequality for the tapered sample covariance T k (S n ), for some positive constants C 1 , C 2 and λ. According to inequality (5.5) in [34], the banded sample covariance B k (S n ) also has the same form of concentration inequality. Thus, the same convergence rates are obtained for the banding versions of the proposed frequentist methods.

Appendix A: Proofs of remaining lemmas and theorems
In this section, we give the proofs of Theorems 3.1 and 3.2 and lemmas given in Section 5.

A.1. Proofs of Theorems 3.1 and 3.2
In this section, we prove Theorems 3.1 and 3.2 which give the P-risk convergence rates of the tapering and blockwise tapering post-processed posteriors, respectively. First, we present Lemmas A.1-A.4 necessary for the proofs of Theorems 3.1 and 3.2.
for all sufficiently large n.
for sufficiently large n.
for all sufficiently large n.
for all sufficiently large k. Lemma A.4 gives that there exist some positive constants C 4 and λ 3 depending only on M , M 0 , M 1 and α such that for all sufficiently large n. Thus, there exists some positive constant C 5 depending only on M , M 0 , M 1 and α such that for all sufficiently large n and k. Collecting the upper bounds of (30), (31) and (32), we complete the proof.

A.2. Proofs of lemmas in Section 5.1
In this section, we give the proofs of lemmas in Section 5.1.
Proof of Lemma 5.1. Since Σ is a k-band matrix, Thus, we get for all sufficiently large k.
This inequality condition holds for all sufficiently large k, since ||B k (Σ 0 ) −1 || 1 is bounded above by a constant and Lemma 4.14]. Thus, Theorem 2.4 of [11] gives and κ 1 is the spectral condition number of B k (Σ 0 ). To show that (36) is bounded, it suffices to show that ||B k (Σ 0 ) −1 || 2 is bounded. By Lemma 4.14 in [23], we have which is bounded for all sufficiently large k. Thus, for all sufficiently large k, ||B k (Σ 0 ) −1 {Σ 0 − B k (Σ 0 )}|| 1 < 1 and the inequality (35) is satisfied. By combining (35) and (36), we have that there exists some positive constant C 2 depending only on M , M 0 and M 1 such that for all sufficiently large k.
Next, we consider the upper bound of ||Σ −1 In the same way as deriving the inequalties (36) and (37), we have ) and κ 2 is the spectral condition number of T k (Σ 0 ). Thus, we have ||T k (Σ 0 ) −1 (Σ 0 − T k (Σ 0 ))|| 1 < 1 for all sufficiently large k, and apply Theorem 2.3.4 of [16]. By Theorem 2.3.4 of [16], there exists some positive constant C 4 depending only on M , M 0 and M 1 such that for all sufficiently large k. The last inequality holds by the inequalities (38) and (39).

A.3. Proofs of lemmas in Section 5.2
In this section, we give the proofs of the lemmas in Section 5.2. First, we present Lemmas A.5 and A.6 which are necessary for the proofs of these lemmas.
Lemma A.5. Let p and k be positive integers with k ≤ p. For an arbitrary covariance matrix Σ ∈ C p , where || · || r is the matrix r-norm.

A.6. Proofs of lemmas in Section A.1
In this section, we give the proofs of the lemmas in Section A.1. First, we present Lemmas A.7 and A.8 which are necessary for the proofs of these lemmas.
Lemma A.7. Let n, k and r be positive integers and M 0 be a positive real number. Suppose Σ 0 ∈ C k and let the prior distribution π on C k be IW (A n , τ n ) for A n ∈ C k and τ n > 2k. If (τ n − 2k) ∨ ||A n || ∨ k = o(n) and ||Σ 0 || ≤ M 0 , then there exists some positive constant C depending only on M 0 and r such that for all sufficiently large n.
Proof. LetΣ = (nS n + A n )/(n + τ n − k − 1). We have For the upper bound of (55), we have Since (τ n − k − 1)/n = O(k/n) and ||A n || = o(n), there exists some positive constant C 1 depending only on M 0 and r such that for all sufficiently large n. Lemma A.6 gives E Σ0 n n + τ n − k − 1 (S n − Σ 0 ) r ≤ C 2 5 k n r/2 , for some positive constant C 2 depending only on M 0 and r. Thus, we have (55) ≤ 4 r−1 C 1 k n r + 4 r−1 C 2 5 k n r/2 .