Projection inference for high-dimensional covariance matrices with structured shrinkage targets

Analyzing large samples of high-dimensional data under dependence is a challenging statistical problem as long time series may have change points, most importantly in the mean and the marginal covariances, for which one needs valid tests. Inference for large covariance matrices is especially difficult due to noise accumulation, resulting in singular estimates and poor power of related tests. The singularity of the sample covariance matrix in high dimensions can be overcome by considering a linear combination with a regular, more structured target matrix. This approach is known as shrinkage, and the target matrix is typically of diagonal form. In this paper, we consider covariance shrinkage towards structured nonparametric estimators of the bandable or Toeplitz type, respectively, aiming at improved estimation accuracy and statistical power of tests even under nonstationarity. We derive feasible Gaussian approximation results for bilinear projections of the shrinkage estimators which are valid under nonstationarity and dependence. These approximations especially enable us to formulate a statistical test for structural breaks in the marginal covariance structure of high-dimensional time series without restrictions on the dimension, and which is robust against nonstationarity of nuisance parameters. We show via simulations that shrinkage helps to increase the power of the proposed tests. Moreover, we suggest a data-driven choice of the shrinkage weights, and assess its performance by means of a Monte Carlo study. The results indicate that the proposed shrinkage estimator is superior for non-Toeplitz covariance structures close to fractional Gaussian noise.


Introduction
Many modern applications encounter large data sets of very high dimension d, which requires new approaches for modeling as well as statistical inference, since classical methods usually fail.A particularly prominent example is the covariance matrix Σ of a random vector of d variables, which has d 2 unknown parameters.It is well-known that the sample covariance matrix Σn of sample size n is singular if n < d, and consistency of Σn for growing dimension d requires to control the O(d 2 ) covariances and otherwise generally fails.Specifically, for n i.i.d.samples of d-dimensional random vectors with finite fourth moments, consistency in terms of the expected squared Frobenius norm of Σn −Σ n requires that diag(Σ n ) and the variance of the average of the squared variables to be of the order o(n/d), (Ledoit and Wolf, 2004).In terms of eigenvalues and eigenvectors, classic inconsistency results for the largest eigenvalue and the top eigenvector under the regime d/n → c ∈ (0, 1) are due to (Johnstone, 2001;Johnstone and Lu, 2009), even when Σ = I.Since, however, it is generally believed that many high-dimensional data sets are governed by, say r ≪ d, strong signals, spiked covariance models with r leading eigenvalues strictly larger than 1, whereas the remaining bulk of the spectrum consists of unit eigenvalues, are quite extensively studied.Here the leading eigenvalues of Σn are biased, thus requiring de-biasing procedures which, however, depend on the loss function, and the eigenvectors are inconsistent as well, see (Donoho et al., 2018) and the references therein.Further potential remedies are to impose additional structural assumptions on Σ, e.g.sparsity (Bickel and Levina, 2008a), Toeplitz shape (Cai et al., 2013), or assuming the entries further from the diagonal to decay rapidly, also known as a bandable covariance matrix (Bickel and Levina, 2008b).In these frameworks, it is possible to derive consistent estimators of Σ even if d ≫ n, see Cai et al. (2016) for a survey of suitable estimators and matching minimax bounds.
The structured covariance matrix estimators admit optimal rates of convergence within the specified class of matrices.However, if the true Σ does not satisfy the structural assumptions, the estimators might perform poorly.For instance, if we use an estimator of Toeplitz form, but Σ is not a Toeplitz matrix, then this estimator will be inconsistent.In order to strike a compromise between structured and model-free estimation, one may consider a shrinkage estimator of the form Σw n = (1 − w) Σn + w Σn , w ∈ [0, 1], where Σn is the usual sample covariance matrix, and the shrinkage target Σn is a structured estimator.The weight w may be interpreted as quantifying the confidence in the structural assumption underlying the estimator Σ.A particularly useful feature of shrinkage estimators is that Σw n will be regular, as Σn is usually chosen as a regular matrix.This makes shrinkage estimation especially interesting for situations where the inverse of the covariance matrix is required, e.g. in portfolio optimization.Ledoit and Wolf (2004) suggest to use a multiple of the identity matrix, and Steland (2018) studies more general diagonal matrices.Linear shrinking towards a structured target with respect to the scaled Frobenius loss has been studied in (Ledoit and Wolf, 2003), but the results therein on consistent estimators of the optimal shrinking weights consider classical fixed-d asymptotics only.In the present paper, we go beyond a convex combination of the sample covariance matrix with a single shrinkage target.Instead, we study convex combinations with a structured covariance estimator with known minimax optimality properties for bandable Σ, and a structured covariance estimator which is minimaxoptimal if Σ is of Toeplitz type.
Theoretical analysis of these structured covariance estimators typically focuses on their consistency and rate of convergence.Here, we contribute by showing that the estimators proposed when Σ is bandable and Toeplitz, respectively, are still consistent under a general nonstationary nonlinear time series model, whereas the known results assume iid d-dimensional random vectors.Moreover, to perform statistical inference, distributional approximations are required.Here, we consider inference about the highdimensional covariance matrix in terms of projections v T Σv for some vector v ∈ R d .The vector v may either be determined by the application, or chosen at random, independently from the sample.For example, in risk management, v may represent weights of a portfolio of risky assets, and v T Σv corresponds to the variance of this portfolio.As another example, the hypothesis Σ = Σ 0 may be studied via random projections: If v is drawn from some continuous distribution, then Σ = Σ 0 implies v T Σv = v T Σ 0 v almost surely.It is thus reasonable to employ a test statistic of the form v T (S − Σ 0 )v for some estimator S, either structured, unstructured, or shrunken.Further applications (lasso prediction, sparse principal components) are discussed below.The idea to study high-dimensional covariance matrices via bilinear forms has been suggested in Steland andvon Sachs (2017, 2018), and applied to changepoint testing (Steland, 2020) and to a K-sample problem Mause and Steland (2020).In the latter references, as well as in the present paper, the important assumption is that the projection vectors v have a bounded l 1 -norm.This condition is satisfied in many applications and enables the methodology to work also for very high dimensions d.
In this paper, we derive Gaussian approximation results for shrinkage estimators of the covariance matrix with structured shrinkage targets Σn .Our results are valid for nonstationary, nonlinear, high-dimensional time series X t , and we demonstrate how to perform inference based on a suitable bootstrap scheme.Notably, we only impose assumptions on the marginal time series, while the dependency between the components may be arbitrary, and we need no restrictions on the dimension d.This is a beneficial consequence of the projection technique, as described in Section 2. As we allow the dimension to grow with sample size, the considered models are arrays of time series.Thus, classical central limit theory is not applicable because a limiting distribution in general does not exist.Instead, we formulate our results in terms of so-called strong approximations, or sequential couplings, in the sense of Komlós et al. (1975).Compared to the existing work of Steland andvon Sachs (2017, 2018), we consider a much broader class of nonlinear time series models, which also allows us to study non-diagonal, structured shrinkage targets.Underlying our mathematical results are recent Gaussian couplings for nonstationary time series established in Mies and Steland (2022) for the regime d ≪ n 1 3 .We extend these results to projections v T X t of very high-dimensional time series X t .In particular, it is shown that for l 1 -bounded projection vectors v, the Gaussian approximation is valid irrespective of the ambient dimension d, and of the dependence structure between the d components.It turns out that this general projection framework is also applicable to bilinear forms of sample covariance matrices, as demonstrated in Section 3. The presented distributional approximation also holds sequentially, and we demonstrate how to apply this to a test for structural breaks in the covariance matrix.An attractive feature of the proposed test is that it is robust against general nonstationarity under the null hypothesis, and only detects changes in the target parameter, i.e. the marginal covariance structure.Furthermore, we discuss the optimal choice of shrinkage weights and provide a data-driven criterion.Simulation results demonstrate that the asymptotic theory is also applicable in finite samples, and that shrinkage may improve the performance both for estimation and for change testing.
The rest of this paper is structured as follows.In Section 2, we present a general sequential Gaussian approximation result for projections of high-dimensional nonstationary time series, which is applied to bilinear forms of covariance matrices in Section 3. The data-driven choice of shrinkage weights is discussed in Section 4. Section 5 describes the application of our distributional approximations to tests for changes in the covariance structure of a high-dimensional time series, and presents a feasible bootstrap scheme.Simulation results are presented in Section 6.

Gaussian approximations for projections
We consider a d-variate time series X t which may be nonlinear and nonstationary, given by for measurable mappings G t : R ∞ → R d , t = 1, . . ., n, where we endow R ∞ with the σ-algebra generated by all finite projections.The ǫ i are iid U[0, 1] random variables, which may be regarded as random seeds for the time series.The model formulation (1) allows for a convenient formulation of ergodicity conditions via the physical dependence measure introduced by Wu (2005).To formulate our assumptions on the kernels G t , introduce a second sequence ǫi of iid U[0, 1] random variables, independent of the sequence ǫ i .For any t ∈ N, denote We assume that the impact of past random seeds decays to zero polynomially, such that for some β > 1 The time series X t may be nonstationary, and the nonstationarity is explicit by making the kernel G t depend on t.Nevertheless, to obtain stronger asymptotic results, we require some regularity in time.Here, we do not choose classical smoothness conditions, but rather formulate the regularity in terms of the total variation norm of the mapping t → G t , measured in L 2 (P ).In particular, we suppose that for some Γ > 0, A few results require to assume X t has finite and stationary eighth moment and β > 2. (P.3) We highlight that (P.1) and (P.2) only impose conditions on the individual components l = 1, . . ., d, and are thus rather easy to verify even for high-dimensional models.To emphasize, we impose no assumption on the dependency among the d components.
Now consider the projection of the d-variate time series X t to a m-dimensional time series Z t , with m ≪ d.That is, for a matrix V ∈ R m×d , we consider the time series Z t = V X t .We measure the size of the matrix V by the operator norm, V ∞ , with respect to the maximum vector norm, i.e.
where V l,• denotes the l-th row of the matrix V .This choice of norm uniformly controls the ℓ 1 -norms of the projection vectors and is motivated by highdimensional statistical problems.Indeed, as well known, if d is large compared to n, classical statistical methods usually fail, because even strong signals are overlayed by too many random noise sources (noise accumulation) and spurious correlations occur.To overcome the curse of high dimensionality, sparse methods typically use • r -norms with 1 ≤ r < 2, often leading to s-sparse solutions, i.e. with only s active (non-zero) coordinates.For example, sparse PCA constructs s-sparse directions u on which the data vectors are projected by calculating u T X t , and the variance of such a coefficients u T X t , given by the quadratic form u T Σ n u and coinciding with the associated eigenvalue when u is an eigenvector, provides information about the importance of that direction and is thus of interest.
Similarly, when predicting a future response Ỹt by a lasso regression on X t , one calculates βT n Xt for future regressors Xt , where the lasso estimate βn is s−sparse and thus selects s ≪ d variables, so that it behaves like a low-dimensional statistic circumventing the issues arising for large dimension.The variance βT n Σ n βT n of the prediction βT n Xt is a natural measure of the prediction accuracy.When considering slightly more general vectors v with bounded v 1 -norm (uniformly in d and n), then we have the bound |v T x t | ≤ v 1 max 1≤j≤d |X tj | at our disposal.Indeed, such projections have bounded moments under weak assumptions guaranteed by (P.1) as summarized in the following lemma, whose assertions no longer hold if v 2 is uniformly bounded but v 1 → ∞.
Lemma 2.1.Let q ≥ 1 and suppose that max 1≤j≤d E|X tj | q ≤ Θ for some constant C and all d, n, cf.(P.1).Then for any v ∈ R s with norm v 1 uniformly bounded by some The formulation of our first main result requires the two rates Theorem 2.2.Let X t = G t (ǫ t ) with E(X t ) = 0 be such that (P.1) holds for some q > 2, β > 1, and let Z t = V X t for some V ∈ R m×d such that m ≤ cn for some c > 0.Then, on a potentially different probability space, there exist random vectors and independent, mean zero, Gaussian random vectors . (2) for some universal constant C = C(q, β, c).
and independent, mean zero, Gaussian random variables . (3) The rate of the approximation ( 2) is faster than (3).The difference is that in the first case, the covariance structure of the approximating Gaussian random vectors Y ′ t is not explicit.In the second case, the distribution of the approximating random vectors is explicit, at the price of assuming some temporal regularity of the stochastic process X t .In Section 5, we will describe a bootstrap scheme to perform inference based on Theorem 2.2.
Neglecting the high-dimensional context for a moment, the projection Z t = V X t may be analyzed as a multivariate, nonstationary time series of fixed dimension m.In this setting, Wu and Zhou (2011) and Karmakar and Wu (2020) study sequential Gaussian approximations, the latter obtaining optimal rates in n which are faster than the rates of Theorem 2.2.However, they require a lower bound on the covariance matrices of Z t , while our result does not need this assumption.This is particularly relevant because we do not impose any conditions on the dependency among the components of X t .As an extreme example, if all d components are independent, then a l 1 -bounded projection will lead to some concentration due to averaging, such that a lower bound on the variance of the projection can in general not be guaranteed.This prevents an application of the result of Karmakar and Wu (2020) in the present situation.Moreover, the covariance of the approximating random vectors in Karmakar and Wu (2020) is not explicit.Instead, we employ a result on Gaussian couplings presented in Mies and Steland (2022), which is in turn enabled by a recent result of Eldan et al. (2020).

Asymptotics of high-dimensional shrinkage covariance matrix estimators and their projections
For a d-variate centered time series X t as above, denote the sample covariance matrix by Σn,n = 1 n n t=1 X t X T t , the partial sums by Σk,n = 1 n k t=1 X t X T t , k = 1, . . ., n, and its mean by Cov(X t ).
Hence, we account for nonstationarity of the time series X t by averaging the covariance matrices at all times t = 1, . . ., n.
The central observation enabling our subsequent analysis is that the matrix-valued time series X t X T t fits within the framework of Section 2. Proposition 3.1.If the time series X t = G t (ǫ t ) satisfies (P.1) resp.(P.2) with power 2q and factor Θ, then the d 2 -variate time series Xt = X t X T t = Ǧt (ǫ t ) satisfies (P.1) resp.(P.2) with power q and factor Θ2 .
It has been suggested by Steland and von Sachs (2017) to perform inference on the covariance matrix via the quadratic form v T Σn,n v for some vector v ∈ R d , i.e., the variance of the projection v T X t .As already discussed above, such projections are ubiquitous in statistical problems, for example arising in optimal portfolio selection, as linear predictors in regression models, or when reducing dimensionality by principal component analysis.The associated empirical version, , may be interpreted as the estimated variance of the univariate projection Xt = v T X t .The results of Steland and von Sachs (2017) and Steland and von Sachs (2018) on Gaussian approximations of v T Σn v, related change-point procedures, and Shrinkage estimators, impose the assumption that the process X t is a linear time series with a single innovation process, which may be interpreted as a special type of single-factor model, although approximate vector autoregressions and spiked covariance models are included under regularity conditions, see Steland (2020).Extensions to multi-factor models with infinitely many factors and general multivariate linear processes with finite-dimensional innovations are studied in Bours and Steland (2021).Although such processes, especially factor models, provide good description for many application scenarios and are widely used, they nevertheless restrict the dependence structure of the d component processes and rule out phenomena such as conditional heteroscedasticity and nonlinearity.Here, by relying on the results of the previous section, we allow for nonstationary nonlinear time series and impose much weaker assumptions on the dependence of the components.
To estimate Σ n and related quadractic forms v T Σ n v, we shall shrink the nonparametric estimator Σn,n towards structured targets, in order to overcome various issues of the sample covariance matrix arising if for some function g(x) → 0 as x → ∞.In this situation, optimal rates of convergence can be achieved via thresholding, see (Cai et al., 2016, Thm. 7) and references therein.In particular, if g(x) ∝ |x| −α−1 for some α > 0, then a rate-optimal estimator is given by the tapering estimator The optimal rate is achieved by the threshold τ = min(n 1 2α+c , d) with c = 2 under the Frobenius norm • F and with c = 1 under the spectral norm • , see Cai et al. (2010).In particular, the optimal rates are Another approach is to impose a Toeplitz structure, i.e.Σ l,l ′ = σ |l−l ′ | , for some sequence σ 0 , σ 1 , . ... The corresponding tapered Toeplitz estimator of Σ is with tapering weights ω(x) as above.If |σ m | ≤ Cm −α−1 , then the optimal rate under the spectral norm is achieved by the threshold choice τ = (nd/ log(nd)) 1 2α+1 , see (Cai et al., 2016, Thm. 9) and Cai et al. (2013).
The following theorem shows that the full-sample estimators Σ † n = Σ † n,n and Σ⋄ n = Σ⋄ n,n are consistent in the (rescaled) Frobenius norm, under mild regularity conditions.To the best of our knowledge, these estimators have not yet been studied under such a general nonstationary nonlinear time series model as given by (P.1) and (P.2).Moreover, Σ † n is rate-optimal for bandable covariance matrices.Note that the optimal rates and thresholds have been obtained under the assumption that X t are iid random vectors with mean zero and a bandable or Toeplitz covariance matrix Σ satisfying further regularity conditions.
Denote by Σ † n and Σ ⋄ n the theoretical oracle estimators associated to the estimators Σ † n,n and Σ⋄ n,n , respectively, obtained by replacing in their definitions the estimates ( Σn,n ) i,j by the true covariances Recall also that Σ n = E( Σn,n ).Denote for A, B ≥ 0 the scaled inner product by A, B * = A • B/d and let A F * = A, A * be the associated scaled Frobenius matrix norm.This scaling ensures that the unit matrix has norm 1 whatever the dimension d.The following theorem studies the above estimators under the general time series model given by (P.1) and (P.2) and provides sufficient conditions on the growth of dimension d relative to the sample size when measuring the estimation error in terms of the scaled Frobenius risk.
Theorem 3.2.Suppose that {X t } satisfies (P.1) with q ≥ 4 and β > 1, and that Cov(X t ) is bandable with exponent α > 0 uniformly for all t, i.e.Cov(X t ) i,j ≤ C|i − j| −α−1 .(i) For any threshold τ , it holds Choosing τ = min(n Choosing τ = min(n In particular, the optimal rate of convergence of the tapering estimator Σ † n carries over to the dependent, nonstationary case, with identical threshold values.
A shrinkage estimator may be defined for any weight It turns out that the theory presented in Section 2 may be applied to the quadratic form v T Σw n v for any shrinkage weight w, and hence also for the individual statistics The crucial observation is that these quadratic forms can be regarded as linear projections of the (d × d)-variate time series X t X T t with respect to the Frobenius inner product , which coincides with the usual inner product on R d 2 of the vectorized matrices.Indeed, we have the representations for the projection weighting matrices ν 1 , ν 2 , ν 3 ∈ R d×d ≡ R d 2 given by In particular, we may write as the Frobenius inner product.With Theorem 2.2, we obtain the following result.
Theorem 3.3.Suppose that the time series X t = G t (ǫ t ) satisfies (P.1) resp.(P.2) with power 2q and factor Θ.Then, the time series X t may be defined on a different probability space, such that there exist independent, mean zero, Gaussian random vectors The covariance of the Gaussian vectors η t ∼ N (0, Ξ t ) is given by It is worth mentioning that the bound in Theorem 3.3 allows that v 1 increases as the dimension, d, or the sample size, n, grow, as long as v 1 = o((log n) 1/4 n −ξ(q,β)/2 ).

Optimal shrinkage weights
The question arises, how the shrinking weights relate to the theoretical (oracle) performance of the resulting shrinkage estimator.To pursue such a study, we first consider the problem to combine the nonparametric estimator with oracles of the two targets within a framework proposed by Ledoit and Wolf (2004) and aim at minimizing the squared Frobenius risk.This leads to a convex but box-constrained optimization problem allowing for an explicit interior solution (if it exists), which we briefly discuss.We propose consistent estimators of the unknown quantities determining the optimization problem and its solution(s) leading to data-adaptive bona fide estimates.
For simplicity of presentation, we elaborate this for the full sample of size n; the corresponding sequential estimates and oracles based on the first k observations can then be derived easily.Write Σn = Σn,n and recall that Σ † n and Σ ⋄ n are the theoretical oracle estimators associated to the estimators Σ † n,n and Σ⋄ n,n , respectively, obtained by replacing in their definitions the estimates ( Σn,n ) i,j by the true covariances (Σ n,n ) i,j .Equivalently, Σ ⋄ n = E Σ⋄ n and Σ † n = E Σ † n .We wish to determine the optimal ideal shrinking weights such that the resulting (scaled) Frobenius risk is minimized when combining the nonparametric estimator Σn with the oracle estimators Σ † n and Σ ⋄ n .The scaled mean squared error of the sample covariance matrix is given by MSE( Σn ) = MSE( Σn ; Σ n ) = E Σn − Σ n 2 F * , and we consider the risk minimization problem be the squared approximation errors corresponding to the oracles, and Adding the constraints 0 ≤ w 1 , w 2 ≤ 1 leads to a quadratic programming problem under box constraints, which generally requires numerical algorithms for its solution.We discuss the special case of an interior solution, where explicit formulas can be derived, at the end of this section.The optimal weights depend on the unknown values , and D n .To obtain a data-driven solution for the optimal weights, we need consistent estimators of these quantities, which are also interesting in their own right.
Under the assumptions of Proposition 3.1 This yields Sancetta ( 2008) proposed an estimator of MSE( Σn ) by estimating the long-runvariance parameters Var( √ n( Σn ) i,j ) by the long-run-variance estimators where K is a positive weighting function, b > 0 is a bandwidth, and The mean square error may then be estimated as Var( √ n( Σn ) i,j ).
The result is as follows, see (Sancetta, 2008, Lemma 1).In practice, the bandwidth b needs to be chosen in a data-driven way, e.g. by the method described in Newey and West (1994).
Let us now derive consistent estimators for the approximation errors E † n and E ⋄ n and the inner product D n .The quantity F * can be estimated by plugging in the estimators Σ⋄ n and Σ † n .Hence, define Ê⋄ To estimate Estimation of ∆ † n is delicate, since this involves estimating terms which would lead to the unsatisfactory condition d 2 = o(n).For the class of bandable covariance matrices under consideration, we can proceed by introducing a second threshold and estimate only the covariances on the diagonals τ /2 + 1, . . ., σ.This means, we estimate ∆ for some threshold σ with τ ≪ σ ≪ d, and put Lastly, the estimation of Consistency of Dn , however, requires an extra condition compared to the estimators of the approximation errors.The reason is that Σ n − Σ † n tends to zero in the • F * -norm within the class of bandable matrices, whereas the error Σ n − Σ ⋄ n may diverge.But within the the subclass of bandable matrices which are located in a neighborhood of their Toeplitz-approximation in the sense that consistency can be established, where Σ ⋄ n is determined using the threshold n 1 2α+c ; the threshold for the estimators are given below.We formulated the results for the highdimensional case d ≥ n 1 2α+c .
Theorem 4.2.Suppose that {X t } satisfies (P.1) and (P.2) and Var(X t ) is bandable for all t.Further, assume that d ≥ n 1 2α+c , and the estimators Σ † n , ∆ † n and Σ⋄ n are calculated with thresholds τ in probability.
If, for example, α = 1, then the dimension needs to satisfy d 3/4 = o(n), i.e. it may grow almost as fast as n 4/3 , and for α = 2 almost as fast as n 6/5 .
Having consistent estimators at our disposal, we can propose the following bona fide estimators of the optimal weights, This yields the feasible shrinkage estimator The performance of the corresponding shrinkage estimator Σ ŵ * n is assessed via simulations in the Section 6.
Let us now briefly discuss the case of an interior solution of the minimization problem (5).We have where Hence, we are given a quadratic minimization problem under box constraints, and a natural condition is to assume that det(Q n ) > 0. If an interior solution w * ∈ (0, 1) 3 exists (or the box constraint is omitted), the explicit solution is easily seen to be given by The resulting optimal weight for the sample covariance matrix Σn is We see from the formula for w * 1 that the optimal solution prefers the nonparametric estimator Σn , if the truth Σ n cannot be well approximated by a banded or Toeplitz matrix.The banded estimator receives a large weight, if the distance between the Toeplitz oracle and Σ n is large.Analogously, the optimal solution assigns a large weight to the Toeplitz estimator, if Σ n is not well approximated by the banded oracle, so that Σ n − Σ † n 2 F is large.As a result, it is optimal to distribute the weights across the estimators Σn , Σ † n and Σ⋄ n according to how well they approximate Σ n , where the approximation accuracy is measured by the squared distances of the oracles to the truth and, in case of the nonparametric estimator, by the MSE. Since n −Dn , i.e., the ratio of the oracle shrinking weights no longer dependents on MSE( Σn ).Further, w * 2 > w * 3 (banding preferable compared to Toeplitz) if and only if Even more is true: The above derivations remain formally true without any changes, if we use an arbitrary unbiased estimator Σn of Σ n , i.e., an arbitrary measurable function Σn of the data with E( Σn ) = Σ n .Therefore, we have the following interesting result: The ratio is a universal characteristic of the risk minimization problem (4) in the sense that it is independent of the nonparametric estimator and completely determined by Σ n .

Testing for structural breaks
The sequential approximation result of Theorem 3.3 may be applied to test for changes in the covariance matrix Σ t = Cov(X t , X t ) ∈ R d×d of a time series X t .To be precise, we want to test the hypothesis We propose the following CUSUM-type statistics To determine critical values for the statistic T n (w) resp.T * n , we employ a bootstrap scheme suggested in Mies and Steland (2022).In particular, let η t ∼ N (0, A t ) be independent Gaussian random vectors in R 3 , where .
and A t = 0 for t < b.Here, b is a block-length tuning parameter which needs to be chosen suitably, see below.We now define the bootstrapped version of the changepoint statistics as Finally, for α ∈ (0, 1), introduce the conditional quantiles We suggest to reject the null hypothesis if ) be an array of d n -variate time series, such that each kernel G n t satisfies (P.1) and (P.2), for some q > 8, β > 2, and with factors Θ and Γ not depending on n.Choose the block length b such that b = b n ≍ n ζ for some ζ ∈ (0, 1).If Cov(X t,n ) = Σ 0 for all t = 1, . . ., n, and for some covariance matrix Σ 0 , then Hence, the proposed bootstrap scheme indeed maintains the specified size for a change in covariance.A particular feature of this test is that it is robust against nuisance changes, because even under the null hypothesis the time series X t may be nonstationary, except for its stationary marginal covariance matrix.The relevance of this kind of robustness has been first highlighted by Zhou (2013) for changes in mean, see also Górecki et al. (2018) and Pešta and Wendler (2020).Changes in the second moment structure in the presence of non-constant mean have been studied by Dette et al. (2019) and Schmidt et al. (2021), and a robust CUSUM test for a broad class of parameters is introduced in Mies (2021).Nevertheless, all these references focus on the low-dimensional case, whereas our new test is also applicable in high dimensions.
The power of the proposed test and bootstrap scheme is hard to investigate analytically, because the model framework is rather broad.Instead, we analyze the behavior of the proposed changepoint test via simulations in the subsequent section.

Simulation
To demonstrate the implementation of our proposed changepoint test and the benefits of shrinkage, we assess our methodology for three examples of high-dimensional time series.

Model A
First, we simulate a high-dimensional time series according to the vector ARMA(1,1) model The matrix Π ∈ R d×d is a permutation matrix, such that the model is non-explosive.Furthermore, for any t, the (ǫ t ) i , i = 1, . . ., d, are chosen as independent, zero mean random variables having a symmetrized Gamma distribution with shape parameter α(t/n), standardized to unit variance.We set α(u) = 2 + sin(2πu), u ∈ [0, 1].Thus, the autocovariance structure is stationary, but the time-series is nonstationary.In particular, the long run covariance matrices Ξ t which occur in Theorem 3.3 are non-constant.
We assess our changepoint procedure under the null hypothesis of no change, and under the alternative where b changes at time t = ⌈ n 2 ⌉.The projection vector v is chosen randomly as N/ N 1 for a standard Gaussian random vector N ∼ N (0, I d×d ).The threshold for the tapering estimator is chosen as τ , and τ ⋄ n = (nd/ log(nd)) 1 2α+1 .Note that these thresholds are optimal for estimation under the spectral norm, in the class of bandable matrices with decay rate α.In practice, α is unknown, and we use the threshold for α = 2 in our simulations.To determine the shrinkage weights, we also use the threshold For the bootstrap scheme, the additional offset is chosen as δ n = log(n) −4 , and the block-length is b n = ⌈5n 0.2 ⌉.
We also consider the data-driven shrinkage weight w * as described in Section 4, which is not covered by our Gaussian approximation results.Regarding the choice of the bandwidth for the estimator MSE( Σn ), we employ the method of Newey and West (1994) with Bartlett kernel as implemented in the R package sandwich (see Zeileis et al. (2020)), and we determine a separate bandwidth for each component Var( √ n( Σn ) i,j ).
When simulating the process, we set a = 0.5 and b = 0.5 (resp.b = 0.75 post-change).Based on our simulation results, we find that the test maintains the specified size 10% and is indeed slightly conservative, as established theoretically in Section 5.Moreover, the power of the changepoint test increases with sample size, as expected, but also with increasing dimension.This may be explained by the fact that here, the change affects all coordinates, such that the additional information can be used to improve the detection performance.It is also interesting to observe that the additional usage of the tapered and the Toeplitz estimator further improves the power of the test.This is interesting because the shrinkage estimator has originally been proposed as a method to improve the performance of a point estimate.Our simulation results demonstrate that shrinkage can also be also useful for testing.

Models B & C
As a second example, we consider the vector ARMA(1,1) process X t given by with a and b as above, and ǫ t ∼ N (0, A) iid random vectors, and we initialize X 0 ∼ N (0, A (1+b 2 )/(1−a 2 )) to ensure weak stationarity.The symmetric positive semidefinite matrix A ∈ R d×d is chosen as , which is the autocovariance function of fractional Gaussian noise.y For the values t i , we consider the scenarios (B) t i = i, such that A is a Toeplitz matrix, and (C) t i = √ i, such that A is not a Toeplitz matrix.For the bootstrap scheme and for the determination of shrinkage weights, we use the same settings as in Model A.
The size and power of the changepoint test are presented in Table 1.As for Model (A), the test turns out to be conservative, and gains power with increasing sample size, and higher dimension.In our simulations, we also find the deterministic weight w (2) leading to higher power compared to the data-driven weight w * .Thus, future work could explore the optimal choice of shrinkage weights from a testing perspective.
Under the null hypothesis of no change, the marginal covariance matrix of X t is given by Cov(X t ) = A 1+2ab+b 2 1−a 2 .Thanks to this explicit formula, we are able to analyze the error of estimation of the proposed shrinkage estimator with data-driven shrinkage weights, see Table 2.As benchmarks, we evaluate the performance of the estimators Σn , Σ † n , and Σ⋄ n individually.For model B, the Toeplitz estimator Σ⋄ n is found to perform best, which could be expected because the true covariance matrix is of Toeplitz type.On the other hand, the shrinkage estimator always improves upon the sample covariance matrix Σn , and is often close the performance of the tapered estimator Σ † n .For model C, the decay of the off-diagonal entries of the true covariance matrix is slower, hence  n is also worse.Indeed, its estimation error is larger than the error of the sample covariance Σn .Yet, for n ≥ 500, the error of the shrinkage estimator is lower than that of all other estimators.This demonstrates that our proposed shrinkage estimator not only chooses among the three estimators, but may indeed improve the performance even further due to the convex combination.

Proofs
Proof of Lemma 2.1.The result follows from Jensen's inequality: Proof of Theorem 2.2.The time series Z t may be written as Z t = Gt (ǫ t ), with kernel Gt = V G t , such that we may apply Theorem 3.1 of Mies and Steland (2022).It can be verified that Gt satisfies conditions (G.1) and (G.2) therein, with Γ = Γ and Θ = √ m Θ V ∞ .To see this, note that for any m-variate random variable X, we have q x q for any vector x ∈ R m .Moreover, for any l = 1, . . ., m, The same argument can be used to establish (G.1) and (G.2).
Proof of Proposition 3.1.Observe that for four random variables Inequality (P.1) may be obtained by setting Proof of Theorem 3.2.By (Mies and Steland, 2022, Th. 3.2) we have under Assumption (P.1) for all d, n To show the bound for Σ † n , observe that where Thus, The optimal rate is attained for τ = min(n 1 2α+2 , d).To show the bound for Σ⋄ n observe that by Jensen's inequality ).Moreover, G n = O(min(d, τ )/n).We obtain the same bound as in case (i), i.e.
Optimal choice of τ yields the same rates as in case (i).
Proof of Theorem 4.2.The first assertion follows from rearranging terms, and the triangle inequality, so that Hence, the result follows from Theorem 3.
which follow from the triangle inequality.Therefore,

in probability, which in turn gives Ê⋄
n − E ⋄ n → 0, in probability.To show (ii) observe that we have, similarly as above, the bounds Thus, it suffices to study E ∆ † n − ∆ † n 2 F * .We have Both terms are o(1) and the first term dominates if and only if s > c/2α.(iii) The claim follows directly from the fact that by assumption.This completes the proof.Proof of Theorem 5.1.Note that the statistics T n and the critical values a n (α, w) are linear in v 1 , such that we may suppose without loss of generality that v 1 = 1.If we replace the χ s by the centered random vectors χs = [ν j • (X s X T s − Σ 0 )] 3 j=1 , then Theorem 5.1 is a special case of (Mies and Steland, 2022, Prop. 4.3).We briefly show that the estimation error of Σn versus the true covariance matrix is negligible.To this end, denote the sequential estimators of the asymptotic covariance, for k = b, . . ., n, by where A tr = tr((AA T ) 1 2 ) denotes the trace norm of a symmetric matrix.Since the matrices Q(k) and Q(k) are always of fixed dimension 3 × 3, the choice of norm is in fact irrelevant as all norms are equivalent, and we may use an arbitrary matrix norm • .If (7) holds, then both Q(k) and Q(k) satisfy Theorem 5.1 in Mies and Steland (2022), such that the bootstrap scheme based on Q(k) is valid.
To establish (7), denote ∆ n = ν j • (Σ 0 − Σn,n ) 3 j=1 = χ s − χs ∈ R 3 , for all s.Via quadratic expansion, we find that, for some universal C which may change from line to line, By virtue of Proposition 3.1 and Lemma 2.1, the centered time series χs satisfies condition (G.1) of Mies and Steland (2022), such that the Rosenthal-type inequality, Theorem 3.2 therein, is applicable.This yields, because Θ is fixed in the asymptotic regime of Theorem 5.1.Analogously, Hence, The last inequality holds because b ≤ n, and establishes (7).
d = 4n 0.3 0.101 (0.09) 0.289 (0.08) 0.520 (0.08) 0.824 (0.09) d = 4n 2.For the second assertion, use the fact that ifA n − B n ≤ C n on A n ≥ B n and B n − A n ≤ D n on A n < B n , for random variables A n , B n and bounds C n , D n , then E|A n − B n | ≤ E(A n − B n )1 An≥Bn + E(−A n + B n )1 An<Bn ≤ EC n + ED n Apply this bound with A n = Σ⋄ n − Σ † n F * , B n = Σ ⋄ n − Σ † n F * and the bounds Σ⋄ n − Σ † n F * − Σ ⋄ n − Σ n F * ≤ Σ⋄ n − Σ † n − (Σ ⋄ n − Σ n ) F * = C nas well as

Table 2 :
Mean square estimation error of the marginal covariance Σ = Cov(X t ) under stationarity, measured in the scaled Frobenius norm • F * .Errors are reported, in this order, for (i) sample covariance Σn , (ii) tapered estimator Σ †