Penalized estimation of threshold auto-regressive models with many components and thresholds

Thanks to their simplicity and interpretable structure, autoregressive processes are widely used to model time series data. However, many real time series data sets exhibit non-linear patterns, requiring nonlinear modeling. The threshold Auto-Regressive (TAR) process provides a family of non-linear auto-regressive time series models in which the process dynamics are specific step functions of a thresholding variable. While estimation and inference for low-dimensional TAR models have been investigated, high-dimensional TAR models have received less attention. In this article, we develop a new framework for estimating high-dimensional TAR models, and propose two different sparsity-inducing penalties. The first penalty corresponds to a natural extension of classical TAR model to high-dimensional settings, where the same threshold is enforced for all model parameters. Our second penalty develops a more flexible TAR model, where different thresholds are allowed for different auto-regressive coefficients. We show that both penalized estimation strategies can be utilized in a three-step procedure that consistently learns both the thresholds and the corresponding auto-regressive coefficients. However, our theoretical and empirical investigations show that the direct extension of the TAR model is not appropriate for high-dimensional settings and is better suited for moderate dimensions. In contrast, the more flexible extension of the TAR model leads to consistent estimation and superior empirical performance in high dimensions.


Introduction
The threshold Auto-Regressive (TAR) model [44,46] allows regime-specific auto-regressive parameters, where the regimes are governed by a thresholding random variable, typically some previous lag of the time series (see formal definition in Section 2). Thanks to its flexibility, the TAR model has become a popular framework for analyzing non-linear time series from diverse application domains, from economics [27] and finance [11] to genomics [24] and epidemiology [53]. Applications in macroeconomics have been particularly diverse: Enders et al. [15] modeled the U.S. GDP growth, and constructed confidence intervals for the parameters; Juvenal and Taylor [25] explored the validity of the law of one price in nine European countries; and Aslan et al.
[1] applied a TAR model to commodity prices, and used it to represent abrupt changes, time-irreversibility, and regime-shifting behavior. See Hansen [20] for a selective review of threshold autoregression in economics.
TAR models have been extensively studied in univariate and fixed-dimensional settings. For example, Chan [8] investigated the asymptotic properties of the least squares estimation for TAR models with two regimes, Chen [12] proposed an estimation procedure when the thresholding variable is unknown, Bruce [6] derived the asymptotic distribution of general TAR models, and Li et al. [29] developed the asymptotic theory of the least squares estimator for a moving average TAR model. In other related work, Chan and Kutoyants [9] proved the consistency of a Bayesian estimator of the TAR model, while Chan et al. [10] proposed a novel modified LASSO approach for threshold estimation and established its consistency in multiple threshold models. Tsay [48] first extended univariate TAR models to multivariate settings, and proposed to use grid search based on the Akaike information criterion (AIC) to select the thresholds. Later, Lo and Zivot [33], Hansen and Seo [21], Dueker et al. [14], Li and Tong [28] used grid search based methods to study the multivariate TAR models assuming either a known number of thresholds or an upper bound on the number of thresholds. However, these approaches may not work in practice, as the number of thresholds is often unknown. More recently, Calderón V and Nieto [7], Orjuela and Villanueva [37] introduced Bayesian methodologies for the estimation of thresholds in multivariate TAR models with an unknown number of thresholds. These methods bypass the assumptions on the number of thresholds, but do not establish the consistency of the number of the estimated thresholds. Another limitation of existing approaches is that they are not applicable in high dimensions. The advantages and limitations of existing approaches are summarized in Table 3 in Appendix 6. See also Tong [45] for a review of threshold models in time series analysis.
High-dimensional time series models have received considerable attention in recent years [2,26,19]. In this setting, the ambient dimension is of the same order or larger than the sample size. This poses numerous practical and theoretical challenges. While a number of theoretical results have been established for linear time series models in high dimensions, with few exceptions [e.g., 13,42], their non-linear counterparts have received less attention. In the context of threshold models, the recent work by Liu and Chen [32] investigates the estimation of threshold factor models with growing number of variables. However, this work assumes a single threshold, which limits the flexibility of the model. Moreover, while the number of time series components is allowed to grow, it is assumed to be smaller than the sample size (see Theorem 1 in [32]). In fact, to the best of our knowledge, methods and theory for high-dimensional TAR models are currently lacking.
Given the paucity of the literature on high-dimensional TAR models, in this paper, we propose two estimators for detecting the (unknown) number and values of thresholds and estimating regime-specific auto-regressive parameters in multivariate TAR models with many components. Both approaches are based on a three-step estimation framework and utilize similar penalized estimation strategies, but they differ in one key aspect. The first approach is a natural extension of the classical TAR model and enforces all auto-regressive parameters to change at the same thresholds. As we discuss in Section 3, this assumption may be too restrictive in high-dimensional settings with many components. In fact, our theoretical and empirical investigations indicate that the extension of the classical TAR is not appropriate for high-dimensional settings and is better suited for moderate dimensions. As such, we refer to this first version as the multivariate TAR (mvTAR) model. To mitigate the limitation of the mvTAR model, we then propose a more flexible high-dimensional TAR model (hdTAR) where different auto-regressive parameters are allowed to change at different thresholds. This flexibility seems to introduce a new challenge, as the model may have many thresholds. However, our theoretical and empirical investigations show that this flexibility is indeed necessary in high dimensions and leads to improved theoretical guarantees and empirical performances. We develop efficient algorithms for both methods and establish the consistency of the thresholds and auto-regressive parameters under certain mixing conditions.
To establish our theoretical results, we address two key challenges that arise in penalized estimation of high-dimensional TAR models. The first challenge involves verifying appropriate concentration inequalities, including two main ingredients in high-dimensional statistics: (1) a restricted eigenvalue condition and (2) a deviation bound condition [34]. These conditions are crucial in deriving consistency results in high-dimensional settings, as hinted in Bickel et al. [4]. The conditions have been previously verified in the setting of i.i.d. observations and, more recently, studied in certain linear time series models [2,41]. However, extending these results to non-linear TAR models is challenging. This is primarily due to the random ordering of the design matrix based on the threshold (switching) variable (see e.g. Equation (3)). To address this challenge, we develop a bracketing argument [50,10] specifically designed to handle the threshold-type structure (see Lemmas 5 and 7 in the Appendix). These results are verified under certain mixing conditions (see Assumption A2 in Section 4) and are of independent interest in the context of non-linear high-dimensional time series models. The second challenge concerns our screening step to consistently estimate the number of thresholds. Many theoretical results in the context of TAR models assume that the number of thresholds is known [33,32]. This assumption may not be realistic in practice; in fact, it is appealing to infer the number of thresholds from data. To that end, the second step of our proposed algorithms utilizes an information criterion that screens candidate thresholds identified in the first step and removes redundant ones. This step successfully resolves the challenge by consistently estimating the number of thresholds with high probability (see Theorem 2).
The rest of the paper is organized as follows. After formally defining the multivariate TAR model in Section 2, we describe our algorithms in Section 3 and establish their theoretical properties in Section 4. In Section 5, we propose data-driven methods to select the hyper-parameters. While the required hyper-parameters are characterized in our asymptotic results, these rates involve unknown constants and cannot be used in practice. The empirical performance of the proposed methods is investigated using both simulated and real data sets, in Section 6 and Section 7, respectively. We conclude with a brief summary in Section 8.

Multivariate TAR formulations
The classical TAR model, proposed by Tong and Lim [46], is defined as where m 0 denotes the number of thresholds, r j s are the threshold parameters which partition the time series into m 0 + 1 regimes, K is the number of lags to be considered in the model, z t is a switching variable (maybe functions of some components of x t ), σ j s are segment-specific error variances, and a 0 (j) and a k (j) are coefficients in regime j, for j = 1, . . . , m 0 + 1 (they are allowed to be different in each regime). The noise or innovation, t , is an i.i.d. sequence of random variables with zero mean and unit variance. The original TAR model was restricted to univariate time series, but can be extended to multivariate settings, as described in [47]. Formally, a multivariate time series {x t } follows TAR model with one switching variable if p is the number of time series components, and K is the number of lags considered in the model. Here t = ( (t,1) , (t,2) , . . . , (t,p) ) ∈ R p is a multivariate i.i.d. sequence with zero mean in all components. The covariance matrix Σ j for the j-th regime, Σ j , is allowed to be different in each regime. To simplify the notations, when there is no ambiguity, we simply denote the error term by t instead of Σ For simplicity, we suppress the n-index. Finally, let t,l be error term of l-th time series, and recall that t = (t,1) , (t,2) , . . . , (t,p) . Throughout the paper, positive constants C, C 1 , C 2 , . . . are used to denote universal constant, A denotes the transpose of a matrix A, and A 1 and A 2 denotes its 1 and Frobenius norms, respectively. We denote the 1 and 2 norms of a vector v by v 1 and v , respectively.

Regularized estimation of high-dimensional TARs
The number of parameters in the TAR model (2), (m 0 + 1)(Kp 2 ), increases with the number of time series p and the number of thresholds m 0 . Estimating these parameters becomes especially challenging when the model has more than one threshold, i.e. m 0 > 1, and the number of thresholds is unknown. This is because identifying the thresholds would require a search over all possible values of threshold levels z t , which is infeasible. To overcome the above challenges, in Section 3.1 we first reformulate the TAR estimation problem via a non-parametric model with (T − K)p 2 K parameters. This over-parameterization allows us to use regularized estimation strategies to efficiently obtain an initial estimate of the thresholds by solving a penalized least squares estimation problem. In particular, we use a total variation penalty [43] to obtain piecewise constant estimates of A (k,j) for regime j with respect to the threshold variable z t .
The classical multivariate TAR model (2) requires the parameters of transition matrices A (k,j) to change at the same threshold values z t . To obtain such an estimate, we consider a grouped fused lasso penalty in Section 3.2. The resulting estimate, referred to as mvTAR, is suitable for low-to-moderate-dimensional problems, where p is fixed or small compared to the number of observations T . However, for problems with large p, especially when p T , requiring that all transition matrix parameters change at the same threshold value becomes restrictive. Moreover, the theoretical advantages of the group lasso penalty dissipate when grouped parameters do not follow the same sparsity pattern [23]. These limitations are reflected in our theoretical and numerical analyses in Sections 4 and 6. To achieve efficient estimation in high-dimensions, in Section 3.2 we propose a more flexible high-dimensional TAR model, named hdTAR, in which transition matrix parameters are allowed to change at different thresholds. As we show, this flexibility results in theoretical and empirical advantages. The difference between the flexible TAR model and the original version is illustrated in Figure 1.
Both our group and regular fused lasso penalties overestimate the number of thresholds. This is because a key requirement for consistency of 1 -regularized estimation strategies, namely the restricted eigenvalue property [4] is not guaranteed to hold in our setting (see Section 4). To remove the redundant selected thresholds, we introduce a screening criterion in Section 3.3 that consistently estimates the (many) unknown thresholds. In Section 3.4, we obtain consistent estimates of high-dimensional auto-regressive parameters within each estimated regime.

Reparametrization of the TAR model
In this section, we reparametrize the TAR model (2) by considering n transition matrices for each value of the ordered switching variable z t (assuming, without loss of generality, that z t assumes unique values).
Let n = T − K and let π(i) be the time index of the i-th smallest element of z t for i = 1, . . . , n. Then the TAR model (2) with lag K can be written as (2) . . .
π (2) . . . Let is the transition matrix for i-th ordered observation at lag k. Denote the response matrix, the design matrix, the model parameters and the error term in Equation (3) by Y, X , Θ and E, respectively. Then, (3) can be written as with ⊗ denoting the tensor product, (3) can be written in vector form as While redundant, the over-parametrization in (3) has an important benefit: θ i = 0 if and only if the auto-regressive coefficients change in TAR process at time π(i). Thus, finding the thresholds is equivalent to finding non-zero θ i s for i > 1. In other words, the problem of threshold estimation can be translated to a high-dimensional variable selection problem in (4).

Penalties for moderate-and high-dimensional TARs
Sparsity-inducing penalties, such as lasso, are particularly suitable for estimating Θ in (4): A sparse estimateθ 1 gives an interpretable estimate of the transition matrices for the smallest value of z t , while sparsity inθ i for i > 1 would imply no changes in the transition matrices over z t . Such a strategy corresponds to a fused lasso, or total variation, penalty [43,38]. In this paper, we consider a similar strategy and obtain an estimate of Θ by solvinĝ The first penalty in (5), · ♦ , encodes either an 2 , or grouped fused lasso penalty, · 2 , or an 1 , or fused lasso penalty, · 1 . The group fused lasso penalty encourages all entries of the transition matrices to change at the same threshold values. In contrast, the fused lasso penalty provides a more flexible TAR model in which different transition matrix parameters are allowed to change at different thresholds. As discussed earlier, the group fused lasso penalty is only suitable for low to moderate-dimensional problems (where p is allowed to grow, but p < T ), whereas the more flexible fused lasso penalty is appropriate for both moderateand high-dimensional problems (where p T ); see also Figure 1. In both cases, the magnitude of the penalty is controlled by the tuning parameter λ 1 , which is chosen data-adaptively via cross validation; see Section 5 for more details.
The second penalty in (5), controlled by tuning parameter λ 2 , further encourages the overall sparsity of the estimated transition matrices by penalizing changes in transition matrices after each potential threshold index i. While often not needed in practice, this additional sparsity results in improved estimation and allows us to obtain better rates of convergence for the proposed estimator in Section 4.
With either 2 or 1 penalties, the optimization problem in (5) is convex and can be solved efficiently. With the 2 penalty, the problem can be solved using a sub-gradient descent algorithm. However, the problem further simplifies when λ 2 = 0 and we can instead use a more efficient proximal gradient descent algorithm; see Algorithm 2 in the Appendix. With the 1 penalty, the problem is easy to solve efficiently using a path-wise coordinate descent algorithm [17] regardless of the value of λ 2 . This is because, by Proposition 1 in [17], it suffices to first find the solution for λ 2 = 0, and then apply an element-wise soft thresholding operator; see Algorithm 1 in the Appendix.

Threshold selection
Using Equation (5), we can define a set of candidates threshold estimates aŝ Letr j be the j-th sorted (from the lowest to the highest) estimated threshold in the setÂ n , and letm be the cardinality of the setÂ n . As we show in Section 4, it is likely for the fused lasso to over-estimate the number of thresholds [22]. Thus, we need to remove the redundant thresholds. In our screening step, we aim to keep exactly m 0 points inÂ n which are close enough to the true threshold values. To that end, we develop an information criterion by modifying the screening procedure of [41] to make it more suitable for the threshold structure of model (2). Essentially, this step consists of estimating the transition parameters within each estimated regime {t :r j < z t ≤r j+1 } for j = 0, 1, . . . ,m withr 0 = −∞ andrm +1 = +∞ and comparing the total sum of squared error (SSE) before and after excluding a certain estimated thresholdr j . The basic idea is to keep the estimated thresholds for which the value of SSE increases significantly if we remove them. More specifically, for a given set of estimated thresholds {−∞, s 1 , s 2 , . . . , s m , +∞} with 1 ≤ m ≤m, and for j-th estimated threshold s j , denote by T (sj−1,sj ) = i : s j−1 < z π(i) ≤ s j the set of orders of z t s for which their corresponding ordered switching variable z π(i) s fall into the interval [s j−1 , s j ]. Now, given a fixed number of thresholds m, we obtain the estimatorθ s1,s2,...,sm of θ s1,s2,...,sm by minimizing the following penalized regression problem where Y π(i) = x π(i)−1 . . . x π(i)−K , θ s1,s2,...,sm = θ (s0,s1) , θ (s1,s2) , . . . , θ (sm−1,sm) , and tuning parameters η n = η (−∞,s1) , . . . , η (sm,+∞) . The glmnet package [18] readily solves the problem. Denoting we construct our information criterion as where ω n is a carefully chosen sequence defined in Section 4. We then select a subset of the initialm candidate threshold values by solving (m,r 1 ,r 1 , . . . ,rm) = argmin 0≤m≤m,s=(s1,s2,...,sm)∈Ân IC (s; η n ) .
Practical choices for tuning parameters η n and ω n are discussed in Section 5.
The over-estimation of the thresholds and the effect of the screening step are illustrated in Figure 2. The left panel of Figure 2 -which is obtained for one replicate of simulation Scenario 1 in Section 6 -clearly shows that the first step of our procedure detects more threshold values. The middle panel shows that second step successfully screens out the extra threshold estimates and keep a single value which is very close to the true threshold (here, the true threshold value is 4). The right panel of Figure 2 confirms that the final estimated thresholds across all 200 replicates are indeed close to the true thresholds.
When the number of estimated thresholds selected in Step 1 is large, it might be computationally demanding to find the minimizer of the IC. In such cases, we propose to approximate the optimal thresholds using the backward elimination algorithm (BEA) proposed in [41]. Starting with the set of initial thresholdsÂ n , the algorithm reduces the computational cost by removing one threshold at a time until IC does not reduce any further.

Estimation of auto-regressive parameters
Given the estimated thresholds, we simply take each estimated regime T (rj−1,rj ) = i :r j−1 < z π(i) ≤r j withr 0 = −∞ andrm +1 = −∞ for j = 1, . . . ,m +1, and estimate the transition matrices in each regime separately. More specifically, for a fixed j = 1, . . . ,m + 1 we solvê where α j is the tuning parameter for the j-th regime for j = 1, 2, . . . ,m. It can be solved efficiently using existing software and HBIC can be used to select α j . As an alternative to the separate estimation in (11), if the distances between consecutive threshold values are of the same order, the auto-regressive parameters can also be jointly estimated [40].

Theoretical properties
In this section, we establish the consistency of our procedure proposed in Section 3.2. Recall that in the first step of our procedure we use either the 2 or the 1 penalty in Equation (5), corresponding to classical (mvTAR) and flexible (hdTAR) TAR models. More specifically, in the following,Θ is the estimator defined in Equation (5) with either the 1 penalty or the 2 penalty,θ s1,s2,...,sm is the estimator defined in Equation (7), and finally,β (.,j) is the estimator defined in Equation (11). We make the following assumptions.
Assumption A2. For each j = 1, 2, . . . , m 0 + 1, the process is sub-Weibull with sub-Weibull parameter κ 1 > 0 and β-mixing stationary with a geometrically decaying mixing coefficient b n ; specifically, there exist constants c b > 0 and κ 2 > 0 such that for all n ∈ N, b(n) ≤ exp(−c b n κ2 ) and for Moreover, there exist constants l and u such that r j ∈ [l, u] for 1 ≤ j ≤ m 0 . In addition, there exists a vanishing positive sequence γ n such that as n → ∞, Assumption A5. {z t } is a β-mixing stationary process with a geometric decaying mixing coefficient and positive density. In addition, E|z t | 2+ι < ∞ for ι > 0.
The above assumptions are natural in high-dimensional settings and commonly used in the literature. Assumptions A1 and A2 are utilized to derive appropriate concentration inequalities needed to verify the asymptotic properties of the proposed methodology and have been used in the literature [29,54]. The sub-Weibull distribution of error terms controls the tail effects, while the β-mixing condition ensures the dependence structure can be controlled appropriately. The latter is specifically needed due to the temporal correlation among observations. We can relax the β-mixing assumption to α-mixing if we restrict to Gaussian distributions, rather than sub-Weibull processes. However, to keep the distributional assumption more general, we consider here the β-mixing assumption. In Appendix 4, we also develop a sufficient condition for β-mixing processes by imposing constrains on the operator norm of transition matrices; this implies that the β-mixing condition is less restrictive. The assumption κ 0 ≥ 2/3 is to ensure a sharp consistency rate for estimating the thresholds and can be removed at the cost of worsening the consistency rate (see additional details in Remark 1). Assumption A3 ensures the sparsity of the model and is needed to quantify the effect of model misspecification, since exact recovery of threshold values is not possible. A similar assumption has been used in [41] in the context of change point detection. Further, Assumption A4 puts a minimum jump size on the transition matrices ensuring a detectable change occurred at threshold r j ; it also puts certain conditions on the detection rate, which is related to γ n . Assumption A4 can be seen as an extension of Assumption H4 in [10] to high-dimensions. It can be seen that the assumption is more stringent for mvTAR, rendering this procedure not suitable for high dimensions. Finally, Assumption A5 is used to build the relationship between the length of each regime and the number of observations in that regime.
Our first theoretical result concerns the first step, i.e., the initial estimation of thresholds using group or regular fused lasso penalties. The penalized estimation (5) in this step does not guarantee parameter estimation consistency since the design matrix Z in Equation (4) may not satisfy the restricted eigenvalue condition [2], which is critical for establishing the parameter estimation consistency in high-dimensions [4]. However, with either penalty, the estimator over-estimates the true number of thresholds, as established next.
Let A n = {r 1 , r 2 , . . . , r m0 } be the set of the sorted true thresholds. Define the Hausdorff distance between two countable sets as: Though not a distance, d H (A, B) proves useful in Theorem 1.

Theorem 1. Under assumptions A1 to A5, there exist large constants
√ nγn , where for hdTAR λ 1,n =λ 1,n and λ 2,n =λ 2,n , whereas for mvTAR, λ 1,n = p 2 Kλ 1,n and λ 2,n = p 2 Kλ 2,n . Then, Theorem 1 shows that the number of estimated thresholdsm in Step 1 is no less than the true number of thresholds m 0 with high probability. In addition, there exists at least one estimated threshold in the γ n -radius neighborhood of the true thresholds. The rate of consistency for threshold detection, γ n , depends on the number of time series p, the maximum considered lag K, and the minimum distance between consecutive true thresholds in the model. In addition, the convergence rate for usingr j to estimate r j could be as low as log log n log p 2 K 2 /n when m 0 is finite.
The rate of consistency for thresholds detection, γ n , for mvTAR also depends on the number of time series p, the maximum considered lag K, and the minimum distance between consecutive true thresholds in the model. However, the assumptions on γ n for hdTAR and mvTAR are different, so the consistency rate for thresholds detection is different for these two methods. In addition, when using the 2 penalty, the convergence rate for usingr j to estimate r j could be as low as log log n log p 2 K 2 p 2 K/n when m 0 is finite. Thus, convergence of mvTAR is only guaranteed in low to moderate dimensions and not in high dimensions. Finally, the minimum sample size requirement depends on the sub-Weibull parameter κ 1 and β-mixing parameter κ 2 . For example, as mentioned in . This indicates that if the sub-Weilbull parameter κ 1 increases (i.e., the tail probability decays faster), the minimum sample size will decrease; similarly, the minimum sample size decreases as the β-mixing parameter κ 2 increases.
Next, we state Theorem 2 which shows the screening procedure (10) consistently estimates the number and values of thresholds. For that, we need two additional assumptions.
Assumption A7. There exist positive constants c, c 1 , c 2 and c 3 such that for indexes j and j − 1 and corresponding estimated thresholds s j and Assumption A6 makes a unique connection between three important quantities: (1) minimum spacing between consecutive thresholds, Δ n ; (2) the consistency rate for estimating the threshold values, γ n ; (3) the penalty term in the definition of the information criterion, ω n . This connection helps with quantifying the consistency rate for estimating the threshold values as discussed after Theorem 2.
Assumption A7 specifies three different tuning parameter rates for the screening step. Although this assumption may seem technical, but it is needed to get the sharpest consistency rate. It is possible to define a fixed tuning parameter for all cases in Assumption A7, but the consistency results will be worsened. Remark 5 in [41] shed some light into this issue.
In addition, there exists a constant B > 0 such that: When p = cn κ , where c > 0 and κ ∈ (0, 1), the proposed procedure for both hdTAR and mvTAR can also be applied to low-dimensional time series. The consistency results would be similar to those in Theorem 2. It is challenging to select ηs in practice, since the distance between estimated thresholds to the true thresholds is unknown. Instead, we set ηs to be the same and apply BIC/HBIC to select them.
Although the consistency rates for mvTAR and hdTAR are both functions of γ n , the assumptions on γ n for the two methods are different, leading to different rates of consistency. To illustrate this point, consider the case when m 0 is finite. Then, when using the 1 penalty in the first step, we can set γ n = (log n) ρ log p 2 K 2+2ρ /n for some ρ > 0. With this rate, the hdTAR model . The consistency rate then becomes of order (log n) 5 2 ρ log p 2 K 3+5ρ /n. In comparison, when using the 2 penalty, we can set γ n = (log n) ρ log p 2 K 2+2ρ p 2 K 1+ρ /n for some 0 < ρ < 1 to ensure that Assumption A3 is satisfied. With this rate, the Using a similar calculation, the consistency rate for mvTAR becomes of order (log n) /n, further highlighting that mvTAR is not suitable in high dimensions, when p = cn κ , where c > 0 and κ ≥ 1.
Remark 1. If we remove the assumption κ 0 ≥ 2/3 and only keep κ 0 < 1, then, according to Lemma 7, the choice of γ n would also depend on κ 0 . For the hdTAR model, we can set γ n = (log n) ρ log p 2 K 2/κ0−1+2ρ /n for some ρ > 0, and keep the total sparsity the same as above. The consistency rate then becomes of order (log n) 5 2 ρ log p 2 K 3/κ0−3/2+5ρ /n. Similarly, for mvTAR model, we and keep the total sparsity the same as above. The consistency rate for mvTAR becomes of order (log n) Our last theorem establishes the consistent estimation of regime-specific transition matrices in the third step.

Theorem 3. Under Assumptions A1 to A7, and selecting
for some large enough C > 0, with high probability approach to 1, there exists a positive constant C such that we have for each fixed regime j: The consistency rate derived in Theorem 3 is similar to that of Wong et al. [54], Basu and Michailidis [2] for high-dimensional vector auto-regressive models.

Tuning parameter selection
We next provide guidance on selecting the tuning parameters for our three-step procedure.
λ 1,n We choose λ 1,n by cross-validation. We first randomly choose the order of switching variable z t with equal space. Let T be a set of time points corresponding to selected switching variable. We use the rest of observations to estimate Θ in the first step for a range of λ 1,n . To choose the optimal value of λ 1,n , we use the estimated Θ to predict the series at time points in T. The optimal λ 1,n is selected as the value corresponding to the minimum mean squared prediction error over T. λ 2,n The rate for λ 2,n vanishes fast as n increases. Thus, to lower the computational cost, we set λ 2,n to zero. It is possible to select λ 2,n using crossvalidation as well at the cost of increasing computation time. However, the sensitivity analysis reported in [41] indicates that setting λ 2,n = 0 is a reasonable choice. η n Selecting η n is in general difficult. For 0 ≤ m ≤m (m is the number of estimated thresholds in step 1), we choose different ηs for different regimes, and use HBIC and eBIC [52] across all regimes. For each time series l, l ∈ 1, 2, . . . , p, and j = 1, 2, . . . , m + 1, set η l j as the tuning parameter for l-th time series at j-th regime. Then, the HBIC for interval [s j−1 , s j ] is defined as where γ 1 = 2.8 that is within the recommended range in [52]. Similarly, the eBIC for interval [s j−1 , s j ] is defined as where γ 2 = 1.4 that is within the recommended range in [52] as well. If T (sj −sj−1) ≥ pK, η l j is selected as: If T (sj −sj−1) < pK, η l j is selected as: ω n We first perform the backward elimination algorithm (BEA) until no break points are left. Then, we cluster the differences in the objective function L n into two subgroups, small and large. If removing a threshold only leads to a small decrease in L n , then the removed threshold is likely redundant. In contrast, true thresholds lead to larger decrease. We choose the smallest decrease in the second group as the optimal value of ω n . To this end, we first calculate the minimum sum of squared error for removing all thresholds inÂ n one by one, denoted as L 0 , L 1 , . . . , L m . Then, ω n is selected as the maximum values among L j+1 − L j for j = 0, 1, . . . ,m − 1. α i For simplicity, we let all time series share the same α i , denoted by α n .
For low to moderate dimensions, the tuning parameter α n for parameter estimation is selected as the minimizer of the combined HBIC over all regimes. For j = 1, 2, . . . ,m + 1, we define the HBIC on interval [r j−1 ,r j ] as: whereΣ ,j is the residual sample covariance matrix withβ estimated in Equation (11) and γ = 2.8. For high dimensions, we choose α n by 10-fold cross validation.

Empirical evaluations
In this section, we present simulation results evaluating the performance of the proposed procedure in both moderate dimensions and high dimensions; the first four simulations scenarios presented are moderate-dimensional, while the last one is high-dimensional. Details of simulation settings are presented in Appendix 7. All results are averaged over 200 replicates. We compare our method with Tsay [48], Li and Tong [28], and the threshold vectorized auto-regressive method [33]. These methods, which are denoted as Tsay (1998), Li (2016) and TVAR, respectively, assume a known number of thresholds or at least a known upper bound on the number of thresholds when establishing the asymptotic properties of their estimators. In practice, Tsay [48] proposes to perform a grid search to select the number of thresholds, when unknown, by minimizing AIC. They are also restricted to low dimensions. For instance, TVAR estimates the number of thresholds and the values of thresholds using two separate steps and assumes the number of thresholds is at most 2. This is in contrast to our developed mvTAR and hdTAR methods, which do not make any assumptions about the number of thresholds. Though Calderón V and Nieto [7] and Orjuela and Villanueva [37] do not require a known number of thresholds, we did not include a comparison with these methods, as they cannot handle larger dimensions, e.g., p = 20.
We compare the estimated thresholds and the percentage of simulations where thresholds are correctly estimated; this is defined as cases where the selected thresholds are close to the true thresholds. More specifically, a selected threshold is considered as close to the first true threshold, z 1 , if it is in the interval [−∞, z 1 + 0.5(z 2 − z 1 )); similarly, a selected threshold is considered as close to the second true threshold, Note that the number of thresholds is set to be known for Tsay (1998), Li (2016) and TVAR, since the first two require a known number of thresholds and TVAR does not perform well in selecting the number of threshold in its first step (Note that when the number of thresholds is not provided, TVAR's rates of correctly identifying the correct number of thresholds are 84%, 87%, 65%, and 16% (11.5% for T = 300) in the first four scenarios and the method is not applicable in the last scenario (scenario 5) due to high-dimensionality of model.)

Simulation results
We next compare the performances of the proposed hdTAR and mvTAR methods with Tsay (1998), Li (2016) and TVAR. Here, the selection rate of Tsay (1998), Li (2016) and TVAR is based on whether the estimated thresholds are within one standard deviation of true threshold. Table 1 summarizes the results of threshold estimation. In all simulations, if any of the methods does not select a thresholds, we set the minimum value of the threshold variable as the selected threshold. The results indicate that Tsay (1998) and Li (2011) do not work well even for the first three scenarios, while hdTAR, mvTAR, and TVAR perform well in the first three scenarios; however, the estimation error and standard deviation of TVAR are larger than those of hdTAR and mvTAR. In Scenario 4, in which only a portion of time series components change at threshold values, the detection rate for both mvTAR and TVAR drops significantly, while hdTAR still achieves 100% threshold detection rate. This is expected since hdTAR is more flexible and mvTAR only works well for scenarios in which auto-regressive components change at the same threshold values. In Scenario 4, mvTAR tends to choose a large λ 1 which leads to selecting smaller number of threshold values in the first step than needed. Nonetheless, when the changes in the transition matrices are large enough, the threshold values can still be detected using the 2 penalty. Finally, hdTAR continues to offer excellent threshold detection in the high-dimensional setting of Scenario 5; in contrast, the other methods are not well suited for this scenario and are not included. Table 2 summarizes the performance of the five methods in terms of autoregressive parameter estimation. Since Tsay (1998) does not provide coefficients estimates, so we use the method in our Step 3 to estimate the parameters given the thresholds obtained by Tsay (1998). The results indicate that both hdTAR and mvTAR perform well in the first three scenarios, as measured by their high true positive rates and low false positive rates. Since TVAR does not perform variable selection, all estimated values of transition matrices using this method are non-zero. This leads to true positive and false positive rates that are both equal to 1, which are not meaningful and are hence excluded from the table.
The results also indicate that in Scenario 4 with T = 600 and Scenario 5 hdTAR performs satisfactorily, while in Scenario 4 with T = 300, its FPR increases to around 20%. This is primarily due to the smaller sample size in this scenario for each of the three regimes. Recall from Table 1 that the selection rate of mvTAR in both of these scenarios was very low; as a result, in many simulation replicates there were fewer number estimated regimes than needed to obtain estimates of auto-regressive parameters. As a result, mvTAR is not included in the comparisons for Scenarios 4 and 5. These findings underscore the advantages of hdVAR in settings with complex patterns of changes in autoregressive parameters as well as in high dimensions. Box plots summarizing the results in Table 1 are presented in Figure 3.

Real data application
We demonstrate the utility of our penalized estimation framework in financial econometric applications by analyzing a bank balance sheet data. The data  consists of total balances of the top 10 largest US banks over time, each measured in thousands of dollars (available from www.fdic.gov).
To assess the relationship between the state of the banking sector and the overall economic conditions, we fit a multivariate TAR model of the quarterly bank balance sheet data over the period of 1995 to 2018 with the growth rate of the US GDP as the switching variable. For the quarterly GDP data y t ; t = 1, 2, . . . , T over T observations, the growth rate is defined as z t = 100(log y t − log y t−1 ), t = 2, 3, . . . , T. To reduce the non-stationarity, the bank balance sheet data v t ; t = 1, 2, . . . , T , is also transformed as We applied the hdTAR on the entire time series consisting of T = 98 quarterly observations from 1995 to 2018. To examine how results change with smaller sample sizes, we also analyze the shorter time period of quarterly observations from 2005 to 2015. The detected threshold for both time periods are shown in Figure 4. Although hdTAR does not enforce the coefficients to change at the same threshold value, irrespective of the sample size it identifies a single threshold corresponding to the great recession of 2008. This further highlights the flexibility and adaptability of hdTAR for both moderate-and high-dimensional TAR models. As a comparison, we also applied the mvTAR to the same two data sets, but exclude the results due to the inconsistency in the estimated thresholds using mvTAR when applied to the same two data sets.
The Granger causal networks [3] of interactions among these ten banks in both recession and non-recession periods during 1995-2018 are shown in Figure 5. The red links in each panel represent the interactions that occur in that economic period only. The results show strong interactions between Citibank and Harris Bank and a comparable strong interaction between PNC and JP-Morgan Chase during the recession. The interactions become weaker during the non-recession period, but more interactions appear among banks. A similar observation was made in [31].
We only plot the estimated network structures during non-recession period from 2005-2015. This is because the detected threshold is very close to the lower boundary of the sorted values of the switching variable, resulting in very few observations in the recession regime. From Figure 5, the interactions among banks in non-recession period from 2005 to 2015 are similar to the structures detected using full data set. This further confirms the satisfactory performance of hdTAR in both larger data and smaller data sets.

Discussion
We developed a three-step algorithm to estimate the number and values of thresholds, as well as the auto-regressive parameters in possibly highdimensional TAR model. The proposed algorithm can utilize either an 2 or an 1 penalty, or more specifically, a grouped or regular fused lasso penalty. The 2 penalty corresponds to the natural extension of the original multivariate TAR model in which all coefficients are forced to change at the same thresholds. The 1 penalty, in contrast, is more flexible allowing each coefficient to potentially change at different thresholds. Although this flexibility potentially comes at the cost of a larger number of thresholds in the TAR model, our theoretical and empirical results indicate that mvTAR is not appropriate for high-dimensional settings and is better suited for moderate dimensions. In contrast, the more flexible hdTAR leads to consistent estimation and superior empirical performance in both moderate and high dimensions.
We established that both versions of our algorithm, termed mvTAR and hdTAR, consistently estimate the model parameters under natural conditions on the distribution and on the level of temporal correlations in the model. The consistency rates for both models depend explicitly on several model characteristics. Specifically, when the total number of thresholds, m 0 , is finite, the rate of consistency for detecting the thresholds is based on: (1) the effective number of time points, n, (2) the number of time series components, p, (3) the number of lags, K, and (4) the total sparsity of the for small 0 < ρ < 1, then the consistency rate becomes of order (log n) 5 2 ρ log p 2 K 3+5ρ p 2 K 3 2 + 5 2 ρ /n. This confirms that mvTAR is suitable for moderate dimension but may not work in high dimensions. In contrast, for hdTAR, setting d * n = o log n log p 2 K 2 ρ/2 for some small positive ρ, the consistency rate becomes of order (log n) 5 2 ρ log p 2 K 3+5ρ /n.
The first component of the rate, i.e. (log n) 5 2 ρ , is similar to some existing consistency rates for univariate TAR models [10] while the additional term log p 2 K 3+5ρ quantifies the difficulty in estimating the thresholds in high-

dimensions.
A limitation of the proposed procedure is that it requires several hyperparameters, especially in the second step. To lower the computational cost, we chose similar tuning parameters in the second step according to eBIC/ HBIC. However, regime-specific tuning parameters may improve the estimation performance in finite samples. Fast selection of regime-specific tuning parameters is an interesting future research direction. Identifying the switching variable in the TAR model is another challenge, specifically in applications. For example, in the bank data, we selected the GDP as the switching variable. However, it is not obvious whether this is an optimal choice; for example, the unemployment rate or the inflation rate could also serve as the switching variable. Selecting optimal (data-driven) switching variable is another fruitful future research direction.

Appendix
Notations. We first describe some notations which will be used across all proofs. For a symmetric matrix X, let λ max (X) and λ max (X) denote its maximum and minimum eigenvalues and | X | denotes its operator norm λ max (X X).

Appendix 1: Some definitions
Sub-Weibull random variable: A random variable U is sub-Weibull [39] if there exist constants K U > 0 and κ > 0 such that Moreover, K U is called the sub-Weibull constant while κ > 0 is called the sub-Weibull parameter.

Mixing conditions:
We follow the definitions in [5]. Given the probability space (Ω, F, P ), for any σ-field A ⊂ F, define L 2 (A) to be the family of all square integrable A-measurable random variables. For any two σ-fields A and B ⊂ F, we define: where the supremum is taken over all pairs of (finite) partitions {A 1 , A 2 , . . . , A I } and {B 1 , B 2 , . . . , B J } of Ω such that A i1 ∈ A for each i 1 and B i2 ∈ B for each i 2 . The stochastic process is said to be α-mixing (strongly mixing) if α n → 0, and β-mixing if β n → 0. Note that β-mixing implies α-mixing.
Combined with above statement and based on Theorem 1 in [51], there exists K C > 0 such that for all y x ≥ 0, we have: By Theorem 1 in [51] again, x ((t−k),l) I (z t ≤ x) (t,l ) is sub-Weibull with parameter 1 1/κ1+1/κc . By similar procedure, we can prove x ((t−k),l) I (z t ≤ x) x (t,l ) is sub-Weibull with sub-Weibull parameter κ 1 /2. with probability at least 1 − c 3 η 1 − η 2 , we have: where η 1 = exp −c 1 log p 2 K and Proof of Lemma 5: First, we rewrite Equation (21) with respect to the switching variable z t and t as: The main goal is to find a proper rate for Equation (22). The indicator term I(z t < z π(i) ) makes the proof more complicated, since we need to maximize Equation (22) Similar to [10], we use the bracketing technique to bound Equation (23).
To simplify the notation, we denote x k t,l = x ((t−k),l) , and let W For any x 1 < x 2 , we have: where Then for fixed l , l, k, we can construct a psudo-metric For any 0 < δ < 1, the integral of the bracketing entropy satisfies the following where N ( , F, L 2 ) denotes the brackets number, that is, the minimum number of -brackets needed to cover F. Choose a fixed integer q 0 such that 4δ ≤ 2 −q0 ≤ 8δ. Then, choose a nested sequence of partitions F q u of F indexed by the integer q ≥ q 0 . By the Chaining Lemma [49], set Fix q for each level q > q 0 and each partition F q u . For x ∈ B q u , select a fixed x q u ∈ B q u and define: where To bound H (l ,l,k) 1 , we apply Proposition 7 in [54], and take the union over N q0 balls (which is a finite number). Let K, c 0 , c 1 , and c 2 be positive constants. Then, for n ≥ c 0 log(p 2 K) 2/κ0−1 , we can get: To bound H Since EH can be decomposed into three parts By Lemma 4 and Proposition 3 from [51], ) is sub-Weibull. So there exists a constant C 0 and κ 1 > 0 such that

1917
for all c ≥ 1. Then, using the sub-Weibull property (first part of Theorem 2.1 in [51]), for any ε > 0, we can get where c 3 is a positive constant and κ 1 , κ c > 0. Now, take the union over p 2 K. Then, for a positive constant c 4 such that: The second inequality satisfies since q 0 is fixed and its value will be determined later. To bound S (l ,l,k) n2 and S (l ,l,k) n3 , we apply a similar procedure as in Lemma A.1 in [10]. Specifically, let For any q ≥ q 0 , since Γ (xq) (Y (q, Γ (x) )} is a centered β-mixing process. Since β-mixing implies α-mixing, we can use the results in Lemma A.1 in [10]. For any y ≥ 0, there exists a positive constant c 5 such that P sup where h q0 = ∞ q=q0 2 −q/2 log N q , and q 0 , n ≥ 3. Since i and j are fixed, we take the union over p 2 K again and get the bound of S (l ,l,k) n2 1918

K. Zhang et al.
Use the same argument of S (l ,l,k) n2 , we have: When n is large enough, we can combine S , and S (l ,l,k) n3 . Thus, we have: Now, take y = C 0 log(p 2 K) √ n √ n and q 0 to be a smallest integer such that q 0 ≥ 3 and h q0 ≤ 1. Note that N q0 is a constant since q 0 is selected to be fixed. Equation (28) and Equation (36) give Lemma 5 as desired. Specifically, if we choose C 0 large enough, there exist positive constants C, c 6 , c 7 , c 8 , c 9 > 1, c 10 and c5y 2 2+y > 3 such that Lemma 6. Under Assumptions A1 to A4, there exist positive constants C, c 0 , c 1 , c 2 , and c 3 , such that for with probability at least 1 − c 3 η 1 − η 2 , we have: where η 1 = exp −c 1 log p 2 K and η 2 = exp −c 2 n κ1κc/2(κ1+κc) (log n) 2κ1κc/(κ1+κc) + log np 2 K .
When s ∈ [γ n , D], we want to partition the interval into small pieces, and prove the consistency in each piece.
Recall that 1 > κ 0 > 0, and the fact that the function of a β-mixing process is also a β-mixing. Since x t and z t are β-mixing, and are β-mixing. To bound H ng and I ng , we can apply Proposition 7 from [54]. Set For we can get Then, we can get Similarly, we can get By Equation (51), Equation (50), and Equation (47), we then can get: with probability 1 − δ 3 , where Thus, this proves Equation (44). Similarly, we can prove Equation (43). Note that is non-decreasing in s and E|x t | 2 is positive and bounded from Assumption A2. For s ≥ D ≥ γ n , we first fix l, l in 1, 2, . . . , p and k in 1, 2, . . . , K. Recall that where C is a constant greater than 0. Note that κ 0 = (1/(κ 1 /2) + 1/κ 2 ) −1 < 1.

1925
For s ∈ [γ n , D], we have: Note that E(x (t−k) x (t−k) I(r j −s < z t ≤ r j )) is non-decreasing in s and E|x t | 2 is positive and bounded from Assumption A2. By Lemma 4, Lemma 13 in Wong et al. [54] and taking union over p 2 K, for nγ n > 4, we can obtain M g=0 I 1ng ≤ nγ n exp −c 8 (nγ n y 2 ) κ0 + log p 2 K + exp −c 9 nγ n y 2 2 + log p 2 K .

Lemma 8.
Set σ 2 (s) = E x (t−k,g ) I (r j − s < z t < r j ) 2 for any given 1 ≤ g ≤ p. Let I s ∈ R np×np be the diagonal matrix with diagonal I 1 (s), . . . , I n (s), where I t (s) is a p × p diagonal matrix with all diagonal elements equal to I (r j − s < z t ≤ r j ) for t = 1, 2, . . . , n. Under Assumptions A1 to A4, there exist positive constants c 3 , c 4 , c 5 , C , C 1 , such that for any given j-th threshold, with probability at least 1 − δ 2 , where Proof of Lemma 8: By Lemma 7, we have: where C , c 3 , c 4 , c 5 are positive constants. Note sup 1≤l,l ≤p, 1≤k≤K, |s|≥γn can be written as sup |s|≥γn nσ 2 (s) −1 Z I rj (s)E ∞ for a given r j .

Proof of Lemma 9:
The road-map for the proof of Lemma 9 is similar to that of Lemma 4 in [41], once adapted to the TAR modeling framework.
Denote b j as the order of the given j -th estimated threshold s j . Since m < m 0 , there exists a threshold r j such that |s j − r j | > Δ n /4. In order to find a lower bound on the sum of the least squares, without loss of generality, we try to find a lower bound for the sum of squared errors plus penalty term in the following three cases: (a) |s j − s j −1 | ≤ γ n ; (b) there exist two true thresholds r j , r j+1 such that |s j −1 − r j | ≤ γ n and |s j − r j+1 | ≤ γ n ; and (c) otherwise.
Based on the Assumption A5, {z t } is a β-mixing process, then I(u 0 < z t ≤ u 1 ) is an β-mixing process for fixed u 0 and u 1 . By the second inequality of Theorem 1 in [35], there exists a certain positive constant c B such that: with high probability. Recall that according to Assumptions A1 and A5, the density of { t } and z t are positive, so where c 0 is certain positive constant. Useθ s j −1,s j to denote the estimated parameter in the estimated regime (s j − 1, s j ]. Recall that b j represents the order of the given j -th estimated threshold s j . To simplify the notation, setθ =θ s j −1,s j . For case (a), consider the case where r j < s j −1 < s j < r j+1 . Then, In case (a), |s j − s j −1 | ≤ γ n . Based on Lemma 5 and Equation (69), we can get According to Assumption A7, we obtain For case (b), consider the case where s j −1 < r j and s j < r j+1 .
By rearrangement, there exist constants c > 0, c 1 > 0, c 2 > 0, c 3 > 0, and . In addition, we can get A (.,j+1) −θ (75). Recall that w j denotes the j-th order of the true threshold. By Equation (75), we can use the same procedure as in the case (a). For some constants c h > 0 for h = 5, 6, . . . , 11, we have: For the threshold interval (s j −1 , r j ), there exist positive constants C h for h = 1, 2, . . . , 9 such that By Equation (76) and Equation (77), for certain constant C 1 > 0, we have: (78) In case (c), s j −1 < r j < s j with |s j −1 −r j | > Δ n /4 and |s j −r j | > Δ n /4. In this case, the restricted eigenvalue condition does not hold, since the distance between two consecutive true thresholds is very large, which leads to a large distance to the intersection of the estimated thresholds. However, if the tuning parameters are chosen properly, then the deterministic part of the deviation bound argument holds. Consider threshold intervals (s j −1 , r j ) and (r j , s j ) According to the results from Lemma 7 and Equation (79), we have where c 1 , c 2 are some positive constants. Based on the Assumption A4, According to Equation (82) and Equation (83), we can get: (84) When both |s j −1 − r j | > γ n and |s j − r j | > γ n , we can follow the similar steps that obtain Equation (84) and obtain: Combining above three cases (a),(b), and (c), we can prove the results. Proof of Lemma 10: The proof of Lemma 10 is similar to that in Appendix E.1 of [54]. We mainly need to apply Proposition 1 and Proposition 2 in [30] and the fact that any measurable function of a stationary process is β-mixing if the original stationary process is β-mixing. Proposition 1 in [30] gives the result that the sequence is geometrically Ergodic based on certain conditions, and we can show that the sequence will be β-mixing with geometrically decaying mixing coefficients, by using Proposition 2 in [30]. Finally, we verify the sub-Weibull assumption by using the definition of sub-Weibull distributions.
To apply Proposition 1 in [30], we check the three conditions, where we set the corresponding parameters E = R p , and μ as the Lebesgue measure. Condition (i) is satisfied if we set the parameter m in the Proposition 1 of [30] to 1. (Note that here m is not the number of thresholds.) For condition (ii), we setm = inf u∈C,v∈A u − v 2 the minimum "distance" between the sets C and set A in [30], where A is any set that A ∈ B where B is the σ-algebra of Borel sets of E, and C is any compact set that C ⊂ E. Since C is bounded and A is Borel,m is finite. For condition (iii), the function Q(·) = · and set K c = {x ∈ R p : x ≤ 4Cac c } where c = 1 − B max and C ac := E t . Since max j=1,2,...,m0+1 B (j) < 1, • For allỹ ∈ E \ K c ; i.e.ỹ such that ỹ > 4Cac c , • For allỹ ∈ K C , 0 ≤ ỹ ≤ 4C ac c .
By Proposition 1 in [30],X t is geometrically Ergodic and stationary. By Proposition 2 in [30], the sequence will be β-mixing with geometrically decaying mixing coefficients.
Next, we verify the sub-Weibull distribution. Let κ be the sub-Weibull parameter associated withŨ t in (106). Since X t ψ ≤ B max X t−1 ψ + Ũ t−1 ψ , and B max < 1, Now, given thatX t is an (equivalent) representation for x t , it follows that x t is also sub-Weibull. Therefore, Assumption A2 holds.

Algorithm 1: The fused lasso algorithms
Initialize θ i = 0, for i = 1, · · · , n.; while h < maximum iteration do for i = 1, · · · , n do Calculate the (h + 1)th iteration of θ h+1 . end In this section, we summarise the existing methods for estimating multivariate TARs, along with their treatment of the number of thresholds m 0 and dimension of the TAR model. Table 3 highlights the limitations of existing approaches and the fact that our methods are the only available approach that can handle both low-and high-dimensional settings, while allowing for an unknown number of thresholds that could diverge with the number of observations T . Allowing for an unknown number of thresholds amounts significantly complicates the problem as previous approaches for multivariate TAR models need to first estimate the number of thresholds and then proceed with estimating the location of thresholds. An incorrect estimation of number of thresholds in the first step may result in biased estimation of thresholds due to having misspecified components in the estimation procedure.
As seen in Table 3, many methods assume that the number of thresholds is known, even though this information is often not available in practice. Thus, Table 3 Comparison of existing methods for estimating multivariate TAR models. Here m 0 represents the number of thresholds and T the length of the time series.
Paper m 0 m 0 assumed known? Dimension Tsay [48] finite (at most three) No low Lo and Zivot [33] (TVAR) at most 2 Yes low Hansen and Seo [21] 1 Y e s l o w ( b i -v a r i a t e ) Nieto [36] finite No low (bi-variate) Dueker et al. [14] finite Yes low Li and Tong [28] 1 Y e s l o w Calderón V and Nieto [7] at most 3 No low Orjuela and Villanueva [37] finite No low Our method diverging with T No moderate & high in the remainder of this section we discuss how existing approaches treat the number of thresholds. Utilizing this assumption, Tsay [48] performs a grid search, estimating the coefficient using simple linear model for each interval, and selecting the threshold based on the Akaike information criterion (AIC). Lo and Zivot [33] instead assume the model as at most 2 thresholds. While a relaxation compared to a known number of thresholds, this assumption still considerably simplifies the problem. Using this assumption, Lo and Zivot [33] use nested hypothesis tests (testing whether the data can be modeled by the linear model versus a TAR model) to detect the thresholds, and apply the grid search method to estimate the values of the thresholds based on the results of the hypothesis testing. As an alternative, Hansen and Seo [21] couples the grid search with a maximum likelihood estimation (MLE) of the model parameters. However, the algorithm is difficult to implement in higher dimensions, and the consistency and/or distribution of the MLE estimator is not investigated. Dueker et al. [14] restricts the switching variable to be constructed based on the lags of the original time series that is being modeled and performs a grid search with respected to certain log likelihood function. The key advantage of this method is that it allows for multiple switching variables, but with only one threshold for each switching variable. Li and Tong [28] provides a nested sub-sample search algorithm to reduce the time complexity of the grid search.
A few methods have recently tried to estimate multivariate TAR models under less restrictive assumptions on the number of thresholds. However, these methods can only handle finite number of thresholds or only work in lowdimensional settings. To our knowledge, Nieto [36], Calderón V and Nieto [7] and Calderón V and Nieto [7] are the only methods that do not require a known number of thresholds or a bound on the number of thresholds. This is achieved by utilizing a Bayesian estimation framework. However, the consistency of the number of estimated thresholds is not investigated for these Bayesian methods, which could be a challenging problem. Our proposed methods and the corresponding theory thus bridge a gap in the existing literature, as the only methods that allow for an unknown and diverging number of thresholds, m 0 , while also facilitating estimation of moderate and high-dimensional time series.

Simulation Scenario 5 (Simple high-dimensional A with uncorrelated error)
In this scenario, T = 80, p = 100, and K = 2. There is only one threshold value r 1 = 5. The auto-regressive coefficients are chosen to have the same structure as in Scenario 1 but with different values (see Figure 6).
The auto-regressive coefficients for the above simulation scenarios are visualized in Figure 6