Rank Determination in Tensor Factor Model

Factor model is an appealing and effective analytic tool for high-dimensional time series, with a wide range of applications in economics, finance and statistics. This paper develops two criteria for the determination of the number of factors for tensor factor models where the signal part of an observed tensor time series assumes a Tucker decomposition with the core tensor as the factor tensor. The task is to determine the dimensions of the core tensor. One of the proposed criteria is similar to information based criteria of model selection, and the other is an extension of the approaches based on the ratios of consecutive eigenvalues often used in factor analysis for panel time series. Theoretically results, including sufficient conditions and convergence rates, are established. The results include the vector factor models as special cases, with an additional convergence rates. Simulation studies provide promising finite sample performance for the two criteria.


Introduction
Factor models have become a popular dimensional reduction tool in economics and statistics, especially for analyzing high dimensional time series. In practice a few common factors can often capture a large amount of variations and dynamics among a large pool of variables and time series. In the finance literature, Chamberlain and Rothschild (1983) exploited factor analysis to extend classical arbitrage pricing theory. In macroeconomics, Bai (2003), Bai and Ng (2002), Stock and Watson (2002) considered static factor models for modeling macroeconomic time series. Forni et al. (2005) studied the identification of economy-wide and global shocks using generalized dynamic factor models. Mincheva (2011, 2013), Fan, Wang and Zhong (2019) established large covariance matrix estimation based on the static factor model. Factor models are also used to evaluate the impacts of various policies; see, e.g., Bai, Li and Ouyang (2014), Ouyang and Peng (2015) and Li and Bell (2017). Recently, large matrix or tensor (multi-dimensional array) data has become ubiquitous. Wang, Liu and Chen (2019) proposed a matrix factor model and applied it to matrixvalued financial data. Chen, Yang and Zhang (2021) analyzed the multi-category import-export network data via tensor factor model. A critical step in building a factor model is to correctly specify the number of factors used in the model. Estimation and forecasting procedures are all depended on the number of factors. Moreover, in some cases the number of factors may have some crucial economic interpretations and important theoretical consequences. For example, in finance and macroeconomics, it provides the number of sources of nondiversifiable risk or the fundamental shocks driving the macroeconomic dynamics. See, Forni and Reichlin (1998), Stock and Watson (2016), Giannone and Reichlin (2006), Forni et al. (2009), among others.
Over the past decades, many methods have been developed to determine the number of common factors needed for modelling high dimensional vector time series. The most widely studied approach is to utilize the behavior of the eigenvalues of the covariance matrix (see, e.g., Bai and Ng (2002)), or the singular values of the autocovariance matrix (see, e.g., Lam and Yao (2012)). By the definition of factor models, the eigenvalues or the singular values corresponding to the systematic components must increase with the number of cross-sectional units. The rest of the eigenvalues, which represents idiosyncratic components, stay bounded or remain to be zero. In the static factor model, Bai and Ng (2002) proposed to estimate the number of factors by separating diverging eigenvalues from the rest using threshold functions, in the form of an information criterion. Alternative criteria based on random matrix theory have been studied in Kapetanios (2010) and Onatski (2010) for the static factor model. Specifically Kapetanios (2010) developed sequential tests and employed a subsampling method to obtain an approximation of the asymptotic distribution of the estimated eigenvalues. Onatski (2010) constructed tests based on the empirical distribution of the eigenvalues. In addition, Onatski (2012) proposed an alternative estimator using the difference of consecutive eigenvalues. Bai and Ng (2007) and Amengual and Watson (2007) extended the work of Bai and Ng (2002) to the restricted dynamic factor model. Hallin and Liška (2007) further extended the framework to the generalized dynamic factor model through thresholding eigenvalues of the spectral density matrix. They also proposed a data-dependent method to adjust the multiplicative constant of the penalty function. Alessi, Barigozzi and Capasso (2010) introduced a tuning multiplicative constant in the penalty for dealing with approximate factor models. Kong (2017) employed similar ideas to study continuous time factor model with high frequency data. Li, Li and Shi (2017) modified Bai and Ng (2002)'s procedure to the case that the number of factors is allowed to increase with the sample size. Trapani (2018) proposed a randomized sequential test procedure to determine the number of factors.
An alternative approach is to study the ratio of each pair of adjacent eigenvalues, with the insight that ratio of the smallest eigenvalue among these cor-responding to the system component and the largest eigenvalue among these corresponding to the idiosyncratic component goes to infinity. Under stationary conditions, Ahn and Horenstein (2013) developed such an estimator based on the sample covariance matrix. Lam and Yao (2012) used such a ratio based estimator based on singular values of the aucovariance matrix, under an alternative definition of factor models proposed in Pan and Yao (2008) and Lam, Yao and Bathia (2011).
Other than the eigenvalue-based methods, Ye and Weiss (2003) developed an eigenvector based order determination procedure. Luo and Li (2016) proposed a new estimator that combines both the eigenvalues and the bootstrap eigenvector variability. Jung, Lee and Ahn (2018) suggested to sequentially test skewness of the squared lengths of residual scores that are obtained by removing leading principal components. However, these works assumed that the data are temporally independent, which are unlikely to hold for economic data.
These studies all focus on panel (vector) time series. Recently there is a growing interest in analyzing matrix-or tensor-valued time series, as such time series is encountered more and more frequently in applications, including Fama-French 10 by 10 series (Wang, Liu and Chen, 2019), a set of economic indicator series among a set of countries (Chen, Xiao and Yang, 2021), multi-category international trading volume series Chen, 2019, Hoff, 2011), multitype international action counts among a group of countries (Hoff, 2015), sequence of realized covariance matrices (Kim andFan, 2019, Lunde, Shephard andSheppard, 2016), sequence of gray-scale face recognition images (Chen and Fan, 2021), dynamic networks (Barabási andAlbert, 1999, Jiang, Li andYao, 2020), dynamic human brain transcriptome data (Liu, Yuan and Zhao, 2017), multivariate spatial-temporal climate series , neuroimaging data (Zhang, 2019, Zhou, Li andZhu, 2013). Factor model is again developed as an effective dimension reduction tool (Chen, Yang and Zhang, 2021, Wang, Liu and Chen, 2019. Same as for the vector factor models, it is important to determine the number of factors in these models. In this paper, we consider the determination of the dimension of the core tensor factor in the tensor factor model in Chen, Yang and Zhang (2021) and Han et al. (2020), which assumes the form Similar to Lam and Yao (2012), the noise tensor E t is assumed to be a white tensor process with potentially strong contemporary correlations among the elements of the noise tensor, and all common dynamics is absorbed in the signal process M t . This model setting is different from the approximate factor model in Bai and Ng (2002) and the dynamic factor model in Hallin and Liška (2007), in which the noise process is allowed to have weak auto-correlations, but with strong restriction on the contemporary correlation. Chen, Yang and Zhang (2021) and Han et al. (2020) studied the estimation procedures of the tensor factor model, assuming the ranks of the core tensor F t is given, with some ad hoc rank determination suggestions. In this paper we formally propose two criteria for specifying the ranks of the core factor process, which we name "the information criterion" (IC) and "the eigenvalue ratio" (ER). They are all based on examining the eigenvalues of the sample cross-automoment of the observed tensor time series, utilizing the whiteness property of the noise process. The IC estimators aim at truncating eigenvalues, which is similar to the information criteria in vector factor models (e.g., Bai and Ng (2002) and Hallin and Liška (2007)). The ER estimators are obtained by minimizing the ratio of two adjacent eigenvalues arranged in ascending order, extending the standard ER estimator in Lam and Yao (2012) and Wang, Liu and Chen (2019) with an added small penalty term in both the numerator and denominator of the ratio. The penalty term behaves like a lower bound correction to the true zero eigenvalues. We adopt similar ideas of the TOPUP and TIPUP procedures of Chen, Yang and Zhang (2021), and their corresponding iterative versions, iTOPUP and iTIPUP, of Han et al. (2020), to construct sample auto-crossmoments. Our theoretical and empirical investigations show that estimators based on the iterative algorithms are much better than that based on the noniterative ones, as the iterative algorithms significantly improve the estimation accuracy of the eigenvalues. The finite sample properties of the IC and ER criteria are also good. The empirical evidences show that the best estimators in tensor factor model are the IC and ER estimators based on iTIPUP, under some mild conditions on the level of signal cancellation typically associated with the TIPUP based procedures. This paper is organized as follows. Section 3.1 briefly describes the tensor factor model and the corresponding estimation procedures proposed in Chen, Yang and Zhang (2021) and Han et al. (2020). Section 3.2 introduces the criteria for determining the ranks of the core tensor factor process, and their iterative versions. Section 4 investigates theoretical properties of the proposed estimators. Section 5 presents simulation studies of the finite sample properties of the proposed methods. Real data analysis is given in Section 6. Discussions are provided in Section 7. All technical details are relegated to the Supplementary Material.

General order determination criteria of semipositive definite matrices
In this section, we first propose two general order determination criteria based on the properties of the estimated eigenvalues of a semipositive matrix. Let W be a p × p symmetric and non-negative definite matrix, which is a sample version of a true p × p symmetric and non-negative definite matrix. We assume W = E W . Also letλ j be the eigenvalues of W such thatλ 1 ≥λ 2 ≥ . . . ≥λ p . Let λ 1 ≥ . . . ≥ λ r > λ r+1 = . . . = λ p = 0 be the eigenvalues of W . Note that the rank of W is r.
Let m * < p be a predefined upper bound and functions G( W ) and H( W ) be some appropriate positive penalty functions. We propose the following two quantities IC( W ) = argmin (1) The first criterion in (1) is similar to an information criterion as its first term mimics the residual sum of squares of using a rank m matrix to approximate the matrix W while the second term mG( W ) penalizes the model complexity m. We will call it the information criterion (IC). The second criterion in (2) uses the ratio of two adjacent eigenvalues of W , with a small penalty term H( W ) added to both the numerator and denominator. We will call it the eigen-ratio criterion (ER).
Remark 1 (The information criterion). Note that, for a given m, the principle components can be viewed as solutions of an optimization problem in which the "sum of squared residuals" is minimized, Note that tr I − U m U m W = p j=m+1λ k . It plays the role of residual sum of squares classically appearing in information criterion methods. Criterion (1) has a structure comparable to that of Bai and Ng (2002) and Hallin and Liška (2007). For vector factor models, the method proposed in Bai and Ng (2002) is the same of the IC criterion with W being the sample covariance matrix, while that in Hallin and Liška (2007) used spectral density matrix estimation. The penalty mG( W ) is intimately related to the rate of convergence of the nondivergent eigenvalues, when W is estimated from a set of data with diverging dimensions, and balances between overestimation and underestimation.
Remark 2 (Eigen-ratio criterion). Different from the standard ER estimator in Lam and Yao (2012), we add a penalty term H( W ) to both the numerator and denominator. The intuition behind H( W ) is as follows. Since W is a noisy version (an estimator) of W of rank r, all estimated eigenvaluesλ j (r+1 ≤ j ≤ p) correspond to the zero eigenvalues of W . Hence the ratioλ j+1 /λ j (j > r) theoretically can be arbitrary small. The penalty H( W ) provides a lower bound correction toλ j (r +1 ≤ j ≤ p). When it is of a proper order, we can ensure that the ratio (λ m+1 +H( W ))/(λ m +H( W )) goes to zero when m = r (the true rank), while all other such ratios are asymptotically bounded from below. In vector factor models, Ahn and Horenstein (2013) exploited the ratio of eigenvalues of sample covariance matrix to determine the number of factors. Non-divergent eigenvalues therein are bounded below by a positive number asymptotically, as long as the eigenvalues of covariance matrix of idiosyncratic noises are bounded away from zero. Our criterion (2) has a similar flavor.
Here we show a consistency result for the general estimator. To be more precise, let W (n) and W (n) be two sequences of semi-positive symmetric matrices, with W (n) = E W (n) and n be an index associated with the sample size and dimension. Also letλ (n) j be the eigenvalues of W (n) such thatλ The following proposition provides the sufficient conditions for the consistency of the IC and ER estimators in (1) and (2), respectively. It provides a guideline of choosing proper penalty functions G(·) and H(·) in order determination for any generic W (n) .
In the conditions of Proposition 1, γ n represents the convergence rate of the sample eigenvalues corresponding to the non-zero eigenvalues of W (n) , and β n represents the rate of the sample eigenvalues corresponding to the zero eigenvalues of W (n) . For example, under the strong factor model of Lam and Yao (2012)'s setting, γ n = p 2 T −1/2 and β n = p 2 T −1 , where T is the sample size.
Remark 3. In our model setting, λ (n) r+1 = . . . = λ (n) p = 0 and our objective is to separate the zero and non-zero eigenvalues. The proposition holds for general spiked eigenvalue detection as well. Specifically, let λ are called non-spiked eigenvalues (Cai, Han and Pan, 2020). Again it is assumed that λ (n) r → ∞ as n → ∞. Then Proposition 1 holds when γ n and β n are the convergence rate of the sample eigenvalues corresponding to the spiked and non-spiked eigenvalues of W (n) , respectively. Note that the approaches of Bai and Ng (2002), Amengual and Watson (2007), Hallin and Liška (2007), Lam and Yao (2012) and Ahn and Horenstein (2013) all fit in this generic setting or its variants, with various forms of the penalty functions G( W ) and H( W ) to distinguishλ (n) r fromλ (n) r+1 . For example, Bai and Ng (2002) suggest to use where T is the sample size and p is the number of variables.
Remark 4. When the dimensions d k are large, estimating eigenvalues of a matrix using its sample version is in general very difficult and potentially inaccurate. However, to determine the number of factors, only the leading eigenvalues need to be estimated relatively accurately to achieve the purpose, which requires relatively mild conditions on the sample version of the matrix.
For the specific problems such as the tensor factor model problem we focus there, a detailed analysis of the rates γ n and β n is needed to construct the penalty functions G(·) and H(·) and to establish the consistency of the rank estimators. In fact one can establish the convergence rate with a more detailed analysis beyond the simple consistency results in Proposition 1, as we will do for the tensor factor model.

The model
Here we briefly introduce the tensor factor model setup in Chen, Yang and Zhang (2021) and Han et al. (2020). A tensor factor model can be written as where X t ∈ R d1×···×d K is the observed tensor at time t, the core tensor F t is the unobserved latent tensor factor process of dimension r 1 × . . . × r K , A k are the deterministic loading matrix of size d k × r k and r k d k , and E t is the idiosyncratic noise components of X t , which is assumed to be a white process.
The core tensor F t is usually much smaller than X t in dimension. We also assume that the rank of A k is r k . Otherwise X t in (4) may be expressed equivalently with a lower-dimensional factor process. The parameters r 1 , ..., r K are assumed to be fixed but unknown. For more details of the tensor factor model (4), see Chen, Yang and Zhang (2021) and Han et al. (2020).
It is obvious that the loading matrices A k are not identifiable in Model (4). Model (4) is unchanged if we replace (A 1 , ..., However, the linear space spanned by the columns of A k , called the factor loading space, is uniquely defined. Assume A k has a SVD representation A k = U k Λ k V k . Then, the factor loading space of A k can be represented by the orthogonal projection P k ,

Rank selection criteria for tensor factor models
The two criteria introduced in Section 2 can be used to estimate the number of factors in the tensor factor model (4), using properly constructed matrices W and W . Particularly, we will study the following four constructions.
Since they have been proposed and used for loading space estimation in Chen, Yang and Zhang (2021) and Han et al. (2020), we adopt the same names to represent them.
where ⊗ is the tensor product such that, for any A ∈ R m1×m2×···×m K and B ∈ R r1×r2×···×r N , and mat k is the tensor unfolding (into a matrix) operation along mode-k of a tensor. Here we emphasize that W k is constructed using X 1:T = (X 1 , . . . , X T ). The constant h 0 is a (small) predetermined integer and the sum over h in W k is to accumulate the information from different time lags h. The rank of its population version can be shown to be r k under certain conditions, hence we can use the IC and ER estimators presented in Section 2 to determine r k .
(III and IV) iTOPUP and iTIPUP: Han et al. (2020) proposed an iterative procedure to estimate U k , k = 1, ..., K in (5), based on either TOPUP or TIPUP procedure. Briefly, at i-th iteration, suppose we have obtained an estimate of the ranks r (i−1) k (k = 1, . . . , K) and their corresponding U Note that Z (i) k,t uses projection of X t on all modes, except mode-k. The initial ranks r through the non-iterative TOPUP and TIPUP procedure. Let Z The iterative procedure is motivated by the observation that Z k,t is a r 1 . . . r k−1 d k r k+1 . . . r K tensor, much smaller than X t , which is d 1 . . . d k tensor. Hence A k can be estimated more accurately if all A j (j = k) are given in advance or can be estimated accurately, since the convergence rate now depends on The IC and ER estimators are constructed by replacing W in (1) and ( . This yields eight different criteria, summarized in Table 1. Again, we use the same names of the procedures as that in Chen, Yang and Zhang (2021) and Han et al. (2020) to represent the various constructions of W . For the iterative procedures, we start with an initial rank estimate r (0) k , k = 1, . . . , K and estimate the ranks through iteration until convergence. See remark below for setting the initial ranks and the stopping criteria.
Remark 5. In our theories, we fix m * in (1) and (2) as a finite constant. However, in practice, we may use, for example, m * = p/2. We do not recommend to extend the search up to p, as the minimum eigenvalue is likely to be practically 0, especially when T is small and d k is large.
The choice of the penalty function G(·) and H(·): Both criteria essentially try to distinguish the smallest (true) non-zero eigenvalue from the true zero eigenvalue using noisy estimators of the eigenvalues. Hence the penalty function is closely related to the amount of error in the eigenvalue estimation and the strength of the smallest (true) non-zero eigenvalue. We consider the following penalty functions G(·) = g k (d, T ): where d = Π K k=1 d k and ν is a tuning parameter. Ideally ν should be chosen to be the strength of the weakest factor (see Assumption IV in Section 4.1), though in practice we usually do not know its precise value. A more thorough discussion on this issue will be given later in Remark 10. Note that only g k,5 involves k.
For the eigen-ratio criterion, we consider the following penalty function H(·) = h k (d, T ): where c 0 is a small constant, e.g. c 0 = 0.1. Note that the penalty functions scale with h 0 , because the strength of divergent eigenvalues increases with h 0 . Our theoretical analysis indicates that a better penalty function H(·) should also involve the strengths of the factors, similar to G(·). However, the function G(·) has a much wider allowable range, and in most of the situations a simple constant function h k,1 is sufficient. A more detailed discussion will be given later in Remark 10.
More considerations of the iterative procedure: The non-iterative procedures estimate r k (k = 1, . . . , K) individually. The accuracy ofr k does not depend on the accuracy of the estimation of the ranks in other directions. On the other hand, the iterative procedures estimate all the ranks simultaneously, hence the accuracy of estimated rank in one direction depends on that in all other directions. The iterative algorithm improves the estimation accuracy of the eigenvalues and the principal subspace because the projected tensor Z (i) k,t in (6) is of lower dimensional than X t . Figure 1 numerically shows that the iterative algorithms improve the accuracy of estimated true zero eigenvalues over the non-iterative algorithms. In iterative algorithms, one would need to specify all the ranks r (i) k (k = 1, . . . , K) in each iteration. Intuitively, an overestimated r (i) k > r k would still produce consistent estimators, since the non-iterative procedure, using r (i) k = d k for all other directions, is consistent. Our theoretical results shown later confirm that, if r (i−1) k used is larger than the true r k , the iterative algorithm warrant the consistency of the IC and ER estimators at i-th iteration. However, an underestimated r (i) k < r k would potentially result in loss of signal strength hence negatively impacting the estimation in other dimensions. A precise quantification of the impact requires a more detailed investigation. But the numerical studies show that for iteration i > 1, the performance is relative robust by using the order obtained by the IC or ER criteria. This is partially due to the theoretical justification that, for fixed ranks r k , iterative algorithm only needs one iteration to achieve the ideal convergence rate for the estimation of the eigenvalues (see Theorem 2 later). In fact, if one has a priori information about a possible maximum (fixed) ranks of the core factor process, one could use such ranks in the iterative algorithms accordingly.
For iTOPUP procedure, we use r k (ER) after the initial iteration i ≥ 1. However, one needs to use a more conservative estimator of the rank for the initial step since r (0) k (IC) and r (0) k (ER) tend to be inaccurate. We suggest to user k (ER) + 3} by default, unless one has prior knowledge of the number of the factors. iTIPUP procedure is similar. Although it is safer to use larger initial ranks r (0) k , it is often not necessary to be extremely conservative, as the initial loss of signal strength of using a rank too small can be corrected later through iterations.
In the iterative algorithms, iteration is not stopped until the convergence of both the rank estimators and the loading space estimators. Theoretical prop-erties in Section 4 only state consistency results in each iteration step. This stopping rule is mainly suggested by simulation study. In practice, we may stop the algorithm when the estimated number of factors in current iteration is the same as that in previous iteration.

Some discussions
Remark 6. When the means of the factor processes deviate from zeros by a large margin, there is often one or several dominating factors corresponding to these non-zero means, as observed by Brown (1989), while the factors associated with the covariances become weak factors and are more difficult to identify. Assuming that the factor tensor process does not change its dimension after the deterministic means are removed, i.e. the factor tensor process is not a constant in any of its dimensions, then one should always demean the data in practice for the determination of the dimension of the factors, though not necessary for the estimation of the loading spaces, as shown by Chen and Fan (2021) that aggregating the first and second moments of the data may improve the estimation accuracy of the factor loading matrices.
Remark 7. When some r k = 1, although the factor process has a reduced number of tensor modes, the proposed IC and ER methods should work well in identifying r k = 1 cases. If there is no factor structure (r 1 = ... = r K = 0), the proposed IC methods still can select the zero rank. But the ER methods need a slight modification, by constructing a new mock eigenvalueλ k,0 using the rate of the spiked eigenvalues. In the current paper, we focus on the case that r k is fixed and d k diverges. If r k = d k along one (or even all) dimension(s), the theoretical results in Section 4 can be extended. In this case we will need to modify the IC and ER methods using the developed convergence rates of zero eigenvalues.
Remark 8 (Improved penalization for IC approaches). The information criterion (1) has the property, exploited by Hallin and Liška (2007) in the context of dynamic factor models, that a penalty function G(·) = g k (d, T ) leads to a consistent estimate of r k if and only if cg k (d, T ) does, where c is an arbitrary positive real number. Thus, multiplying the penalty by c has no influence on the asymptotic performance of the identification method. However, for given finite d and T , the value of a penalty function g k (d, T ) satisfying (1) can be arbitrarily small or arbitrarily large, and this indeterminacy can affect the actual result quite dramatically. The procedures in Hallin and Liška (2007) and Alessi, Barigozzi and Capasso (2010) can also be used in tensor factor model to robustify the IC approach with an empirically optimal choice of c.
Specifically, following Hallin and Liška (2007) and Alessi, Barigozzi and Capasso (2010), we generate a sequence of subsamples of sizes (d 1,j , ..., d K,j , T j ) with j = 0, ..., J such that d k,0 = 0 < d k,1 < d k,2 < · · · < d k,J = d k and T 0 = 0 < T 1 ≤ T 2 ≤ · · · ≤ T J = T , where d k is the original data dimension of tensor mode k and T is the original sample size, 1 ≤ k ≤ K. For any j, we The optimal c is then chosen to minimizes S k,c .

Assumptions and notation
We introduce some notations first.
The tensor Hilbert Schmidt norm for a tensor A ∈ R m1×m2×···×m K is defined as Define the tensor operator norm for an order-4 tensor A ∈ R m1×m2×m3×m4 , where ⊗ is the tensor product and U k is from the SVD form of as the canonical version of the auto-covariance of the factor process. Similarly define For simplicity, we write U k = U k,r k and U k = U k,r k . To facilitate consistency properties of the proposed procedures, we impose the following assumptions.
Assumption I. The error process E t are independent Gaussian tensors, condition on the factor process {F t , t ∈ Z}. In addition, there exists some constant Assume the factor process F t satisfies the strong α-mixing condition such that for some constant c 0 > 0 and 0 < θ 1 ≤ 1, where Han et al. Assumption III. For any u k ∈ R r k with u k 2 = 1 and 1 ≤ k ≤ K, where c 1 , c 2 are some positive constants and 0 < θ 2 ≤ 2.
Assumption IV. Assume r 1 , ..., r K are fixed. There exist some constants k,1:h0 )] is of rank r k for 1 ≤ k ≤ K. Assumption I is the same assumption used in Chen, Yang and Zhang (2021) and Han et al. (2020). This assumption corresponds to the white noise assumption of Lam, Yao and Bathia (2011), Lam and Yao (2012). It allows substantial contemporaneous correlation among the entries of E t . Note that the normality assumption, which ensures fast convergence rates in our analysis, is imposed for technical convenience. In fact we only need to impose the sub-Gaussian condition. Assumption II allows a very general class of time series models, including causal ARMA processes with continuously distributed innovations; see also Bradley (2005), Fan and Yao (2003), Rosenblatt (2012), Tong (1990), Tsay (2005), Tsay and Chen (2018), among others. The restriction θ 1 ≤ 1 is introduced only for presentation convenience. Assumption III requires that the tail probability of any orthonormal projection of F t decay exponentially fast. In particular, when θ 2 = 2, F t is sub-Gaussian.
Assumption IV is similar to the signal strength condition of Lam and Yao (2012), and the pervasive condition on the factor loadings (e.g., Stock and Watson (2002) and Bai (2003)). It plays a key role in identifying the common factors and idiosyncratic noises in (4). Indices δ 0 , δ 1 are measures of the strength of factors, or the rate of signal strength growth as the dimension d k grows. When δ 0 = δ 1 = 0, the factors are called strong factors; otherwise, the factors are called weak factors. In particular, δ 0 represents the strength of the strongest factors and δ 1 the strength of the weakest factors.
Remark 9 (Signal cancellation). Assumption V guarantees that there is no redundant tensor direction in F t when combined with A k 's. It is related to certain signal cancellation phenomenon which is rare for TOPUP procedures but may occur among TIPUP related procedures. Consider the case of k = 1 and K = 2. We write the factor process in the canonical form as F as the time average cross product between fibers f * i1,j1,1:T and f * i2,j2,1:T of the factor process (in canonical form). Then Θ 1,h i1,j,i2,j,h is subject to potential cancellation among its terms for h > 0. In the extreme cases, E[mat 1 (Φ * (cano) k,1:h0 )] may not have full rank r k and thus the signal strength τ * k,r k can be much smaller than the order d 1−δ1 . In Assumption V(b), we rule out the possibility of such severe signal cancellation. In practice, Han et al. (2020) suggest to examine the patterns of the estimated singular values under different lag h values. If there is no severe signal cancellation, we would expect that the pattern of h Here we emphasize that τ k,r k ,h0 and τ * k,r k ,h0 depend on h 0 , though in other places when h 0 is fixed we will omit h 0 in the notation. Severe signal cancellation would make the patterns different, Han et al. (2020). On the other hand, Assumption V(a) is sufficient to guarantee that E[Θ k,1:h0 ] and A k have the same rank r k .
Instead of Assumptions II to V, Chen, Yang and Zhang (2021) and Han et al. (2020) imposed conditions on Θ k,0 op , Θ * k,0 2 , τ k,r k and τ * k,r k in order to allow r k to increase with d k . The following proposition establishes a connection between these two types of assumptions when the rank r k is fixed.

Theoretical properties for IC and ER estimators
In this section, we shall present theoretical properties of the IC and ER estimators using non-iterative TOPUP, non-iterative TIPUP, iTOPUP and iTIPUP. We first introduce some quantities related to the estimation errors of the estimated eigenvalues of the four different methods. For non-iterative TOPUP, define For non-iterative TIPUP, define Similarly, we useβ k ,γ k andβ * k ,γ * k for iTOPUP and iTIPUP, respectively, whereβ correspond to the β n and γ n sequence in Proposition 1.
We impose the following set of conditions to ensure that the estimation error of the divergent eigenvalue is much smaller than the true smallest non-zero eigenvalue, and the estimation error of the zero eigenvalues is relatively small. The γ's above are part of the estimation errors of the divergent eigenvalues, and the β's are the estimation errors of the (true) zero eigenvalues, for the four different estimation method. Note that d 2−2δ1 is the growth rate of the smallest non-zero eigenvalues corresponding to the weakest factors. See also Theorem 2 below.
The sequences a k and b k will be one of the γ k and β k sequences defined above, respectively, based on the estimators.
We will impose the following sufficient conditions on the penalty function g k (·) and h k (·).
Assumption VII (Sufficient condition on the penalty functions).
where n ≺ n indicates that there is a constant C such that n < C n uniformly, and n ≺ ≺ n indicates n = o( n ). The sequence b k will be specified for different estimators.
Remark 10 (Penalty functions). The penalty functions g k and h k enter the consistency theorem below through Assumption VII. They do not have direct impact on the convergence rate of the rank estimators, as long as the condition is satisfied. Indirectly their choice interacts with the required sample size T and dimension d. Roughly speaking, Assumption VII(a) dictates that the penalty g k (d, T ) should be less than the smallest diverging eigenvalue, but large enough to correctly truncate the estimated true zero eigenvalues. The b k sequence in the assumption is taken to be one of the β k , β * k ,β k andβ * k defined above, according to the procedure used. In practice we generally do not know the the factor strengths δ 0 and δ 1 , hence may not always be able to specify a g k (d, T ) that satisfies the condition. However, the range between the upper and lower bounds is quite wide in most of the cases, especially with the additional T −1 term in the lower bound. All of the suggested g k (d, T ) listed in (7) satisfy the condition, if ν = δ 1 . Our experiments shows that setting ν = 0 in (7) is sufficient in most of the cases. Only when the true δ 1 is very large (extreme weak factors), the results become sensitive to the selection of ν. In such cases, a data driven procedure similar to that in Hallin and Liška (2007) for vector factor models may be used to estimate δ 1 . Its property for tensor factor model may need further investigation. One can also study the pattern of the rank estimates under different ν.
The condition imposed on the penalty function h k (d, T ) in Assumption VII(b) is even weaker. The upper bound goes to infinity but the lower bound goes to zero, except when both δ 0 and δ 1 are large, and T is of smaller order than d. Hence in most of the cases a (small) constant function is sufficient. If r k is the true rank, the function h k (d, T ) is designed to adjust the ratio of eigenvalueŝ λ k,j+1 /λ k,j , j > r k to be bounded below and to be around 1 (as we add h k (d, T ) on both the numerator and denominator) so that they do not accidentally be smaller thanλ k,r k +1 /λ k,r k (a number that goes to 0). Hence intuitively we do not expect the impact of h k (d, T ) to be large, which is confirmed by our empirical study. The suggested functions in (8) all satisfy the Assumption VII(b), except the extreme weak factor cases, for which the ER estimators do not perform well under any penalty function.
Assumption VII only provides broard guidance asymptotically. There is no general unique optimal penalty function. Note that if g k (d, T ) is an appropriate penalty function, then cg k (d, T ) is appropriate as well asymptotically. The same property holds for the approaches of Bai andNg (2002, 2007), Amengual and Watson (2007), Hallin and Liška (2007) and Li, Li and Shi (2017). This creates potential problems in practice with given d and T . Under certain circumstances, the empirical performance of IC estimators may heavily depend on the threshold function chosen among many alternatives; see the discussion in Hallin and Liška (2007).
The following is a set of different sample size conditions for different settings. They ensure sufficiently large sample size T so that the non-iterative (true rank) factor loading space estimator based on TOPUP or TIPUP is consistent (for (a) and (b)), or has a relatively small error (for (c) and (d)).
Assumption VIII (Condition on the sample size).
Theorem 1 presents the asymptotic properties of the IC and ER estimators in (1) and (2) based on non-iterative TOPUP, non-iterative TIPUP, iTOPUP and iTIPUP.
Theorem 1. Let 1/ϑ = 1/θ 1 +2/θ 2 as in Proposition 2. For the various rank determination estimators, if their corresponding conditions listed in Table 2 hold, In addition to the consistency of the rank estimators, we also have the following more detailed properties of the estimated eigenvalues. Letλ k,j be the eigenvalues of W k (defined in Section 3.1) such thatλ k, Theorem 2. Suppose the same conditions (I-V, VIII) in Theorem 1 hold. In an event with probability approaching 1 (as T → ∞ and d → ∞), the following holds.
(i). For estimating the true zero eigenvalues, we have, for j ≥ r k and i ≥ 1, For estimating the non-zero eigenvalues, we have, for all 1 ≤ j ≤ r k and i ≥ 1, Remark 11 (Strong factor cases). To illustrate the theorem, we consider the strong factor case δ 0 = δ 1 = 0. Here the quantities (11)-(18) can be simplified to The sample size conditions in Assumption VIII all reduces to T → ∞. And the penalty function conditions (Assumption VII) is equivalent to ( Thus, we shall expect similar performance for non-iterative TIPUP, iTOPUP and iTIPUP, but the non-iterative TOPUP may be worse. Remark 12 (The vector factor models). Theorem 1 and 2 hold for vector factor models by setting K = 1 and d = d 1 . In such a case, TOPUP is the same as TIPUP. More specifically, assuming all factors have the same strength (δ 0 = δ 1 ), a common assumption used in the literature, Theorem 2 reduces to for the vector factor model case. This is the same as the convergence rate of the estimated eigenvalues derived in Lam and Yao (2012), though our improved technical proof removed the restrictive conditions that T = O(d) and all the non-zero eigenvalues are distinct. In addition, Theorem 1 provides the rate of convergence of the rank estimators.
Remark 13. Note that our model setting is different from that used in Bai and Ng (2002) and Hallin and Liška (2007) where covariance matrix or spectral density matrix are used, instead of the auto-co-moment we use here. It is possible to extend our approach to identify the number of factors in these models, by setting h 0 = 0 in the construction of W and using an extension of Proposition 1 discussed in Remark 3. The main difference is that, in our model and with auto-co-moments, we are trying to separate non-zero and zero eigenvalues in the underlying W , while in approximate factor model and h 0 = 0, one would be trying to separate spiked and non-spiked eigenvalues. Hence a detailed analysis of the corresponding γ n and β n in Proposition 1 will be needed.
Remark 14 (Sample size requirement comparison). The sample size required for the non-iterative estimators as shown in Assumptions VIII(a,b) is of higher order than that for the iterative estimators as in Assumptions VIII(c,d). This is because, when the true ranks are used, TOPUP and TIPUP require a larger sample size to consistently estimate the true loading spaces A k than the iTOPUP and iTIPUP procedures which require only a sufficiently "good" initial estimator of the loading space, but not necessarily a consistent one . Similarly, the required sample size condition for TIPUP in Assumption VIII(b) is much weaker than that for TOPUP in Assumption VIII(a). In this regard, iterative procedures are better than the non-iterative ones, and TIPUP based procedures are better than TOPUP based ones.
Remark 15 (Convergence rate comparison). The convergence rates of the estimated eigenvalues in the iterative methods are faster than that in the non- Table 3 Comparison of convergence rate for estimating the true zero eigenvalues δ 0 condition δ 1 and T condition comparison iterative methods, especially when there are weak factors in the model. Moreover, the rate of TIPUP related procedures is also faster than that of TOPUP related procedures. This can be seen by comparing β k , γ k , β * k , γ * k withβ k ,γ k , β * k ,γ * k . For example, consider the case that all d k are of the same order and K > 1. The following table shows the comparison of the convergence rate (β's) of the estimated true zero eigenvalues, where ≺ and ≺ ≺ are defined in Assumption VII.
From Tables 3, it is clear thatβ * k is always the smallest, and β k is the largest. Similarly, Table 4 shows thatγ * k andγ k are always the smallest among these four γ's.
For the IC estimators with a fixed penalty functions g k (d, T ), a faster convergence rate of the eigenvalue estimators make the sufficient condition Assumption VII(a) easier to satisfy with smaller sample size T and/or dimension d k , 1 ≤ k ≤ K. Similarly, for the ER estimators, faster rates for estimating the true zero eigenvalue increase the gap between the estimated divergent eigenvalues and the estimated true zero eigenvalue, leading to better performance of the ER estimators.
Combining the discussion in Remarks 14 and 15, we can conclude that in general the iterative procedues are better than the non-iterative ones and the TIPUP based procedures are better than the TOPUP ones, assuming no signal cancellation when TIPUP is used (see Remark 9).

Simulation Study
In this section, we compare the empirical performance of the proposed methods and their variants under various simulation setups. We consider the identification of the number of factors based on the non-iterative TIPUP and TOPUP methods (denoted as initial estimators), the one step iterative methods (denoted as one-step estimators) and the iterative procedures after convergence (denoted as final estimator). In the iterative algorithm, at i-th iteration (i > 1), we use the order obtained by the IC or ER criteria. Iteration is stopped when both the rank estimators and the loading space estimators converge. We also check the performance of different choices of penalty function g k (d, T ) and h k (d, T ). Specifically, we consider penalty functions (7) and (8), and denote them as IC1-IC5, ER1-ER5, respectively. The empirical performance of IC1-IC5 (resp. ER1-ER5) are very similar, thus we only present IC2 and ER1 in this section. The detailed comparison are shown in Appendix B.
The simulation study consists of three parts. The first part is designed to investigate the overall performance of our methods and their comparisons under models with different factor strength. As the strength of the weakest factors is unknown, we by default set ν = 0 for all the penalty function g k (d, T ) in (7). In the second part, we investigate the case in which some factors have a dominantly strong explanatory power. The third part is the case in which we use ν = δ 1 in (7) for all IC estimators, when some factors are weak. For each case, we compute the proportion of correct identification of the rank of the factor processes or the root mean squared errors (RMSEs) of the rank estimates from 1000 simulated data sets. In Section 5.4, we study the selection of the optimal constant c in IC criteria, using the method proposed in Remark 8.
The simulation uses the following matrix factor model: Here, E t is white and is generated according to E t = Ψ 1/2 where Ψ 1 , Ψ 2 are the column and row covariance matrices with the diagonal elements being 1 and all off diagonal elements being 0.2. All of the elements in the d 1 × d 2 matrix Z t are i.i.d N (0, 1). This type of model of E t has been proposed and studied in the literature, see, for example Hafner, Linton and Tang (2020), Hoff (2011), Linton and Tang (2020). The entries f ijt in the factor matrix F t were drawn from independent univariate AR(1) model f ijt = φ ij f ij(t−1) + ijt with standard N (0, 1) innovation.

Part I: Determining strong and weak factors, using ν = 0 in g k (·)
In the first part, the following three models are studied: (M1). Set r 1 = r 2 = 5. The univariate f ijt follows AR(1) with AR coefficient   (7), under the assumption that all factors are strong, even though some of factors simulated are weak (e.g., true δ 1 = 0.4 in Model M2).
For Model M1, the results in Table 5 show clearly that, using TOPUP and IC, the initial estimator behaves very poorly even for large sample sizes. On the other hand, the one step estimator uniformly and significantly outperforms the non-iterative initial estimator. In addition, the final estimator performs the best over all choices of d 1 , d 2 and T . We also observe that the performance improves as the dimension increases, except that d 1 = 40 is not as good as d 1 = 20 when T = 100. This improvement is due to the fact that when with strong factors (δ 0 = δ 1 = 0), larger dimension (d 1 , d 2 ) provides more data points and information on the rank r k . With the same settings, IC criterion based on TIPUP determines the ranks perfectly, indicating that it is uniformly better than IC criterion based on TOPUP. This is partially due to the fact that noniterative and iterative TIPUP procedures estimate the loading matrices and the eigenvalues more accurately than the corresponding TOPUP procedures. More interestingly, with the same setting, the ER estimators based on both TOPUP and TIPUP procedures perform perfectly. For Model M2, Tables 6 reports the proportion of correct rank identification using IC and ER estimators based on TOPUP and TIPUP procedures. It is seen that, for small sample sizes (T = 100 and 300), the performance of IC estimators deteriorate when we use iterative procedures, which may indicate that the sample size T and dimension d do not meet the required Assumption VII(a) in Theorem 1. In addition, for T = 100, the IC estimators using TIPUP procedures do not work at all, though they are better when T = 300. This is due to the existence of weak factors and the fact that we use the default ν = 0 in (7). When the sample size is small, the estimators tend to identify the strong factors while miss the weak factors as their corresponding eigenvalues are relatively small and comparable to the penalty function g k (·). The IC estimators using TOPUP procedures performed much better for small sample sizes. The performance also becomes worse as d 1 and d 2 increases, also due to the existence of weak factors. The ER estimators based on TIPUP procedures show almost perfect accuracy, even there are weak factors in the model. We note that, even with weak factors, the performance remains almost the same with larger (d 1 , d 2 ). Again, accuracy improves by using the iterative procedure. The ER estimator based on TIPUP procedures are much better than all the other estimators for initial estimator one step estimator final estimator (r1,r2) IC2-TOP IC2-TIP IC2-TOP IC2-TIP IC2-TOP IC2- Model M2. Table 7 shows the more detailed identification results using IC2 and ER1 estimators for Model M2 over 1000 replications, based on TOPUP and TIPUP procedures. The sample size is T = 300 and the data dimension is (d 1 , d 2 ) = (80, 80). The true rank pair is (5,5) for Model M3. From the table, it is seen that the IC procedures tend to under-estimate the number of factors, with the correspoding iterative procedures perform the worst. The ER estimators using the TOPUP procedures are likely to pick up only the strong factors, as the gap between strong factors (δ 1 = 0) and weak factors (δ 1 = 0.4) may be larger than that between weak factors and true zero eigenvalue estimations. We note that the outstanding performance of the ER estimators using the TIPUP procedures is quite different from the performance of a similar ER estimator in vector factor models under similar mixed strong and week factor cases (see e.g. Lam and Yao (2012)). The main reason is that the other tensor modes provide additional information and in certain sense serve as additional samples. Then, for each k ≤ K, the signals of all divergent eigenvalues depend on d instead of d k , leading to larger gap between weak factors and true zero eigenvalue estimations.
For Model M3 with all very weak factors (δ 0 = δ 1 = 0.6), Table 8 reports RM-SEs of the ER1 estimators. The results of using IC estimators (not shown here) are significantly worse than that of the ER estimators due to the difficulty of IC estimators in dealing with weak factors when ν = 0 is used. It is seen from the table that ER estimators based on TIPUP procedure outperform that based on TOPUP procedure. We also see that the iterative algorithms improves the performance very significantly under this very weak factor case. The performance varies with the change of (d 1 , d 2 ) in a non-standard way, as the performance with (40, 40) seems to be better than that with (20, 20) and (80, 80). We note that in weak factor cases, a large d k will potentially reduce the accuracy of the estimator of r k , since the signal level on the k-th dimension becomes weaker. On the other hand, a larger d i (i = k) in the other dimension potentially improves the estimation of r k , since we have more "repeated" observations to be used for   estimating r k . Table 9 shows the relative frequency of different estimated ranks of the ER1 estimator based on both TIPUP and TOPUP procedures, for the case of T = 300 and (d 1 , d 2 ) = (80, 80) under Model M3 with the true rank (5, 5). It is seen that the ER estimators tend to overestimate the number of factors, when all the factors are weak. All ER1-TOPUP estimators essentially identify (6, 6) as the rank. The non-iterative ER1 estimator based on TIPUP procedure can overestimate the ranks by a large margin. However, the iterations can gradually correct the over-estimation.
In summary, the first part of simulation shows that the ER estimators and IC estimator based on TIPUP procedures perform very well when all the factors are strong. The ER estimators significantly outperform the IC estimators when some or all factors are weak. Different from the results shown in Lam and Yao (2012) for vector factor models with both strong and weak factors, in tensor factor models, the ER estimators (based on TIPUP procedure) are able to determine the correct number of factors in many cases. The results also show that the iterative procedure significantly improves the performance except the IC estimators based on TOPUP in Model M2, which may due to that the sample size T and dimension d do not meet the required Assumption VII(a) in Theorem 1. It also shows that the estimators based on TIPUP perform better than that based on TOPUP in general. Hence, when some factors are weak, the iterative ER estimators based on TIPUP are the choice.

Part II: The case of dominating strong factors
The second part of our simulation examines the effects of dominate strong factors on the IC and ER estimators. The data are generated from the following model, (M4). Set r 1 = r 2 = 2. The univariate f ijt follows AR(1) with AR coefficient φ 11 = 0.98 and φ 12 = φ 21 = φ 22 = 0.15; The elements of the loading matrices A 1 and A 2 are i.i.d N (0, 1).
We fix d 1 = d 2 = 40 and T = 200. Again, we use IC2 and ER1 for demonstration, and assume ν = 0 in the penalty function in (7). Although all of the four factors are strong factors, the strongly imbalanced signal strength in F t makes one of factors dominating the others in explanatory power. Table 10 reports the relative frequencies of estimated rank pairs over 1000 replications. It is seem that the ER estimators are very likely to pick up only the dominate factor, although iterations significantly improves the accuracy. The IC estimators performs much better in this case. And over all, estimators based on TIPUP perform better than the corresponding estimators using TOPUP. Overall, IC-TIPUP performs the best in this case.

Part III: Using the correct penalty function in the IC estimators
The third part of the simulation considers the impact of penalty function selection. We consider Model M2 and M3 again, with d 1 = d 2 = 40, but use the true weakest factor strength δ 1 as ν in the penalty function g k (T, d) in (7) instead assuming strong factor and use ν = 0 as in Section 5.1. To compare with Tables 6 for M2 and 8 for M3, we report the proportion of correct rank identification using IC2 estimators in Table 11 for M2 and RMSEs of the IC2 estimators in Table 12 for M3. Comparing Tables 6 and 11, it is seen that the performance of IC2 estimators using TIPUP improve by using the correct ν in the penalty function. We notice that the initial and one step IC estimators using TOPUP with the correct ν in the penalty term actually under-perform the ones with the incorrect ν. The reason for this unusual behavior of TOPUP is unclear. It might be that the magnitude of the estimation of true zero eigenvalue is much larger in this mixed weak and strong factor case, as shown in Figure 1. Hence, reducing the penalty term by using ν = δ 1 = 0.4 resulting in severe over-estimation. However, the TIPUP estimators estimate the true zero eigenvalues more accurately, hence are able to take advantage of the more accurate penalty term.
For Model M3 with all weak factors (δ 0 = δ 1 = 0.6), all IC estimators using the wrong penalty function g k (T, d) with ν = 0 in (7) identified (1, 1) rank pair in all 1000 simulations for all sample sizes. Comparing it with the result shown in Table 12, the importance of using the right ν in the penalty function is obvious in this all-weak factor case. Table 12 also shows that IC estimator  using TIPUP outperforms that using TOPUP, when the right penalty function is used. Comparing Tables 8 and 12, the iterative IC estimators out-perform the ER estimators when T = 300, 500, 1000. It shows the great potential of IC estimators using a proper ν in the penalty function under the weak factor cases. As mentioned earlier, more investigation is needed to determine the proper value for ν.

Selection of the optimal c in IC criteria
To study the empirical property of the optimal constant c discussed in Remark 8, we simulate data from Model M1 with r 1 = r 2 = 5, and set d 1 = d 2 = 80, T = 300 and set the upper bound of the rank as m * = 10. In Figure 2, we show respectively the behavior of r k,c,j as a function of (d 1,j , d 2,j ), and of r k,c,J and of S k,c as functions of c, when setting T 1 = T 2 = · · · = T J = T and d 1,j = d 2,j . The rank r k of the factor process can be determined by considering the mapping c → S k,c and choosing r k,c,J = r k, c,J , where c belongs to an interval of c implying S k,c ≈ 0 and therefore the value of r k,c,J is a constant function of c. Similar to Hallin and Liška (2007) and Alessi, Barigozzi and Capasso (2010), we see that the second stability interval always delivers an estimated number r k,c,J which is closer to the true r k = 5 than the number suggested by the other intervals. That is, the smallest values of c for which r k,c,j is also close to a constant function of j, j ≤ J. Note that the first stability interval always corresponds to the predefined upper bound m * and it is thus a non-admissible solution.

Real data analysis
In this section, we illustrate the proposed procedures using the Fama-French 10 by 10 monthly return series as an example. According to ten levels of market capital (size) and ten levels of book to equity ratio (BE), stocks are grouped into 100 portfolios. The sampling period used in this excises is from January 1964 to December 2015 for a total of 624 months. There are overall 62,400 observed monthly returns used in this analysis. The data is from http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data library.html. Similar to Wang, Liu and Chen (2019), we subtract from each of the series their corresponding monthly excess market return.
The data was used by Wang, Liu and Chen (2019) to demonstrate the estimation of a matrix factor model using a non-iterative TOPUP procedure. They used an estimator similar to our non-iterative TOPUP based ER estimator without a penalty term to estimate the number of factors and found the estimator suggested (1, 1) as the rank -a single factor in the model. In the end they demonstrate the model using (2, 2) as the ranks for demonstration of the estimation procedures. Figure 3 shows the 1st to 3rd largest eigenvalues of using the non-iterative TIPUP and TOPUP procedures, h −1 0τ * 2 k,m k and h −1 0τ 2 k,m k (m k ≤ 3), under different lag values h 0 . The pattern of eigenvalues using the iterative TIPUP and TOPUP are similar, thus is omitted. It can be seen from panel (a) of Figure 3 that, using TIPUP procedure, the 1st and 2nd largest eigenvalues reach their maximum value at h 0 = 1, and tends to decrease as h 0 increases. However, the 3nd largest eigenvalue reaches its maximum at h 0 = 2. In contrast, from panel (b) of Figure 3, using TOPUP procedure, the 1st to 3nd largest eigenvalues reach the maximum at h 0 = 1. The difference of the patterns of estimated singular values indicates possible severe signal cancellation when using h 0 = 1, according to the suggestions in Han et al. (2020). Hence, we choose h 0 = 2, and consider IC and ER estimators based on the iterative TIPUP procedure. Figure 4 shows the estimated rankr k of the core factor process with different number of iterations, using IC2(TIPUP) with ν = 0 (left figure) and ER1(TIPUP) (right figure). It is seen that the iterative algorithms converge very quickly.   Table 13 shows the estimated rank pairs using different IC(TIPUP) estimators and different ν parameter for the penalty function. It is seem that IC1 and IC3 tend to select larger models. These rank estimates do not change when we use h 0 = 3 and 4.
On the other hand, ER1-ER5 in (8) produce exactly the same rank estimate (1, 3) using h 0 = 2. But these rank estimates change to (1, 1) when we use h 0 = 3 and 4. Figure 5 shows the estimated eigenvalues τ * 2 k,m k , 1 ≤ m k ≤ d k , using the non-iterative initial TIPUP procedure, for k = 1 (size factor) and k = 2 (BE factor). It is seen that for the size factor, the largest eigenvalue is more than 20 folds larger than the second largest eigenvalue. As simulation results in Section 5.2 show, in such an unbalanced case, the ER estimator may find it difficult to find the gap between the true non-zero eigenvalues and the true zero eigenvalues, based on the ratio of the eigenvalues. On the other hand, the IC estimator may fare better in such cases since it is based on the level of estimation error of the true zero eigenvalue.
Overall, it seems that (2, 2) or (3, 3) are possibly good choices. More detailed analysis, include goodness-of-fit measures, prediction performance and result interpretation, is needed.

Discussions
In this paper, we develop two rank identification estimators, in an attempt to fill a gap on modelling tensor factor model in the literature. Non-iterative and iterative IC and ER estimators, based on similar ideas of the TOPUP and TIPUP procedures of Chen, Yang and Zhang (2021) and Han et al. (2020) are considered. Theoretical analysis shows that the iterative estimators are much better than the non-iterative estimators. We show that in general the estimators based on TIPUP procedures are better than that based on the TOPUP procedure, due to its fast convergence rate of the estimated eigenvalues under proper conditions. However, in situations when TIPUP procedures also lead to significant signal cancellation, extra care needs to be taken, including increasing the maximum lag h 0 in the procedure.
Simulation studies are conducted to compare the finite sample performance of the estimators using the non-iterative and iterative estimation procedures. The results show that the ER estimators based on both TIPUP and TOPUP procedures, and the IC estimators based on TIPUP procedures generally perform very well when all the factors are strong. The ER estimators are better than the IC estimators when some factors are weak, unless one chooses the precise tuning parameter ν in the IC penalty function, which is a difficult task. When some dominant factors have unrealistically high explanatory power, the ER estimators may not perform well. But the IC estimators still work very well, since the factors are strong. In summary, IC estimator based on iTIPUP shall be used to estimate the number of strong factors, while ER estimators based on iTIPUP are likely to capture weak factors.

Appendix A: Proofs
It suffices to consider K = 2 as the TOPUP and TIPUP begin with mode-k matrix unfolding. We observe a matrix time series with X t = A 1 F t A 2 + E t = G t + E t ∈ R d1×d2 . Let U 1 , U 2 be the left r 1 , r 2 singular vectors of A 1 and A 2 , respectively. Recall is kronecker product and ⊗ is tensor product. Without loss of generality, we only consider the case k = 1. Denote E(·) = E(·|{F 1 , ..., F T }). For simplicity, write Although we present Theorem 1 for the IC and ER estimators together, the proofs of the IC and ER estimators using the same estimation procedure are very similar. Thus, we first prove Theorem 1 based on TOPUP, and then move to iTOPUP, TIPUP and iTIPUP in sequence.
Proof. Under Assumptions I, II, III, IV, V(a) and Proposition 2, as the derivation of Theorem 1 in Han et al. (2020), in an event Ω 11 ∩ Ω 0 with P( Applying Lemma 17 and Theorem 1 in Chen, Yang and Zhang (2021), for sufficient large T , there exists a matrix Q ∈ R (d1−r1)×r1 such that is an estimator for U 1 with U 1Ũ 1 =Û 1Û 1 . Elementary calculation shows that It follows that In view of (27), we have (25). For (26), notice that in the event Ω 11 ∩ Ω 0 , for all r 1 + 1 ≤ j ≤ d 1 ,

Lemma 3. Suppose Assumptions I, II, III, IV and V(a) hold. Then, in an event
for some constant positive C depending on K only.

Lemma 4. Suppose Assumptions I, II, III, IV and V(a) hold. In an event
As m 1 is fixed, it is sufficient to prove for each m with m > r 1 , in the event Ω 1 , In the following, we shall only work on Ω 1 . Elementary calculation shows that := I + II + III + IV.
Note that III = IV.
We first consider II.
We consider each term in turn. Note that whereŨ 1 is defined in Lemma 2. Then, by Lemma 2 and 3, by Lemma 2 and 3. Combing the bounds of II 1 , II 2 and II 3 , we have Next, we consider III.
Note that III = IV.
We first consider I.
For I 1 , It follows that where e j is a r 1 × 1 vector with 1 at j-th element and 0 otherwise. Thus, we can obtain By Proposition 2, in the event Ω 0 , σ r1 U 1 E mat 1 (TOPUP 1 ) = τ 1,r1 d 1−δ1 , with 0 ≤ δ 0 ≤ δ 1 < 1. Hence, there exists a constant c m > 0 such that in the event Ω 1 , By Lemma 3, This implies that Next, we consider II.
By Lemma 3, Using the same decomposition (42), Similarly, Combing II 1 , II 2 and II 3 , we have Next, we consider III.
Again, we bound each term in turn. Using the decomposition (42), adopting same arguments in the proof of II, we have, Lipschitz function in (E 1 , . . . , E T ). Then, by Gaussian concentration inequalities for Lipschitz functions, As T ≥ 4h 0 and K = 2, this implies with x √ d 1 that in an event Ω a with at least probability 1 − e −d1 /6, with a constant C 0 depending on K only. Then, by Proposition 2, in the event Ω a ∩ Ω 0 , (62) follows. Similar arguments yield (63), (64) and (65) in the event Ω a ∩ Ω 0 . We split the sum into two terms over the index sets, Note that III = IV.
We first consider II.
Note that III = IV.
We first consider I.
Note that by Han et al. (2020), in the event Ω 4 , at i-th iteration (i ≥ 1), for the iTIPUP estimator, assume that Û (i) The proofs of Lemma 13, 14 and 15 are omitted, since they are similar to Lemma 6, 7 and 8.
Proof of Theorem 1 for iTIPUP . Theorem 1 for iTIPUP part follows by similar arguments in the proofs of Theorem 1 for non-iterative iTIPUP.

A.5. Proof of propositions and Theorem 2
Proof of Proposition 2.

Appendix B: Techinical lemmas
Then, the following equivalent forms hold Proof. The singular values σ j ofV V correspond to eigenvalues ± 1 − σ 2 j for VV − V V .
Lemma 17. Suppose A and A + E are n × n symmetric matrices and that is an orthogonal matrix such that span(Q 1 ) is an invariant subspace for A, where Q 1 ∈ R n×r and Q 2 ∈ R n×(n−r) . Partition the matrices Q AQ and Q EQ as follows: If sep(D 1 , D 2 ) = min λ∈λ(D1),μ∈λ(D2) |λ − μ| > 0, where λ(M ) denotes the set of eigenvalues of the matrix M , and E 2 ≤sep(D 1 , D 2 )/5, then there exists a matrix P ∈ R (n−r)×r with with at least probability 1 − 2e −x 2 /2 for all x ≥ 0.

Appendix C: Additional simulation results
In this section, we show detailed comparison among IC1-IC5, and ER1-ER5 for the first part and third part simulation in Section 5. We also study a strong factor model with r 1 = r 2 = 2.
For Model M0 with small r 1 = r 2 = 2 and all strong factors, all versions of our methods show perfect accuracy for all sample sizes except T = 100 when some versions show error rates less than 1%. For Model M1, the results in Table  14 show clearly that, among all five penalty functions, IC2 and IC4 seem to be slightly better than the others. With the same settings, the determination of  T  IC1  IC2  IC3  IC4  IC5  IC1  IC2  IC3  IC4  IC5  IC1  IC2  IC3  IC4 Tables 15 and 16 report the proportion of correct rank identification using IC estimators based on TOPUP and TIPUP procedures, respectively. Tables 17 and 18 report the proportion of correct rank identification using ER estimators based on TOPUP and TIPUP procedures, respectively. 0 . 9 9 6 1 0 . 9 9 5 1 1 0 . 9 9 5 1 0 . 9 9 3 1  T  IC1  IC2  IC3  IC4  IC5  IC1  IC2  IC3  IC4  IC5  IC1  IC2  IC3  IC4       In the third part of simulation, Table 21 shows the proportion of correct rank identification of the IC estimators in (7) with known δ 1 for Model M2. Moreover, table 22 reports the RMSEs of the IC estimators in (7) with known δ 1 for Model M3. Again, given δ 1 , the performance of all IC1-IC5 is also very similar. In some cases, IC2 and IC4 are slightly better.