Regularized high dimension low tubal-rank tensor regression

: Tensor regression models are of emerging interest in diverse ﬁelds of social and behavioral sciences, including neuroimaging analysis, neural networks, image processing and so on. Recent theoretical advance- ments of tensor decomposition have facilitated signiﬁcant development of various tensor regression models. The focus of most of the available lit- erature has been on the Canonical Polyadic (CP) decomposition and its variants for the regression coeﬃcient tensor. A CP decomposed coeﬃcient tensor enables estimation with relatively small sample size, but it may not always capture the underlying complex structure in the data. In this work, we leverage the recently developed concept of tubal rank and develop a tensor regression model, wherein the coeﬃcient tensor is decomposed into two components: a low tubal rank tensor and a structured sparse one. We ﬁrst address the issue of identiﬁability of the two components comprising the coeﬃcient tensor and subsequently develop a fast and scalable Alternating Minimization algorithm to solve the convex regularized program. Further, we provide ﬁnite sample error bounds under high dimensional scaling for the model parameters. The performance of the model is assessed on synthetic data and is also used in an application involving data from an intelligent tutoring platform.

A typical setting for tensor regression is as follows: one has access to predictors whose structure can be succinctly represented by a tensor and an outcome variable of interest. For example, [22] consider Magnetic Resonance Imaging technology three-dimensional scans from subjects in an Alzheimer's Disease study together with their corresponding mini-mental state exam scores. A regression model to associate the information contained in the scans and the outcome scores takes the form y = B, X + ε, wherein B ∈ R d1×d2×d3 is the regression coefficient tensor that captures the association between the score y and the tensor predictor (image scan) X ∈ R d1×d2×d3 . Note that the inner product between the coefficient tensor and the predictor tensor is defined as where X i1,i2,i3 and B i1,i2,i3 are the (i 1 , i 2 , i 3 ) th element of X and B, respectively. The number of parameters in B increases rapidly and can easily become larger than the available sample size, thus requiring some form of regularization to estimate the regression coefficient. To that end, [42] proposed a generalized linear model with a tensor predictor in the systematic component and utilized a CP Decomposition to reduce the high dimension of the coefficient tensor B. [22] considered a regression model with a multivariate (vector) response and a tensor predictor and assumed that regression coefficients admit a sparse CP decomposition. Some other work, dealing with sparse CP decomposition, includes [2], [10] and [31]. [34] extended the regression model to accommodate a tensor response, assumed the previously mentioned sparse CP decomposition on the regression coefficient ( [22]) and also developed an alternating updating algorithm to obtain a (local) minimizer of the underlying non-convex optimization problem. [21] extended the work of [42] by using a Tucker decomposition, which is a generalized version of the CP decomposition. [32] presented a general convex optimization approach for solving tensor regression problems by applying convex and weakly decomposable regularizers on the regression coefficient. They established the weak-decomposability of the sparsity regularizer (both element-wise and group-wise) and low-rankness regularizer (nuclear norm, defined through its dual norm) and derived their theoretical results for these two special cases. While a sparse CP decomposed structure on B enables estimation with relatively small sample size, in many applications it may not be particularly suitable, since the data may exhibit more complex structure. For example, imaging data over different time points can be considered as a three-dimensional predictor, while the response can be a clinical outcome (see [22]). The images may have similar baseline effects across all the time points, while there might be selected segments of the images showing some additional effects, which are specific to selected time points only. A sparse CP decomposed regression model does not account for this structure, that in turn may not help researchers obtain good scientific insights from their data.
Another example relates to online commerce data, where the predictor can be organized as a three-dimensional tensor with one dimension corresponding to product categories (books, electronics and clothing accessories), the second dimension to ratings of these products (different types of ratings, including "fashionable", "durable" and so on) and the third dimension to age groups of the customers, as shown in Figure 1 (see Example 2 in [1]). The response can be the gross sales of the company. It is meaningful to decompose the effects of the coefficient tensor into two parts. In the first component, all the "rating type -age group" combinations share a similar baseline effect across the product categories. On the other hand, there might be some of the "rating type -age group" combinations, for which additional effects are present only in some specific categories. For instance, the combination of rating type label "fashionable" and age group range "16 years -30 years" should have additional effects in the clothing accessories category. Similarly the combination of label "durable" and "31 years -50 years" demands additional effect in the electronics category.
In this paper, we propose a tensor regression model with a scalar response, a third-order tensor predictor and consider a low-dimensional structure on the coefficient tensor suitable for the previously mentioned examples. The key technical assumption is that one component of the regression coefficient B exhibits low tubal rank, a concept introduced for tensor decompositions in [14] and [13] and briefly explained next. As depicted in Figure 2, in the case of a third-order  [16] for a pictorial illustration) play the roles of rows, columns and elements of a matrix, respectively. The question addressed in the aforementioned papers is "How does one extend the well known concepts of matrix algebra, e.g., linear combination, linear dependence, rank and so on to this case?". Their key technical development lies in the novel concept of t-product [Definition C.1] between a lateral slice (or, horizontal slice) and a tube, which is the tensor counterpart of the product between a column (or, row) and a scalar. Along the same lines, the t-linear combination [Definition C.7] of lateral slices and the Range of the tensor are defined, which can be conceptualized as the tensor counterparts of the linear combination of the columns and the column space, respectively. Finally, [14] shows that the number of elements needed to generate any element in the range, is the same as the tubal rank [Definition C.9]. Thus, just like the rank of a matrix, tubal-rank of a third-order tensor determines the number of lateral (and horizontal) slices that are t-linearly independent (see the discussion after Definition C.7, Figure 3 and Section 2 for more details). In addition to the low-dimensional structure as expressed through a low-tubal rank, to capture idiosyncratic effects, we assume that the coefficient tensor B can be decomposed as follows: B = L + S, wherein the first component is a low tubal-rank tensor, where the baseline effects are shared across the slices and the second component is sparse, reflecting additional idiosyncratic effects (see Figure 1).
The key novel contribution of this paper is to develop the algorithm and technical tools to obtain estimates of the two components of the regression coefficient B and establish a non-asymptotic upper bound to their estimation error. This type of decomposition has been employed earlier in another line of research that deals with tensor recovery (see [23]). However, to the best of our knowledge, the proposed methodology and the subsequent theoretical analysis are new in the context of tensor regression. [20] considered a tensor regression model with sparse tubal-regularized penalization. However, the latter paper is motivated by a different perspective vis-a-vis the current work. As mentioned in the previous paragraph, the current work characterizes the baseline effects as the low tubal-rank tensor and the idiosyncratic effects as the sparse tensor. Thus, this can be conceptualized as the third-order generalization of the matrix low-rank plus sparse approach in [1]. On the other hand, [20] does not consider any such decomposition of their coefficient tensor into baseline (L in our case) and idiosyncratic parts (S in our case). Rather, in order to reduce the dimension of the regression coefficient, they simply assume both low tubal-rank and sparse structure on the coefficient tensor W. Consequently, the objective functions considered in the two papers are different. Finally, in terms of theoretical developments, a novel incoherence condition is needed for identifiability of the two components in our case and we also provide a detailed derivation and interpretation of the non-asymptotic upper bound of the estimation error. The bounds, as discussed in Section 3, are in line with the matrix case results [1]. On the other hand, [20] did not make any such attempt in their paper.
The remainder of the paper is organized as follows: In section 2, we develop and provide intuition on how to interpret the low tubal rank regression model. Section 2.1 discusses the convex relaxation and presents a block alternating minimization algorithm to estimate the unknown model parameters from data. Theoretical results related to estimation error bounds are discussed in Section 3. In Sections 4 and 5, we evaluate the performance of our model on synthetic and real data, respectively. Finally, we conclude with a discussion in Section 6. Background material on the t-product and the corresponding t-Singular Value Decomposition and related concepts are provided in Appendix C.

Notation:
The order of a tensor is the number of its dimensions. For a tensor of order N, we use d 1 , d 2 , · · · , d N to denote the size of the tensor along each of the N dimensions. Throughout the paper, tensors of order three or higher are denoted by boldface Euler script letters, e.g., X. We write X ∈ R d1×d2×···×d N to represent an N-order tensor of size d 1 × d 2 × ... × d N . For any third-order tensor X, X ijk denotes the (i, j, k) th element of X. One dimensional sections of a third-order tensor X, namely, Column Fiber, Row Fiber and Tube Fiber (see Figure 2.1 of [16]) are denoted by x :jk , x i:k and x ij: respectively. Similarly, the two-dimensional sections, namely, Horizontal Slice, Lateral Slice and Frontal Slice (see Figure 2.2 of [16]) are denoted by X i:: , X :j: and X ::k respectively. As illustrated in Figure 2, the lateral slices, horizontal slices and tube fibers can be visualized as columns, rows and elements of a matrix. For any matrix A ∈ R d1×d2 , whose (i, j) th element is denoted by a ij , the Frobenius Norm is defined as

S. Roy and G. Michailidis
Similarly, 2,∞ norm of A is given by,

Low tubal-rank tensor regression model and its estimation
Let y ∈ R be a scalar response and X ∈ R d1×d2×d3 be a third-order tensor predictor. We propose the following tensor regression model, The coefficient tensor B * ∈ R d1×d2×d3 captures the association between the scalar response and the tensor predictor. The inner product is defined as in (1). Further, it is assumed that ε ∼ N (0, σ 2 ). To deal with the large number of regression coefficients, we posit that B * = L * + S * , where L * and S * are characterized by two complementary types of low dimensional structure, discussed next.
The L * component corresponds to a low tubal-rank tensor. As shown in [13] and discussed in Section 1, the tubal-rank of a third-order tensor is analogous (in the appropriate algebra, see Appendix C) to the rank of a matrix. Hence, analogously to the fact that low rankness of a matrix implies linear dependence among its columns and rows, a low value of the tubal-rank characterizes a similar type of dependence, namely t-linear dependence (see Definition C.7 and the ensuing discussion), among the lateral and horizontal slices. For the posited model, we select the dimension across the lateral slices to impose low-rankness. However, one can always reorient the tensor and thus impose low-rankness assumption across any of the dimensions. The aforementioned dependence is governed by the concept of the t-product and the t-linear combination introduced in [13]. Although the formal definitions are deferred to Appendix C, Figures 3 and 4 illustrate the key ideas. As Figure 3 depicts, a lateral slice is said to be t-dependent on the other, if the former can be expressed as the t-product between the latter and a suitable tube. Using the definition of t-product, Figure 3 also provides an alternative representation of t-dependence in terms of a block-circulant matrix [see Notation B.2]. Figure 4 provides some numerical examples to show the relation between tubal-rank and t-dependence among the lateral slices. Based on this brief discussion, the purpose of the first component L * becomes to capture similar baseline effects.
The second component, S * consists of the additional effects, which might be present only in some of the specific lateral slices and they can be at element, row or column level within each slice. For the posited model, it is assumed without loss of generality that the slice specific effects are present column-wise in S * . However, one may also assume any of the other two possibilities with few minor adjustments, as discussed in the sequel. It is interesting to note that, one can  have an alternative representation of S * , which is easier to work with, namely that its frontal slices are column-wise sparse.
Based on the above discussion, we rewrite the regression model in more explicit form as follows: where L * is a low-tubal rank tensor, S * is a tensor whose frontal slices are column-wise sparse and ε ∼ N (0, σ 2 ). The goal becomes to estimate both L * and S * based on available data. Note that similarly to the matrix case [1], an additional constraint is needed in order to make the model identifiable. Specifically, L * needs to be "incoherent" with S * , an issue addressed in Section 2.1.

Estimation of the tensor regression coefficient
Analogously to the matrix case, the tubal-rank is non-convex. Hence, for estimation purposes we aim to leverage a convex relaxation. To that end, from equation (31) and the related discussion in Appendix B, it can be seen that 1 d3 Circ(L * ) * corresponds to such a convex relaxation, where Circ(L * ) is the block-circulant matrix associated with the tensor L * (see notation B.2 in Appendix) and · * denotes the matrix nuclear norm. Indeed, it is an alternative form of the Tensor Nuclear Norm defined in [26] (see Definition 7 in [26] and Equation (12) in [25]) and imposing a restriction on this norm translates into an analogous restriction on the tubal-rank. Further, since S * consists of columnwise sparse frontal slices, one can simply treat each frontal slice as a matrix and employ the usual column-wise 2,1 norm for each of them. Hence, we consider a regularizer d3 k=1 S * ::k 2,1 to constrain the sparse component. The objective function for estimating the tensor regression coefficient B based on the posited low-tubal rank and structured sparse decomposition is given by: wherein λ L and λ S are non-negative regularization parameters corresponding to low tubal-rank and sparse components, respectively. The factor 1 d3 can be interpreted as follows: for any third-order tensor in R d1×d2×d3 , the associated block-circulant matrix consists of d 2 blocks. Each block corresponds to a particular lateral slice and contains all of the d 3 possible block-circulant arrangements of that lateral slice. Hence, besides causing inter-slice t-dependence, the penalty on Circ(L * ) * also induces intra-slice dependence among the d 3 blockcirculant arrangements of the slice. The factor 1 d3 thus adjusts for this additional penalization.
Before presenting an algorithm for estimating (L * , S * ), we address the issue of identifiability of these parameters. In the matrix case, an incoherence condition is required and usually operationalized through conditions on the singular vectors of the low rank component obtained from the SVD (see, e.g., [6], [4] and [37]). We adapt the approach used in [1] to the low tubal rank tensor L * and the structured sparse tensor S * and require for some fixed parameter α > 0. Note that based on Proposition A.1, a low tubal rank for L * translates to low matrix rank for Circ(L * ) and vice versa. Hence the nature of the posited incoherence constraint follows from that for the matrix case. Specifically, by imposing this "spikeness" restriction on the columns of Circ(L * ), one can ensure a sufficient number of non-zero columns in Circ(L * ) and thus in each of the frontal slices of L * . However, each of the last d 2 (d 3 − 1) columns of Circ(L * ) can be written by rearranging elements of any one of its first d 2 columns. Hence by restricting the "spikiness" of only the first d 2 columns, one can essentially control the "spikiness" of all the columns of Circ(L * ), which leads to the posited incoherence conditions. Note that the objective function (4), denoted by f (L, S), is jointly convex and hence the following alternating block minimization procedure summarized in Algorithm 1, will obtain the desired minimizer. The details of Steps 1 and 2, that update L and S alternatively, are discussed next.

Algorithm 1 Alternating Block Minimization
Procedure for minimizing f (L, S) Step 1: Update L (t+1) = arg min L f (L, S (t) ), given S (t) Step 2: Update S (t+1) = arg min Step 1: Step 1 updates the value of L given S. For a given value of S, letting u i = y i − S, X i , the problem then reduces to minimizing g(L) = 1 2n by W and Circ(X i ) by V i , some simple algebraic steps show that, minimizing g(L) with respect to L, is equivalent to minimizing This minimization problem shows up in various applications of machine learning, such as matrix classification, multi-task learning and matrix completion (see [3,35]). [12] consider a general class of optimization problems that includes the above formulation: specifically, for a matrix variable M , the objective function of interest is given by min M {F (M )+λ M * }, wherein F (·) is a smooth convex function. [12] proposed an Extended Gradient Algorithm and Accelerated Gradient Algorithm to obtain the minimizer M and also addressed convergence issues. A direct application of the aforementioned algorithms provides the optimal solution W and thus eventually, the optimal L.
Step 2: In Step 2, for a given value of L, letting z i = y i − L, X i , the problem boils down to minimizing h(S) = 1 2n n i=1 (z i − S, X i ) 2 + λ S d3 k=1 S ::k 2,1 , with respect to S. We now construct the matrix S Mat of dimension d 1 × d 2 d 3 (and similarly X iM at ), by placing the frontal slices of S (and of X i ) side by side. One can easily check that S, X i = T r(S T Mat X iM at ) and d3 k=1 S ::k 2,1 = S Mat 2,1 and thus h(S) = 1 2n n i=1 (z i − T r(S T Mat X iM at )) 2 + λ S S Mat 2,1 . Finally, vectorizing S Mat and X iM at , the problem takes the form of a grouplasso penalized learning problems, as discussed in [38]. Assuming that the loss function admits the Quadratic Majorization condition, [38] develops Groupwise-Majorization-Descent (GMD) algorithm in order to solve such problems. We directly employ that method to obtain the optimal S.
The above algorithm requires a slight modification to accommodate sparsity at the element level for the component S * . Specifically, in Step 2 the objective function h(S) penalizes the 1 norm of vec(S Mat ), and thus the optimal S is obtained by solving a lasso problem.

Theoretical results
We start by defining the estimation error e 2 (L,Ŝ) as follows: Next, we introduce some additional notation needed in the sequel.
Additional notation: For L * with tubal rank r min{d 1 , d 2 }, the rank of the associated block-circulant matrix is denoted by R and is bounded above by r × d 3 (see Proposition A.1). S * Mat is a matrix of dimension d 1 × d 2 d 3 , that is constructed by placing the frontal slices of S * side by side. We assume that S * Mat has s d 2 d 3 non-zero columns. More specifically, suppose that As shown in [1,28], one can easily verify that for any 1 . This ensures that the regularizer · 2,1 is decomposable (see [28]) with respect to the subspace pair (M(E), M ⊥ (E)). Simplifying the notation is the projection onto the subspace M. We define,Δ L =L − L * ,Δ S =Ŝ − S * .Δ SMat is the matrix constructed by placing the frontal slices ofΔ S side by side.Δ M SMat = π M (Δ SMat ) and are the tensor counterparts ofΔ M SMat and Δ M ⊥ SMat respectively. The roadmap of the technical developments is as follows: Lemmas 3.1 and 3.2 characterize the set to which the errors (Δ L ,Δ S ) belong. In addition, we assume that Restricted Strong Convexity of the loss function on this set holds (Assumption 1). For deterministic realizations of the predictors and the error terms and under certain regularity conditions, Lemma 3.3 establishes the bound on the estimation error e 2 (L,Ŝ). Theorem 3.4 extends the result to random realizations of the predictors and the errors, while Corollary 3.4.1 presents the error bound when S * has element-wise sparse frontal slices.
The proofs of the results are delegated to Appendix A. Lemma 3.1. Let R denote the rank of Circ(L * ). Let C(L, S) be a weighted combination of the nuclear norm and the 2,1 norm regularizers as follows: Then, for any R = 1, 2, · · · , min{d 1 d 3 , Lemma 3.2. Suppose the predictors X i 's and the errors i 's are deterministic. Define a third-order tensor D ∈ R d1×d2×d3 as follows: As mentioned earlier, the above lemmas characterize a set in which the error (Δ L ,Δ S ) lies. Given this set, we are now in a position to summarize all the assumptions that we make. We first prepare a list of the assumptions and then provide further details on each of those assumptions.
, for some fixed parameter α Assumption 3. When the predictors X i 's and the errors i 's are deterministic, the regularizer parameters (λ L , λ S ) satisfy where D and D Mat are as defined in Lemma 3.2.
• Assumption 1 ensures that the loss function exhibits strong convexity over some restricted set of interest, as defined in equation (7). This is a fairly standard assumption in the high-dimensional literature [1].
• Assumption 2 is aimed to ensure that the low-tubal rank component L * is incoherent with the sparse component S * , as discussed in Section 2.1.
It is worth recalling that this assumption is a straightforward application of the 'spikiness' restriction on the columns of the low-rank matrix, as introduced in [1]. We directly impose that restriction on Circ(L * ), which is a reasonable low rank matrix-counterpart of our low tubal-rank component L * . The reader may revisit Section 2.1 for more details. This assumption is milder than other Tensor Incoherence conditions, including those in [24,40], which involve the components of the t-SVD. • Assumption 3 imposes a certain lower bound to the two regularizer parameters, a common requirement in the high-dimensional literature.
The following lemma establishes an upper bound to e 2 (L,Ŝ) in the case of deterministic predictors and errors. (1), (2) and (3), the estimation error e 2 (L,Ŝ) satisfies the following:

Lemma 3.3. Suppose the predictors X i 's and the errors i 's are deterministic. Then, under Assumptions
where

the notation ' ' denotes an upper bound, ignoring all constant factors.
Note that the result is broadly in line with Theorem 1 in [1]; specifically, when the loss function satisfies the Restricted Strong Convexity and the parameters of interest are exactly (not approximately) Low Rank and Sparse, a similar form error bound is obtained. In the current setting, the tubal-rank (instead of the matrix rank) enters the bound, as well as the columnwise sparsity s reflecting the nature of S * Mat .
Next, the above result is extended to the case of stochastic errors and predictors. To that end, we assume that i 's are i.i.d. N (0, σ 2 ) and use the notation X (i) = Vec(X i ) in order to denote the vectorized form of the i th predictor X i . We write to denote the combined predictors from all the n samples in vectorized form. As in [32], we assume that Note that this assumption does not require the data tensors X i 's to be independent. We assume that Σ has bounded eigenvalues. Let λ min (·) and λ max (·) denote the smallest and largest eigenvalues of a matrix respectively. We assume in the sequel that for some constants 0 < c l ≤ c u < ∞. As mentioned in [32], it is evident that in particular if all the covariates {X (i) : i = 1, 2, · · · , n} are independent and identically distributed, then Σ will have a block-diagonal structure and in that case, the condition in equation (12) reduces to the similar conditions on Cov(X (i) ).
With this Gaussian assumption on the predictors and errors, we establish the following result.
and the predictors follow a Gaussian distribution, characterized by equation (11) and equation (12). Suppose that Assumption 2 holds. Then it can be shown that the conditions in Assumption 1 and Assumption 3 are satisfied with high probability and we will have The bound is analogous to the matrix case; for the latter, with m 1 rows, m 2 columns and rank w, the bound involves the expression σ 2 w(m1+m2) n . This comprises two parts: w(m 1 + m 2 ) corresponds to the degrees of freedom, which is in the order of the number of free elements and a multiplicative factor σ 2 n corresponding to the error variance. Analogously, the term rd 1 (or, rd 2 ) corresponds to the r t-independent lateral slices (or, horizontal slices) and estimation of the d 1 (or, d 2 ) tubes in that slice. The multiplicative factor remains the same, except the term c 2 u that appears additionally in this case in order to accommodate the variability in the predictors.
The second part of the error bound is related to the sparse component and can be interpreted as follows: the first term σ 2 c 2 u sd1 n arises as a result of estimating sd 1 non-zero parameters in S * Mat and the second term σ 2 c 2 u s log(d2d3) n is devoted to the selection of s positions to place the non-zero columns in S * Mat . This selection problem induces the term log( d2d3 s ) ≈ s log(d 2 d 3 ). Finally, the last term α 2 s d2 appears due to the non-identifiability of the model.
When the sparsity in the regression coefficient tensor arises element-wise, instead of columnwise, the estimation error bound can be obtained in an analogous manner, with the columnwise (2,1) norm replaced with the elementwise 1 norm, that imposes the elementwise sparsity in S * Mat . Further, instead of restricting the spikinesss of the columns, the incoherence condition now controls the elementwise spikiness of Circ(L * ) and thus Assumption 2 takes the form With these modified versions of the assumptions and denoting the number of non-zero elements in S * Mat by s, we present next the following Corollary to the Theorem 3.4, that provides the estimation error bound in case of elementwise sparsity in S * .  N (0, σ 2 ), the predictors follow a Gaussian distribution, characterized by equation (11) and equation (12) and the modified version of the Assumption 2 holds. Then, it can be shown that the conditions in Assumption 1 and in modified version of Assumption 3 are satisfied with high probability and we obtain with probability greater than 1−exp(−9 log(d 1 d 2 d 3 )), where c 2 u is defined in (12).

Performance evaluation
We illustrate the performance of our estimation procedure described in Section 2.1, based on synthetic data under different settings. We start by describing how the true L * and S * are generated. For the L * , we start by generating a third-order tensor in R d1×d2×d3 with Uniform (0, 1) entries and then obtain its t-SVD (see Definition C.8) using the rTensor R package ( [19]). Let U and V denote the two orthogonal tensors (Definition C.5) and let K be the f -diagonal tensor (Notation C.1) of the t-SVD. For any r = 1, 2, · · · , d = min{d 1 , d 2 }, we randomly select d − r diagonal tubes of K and make them zero, whereas the remaining tubes remain non-zero.
Denoting the resulting f -diagonal tensor by K 1 , L * , with tubal rank r, is then generated as To generate S * , we start with a third-order tensor with Uniform(0, 1) entries as before. Then, for the k th frontal slice, with k = 1, 2, · · · , d 3 , we randomly choose s k ( d 2 ) columns and set all the remaining d 2 −s k columns to zero. With this construction and denoting d3 k=1 s k by s, S * Mat will have s( d 2 d 3 ) non-zero columns, as assumed in Section 3. However, for simplicity, in the simulations we assume that s 1 = s 2 = · · · = s d3 = s * .
The predictors X i 's are sampled independently, where in each of the predictors, the entries are i.i.d. N (0, 1). Finally we simulate independent and identically distributed entries of Gaussian noise for the error term and generate the responses based on Model 2. Given simulated data, we employ the algorithm, discussed in Section 2.1, to obtainL andŜ. The regularization parameters λ L and λ S are selected by a two-dimensional grid search method. We run the algorithm and obtain the estimates for different grids of the pair (λ L , λ S ) and select that pair for which the rank of Circ(L) and the positions of the non-zero columns inŜ Mat are as close as possible to the rank of Circ(L * ) and the positions of the non-zero columns in S * Mat respectively. It is worth mentioning that, later we develop an AIC criteria in order to select the optimum values of the regularization parameters, when the true rank and sparsity level are unknown to us. Performance Evaluation: We use Relative Error, Rank of Circ(L), Sensitivity and Specificity as the criteria of evaluation. Small values of relative error, along with the closeness of rank of Circ(L) and rank of Circ(L * ), characterize the quality of the estimation. In addition to that, sensitivity and specificity together assess the ability of support recovery. Below we provide the rigorous definitions of these criteria. (5), the Relative Error is defined as

Relative Error (RE): Considering the definition of Estimation Error provided in equation
where, FPR is defined as follows: number of non-zero elements inŜ, which are actually zero in S * number of elements that are zero in S * 3. Sensitivity (SN): Sensitivity, also known as True Positive Rate (TPR), which is defined as follows: number of non-zero elements inŜ, which are actually non-zero in S * number of non-zero elements in S * Using the above-mentioned criteria we evaluate the performance of our method under four different scenarios. Each scenario corresponds to specific values of the triplet (d 1 , d 2 , d 3 ). Furthermore, within each scenario, we obtain the estimates under four different sub-cases, where each sub-case corresponds to a particular combinations of the true tubal-rank and sparsity level. Now we first describe all the scenarios and the sub-cases and then summarize the results under all these cases. The results reported in the following tables are based on 100 replicates.
. In all of these cases, L * has been generated in such a way that the right-hand side of the Proposition A.1 follows with equality. More specifically, rank of the true block-circulant matrix is simply r × d 3 . As an example, in Table 1, since d 3 is 8, we will have R as 16 and 24 when r is 2 and 3 respectively.
As depicted in Table 1, in all four sub-cases, the relative error decrease as the sample size increases. Moreover, as the estimation is equipped with more and more samples, it becomes easier to achieve the target rank of the true blockcirculant matrix. Finally, the values of the specificity and sensitivity approaches to 1, with increase in the sample size. The reader may also note that, for a fixed value of true rank, when one increases the true sparsity level, the relative error increases, which is in accordance with the theoretical finding in Lemma 3.3. For example, with sample size 800 and tubal-rank 3 (Rank of the block-circulant matrix 24), when one increases s * from 1 to 2, the relative error increases from 0.37 to 0.42. The same argument follows when the true rank is increased for a fixed level of imposed sparsity. The remaining tables ( Table 2, Table 3 and Table 4) display the results under the remaining three scenarios. As expected, more samples are required to achieve good performance while we increase the number of parameters. However, in all these cases, as in scenario 1, both estimation and support recovery performance become stronger with increase in the sample size. Predictive Performance: To assess the out-of-sample predictive performance of our model, we split the data into two parts. While we fit the model based on the first part of the data (training data), the second part (test data) is used to assess the performance of the model. We use the Root Mean Square Error (RMSE) as the measure, which is defined as , where n test is the number of observations in the test data, y i is the i th actual observation in the test data andŷ i is the fitted value, using the model based on the train data. Thus, lower RMSE values imply better performance. We compare our model with four benchmarks, which are relevant in the literature: 1) Vectorized Lasso (La-vec): in this case, the third order tensor predictor X i ∈ R d1×d2×d3 is vectorized and the resulting vector of length d 1 d 2 d 3 is then used to fit a Lasso regression with 1 norm penalization. 2) Vectorized Elastic Net (EN-vec): this is similar to benchmark 1, except for the fact that here we employ the Elastic Net regularization on the vectorized X, instead of Lasso. 3) Sparse CP regression (Sp-CP): this is based on the method developed in [42], that uses the CP decomposition of the coefficient tensor, that is, n |. [42] developed a Block Relaxation algorithm to solve the problem, which is implemented in a Matlab toolbox, TensorReg 2 and we use the toolbox to generate the results. 4) Sparse Tucker regression (Sp-Tu): this is similar to benchmark 3, except for the fact that [21] applies a Tucker decomposition ( [16]) on the coefficient tensor, instead of a CP decomposition. As in benchmark 3, in this case too, we use toolbox TensorReg in order to generate the results. Note that, competitors (1) and (2) are oblivious to the tensor nature of the problem and treat it as a large size regularized regression one. Table 6 summarizes the RMSE values for our Low Tubal Rank model (Low TR) and for the four benchmarks, La-vec, EN-vec, Sp-CP and Sp-Tu. The first column specifies the true data generating procedure, wherein, the first four entries, TR(r = 3, s = 2), TR(r = 3, s = 1), TR(r = 2, s = 2), and TR(r = 2, s = 1) characterize the four sub-cases (based on the values of tubal rank and the number of non-zero columns) and the relevant data generating procedures discussed in Section 4 of our paper. The last entry of the first column, Spdata, corresponds to the data generating procedure in the sparse CP tensor regression [42], which is also provided in their toolbox documentation. For each of the true data generating processes, Table 6 reports the RMSE values for our model and the benchmarks for three different sample sizes n = 400, 800, 1100. While calculating the RMSEs, the number of test data, n test was taken as 100 and the sizes d 1 , d 2 and d 3 of the tensor predictor were fixed at 10, 10 and 8 respectively. As depicted in Table 6, our model outperforms the benchmarks for the first four data generating procedures, as expected, since the posited low tubal rank plus sparse model corresponds to the true data generating mechanism. For the Sp-data generating mechanism, though the RMSE values from our model are initially slightly higher or on par with the Sp-CP and Sp-Tu ones (that are in accordance to the true generating mechanism), they start getting smaller than Sp-CP and Sp-Tu, as we increase the sample size. This is probably due to the fact that the optimization problem for our model is convex, as opposed to the sparse CP/Tucker decomposition and also the small number of parameters to be estimated. Both these features are advantageous in settings with not enormous sample sizes. As mentioned earlier, while working with real data, the true rank and sparsity level are unknown. In such situations, we choose the values of λ L and λ S in such a way that the AIC, as defined below, is minimized.
AIC : We define AIC as n log( RSS n ) + 2 Rank (Circ(L)) + 2k, where, RSS = n i=1 (y i − L +Ŝ, X i ) 2 and k is the number of non-zero elements inŜ. This formulation is quite common in the literature, which essentially rewards goodness of fit and at the same time penalizes overfitting. Below we provide a numerical analysis that justifies the performance of the posited AIC criterion.
We consider the synthetic data generated in scenario 1 -sub-case 1. Recall that for this data set, the true rank of the block-circulant matrix is 16 and there is only one non-zero column in each frontal slice of the sparse component. The goal of this experiment is to check whether the values of rank and sparsity that we get after minimizing AIC, match closely to the true rank and sparsity level or not. To that end, we obtain the values of AIC for different grids of the pair (λ L , λ S ) with sample size n = 800. Figure 5 depicts the relevant part of the grids that contains the minimum AIC value (see additional tables in Appendix D for the AIC values). The pair (λ L , λ S ) corresponding to this minimum AIC in Figure 5, producesL with R = 17 ansŜ comprising a frontal slice with two nonzero columns and remaining frontal slices with only one non-zero column in the desired positions. Hence, the rank and sparsity, decided by AIC, matches quite well the true values. Also, these AIC based rank and sparsity pattern are exactly in line with the ones obtained in the simulations (see the rank, SP and SN for n = 800 in scenario 1-subcase 1, depicted in Table 1). To gain more assurance on the parity between AIC based results and the simulation results obtained earlier, we recalculate the estimation results for all the sub-cases of Scenario 1, with the AIC based tuning parameters and summarize them in Table 7. As it can be seen, the AIC based results in Table 7 are quite in line with the scenario 1 simulation results in Table 1.

Application to educational data
In this section, we use our proposed method on educational data from an Intelligent Tutoring System (ITS). The ITS under consideration is an online video based tutoring program launched in year 2013 to help prepare students for an End-of-Course basic algebra test. The platform offers videos on various algebra topics, recorded by different tutors. The students can assess their progress by taking practice test. Also, the platform offers a monitored discussion area where the students can pose questions to peers and volunteer tutors. The data we consider in this section, consists of records of the students for four consecutive academic years, starting from 2014-15 to 2017-18. Students who logged in the tutoring platform for at least five times in a particular academic year, were considered as users of the platform for that year. For each year, we gather information on the following variables for each user.
• Socioeconomic variables: 1) Ethnicity: Hispanic/Latino or not, 2) FRL: Reduced-price (or Free) meals at schools or not, 3) Gender: Male or Female • Score in Maths Tests: 1) Pre-Score: Score in state standard assessment maths test, that determines the maths preparedness of the students, 2) EoC Score: Score in the end of course maths test. The students must take and pass this test to establish their maths proficiency. In the next step, we convert this user level data into school level. To that end, for each of the schools, we process the data as follows: • For each academic year, we first make three categories of the teachers, based on their overall teaching experience, namely E1, E2 and E3. While E1 is the group of teachers with the least experience (teaching experience of at most 5 years), E3 is the most experienced group (at least 14 years of teaching experience). • Thus, for each school, we arrive at a third-order tensor predictor of dimension 8 × 4 × 3 as illustrated in Figure 6. The four lateral slices correspond to four academic years, where in each lateral slice, 8 variables are captured across the 3 levels of teaching experience. It is worth mentioning that these variables are now measured at school level, by averaging over the relevant student level data. As an example, the very first element of 2014-15 slice represents the average number of videos watched by the students of that Due to lack of information for many schools, we restrict our analysis to n = 38 schools. Using the observed data {(y i , X i )} n i=1 , we fit the tensor regression model, given by (2). We assume that B = L + S, where L is the low tubalrank component and S is a tensor, whose frontal slices are elementwise sparse. In other words, L captures the baseline effects, which are shared across the academic years. In addition to that, there are some specific combinations of variables and experience level, which may show additional effect in some specific year. The sparse component S is devoted to determining such additional effects.
We first select suitable values of λ L and λ S using the AIC criteria discussed in Section 4 and then employ the Alternating Block Minimization Algorithm (Algorithm 1) to estimate the low tubal rank and sparse components of the coefficient tensor. Figure 7 provides the estimated low tubal-rank component, for which the tubal-rank is 1. Thus, there is only one academic year, for which the corresponding lateral slice is t-linearly independent. In all the remaining academic years, the effects of the variables are t-linearly dependent  [30] analyzed an educational attainment data and found that the effects of the school students' prior scores on their BA degree attainment, are similar across two different "experimental" groups. The first group corresponds to the students who live within 10 miles of the nearest 4-year college. On the other hand, the second group of students live at least 10 miles away from the nearest 4-year college. Similarly in our study, the effect of students' preparedness or prior knowledge, on their future academic achievements, are similar across different years. In addition to the Pre-Score or prior knowledge, the effects of the socioeconomic variables are also similar for different years.
However, as opposed to the Pre-Score and socioeconomic variables, the effects of the ITS platform usage variables are not quite similar across the years. There are some usage variables for which some additional effects are captured by the estimated sparse componentŜ. The binary heatmap in Figure 8 depicts such variables for which there are non-zero values inŜ, whereas the actual values of the estimates are tabulated in Appendix ( Table 9). As shown in Figure 8, the two key variables for which additional effects are present, are the number of videos viewed and the number of logins to the platform. Although it is difficult to discover the ground truth behind this, intuition suggests that the level of platform usage may vary significantly across different segments and thus leads to additional effects in the estimated coefficient. For instance, as the platform gains more and more popularity with time, more students are expected to get acquainted with the platform and thus watch tutorial videos. Consequently, additional effects of videos are expected to become more prominent in Year-3 and Year-4, as compared to Year-1 and Year-2, which is reflected in the heatmap. Regarding the other tensor regression models, as mentioned earlier, our work is the first one to explore the decomposition of the total effect into baseline (low tubal-rank tensor) and idiosyncratic effect (sparse tensor). This decomposition is a necessary intrinsic pattern of the type of educational data that we consider here. The effects of prior knowledge (Pre-Score) and socioeconomic variables are similar or shared across the years ( [30]), which is the baseline component. In addition to that, the platform usage variables (number of logins to the platform, number of tutorial videos watched and so on) display some additional effects as the platform gains more and more popularity. No other tensor regression models in the literature have developed an algorithm to estimate such decomposed effects. As an example, we apply the Sparse CP regression model proposed by [42] to our data. As previously discussed, a sparse CP regression first uses CP decomposition to represent the coefficient tensor and then imposes sparsity assumption on the components of the CP decomposition. The sole purpose of this method is to reduce the dimensionality and it provides a sparse estimate of the coefficient tensor. Figure 9 depicts the heatmap of the estimated sparse coefficient tensor (represented in matrix form) using sparse-CP regression. As it can be seen, the sparsity pattern in the estimated coefficient tensor is random and hardly reveals any meaningful interpretation of the underlying effects.

Discussion
In this paper, we propose a tensor regression model with scalar response and third-order tensor predictor and assume that the third-order coefficient tensor is decomposed into two components. The first one corresponds to a low tubal rank tensor, which captures baseline effects, shared across the lateral (or, horizontal) slices. On the other hand, the second component is a third-order tensor, whose frontal slices are either elementwise or columnwise sparse. The role of the sparse component is to capture the additional idiosyncratic effects. This decomposition of the coefficient tensor, as opposed to the Canonical Polyadic (CP) decomposition used in related literature, expands the scope of exploring more complex structure in the data. We develop a fast and scalable Alternating Minimization algorithm to solve our convex regularized program. In the context of theoretical development, we extend the work in the literature of multivariate regression [1] to third-order tensor and establish a non-asymptotic interpretable upper bound to the estimation error. The efficacy of the methodology is illustrated on synthetic and real education data.

Acknowledgments
The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305C160004 to the University of Florida. The opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Department of Education.

Appendix A: Proofs
In this section, we prove the results presented in Section 3. We start by establishing a simple proposition, followed by Lemmas 3.1 and 3.2 and then a simple inequality, termed as the, Basic Inequality. Based on these results we then prove Lemma 3.3, which provides an upper bound to the estimation error in case of deterministic noise and predictors. Finally, using Lemma 3.3, we prove Theorem 3.4 and Corollary 3.4.1, which provide the upper bound in case of random realizations of errors and predictors.
Proof. Note that, a block-circulant matrix of a third-order tensor in R d1×d2×d3 , can be expressed as [B 1 |B 2 | · · · |B d2 ], where the j th block B j is a matrix of dimension d 1 d 3 ×d 3 , j = 1, 2, · · · d 2 . In each B j , the first column is the vectorized version of the j th lateral slice of the tensor and the remaining (d 3 − 1) columns are just a circulant rearrangement of the first column. Since the tubal rank of the tensor is r, there will be r blocks among these d 2 blocks such that: • any column of any of the remaining d 2 −r blocks can be written as a linear combination of the columns of the aforementioned r blocks and • any column of j th 1 block is linearly independent of any column of j th 2 block, where j 1 = j 2 , j 1 , j 2 = 1, 2, · · · , r.
So the rank of the block-circulant matrix will depend on the intra-block linear dependence of these r blocks. If all the columns within each of the r blocks are linearly independent, then there will be r × d 3 linearly independent columns in the full block-circulant matrix. At the other extreme, if there is only one linearly independent column in each of the r blocks, then there will be r linearly independent columns in the block-circulant matrix. Hence the proof.

Proof of Lemma 3.1
Proof. Note that Circ(L * ) and Circ(Δ L ) are the two matrices of the same dimension. Using Lemma 3.4 of [33], it is possible to decompose Circ(Δ L ) as Circ(Δ A L ) + Circ(Δ B L ), such that, Rank(Circ(Δ A L )) ≤ 2 Rank (Circ(L * )) = 2R and Circ(L * ) T Circ(Δ B L ) = 0, Circ(L * )Circ(Δ B L ) T = 0. The reader may visit [33] to know the details on how to derive such decomposition. It is worth mentioning that, [1] uses the same tool while proving their Lemma 1. However, as Lemma 2.3 of [33] proves, the last two equalities are essentially the sufficient condition of the additivity of nuclear norm. In other words, these imply We use the above finding later in this proof. It now remains to show that inequality (6) holds for such decomposition. Note that, , by the definition of C(L, S) and the fact that Circ(·) is additive , by the aforementioned decomposition and the property of projection , by the Triangle Inequality by Equation (15) and the Decomposability of · 2,1 The remainder of the proof follows from the above inequality and the definition of C(L * , S * ).

Proof of Lemma 3.2
Proof. We start by defining a function f : R d1×d2×d3 −→ R as follows: wherein as before, L(L, S) is used to denote the loss function given by, Since f (0, 0) = 0 and (Δ L ,Δ S ) is the optimal error, one must have, f (Δ L ,Δ S ) ≤ f (0, 0) = 0. Recall that, we already have established a lower bound of C(L * +Δ L , S * +Δ S ) − C(L * , S * ) from equation (6). Now our job is to find a lower bound to L(L * +Δ L , S * +Δ S )−L(L * , S * ). These two bounds, along with the fact that f (Δ L ,Δ S ) ≤ 0, will prove the result.
, one can think C as an alternative regularizer and λ L as the associated parameter for our problem. Now as [28] derives while proving their Lemma 1, using the convexity of the loss function and dual-norm inequality, we get the following: Where, C * is the dual norm associated with the regularizer C. It is easy to check that ∇L(L * , S * ) = [−D, −D]. Now, from the given conditions on the regularizer parameters, we get D Mat 2,∞ λ S ≤ 1 4 and 1 d3 Circ(D) sp ≤ λ L 4 and hence using the similar argument as in the proof of Lemma 1 in [1], C * (∇L(L * , S * )) can be shown to be bounded above by λ L 2 . Also, it is easy to check that Thus (17) reduces to Finally, the rest of the proof follows simply from (6), (16), (18) and from the fact that f (Δ L ,Δ S ) ≤ 0.

Basic Inequality
Proof. By the optimality of (L,Ŝ) and the feasibility of (L * , S * ) we have the following inequality: Next, from y i = L * + S * , X i + i , we will have, Using the above decomposition along with (20), we arrive at the following inequality: This compeltes the proof of the Lemma.

Proof of Lemma 3.3
Proof. To avoid complex notations, in this proof, we initially ignore the term d 3 in the definition of C(L, S) and adjust that later towards the end of the proof. The reader may note that Lemma 3.1, Lemma 3.2 and Basic Inequality hold good with this modification, where the earlier assumption λ L ≥ 4 1 d3 Circ(D) sp is now replaced by λ L ≥ 4 Circ(D) sp .
Using the Assumption 1 and (8), we get 1 2n We will obtain a lower bound of the right-hand side and an upper bound of the left-hand side of the above inequality. We first start with deriving a lower bound . It is easy to check that It can be easily seen that, , since, , using Assumption 3 Hence, from (23) we get, So, we get the following inequality, Next, we derive an upper bound of the left-hand side of inequality (22). To that end, using the inequality (6) and (19) we get, It can be seen that

by definition of C(L, S) and Assumption 3, both ignoring d 3
Putting the above inequality in inequality (25), we arrive at the following inequality 1 2n Using the inequalities (22), (24) and (26), we get the following inequality, Again, using Lemma 3.2 and the fact C(Δ L , Replacing the above inequality in (27), we arrive at Recall from Lemma 3.1 that rank of Circ(Δ A L ) is at most 2R. This fact, along with the concept of Compatibility Constant defined in [1], reveals that Next, we adjust the term d 3 , that we ignored in the beginning of the proof, by adding the factor 1 d3 prior to Circ(Δ L ) F in the above expression. Then, using the facts that Circ(Δ L ) and R ≤ rd 3 , the above inequality reduces to The reader may note that the above inequality is exactly the same as the one obtained in [1], towards the very end of the proof of their Theorem 1. Hence, as done in [1], we substitute the above inequality into inequality (29) and then following the exact same steps as in [1], we complete the proof.

Proof of Theorem 3.4
Proof. To prove this result, we follow the same technique used by [32] while proving their Lemma 11.
First, we prove that the condition λ L ≥ 4 1 d3 Circ(D) sp is satisfied with high probability. It is easy to note that, Comparing the above expression with the one in the statement of Lemma 11 of [32], one can claim that by choosing λ L greater than 2σcmax √ n E[ G sp ], the condition λ L ≥ Circ(D) sp holds with probability at least N (0, 1) entries and c max is such that λ max (Σ Circ ) ≤ c 2 max , with Σ Circ = Cov((V ec(Circ(X 1 ))) T , · · · , (V ec(Circ(X n ))) T ) T .
It is easy to check that with proper permutations of rows and columns of Σ Circ , one can arrive at the following block matrix, where, C ij = Σ (see (11) and (12)) for all i = 1, 2, · · · d 3 and for all j = 1, 2, · · · , d 3 . Hence C is a block matrix, whose each of the d 2 3 blocks are Σ. Thus u d 3 , using assumption 12. Also, using Lemma H.1 of [27] we get . Hence, for suitably chosen constant c * 1 , λ L should be chosen as follows: and with such a choice of λ L , the condition λ L ≥ 4 1 d3 Circ(D) sp is satisfied with probability greater than 1 − exp(−d 3 (d 1 + d 2 )).
Next, we prove that the condition λ S ≥ 4 D Mat 2,∞ is satisfied with high probability. We note that, where X iM at is the matrix of order d 1 × d 2 d 3 that is constructed by placing the frontal slices X i side by side. As before, using Lemma 11 of [32] one can claim that, by choosing λ S greater than 2σc * √ n E[ G * 2,∞ ], the condition λ S ≥ D Mat 2,∞ holds with probability at least However, it is easy to see that Σ Mat is same as Σ and thus λ max (Σ Mat ) ≤ c 2 u , using assumption 12.  3 )) and for such a choice of λ S , the condition λ S ≥ 4 D Mat 2,∞ is satisfied with probability greater than 1 − exp(−9 log(d 2 d 3 )). Hence, we have shown that the regularizer parameters satisfy the conditions in Assumption 3 with high probability.
Finally, we complete the proof by using Lemma 12 of [32] to show that assumption 1 holds with high probability. That lemma is based on the assumption that for any c > 0, there exist an n such that √ Sλ ≤ c, where S is the compatibility constant (see Section 3 of [32]) of the regularizer in the cone set and λ is the associated parameter. Note that in our case, Lemma 3.2 characterizes the cone set and C(L, S) and λ L are the regularizer and the parameter respectively. Hence, keeping in mind the choice of λ L that we made at the first part of the proof and following some simple algebra, it can be shown that √ Sλ has the form constant × ( . Hence, we need to assume that for any c > 0, there exists an n, such that ( However, note that this requirement is in line with the choices of λ L and λ S we make. Thus, using Lemma 12 of [32], we prove that Assumption 1 is satisfied with high probability. Now the proof follows using Lemma 3.3 and employing similar steps as in the proof of Corollary 4 of [1].

Proof of Corollary 3.4.1
Proof. Since the condition on λ L does not change, we arrive at the same choice of λ L as we did in the proof of Theorem 3.4. Now using the result on Gaussian maxima( [18]), for the matrix G * defined in the proof of Theorem 3.4, we get . The rest of the proof follows along the same line of the proof of Theorem 3.4.

Appendix B: Matrix-type view of a third-order tensor
We first present the matrix-type view of a third-order tensor. Three basic elements of interest are the lateral slices, horizontal slices and the tube fibers, defined rigorously under the section Notations. Recalling the definitions, lateral slices of X ∈ R d1×d2×d3 are d 2 laterally oriented matrices of dimension d 1 ×d 3 . As mentioned in [13], by staring at these laterally oriented matrices straight from the front, one will actually see them as column vectors of length d 1 . Hence, the reader can envisage a three-dimensional tensor as a display of such lateral slices, placed side by side, playing the role of columns in a matrix. Similarly, the horizontal slices can be visualized as the row vectors of length d 2 and one can imagine that these slices play the roles of the rows of a matrix. Finally, by viewing the tube fibers of the tensor from the front, one would visualize them as the elements of a matrix. Figure 2 aims to provide the reader a pictorial representation of this discussion. Note that, lateral and horizontal slices, although being matrices, can be considered as third-order tensors in R d1×1×d3 and R 1×d2×d3 respectively. Similarly, a tube fiber, although a vector, can be considered as a third-order tensor in R 1×1×d3 . [13] refer such elements in R 1×1×d3 as Tubal Scalar.
Given this matrix-type view of a third-order tensor, to proceed, we discuss next Block Circulant Matrices. For any vector a = [a 0 , a 1 , a 2 , a 3 ] T , the Circulant Matrix associated with a, denoted by Circ(a), is defined as follows 3 a 2 a 1  a 1 a 0 a 3 a 2  a 2 a 1 a 0 a 3  a 3 a 2 a 1  The way circulant matrix is defined, in the same spirit, one can construct the Block Circulant Matrix using the frontal slices of a third-order tensor. In order to avoid complications, here we slightly modify our previous notation of frontal slices. The earlier notation X ::k for the k th frontal slice is now simply replaced by X k .   Hence fold(MatVec(A)) = A.

Appendix C: Background on t-product and t-SVD
Equipped with the aforementioned notations in Appendix B, we present next the t-product between two tensors and the corresponding t-SVD decomposition.
The idea of the t-product was introduced in [15] and some of its important theoretical properties used in this work were developed in [14] and summarized next.
Definition C.1. Given A ∈ R d1×d2×d3 and B ∈ R d2×l×d3 the t-product A * B is defined to be a tensor C ∈ R d1×l×d3 , where, where Next we discuss the notion of Identity tensor, inverse and transpose of a tensor and orthogonal tensor.
Definition C.2. The n × n × l Identity Tensor, denoted by I nnl , is defined to be a tensor, whose first frontal slice is a n × n identity matrix and all the other frontal slices are zeros. Suppose an orthogonal set of elements in R m×1×l contains m elements. Comparing this framework to usual matrix algebra, it would be of great use, if one could reconstruct any element in R m×1×l from those m elements. As discussed in [13], one could achieve this by extending the concept of usual linear combination to t-linear combination, where, lateral slices act as columns and tubal scalars play the role of scalars.
Employing the definition of t-linear combination, one can now define the range of the tensor A ∈ R d1×d2×d3 , denoted by R(A), as the set of all possible t-linear combinations of its lateral slices. Similarly, extending the notion of usual linear dependence of two columns, one can say that the lateral slice A :j2: is t-linearly dependent on the lateral slice A :j1: , if there exist a tubal scalar c, such that, A :j2: = A :j1: * c. Figure 3 furnishes further clarity of this idea by demonstrating a simple example. keeping this framework in mind, it would be very useful if one would know the minimum number of elements in R d1×1×d3 , that is required to construct any arbitrary element in R(A). As described in [13], this number is characterized by Tubal Rank, which is the last topic of our discussion under this section. Before moving on to that discussion, we need to describe one more notation.
Notation C.
1. An f-diagonal tensor, denoted by F, is a third-order tensor, whose each frontal slice is a diagonal matrix. In terms of notation, F ijk = 0, for i = j, ∀k.
Definition C.8. For any A ∈ R d1×d2×d3 , t-SVD of A is given as follows: Here U and V are orthogonal tensors in R d1×d1×d3 and R d2×d2×d3 respectively.
S is a f-diagonal tensor in R d1×d2×d3 . Definition C.9. For any third-order tensor, Tubal-rank, denoted by r, is defined to be the number of non zero tubes in the f-diagonal tensor S in its t-SVD factorization. Hence, r = # {i: s ii: = 0}, where s ii: denote the i th diagonal tube of S.
Like matrix singular value decomposition, in this case too, R(A) can be written unambiguously by the lateral slices of U. Also, the number of elements in R d1×1×d3 , required to construct any element in R(A), is same as the tubal rank of A. The reader may visit [13] for the proofs and further details. Hence, as rank of a matrix decides the number of linearly independent columns of a matrix, tubal rank plays similar role in case of a third order tensor. Indeed, lower the value of tubal rank, higher the number of t-linearly dependent lateral slices. Figure 4 displays the tubal ranks and the lateral slices of three different tensors. In the first case, only the first slice from the left is t-linearly independent. Both the remaining slices are t-linear combination of the first one. Hence the tubal rank in this case is one. Similar justification follows for the other two cases too.
It is possible to compute t-SVD by performing matrix SVD d 3 times in the Fourier domain. The reader may see [14] for more details. However, [26] recently proposed a more efficient algorithm for computing t-SVD. This algorithm requires one to perform matrix SVD only d3+1 2 times, instead of d 3 times. [26] defines the elements of the first frontal slice of the f-diagonal tensor S, that is S ::1 , as the Singular values of the tensor A and argues that, the number of nonzero singular values is equivalent to the tubal-rank defined in C.9. In terms of the notations used here, r = # {i: s ii: = 0} = # {i: S ii1 = 0}. So, by penalizing high value of r i=1 S ii1 , one can actually restrict the value of the tubal-rank to an upper bound. In [26], the quantity r i=1 S ii1 is defined as Tensor Nuclear Norm of the tensor A. Just as a side note, this definition of Tensor Nuclear Norm is slightly different from the one in [41]. However using Definition 7 of [26] and equation 12 of [25], one can derive the following relationship between Tensor Nuclear Norm and Block Circulant matrix.
It is evident that, in order to restrict the tubal rank of a tensor, one can impose penalty on the right hand side of the above equation. We utilize this fact while we discuss the convex relaxation of our proposed model in Section 2.1.