Tensor factorization recommender systems with dependency ∗

: Dependency structure in recommender systems has been widely adopted in recent years to improve prediction accuracy. In this paper, we propose an innovative tensor-based recommender system, namely, the Ten- sor Factorization with Dependency (TFD). The proposed method utilizes shared factors to characterize the dependency between diﬀerent modes, in addition to pairwise additive tensor factorization to integrate information among multiple modes. One advantage of the proposed method is that it provides ﬂexibility for diﬀerent dependency structures by incorporating shared latent factors. In addition, the proposed method uniﬁes both binary and ordinal ratings in recommender systems. We achieve scalable computation for scarce tensors with high missing rates. In theory, we show the asymptotic consistency of estimators with various loss functions for both binary and ordinal data. Our numerical studies demonstrate that the pro- posed method outperforms the existing methods, especially on prediction accuracy.


Introduction
A recommender system aims to provide recommendations for items users might prefer, which has been widely used in market boosting by targeting different people for different products and giving personalized recommendations. Traditionally, a recommender system can be formulated into a matrix of user-item interactions, such as movie ratings, sales of products, and the frequency of users visiting certain locations.
Nowadays, high-order information beyond user-item pairs is collected, such as time, location, and product features [34,25,5], and incorporating such information can increase predictive power. The additional information defining the underlying situation in which a recommendation is provided is referred to as a context. Recommender systems with contextual information are called contextaware recommender systems (CARS) [2]. The applications of CARS include personalized marketing strategies for retail stores, product recommendations for different seasonalities, and individualized medical therapies. Ignoring context information may result in a loss of predictive power. For example, the preferences of customers buying new clothes can be significantly different between summer and winter [2].
Tensor representation and decomposition are newly developed tools to handle context information in recommender systems [17,4,36,37]. Tensors are generalizations of matrices for higher-order relational data and provide useful high-order data representation formats [4]. In recommender systems, a tensor is effective and efficient for incorporating or utilizing contextual information for CARS. Specifically, in addition to user and item, CARS can be formulated as a tensor X illustrated in Figure 1.
One advantage to utilizing tensor representation is capturing complex useritem-context interactions through tensor decompositions such as CANDECOM-P/PARAFAC (CP) decomposition [16] and Tucker decomposition [33]. Tensor decompositions can be viewed as higher-order latent factor modeling, where a set of latent factor vectors is used to encode users, items and contexts, and capture the relations in a latent space.
The existence of dependency structures is quite common in recommender systems [4,36]. A dependency may be induced from the tensor entries which are not necessarily independent, as users may be influenced by other users who share similar preferences. Existing methods mainly focus on a specific dependency structure. For example, the recommendation engine of multi-layers [4] and the double core tensor factorization model [32] consider the heterogeneity across subgroup structures in each mode. For other example when time is regarded as contextual information, the recommender system often exhibits strong temporal patterns, which also introduces a dependency structure among user and item interactions over time. Accordingly, the temporal dynamics factor models [18] and the Bayesian probabilistic tensor factorization method [36] aim to solve this issue. However, in many real-world applications, dependence can also exist among users, items, and other contextual variables. For example, in beer recommendation, the beer style is entirely controlled by the brewery, and users' preference of brewery highly depends on the beer styles the brewery produces. In tag recommendation, tags are highly associated with items, as users select similar tags for a specific item. Therefore, between-modes dependence occurs frequently, and developing a tensor-based method to capture dependency among user, item, and context is very necessary in practice.
A key feature of recommender systems is that the utilities of matrix or tensor are commonly in a binary or ordinal form to represent yes or no, or ratings of products. Studies considering high-order binary or rating data include link function approaches [20,24,35,12,19], or Boolean tensor decomposition, through a binary tensor decomposition into a series of binary factors without a population assumption [22,11,28,13,10]. However, for a model-free approach such as the Boolean tensor decomposition, no distribution assumption is specified for the tensor entries. In contrast to the Boolean tensor decomposition, methods utilizing link function take advantage of a population model assumption [35,28], which can distinguish the algorithm error from statistical error [35].
In this paper, we propose a new tensor factorization with dependency (TFD) model. The main novelty of our method is to incorporate the dependency structures among modes by utilizing shared latent factors across pairwise interactions. Furthermore, the proposed method unifies the binary and ordinal ratings in recommender systems and is scalable in computing through imposing the scarcity of the observed utility tensor. In addition, we establish the asymptotic consistency and convergence rate of the proposed estimator for both binary and ordinal cases.
The proposed tensor factorization method has the following advantages. First, our method is able to incorporate a wide range of dependency structures and requires no additional assumptions on latent factors or variables. In contrast, existing methods generally target a specific dependency structure such as the Markov chain on temporal data or the multivariate copula [36,15]. In addition, by introducing the shared latent factor, the proposed model can accommodate a weak dependency among modes, therefore improving on the prediction accuracy. Here, weak dependency indicates that there are only a few latent factors shared between user-item interaction and item-context interaction. This also implies that the latent factors corresponding to user-item interaction would not affect the item-context interaction.
Second, the proposed model is effective for high-order CARS with a high observation missing rate. By utilizing a summation of pairwise interactions, we can capture most of the information of the user-item interaction and depen-dency with other interactions via a parsimonious model. Our method is less stringent on sample size requirements for recovering the tensor compared to existing tensor decomposition methods under the same framework. Pairwise interaction tensor factorization (PITF) has been applied in tag recommendations and movie recommendations [27,8]. However, the PITF is not able to model the dependency structure among interactions, where the latent factors modeling interactions across different modes are assumed to be independent with each other.
In addition, we propose a mode-wise coordinate descent (MCD) algorithm to reduce the computational cost via utilizing the sparsity of the observed tensor. In both of our simulations and real data applications, the proposed method outperforms the competing methods on prediction accuracy.
The rest of the paper is organized as follows. Section 2 provides the background of CARS and tensor factorization. Sections 3 and 4 introduce the proposed method and the optimization algorithm. Theoretical properties are provided in Section 5. Section 6 presents simulation studies to investigate the performance of our method. In Section 7, we apply the proposed method to two real data applications: User-Location-Activity data and Beer Review data. Section 8 concludes with a discussion.

Background and notation
In this section, we introduce the notation, the background of tensor representation and the related tensor decompositions.
A tensor is a multidimensional array, a generalization of vector and matrix, where the order of a tensor is the number of dimensions of the array, and is also known as the mode. For example, the vector and matrix can be viewed as a 1-order and 2-order tensor, respectively. In this paper, vectors are denoted by boldface letters, e.g., a, matrices are denoted by boldface capital letters, e.g., A, tensors are denoted by boldface Euler script letters, e.g., X .
Traditional recommender systems can be formulated by a utility matrix X ∈ R n1×n2 representing user-item interactions, such as ratings or purchase status, where the element X ij indicates the interaction between user i and item j. For example, in brewery recommendations, the element X ij represents the rating of brewery j given by user i. However, more sophisticated recommender systems also collect other potential useful contextual information, such as time of year or beer style. Instead of using a utility matrix X to represent user-item interactions, a tensor structure X ∈ R n1×n2×n3 is widely adopted to incorporate additional contextual information, as illustrated in Figure 1. The element X ijk of a third-order tensor indicates the interaction among user i, item j, and context k. For instance, in previous brewery recommendations, the element X ijk represents the rating on beer style k of brewery j given by user i. By fixing the last index, represents the utility matrix under context k.
Tensor representation and decomposition are newly developed tools to deal with context-aware recommender systems (CARS) [2,4,6], due to its capability of reducing model complexity through utilizing a low-rank structure. One commonly used tensor decomposition is Canonical Polyadic Decomposition (CPD) [16], which factorizes a tensor into a sum of r rank-one tensors. For example, the CPD for a third-order tensor X ∈ R n1×n2×n3 is: where • represents the vector outer product, E = {ε ijk } is a noise tensor with independent and identical distributed (i.i.d.) entries, and the latent factor a (k) r ∈ R n k ×1 (r = 1, ...., R; k = 1, 2, 3) corresponds to the k-th mode. The rank of a tensor is defined as rank(X ) = min{R : ) nm×R , m = 1, 2, 3 as factor matrices where each row of A (1) represents the r-dimentional latent factor for each user, and each row of A (2) and A (3) is a latent factor for an item and a contextual variable, respectively.
Pairwise interaction tensor factorization (PITF) has gained considerable attention due to its simplicity and high quality in prediction, and has been applied in tag recommendation, sequential recommender systems and movie recommendation [26,27,8]. Here, tag recommendation refers to a recommender system that suggests useful tags on a specific item, while a sequential recommender system predicts what a user would purchase based on their previous purchases experience. PITF models multi-way interactions through a summation of series of two-way interactions. For example, to model a user's ratings for brewery over different beer styles, the PITF model assumes that each rating is determined by three two-way interactions: the user's inherent preference on a certain brewery, the brewery's specific recipes for different beer styles, and the user's preference over different beer styles. That is, each entry x ijk of tensor X is modeled by: For each mode, two factors are modeled to count for interactions with other two modes. For example, latent factor a ir characterizes the i-th user's interaction with items, while a (c) ir represents i-th user's interactions with contexts. One advantage of the PITF is that it only requires O{nr log 2 (n)} observations to fully recover the tensor [8], while CP tensor factorization requires O{r 5 n 3/2 log 4 (n)}, where n is the maximum size of mode dimensions and r is the dimension of latent vectors [14]. The explicit form of tensor structure in (2) ensures that the parameter space of the latent factor is smaller than the regular CP decomposition, and therefore the prediction accuracy can be improved [27]. However, the PITF does not consider the dependency among pairwise interactions of user, item and context since two latent factors are used for each mode. Therefore, the user-item interaction is independent from the user-context interaction. However, in practice, the user-item interaction is highly associated with item-context interaction. For example, in brewery recommendation, the user's preference for a certain brewery also depends on the main beer style of the brewery.
Another well-known tensor factorization method is the high-order singular value decomposition, also referred as Tucker decomposition (TD) [33], which decomposes a tensor into a core tensor corresponding to each mode. In contrast to CP decomposition, the rank of tensor in the Tucker decomposition is not as well defined as in the matrix rank for singular value matrix decomposition. However, in the recommender systems application, the tensor rank is essential in achieving dimension reduction of original tensor data and providing interpretability on latent factors [4]. Another drawback of the TD is that the theoretical computational complexity of TD is O(R 3 ), while the CP decomposition is only O(R). Thus, CP decompostiion is more popular in recommender systems [22,4,12]. More details on the tensor can be found in [17,4].

Tensor factorization with dependency
In this section, we introduce the proposed tensor dependency modeling and the corresponding factorization method. The dependency between user-item relation and item-context relation is very common in practice. Therefore, we model a context-aware rating tensor via a series of two-way interactions among latent factors corresponding to user, item and context, respectively. Specifically, we propose where a ir is the r-th latent factor for user i, and b jr and c kr are the corresponding latent factors for item j and context k. The underlying dependency among different modes in the tensor is incorporated via sharing mode-specific latent factors across the two-way interactions, e.g., {b jr } are shared in both interaction terms in (3). The proposed method differs from existing tensor modeling of recommender systems such as the PITF [27], in that the proposed model imposes a latent-factor-sharing structure, which incorporates a multi-way interaction among user, item, and context through a summation of two-way interactions. In model (3), user i's preference under context k is implicitly encoded into the dependency between a ir b jr and b jr c kr via the shared latent factor b jr of an item, which serves as a "bridge" between user i and context k. Sharing latent factors can accommodate different types of dependency. For example, we can incorporate dependency among users, items, or contexts from the same group, which has been studied before [4,32]. Correlations between each row of X ij· and each column of X ·jk , which represent the correlation between user-brewery interaction and brewery-style interaction.
For example, consider a third-order tensor X = {x ijk } ∈ R 100×100×100 representing the user-brewery-style relations. We generate two groups of latent factor for each mode, where the first 50 users, breweries, and styles have latent factor a 1 , b 1 , and c 1 , respectively, while the last 50 users, breweries, and styles have latent factor a 2 , b 2 , and c 2 , respectively. Latent factors By averaging ratings over styles and users, each row of X ij· and each column of X ·jk represent ratings on breweries by each user, and ratings on breweries for each style, respectively. Figure 2 shows the correlation between each row of X ij· and each column of X ·jk , representing the correlation between user-brewery interaction and brewery-style interaction. The top-left red block shows that the first group of users' ratings on breweries are positively correlated with the first group of styles' ratings for breweries, while the bottom left blue block shows that the second group of users' ratings on breweries are negatively correlated with the first group of styles' ratings for breweries. This type of dependency may be caused by users' preferences for different tastes of beer styles. If the first group of users like the sweet taste, and the first group of styles' high ratings are due to the sweet flavor of the beer, then the users' ratings on breweries are positively correlated with the styles' ratings for breweries. On the other hand, if the second group of users prefer the strong and malty taste, they may tend to give low ratings to breweries with high ratings for the first group of styles, which leads to negative correlations at the bottom left. The above simulation shows how the proposed method is able to capture the dependency between user-item interaction and item-context interaction under the beer review context.
For our the real data, Figure 3 shows the dependency between user-location interaction and location-activity interaction from user-location-activity data. There are only positive correlations or no correlation under this data. This can be explained in that people tend to go to places where there are activities they prefer. Although one might capture the dependence of interaction through increasing the rank in tensor decomposition, such as the CP and the Tucker, this could be infeasible as non-identifiability, unstable estimation and higher computational cost could also occur if the rank in tensor increases. By specifying a more explicit structure of the tensor as in (3), we can drop the number of observations in order to recover the tensor, especially when the tensor size is large. In addition, a three-way interaction could introduce spurious correlation and noise, especially if there is no correlation between user and context variable. For example, in tag recommendation, a tag is used to describe the characteristics of items, which are not likely to be correlated with users.
In general, the user-context interaction can be ignored in a recommender system; that is, the interaction between user {a ir } and context {c kr } is less influential when a user's primary interest is the item ranking [27]. This can be explained from the Bayesian Personalized Ranking (BPR) perspective, where optimization is achieved through maximizing the likelihood of item ranking conditioning on user-context pairs (i, k). Specifically, let j 1 > i,k j 2 denote that j 1 ranks higher than j 2 , given a user-context pair (i, k), and matrix R (i,k) ∈ {0, 1} n2×n2 denote the rankings of all items for user i under context k, where each entry is R , and I(·) is the indicator function. The problem of finding the best ranking of item R (i,k) given the observed data can be formulated as maximizing the following likelihood of independent observations: where I, K are sets of users and contexts, respectively, and Θ is a model parameter vector. Assume that j 1 > i,k j 2 follows a Bernoulli distribution, we have: Suppose that we have a model predicting a scoring functionŷ : I ×J ×K → R. We derive an estimator for p(j 1 > i,k j 2 |Θ) by plugging inŷ: where is the logistic function. Notice thatŷ is is an arbitrary real-valued function obtained from a model parameter vector Θ, e.g., the proposed factorization model.
The Bayesian Personalized Ranking (BPR) perspective shows that if the primary interest is item ranking, then, when estimating the scoringŷ, we only care about the difference between two items' scoring,ŷ i,j1,k −ŷ i,j2,k . Although ignoring the user-context interaction may lead to inaccurate scoring predictions, the ranking of items based on the estimated scoring will not be affected. For where the third term for the user-context interactions is canceled out. In addition, discarding user-context interaction does not lead to poor rating predictions, as shown in our simulations. This may be because the value-based loss functions for binary ratings and ordinal ratings are similar to the rank-based loss function.
Another advantage of the proposed model (3) is its flexibility in modeling different dependency structure among modes. Existing methods for characterizing dependency structure require a specific dependence structure, such as the Markov chain on temporal data or the multivariate copula [36,15]. In the Markov chain model, the probability of each event depends only on the state of the previous event, while in the multivariate copula model, data are assumed to follow a specific multivariate distribution. These model assumptions could be violated. In contrast, the proposed model utilizes latent factor modeling and is able to approximate complex unstructured dependency between tensor modes.
In this paper, we consider the most common cases where x ijk is binary or ordinal. We first introduce the proposed model for binary rating tensor as follows: where P (x ijk = 1) denotes the probability of x ijk being 1, and logit(p) = log( p 1−p ).
To estimate the latent factor matrices A = {a ir }, B = {b jr } and C = {c kr }, we minimize the following square loss on x ijk with L 2 penalty [5,3,29], where the L 2 penalty controls the model complexity to avoid over-fitting and scale indeterminacy [1]: where Ω denotes the set of indices corresponding to observed ratings, 1+exp(x) is the logistic function, λ is the penalty tuning parameter, and || · || F denotes the Frobenius norm. Note that minimizing the square loss is equivalent to maximizing the likelihood of observed {x ijk } if the data are fully observed. In recommender systems, the first square loss term and the remaining penalty term in (8) could be under different domain spaces, therefore (8) is more computationally efficient in practice.

Ordinal rating tensor
In the following, we extend the proposed tensor modeling to the case of ordinal rating. Given an ordinal ratings tensor X = {x ijk } with L levels, we propose a proportional-odds-based modeling on rating x ijk as follows: where The model (9) has an equivalent representation as follows: where {x * ijk } is a noisy version of {θ ijk }: noises with a cumulative distribution function P( ) = 1 1+e − [19]. According to (10), we assume that the ordinal outcome x ijk can be formulated as a categorized version of a latent continuous variable x * ijk . For example, in a 5-point ordinal ranking, we can treat it as a discrete surrogate of the continuous variable indicating the degree of likeness.
In the following, we propose a proportional odds model where the distance between pairs of levels are the same for all entries, and therefore (9) implies L-1 parallel classification hyperplanes for the L-level ratings as follows: Note that the right hand side of (11) does not involve x ijk , that is the log odds are the same across observations. If there is evidence showing that the log odds are heteroscedastic for post (i, j, k), then a multinomial logistic model is more desirable.
Let Ω represent the set of indices corresponding to the observed ratings. To simplify the notation, let α 0 = −∞, and α L = ∞. We estimate A, B, C, α = {α l }, l = 1, ..., L − 1 via minimizing the following penalized negative log likelihood, where θ ijk = R r=1 a ir b jr + R r=1 b jr c kr . Notice that we do not penalize α as they are the cut-off points for each level of rating. When L = 2, the above proportional odds loss function (12) degenerates to a penalized logistic loss function for the binary ratings.

Computation
In this section, we propose a scalable algorithm to estimate the latent factors A, B, C and cutoff parameter α, and provide implementation strategies to improve the stability and efficiency of the optimization.
The proposed algorithm is based on a mode-wise gradient descent, and the gradient of the latent factors can be derived from loss functions (8) and (12). Let Ω := {(i, j, k) : x ijk is observed} be the set of indices corresponding to the observed ratings, and define Ω i·· := {(j, k) : (i, j, k) ∈ Ω} as a subset of Ω collecting entries whose first index is i. We similarly define Ω ·j· := {(i, k) : (i, j, k) ∈ Ω} and Ω ··k := {(i, j) : (i, j, k) ∈ Ω} corresponding to the second and third modes. For the binary rating case, the gradients of the proposed loss function (8)

with respect to the latent factors
where For the ordinal rating case, the gradients of the penalized negative log likelihood (12) with respect to the latent factors A, B, C, and cutoff α are where (16) Instead of updating only one entry of A, B, C, α at each iteration, we can utilize the matrix and tensor operations to jointly update each mode of the rating tensor for each iteration using parallel computing, thus reducing computation time [12].
Due to the non-convex nature of the proposed objective functions (8) and (12), the gradient-descent based optimization can only guarantee convergence to a stationary point, and is sensitive to the initial value. However, the proposed algorithm is not sensitive to the initial value since our method has a low model complexity compared to the standard tensor decomposition methods. This low model complexity has an advantage of ensuring the convergence of the proposed algorithm in that it is more stable and less likely to be trapped by the saddle points [9]. We set the initial values for the latent factors randomly as where I R is a R × R identity matrix and α ∼ Uniform (0, 1) , for a cutoff in the ordinal rating case. After initializing α, we sort the entries of α in an ascending order. We summarize the proposed optimization in the following Algorithm 1.

Stop if
In addition, we tune rank r and λ via minimizing the mean square error (MSE) on a validation set, where the MSE on a set Ω is defined as 1 High computational complexity is a common issue for tensor based modeling, especially when the dimension of the tensor is large. One difficulty arises from the trade-off between memory cost and computational efficiency. In additional to the rating tensor X , we also need to store Y = {y ijk } in (14) and (16) in computing gradients. When X is sparse, we can store X effciently by storing only its nonzero values and the corresponding indices. However, even when X is sparse, Y can still be dense and might cause memory issues.
The proposed algorithm achieves a higher computational efficiency when the dimension of each mode is large and X is scarce [12]. In practice, X is likely scarce as the vast majority of its entries are missing, due to the fact that tensor Y becomes sparse when X is scarce since each missing element in X corresponds to a zero entry of Y. Our strategy is to store the nonzero entries of Y only to save memory and therefore make our algorithm scalable.

Theory
In this section, we develop the theoretical properties for the proposed method. Specifically, we provide the consistency and convergence rate for the proposed estimator under the cases of binary and ordinal rating. In addition, we extend the convergence property of the proposed estimator under L 2 loss to a general loss function, given additional smoothness conditions. We first provide the asymptotic properties for the proposed method for the binary rating case. Since our primary goal is the prediction of ratings, we focus on the convergence property of the rating values instead of the latent factor recovery. Suppose the entries Y = {y i1i2i3 } ∈ R n1×n2×n3 are i.i.d. following a Bernoulli distribution with probability Θ Let |Ω| be the number of observed ratings and J(Θ) be a non-negative penalty function; for example, we have J(Θ) = ||A|| 2 2 + ||B|| 2 2 + ||C|| 2 2 for the L 2 -penalty on the tensor factors. Then the overall object function is where λ |Ω| is a tuning parameter for the penalization. To establish the convergence rate, we introduce the following assumption: Assumption 1 assumes that the latent factors are bounded, since, in practice, the underlying probabilities of binary utilities are finite. We define the parameter space as n i R is the total number of parameters. We introduce S(k) to ensure that the proposed estimator can reach the best possible convergence rate. Let S ⊆ R (n1+n2+n3)R be the true underlying parameter space. The estimatorΘ |Ω| defined on S may not achieve the best possible convergence rate since the size of S is too large when n i goes to infinity [31]. In contrast, S(k) imposes constraints on the parameters. When k increases, the constraint J(Θ) ≤ k 2 becomes less stringent, and the parameter space S(k) converges to S when n i goes to infinity [5,30].
Let Θ 0 be the true probability of the binary ratings and Θ = arg min Θ∈S L(Θ | Y). We denoteΘ |Ω| as the sample estimator of Θ 0 satisfying: where lim |Ω|→∞ τ |Ω| = 0. Since finding the global minimizer is not always feasible due to the non-convexity of L, we require thatΘ |Ω| be close to a global minimizer of L(Θ|Y ) when |Ω| → ∞.

Theorem 1.
SupposeΘ |Ω| is a sample estimator satisfying (19). Then under Assumption 1 we have: |Ω| , and ε |Ω| ∼ 1 |Ω| 1/2 is the best possible rate that can be achieved when λ |Ω| ∼ ε 2 |Ω| . Theorem 1 indicates that, if the penalty term shrinks to zero at an appropriate rate, the proposed method leads to a convergence rate of 1 |Ω| 1/2 for Θ. Next, we extend Theorem 1 to a more general loss function. Notice that the loss function l(., .) in (17) can be replaced by a loss function other than the L 2 loss to model complex and nonlinear relations between the ratings and Θ. For example, the loss function l(·, ·) can be a log-likelihood loss, then K(·, ·) reduces to the Kullback-Leiber pseudo-distance. Let W α p [a, b] n1×n2×n3 be a Sobolev space with a finite L p norm, where a and b are constants and α is a parameter associated with the degree of smoothness of the loss difference l Δ .

is a constant and · is the Euclidean norm.
The regularity condition defined in Assumption 2 is a restriction on the smoothness of loss function l(., .). Assumption 3 requires that in the neighborhood of the true parameter Θ 0 , a metric ρ (Θ 0 , ·) is larger than a certain order of the Euclidean distance in Theorem 2. In contrast, if ρ (Θ 0 , ·) is bounded by the Euclidean distance and Assumption 2 is satisfied, then Theorem 1 is still valid in terms of the metric ρ (Θ 0 , ·).
We also consider the ordinal rating case. Let Θ denote the vectorization of (α, A, B, C). We represent the penalized likelihood-based loss function (11) as: Then Assumption 1 and the definition of S(k) are the same as in the binary case, except that the total number of parameters γ becomes (n 1 +n 2 +n 3 )R+L.
To measure the difference between the two parametersΘ |Ω| and Θ 0 , we introduce the Hellinger metric h(·, ·) on S(k) as: Theorem 3 establishes the convergence rate ofΘ |Ω| with the Hellinger distance as a metric for the parameter space. This result still holds if the L 2 distance is used and additional assumptions on the local and global behavior of V ar{L(Θ |Ω| |Y ) − L(Θ 0 |Y )} are needed. We can then apply Corollary 2 in [30] to prove it. Notice that the number of levels L also affects the convergence rate, and |Ω| increases as L grows.

Simulation studies
In this section, we conduct simulations to compare the proposed method (TFD) with existing tensor factorization methods for both binary and ordinal rating cases.

Binary rating tensor
We first focus on binary rating tensors and compare the proposed method (TFD) with five competing methods; namely, the Boolean Tensor Decomposition (BTD) [28], Canonical Polyadic Decomposition (CP) [16], Generalized Canonical Polyadic Decomposition (GCP) [12], Bayesian probabilistic tensor factorization (BPTF) [36], and the pairwise interaction tensor factorization (PITF) [27]. Among these models, CP and PITF are considered as baseline tensor factorization methods which model multi-way interactions and two-way interactions, respectively. Both the BTD and GCP are able to deal with binary data, and the BPTF is a tensor factorization method that can characterize dependency across subjects within the last mode.
In the first experiment, we evaluate the performance of different methods via prediction accuracy under different ranks and tensor sizes. We consider a 3-order tensor, whose size is n 1 × n 2 × n 3 = 200 × 100 × 50 or 2000 × 1000 × 50 and the rank of the latent factors is set as R = 3, 5 or 8. Each latent factor is generated as a i , b j , c k iid ∼ N (0, 4 · I R ) for i = 1, . . . , n 1 , j = 1, . . . , n 2 , and k = 1, . . . , n 3 , and the underlying rating probability is formulated as: and the binary rating tensor is generated via x ijk ∼ Bernoulli(p ijk ). We denote the missing rate of the observation in the rating tensor as π 0 . For small tensor size 200 × 100 × 50, the missing rate π 0 = 90%, while for large tensor size 2000 × 1000 × 50, π 0 = 99%. For the observed rating, we assign 60%, 20% and 20% of the data into training, validation and testing sets, respectively. Here, all simulations are replicated 100 times, and the true rank R is applied. Table 1 provides the comparisons between the proposed method and competing methods, where the performance is measured by the mean square error (MSE) and the area under the receiver operating characteristic curve (AUC).  The proposed method performs the best in all cases except for the large tensor of 2000 × 1000 × 50 with R = 5 in terms of AUC. This is because the observed sample size for the large tensor is sufficient for recovering the tensor using CP decomposition. Compared to a small tensor, a large tensor has 100 times more entries, but the missing rate is only 10 times smaller, resulting in 10 times more observations for a large tensor. The improvements of the proposed method compared to the other methods are at least 78% in the MSE. Note that the performance of all methods becomes worse when the rank increases in general, due to the increasing number of total parameter. The Boolean Tensor Decomposition (BTD) is not able to handle the large tensor of 2000 × 1000 × 50, as there is no sparsity constraint. The CP method provides a comparable performance in terms of the AUC, but the MSE is rather poor. This is because CP is directly applied to a binary rating, and the estimation of each rating given by CP is around 0.5, which leads to a high MSE. However, the AUC is mainly used to distinguish the high and low binary ratings, and the CP can learn such information from binary ratings, thus gives a comparable AUC. In contrast, the proposed method is able to improve the performance in terms of MSE by utilizing the link function, and transforming a binary rating to a continuous outcome.
The second experiment investigates the prediction performance of binary ratings when no dependency exists among pairwise interactions of latent factors. Specifically, we generate two separate latent factors of item b (1) j for user-item interaction and b (2) j for item-context interaction from two normal distributions, . Thus, the true probability tensor becomes: jr c kr ).
All other simulation settings are similar to the first experiment. Table 2 provides the performance of different methods when no dependency exists between interactions. The proposed model is still the best compared with competing models, indicating that the proposed method is quite robust against model misspecification when no dependency among pairwise interactions exists. There are two major reasons for improved performance of the proposed method over the PITF. First, we treat binary response as a random variable following an underlying probability, while the PITF treats it as a continuous variable. Second, the proposed loss function includes penalty on the latent factors while the PITF does not. Therefore, the PITF might suffer from low prediction accuracy due to overfitting and scale indeterminacy of latent factors.

Ordinal rating tensor
In the following simulation, we investigate the performance of the proposed method (TFD) with five competing methods on the ordinal rating prediction. Three of them are CP, PITF and BPTF. The other two are extensions of the CP and PITF methods, which are not designed for ordinal rating, but can be plugged into the proportional odds model by replacing θ ijk in (12) with the CP and PITF output. We refer to these two as the Ordinal Canonical Polyadic Decomposition (OCP) and Ordinal Pairwise Interaction Tensor Factorization (OPI).
The underlying probability of the rating being l is where f is the logistic function and the ordinal rating tensor follows We set the missing rate π 0 = 90% for the small tensor size, and π 0 = 99% for the large tensor size. All simulations are replicated 100 times, and use the true rank R. Table 3 shows that for both small and large scale tensors, the proposed method outperforms other methods by more than 30% in the MSE when the number of levels is 3, while the improvement increases to more than 88% when the number of levels is 5 or 7. As the number of levels increases, the MSE of the proposed model increases as well, which is consistent with Theorem 3.
We also compare the performance of the proposed TFD model with PITF, OPI, CP, OCP and BPTF given different observation missing rates. Specifically, we fix the tensor size at 2000 × 1000 × 50, the rank of the latent factor at R = 5, and the number of rating levels at L = 3. We consider two missing rates at π 0 = 0.95 and 0.97. Table 4 shows that although all methods perform worse as π 0 increases, the proposed method and the PITF-based methods are most robust against observation-missing than the CP-based methods. This is because the proposed method pursues parsimonious tensor modeling, and is less demanding on the sample size to recover data information compared with CP-based modeling.

Real data application
In this section, we apply the proposed method to two real application datasets, the User-Location-Activity dataset [38] and the Beer Review dataset [21].

User-location-activity data
The User-Location-Activity dataset [38]  The original data provides the counts of each activity at a specific location. After removing the users with no records and the locations with no counts, there are 146 users and 85 locations remaining, where only 2% of the entries have counts larger than 0. This is because each users only went to a few locations and had few specific activities during the experiment. Therefore, most entries are missing, which result in sparse tensor data. Figure 4 illustrates the dependency between location and activity based on the data. Specifically, people doing specific activities are correlated with their locations. Thus, it is reasonable to consider a dependence between location and activity when build the recommender system.
We define the tensor X as: x ijk = 1 if user i does activity k at location j 0 otherwise.
The goal of our study is to predict the user's probability of choosing a certain activity at a specific location. Through the prediction result, we can recommend location-activity pairs to each user. The preprocessed data are randomly split into training, validation and testing sets with proportions of 60%, 20%, 20%, and the experiments are replicated 100 times. We compare the proposed method with competing methods in Section 6.1. Each method's rank is tuned via minimizing  the mean square errors (MSE) from the validation set, and the largest possible rank is 10. Table 5 shows that the proposed model outperforms the other methods in AUC and MSE. The improvement from the proposed method in the AUC is more than 6 % compared with the second-best model, the GCP method. The proposed model, CP and BTD perform similarly on the MSE of predictions due to the imbalanced data.

Beer review data
In this subsection, we apply the proposed method to Beer Review Dataset [21] and recommend breweries with preferred beer styles to customers.
This dataset consists of 1.5 million beer reviews from 2002 to 2011 collected by BeerAdvocate. Each review includes a rating from 0 to 5 with a 0.5 incremental point scale, thus consisting of 11 levels in total. The dataset includes a total of 33,388 users, 5,743 breweries and 104 beer styles. Note that for each brewery, there is only one beer for each beer style. Here, we consider the brewery as an item and the beer style as the contextual variable. For the dependence structure, the beer style is entirely controlled by the brewery, and users' preference of brewery highly depends on the features of different beer styles in different breweries. As illustrated in Figure 5, the users' ratings on different beer styles vary for different breweries, indicating the dependency between user-brewery interaction and brewery-beer style interaction. To be more specific, the brewery affects the distribution of ratings on beer styles.
We randomly split the data into training, validation and testing sets with proportions of 60%, 20% and 20%, respectively. We adopt the sparse tensor strategy in Section 4 to handle the large tensor size. We compare our method with the CP decomposition, Ordinal CP decomposition, PITF, Ordinal PITF and BPTF. Each method's rank is tuned via minimizing the mean square errors (MSE) from the validation set, and the largest possible rank is 10. Table 6 shows that the proposed model outperforms other methods with respect to prediction MSE. Specifically, neither the CP or PITF can handle ordinal ratings with 11 levels well with large MSE. However, their modified versions, the OCP and OPI, perform better. The comparison between the proposed model and the Ordinal PITF indicates that incorporating dependency structure improves the prediction performance by 40% on the MSE. The proposed model also performs better than the BPTF, showing the advantages of incorporating the dependency between interactions of modes. Without any additional optimization of the implementation, the proposed algorithm has similar computation time compared to existing methods. Note that, except for the BPTF model, all the other methods including the proposed method are implemented in Python. Due to memory limitations, we store a large tensor in a sparse tensor data structure where only observed entries' information are stored. To further optimize the implementation, we can implement parallel computing in calculating gradients for different factors.

Discussion
In this paper, we propose a new tensor-based recommender system to incorporate the dependency among modes, and to achieve tensor completion through utilizing shared factors in addition to the pairwise interactions. The proposed decomposition is capable of capturing the dependency between user-item and item-context interactions, and leads to significant advantages in incorporating context information that introduces dependency among modes. In addition, the proposed method is capable of dealing with ordinal ratings in additional to binary recommendations. We also propose a mode-wise coordinate descent (MCD) algorithm to accelerate computation and reduce memory storage. We demonstrate the superiority of the proposed method on both simulation and real data applications. In theory, we show that the estimated parameter achieves asymptotic consistency in both binary and ordinal rating cases. There are several potential research directions to extend our method. One is to develop a unified framework for different types of utility jointly, such as binary, ordinal, count data and non-negative continuous data. Another possible direction is to extend the proposed method to a higher-order tensor, and develop an inference procedure to test whether certain dependencies between two specific interactions are needed or not.
We verify several conditions of Corollary 2 in [30]. First, we verify Assumption B. By definition, we have and hence β 1 = 0. In the rest of this section, all c i 's with i ∈ N are assumed to be non-negative constants.

A.3. Proof of Theorem 3
Here the density is where θ = r A (1) ir A (2) jr + r A (2) jr A (3) kr . Note that where c 16 and c 17 are some constant. In the last step, the mean value theorem and the boundness of function f is used. We then verify the condition of Lemma 2.1 of [23]. Based on the above inequality, where γ = (n 1 + n 2 + n 3 )R + L is the total number of parameters. With |Ω| and λ |Ω| above, the Assumption A of [30] is satisfied. The result then follows the Corollary 1.