Deviance Matrix Factorization

We investigate a general matrix factorization for deviance-based data losses, extending the ubiquitous singular value decomposition beyond squared error loss. While similar approaches have been explored before, our method leverages classical statistical methodology from generalized linear models (GLMs) and provides an efficient algorithm that is flexible enough to allow for structural zeros via entry weights. Moreover, by adapting results from GLM theory, we provide support for these decompositions by (i) showing strong consistency under the GLM setup, (ii) checking the adequacy of a chosen exponential family via a generalized Hosmer-Lemeshow test, and (iii) determining the rank of the decomposition via a maximum eigenvalue gap method. To further support our findings, we conduct simulation studies to assess robustness to decomposition assumptions and extensive case studies using benchmark datasets from image face recognition, natural language processing, network analysis, and biomedical studies. Our theoretical and empirical results indicate that the proposed decomposition is more flexible, general, and robust, and can thus provide improved performance when compared to similar methods. To facilitate applications, an R package with efficient model fitting and family and rank determination is also provided.


Introduction
Matrix factorization has a long history in statistics.In general, the problem is that given a n-by-p data matrix X where w.l.o.g.n ≥ p, one aims to decompose X into the product of two lower rank matrices that jointly summarize most of the information in X.With the rows of X seen as samples and the columns as sample conditions or experiments, such decompositions hope to interpret the information in X in factors.For instance, consider the (thin) singular value decomposition [SVD; 14] of X, X = U DV , where U is n-by-p with orthogonal columns, that is, U belongs to the Stiefel manifold S n,p , D = Diag{d 1 , . . ., d p } is a diagonal matrix of order p with d 1 ≥ • • • ≥ d p , and V ∈ O(p) is orthogonal.The Eckart-Young theorem [11] then states that, within rank q decompositions V n,p (q) = R n×q × R p×q , one solution to argmin (Λ,V )∈Vn,p(q) X − ΛV F = argmin (Λ,V )∈Vn,p(q) is given by Λ = U 1:q Diag{d 1 , . . ., d q }, where U 1:q has the first q columns of U and V = V 1:q , that is, by a ranktruncated version of X according to its SVD.This result has broad applications in Statistics, the closest one being principal component analysis (PCA).In this case, regarding rows of X as observations, we can center and decompose X by solving ( v 0 , Λ, V ) = argmin v0∈R p ,(Λ,V )∈{Vn,p(q):Λ 1n=0} It is known that v 0 contains column means and Λ V can be analogously obtained from the SVD of X − 1 n v 0 .Here, Λ corresponds to the first q principal components of X.

arXiv:2110.05674v3 [stat.ML] 30 Jun 2023
The formulation in (1) is based on squared error loss, minimizing of which can be considered equivalent to maximizing a Gaussian log-likelihood.However, such a model implicitly assumes that X ∈ R n×p and often leads to spurious results in more constrained cases, for example, if X ∈ F n×p for a field F, when F = N, F = {0, 1}, or even F = R + .
In the case that X stores positive counts, that is, when F = N + , one proposed improvement is non-negative matrix factorization [NMF ; 30], which solves argmin (Λ,V )∈Vn,p(q) i,j and is thus based on a Kullback-Leibler (KL) loss or a Poisson log-likelihood with an identity link.
In the same way that the NMF has been successful in computer vision research, extending the factorization model with more appropriate likelihood (family) assumptions and link transformations has generated a series of benchmark methods in various applications.To name a few, in natural language processing, the Skip-gram model [36,31] can be considered as a multinomial factorization; the Global Vectors for Word Representation [GloVe; 36] model can be considered as a log transformed PCA model with entry-wise weight specifications; in network analysis, the random dot product model [22,52] can be considered as a Bernoulli factorization.These generalizations, along with their popularity in follow-up applications, provided empirical evidence for the importance of more realistic factorization assumptions.However, these models are tailored to specific distributions and lack generality, specially with respect to over-dispersed or heavy-tailed data, limiting their exploration and broader, more suitable use in modern applications.

Related Work: Exponential Family PCA
A more general factorization model, exponential family principal component analysis [EPCA;9] proposes to minimize the following distance loss: where B is the Bregman distance, uniquely determined by the specification of the exponential family likelihood.EPCA has generated significant interest, for example, [51,37,32,33,16,48] focus on extending it using hierarchical structures to improve likelihood fitness and automatic rank determination.These contributions, however, have the following drawbacks: • The hierarchical extensions tend to suffer from heavy computational costs and identifiability issues (e.g.label switching [42] and posterior collapse [49]).
• The properties of the estimator have not yet been investigated.Understanding convergence behaviour even under certain conditions will provide stronger statistical assurances for applications.
• The model provides the freedom of choosing among exponential families without providing sufficient justifications on how to choose the appropriate factorization family given the data.
• An efficient and/or robust implementation to allow entry-wise factorization priority has been wanting from practitioners.Assuming prior weights on the factorization can greatly improve model performance; as examples, see [53] for missing data and [21] for a face-centered NMF.

Our Contributions
To address the gaps in EPCA, we here revisit this factorization task under a generalized linear model [GLM ; 35] framework: we assume implicitly that the data entries X ij ind ∼ F (µ ij , φ ij ) where F belongs to the exponential family with mean E(X ij ) = µ ij and variance Var(X ij ) = φ ij V(µ ij ) for a variance function V(•) : R → R.More specifically, F specifies a density/pmf f of X ij that can be expressed parametrically as a function of canonical parameters θ ij and dispersions φ ij as where the cumulant function b is such that b (θ ij ) = µ ij and b The terms φ ij are usually specified as φ ij = φ/w ij where φ is now a single dispersion parameter and w ij are known weights.Good references to exponential family, GLM and factor model are available correspondingly in the well-known textbook [35] and Chapter 12 of [4].
As discussed above, we relate the entry means µ ij to factors Λ and V by defining the linear predictor η ij = (ΛV ) ij and by adopting a link function g such that η ij = g(µ ij ).That is, we assume element-wisely g(E(X)) = ΛV . (4) If (X ij ; η ij ) is the log-likelihood under F for data entry X ij and parameter η ij , we can then obtain the factorization (Λ, V ) by minimizing the the negative likelihood function L(X ij , η ij ) = (X ij ; X ij ) − (X ij ; η ij ), i.e. the halfdeviance.With the most commonly used members of the exponential family provided in Table 1, the overall loss is then where we made the dependency on dispersion weights explicit.To emphasize these connections to GLM methodology, we denote our method as deviance matrix factorization (DMF).
While the implicit assumption of a data generative process might seem to be too restrictive, here it only serves to define the data loss L in (5).In practice, as we show here, the factorization only really requires the specification of L up to the second moments of the log-likelihood, as usual in GLMs under quasi-likelihood assumptions [50].In this way, our proposed approach inherits robustness to model misspecification.
While we regard the methods we discuss here as descriptive, similarly to PCA, there are close connections to factor analysis, which, in turn, aims at inferring covariance structure.We benefited from recent advances in this field, under a Gaussian error assumption, to investigate factorization consistency [25,2] and determine the factorization rank q [38,46,12].Here we extend these results and propose a maximum eigengap rule to specify q.Compared to related EPCA methods, our methods are simpler and more efficient while achieving similar statistical performance.To summarize, as we show later, our DMF enjoys the following advantages: • To appropriately model a wider variety of real world data, we extended PCA and NMF to a factorization under a general exponential family assumption.• To avoid identifiability issues, we constraint the factors Λ and V to be orthogonal frames and thus unique, yielding a more robust and efficient algorithm.• Under some extra conditions, we extend GLM consistency [8] to DMF factors.
• We adapt a generalized Hosmer-Lemeshow test [45] to determine the adequacy of a factorization family/link choice.• By specifying a link function g and entry weights w ij , we can better accommodate factorizations for various applications such as matrix completion and residual boosting [18,26].

Organization of the Paper
We start in Section 2 by formally introducing the deviance matrix factorization and discussing how we tackle identifiability issues and centering.Next, in Section 3, we devise a flexible and robust algorithm to produce DMFs.In Section 4 we offer theoretical guarantees on issues that would lead to spurious results if left unanswered: we first show that the DMF exists and is consistent, and then discuss how to determine the factorization rank and test the adequacy of family and link choices.In Section 5 we supplement these developments with extensive simulations and case studies using benchmark datasets from many fields.As we show, the proposed methods yield good results and better performance than traditional methods in practice, with real-world datasets favoring over-dispersed distributions such as the negative binomial.Finally, Section 6 concludes with a discussion and offers directions for future work.
For example, using a Gaussian loss, L(x, y) = (x − y) 2 /2, and an identity link we recover (1), while a Poisson loss, equivalent here to KL loss, again with an identity link yields a non-negative matrix decomposition (2).Table 1 lists more examples with correspondences between a distribution in the exponential family and their respective half-deviance loss functions and canonical links.The link function g(µ) is given in the canonical form, but in general can be any smooth and invertible function w.r.t field F. Variance function V(µ)

Identifiability
Denote by GL(q, R) the generalized linear space with rank q and domain R.If Λ and V are a solution to (6) then an arbitrary M ∈ GL(q, R) yields a new solution with Λ = ΛM − and V = V M since then Λ V = Λ V .To guarantee identifiability so that, as in the SVD, our decomposition is unique, we then require any solution (Λ, V ) is such that i) V has orthogonal columns, i.e.V V = I q and so V belongs to the Stiefel manifold S p,q ; ii) Λ has scaled pairwise orthogonal columns, that is, We denote this space for Λ as S n,q .
In this case, V ∈ S p,q implies that V V = M M = I q , that is, M ∈ O(q) is orthogonal, and so But the only way that M commutes with diagonal matrices is if M = I q and D = D (up to sign changes, as in the SVD).Thus, we have just shown that min (Λ,V )∈Vn,p(q) L(X, ΛV ) = min Λ∈ Sn,q,V ∈Sp,q L(X, ΛV ), with the solution to the right-hand side being unique.In practice, we can always identify a unique solution ( Λ, V ) ∈ S n,q × S p,q from a general solution ( Λ, V ) ∈ V n,p (q) by obtaining the q-rank SVD decomposition Λ V = U D V and setting Λ = U D.

Centering
We can capture a known prior structure Λ 0 ∈ R n×q0 of rank q 0 in the decomposition by solving min where the q 0 orthogonality constraints Λ Λ 0 = 0 avoid redundancies.As an example, the most common use of this setup is when Λ 0 = 1 n and V 0 serves as the center of X, similar to PCA [20] under Gaussian loss.
We can obtain a solution ( V 0 , Λ, V ) to (7) directly from a solution (Λ, V ) to ( 6): if H 0 = Λ 0 (Λ 0 Λ 0 ) −1 Λ 0 is the projection matrix into the space spanned by the columns of Λ 0 , then , where, as in the previous section, Λ ∈ S n,q and V ∈ S p,q are computed from the q-rank SVD decomposition of (I n − H 0 )ΛV .Note that since Λ 0 Λ V = Λ 0 (I n − H 0 )ΛV = 0 and V is full rank, we have Λ 0 Λ = 0 as required.
From the last section we know that we just need to solve the unconstrained version in (6) and post-process the solution to identify it even after centering.Moreover, this post-process ensures an unique deviance factorization.In this section we describe a method to deviance decompose the input matrix X. Incorporating the entry-wise factorization weights w ij , we modify the GLM re-weighted least squares [IRLS; 35] algorithm with a normalization step to achieve higher numerical stability and to avoid overflows.The optimization problem in (6) is clearly not jointly convex on both factors Λ and V , but, as in the SVD, it is convex on each factor individually, so we alternate Fisher scoring updates for one factor conditional on the other factor.

Deviance Factorization
Let us first define the linear predictor η ij = q k=1 Λ ik V jk .From the distribution F we have the log-likelihood in terms of the canonical parameter θ ij and up to a constant as where w ij are weights and b and φ are the cumulant function and dispersion parameter of F , respectively.The cumulant allows us to define µ ij = b (θ ij ) as the mean of X ij and V(µ ij ) = b (θ ij ) as the variance function.Finally, the means and linear predictors are related by the link function g, η ij = g(µ ij ).
With β being an arbitrary entry of either Λ or V , we then have For k = 1, . . ., q, we then have the scores that is, ∂L/∂Λ = −2GV and ∂L/∂V = −2G Λ.Similarly, taking both β and γ to be arbitrary entries in either Λ or V , we obtain Thus, for example, with β = Λ ik and γ = Λ lm we have where δ is the Kronecker delta.

Λ-step
Now, denoting λ = vec(Λ), the Fisher scoring update for Λ at the t-th iteration is The Hessian matrix H λ is such that Thus, since (H λ λ) ik = j m V jk V jm S ij Λ im , we have that where • is the Hadamard elementwise product.After pre-multiplication by H λ (λ (t) ) in ( 8) the update then becomes, in matrix form, To simplify the notation, let us drop all t iteration superscripts with the exception of Λ terms and define D i• .= Diag{S (t) i• }, where S i• denotes the i-th row of S. Let us also denote by λ i the i-th row of Λ; we then want ] .We write (9) as Taking the i-th row, i = 1, . . ., n, and transposing we see that we just need to solve where Z (t) is the working response, That is, to obtain each λ is independent of φ, we can take the variance weights S ij to be independent of φ when finding λ

V -step
Similarly, the Fisher scoring update for V at the t-th iteration is Denoting v j as the j-th row of V our goal now is to obtain ] .We again drop the iteration superscripts and denote D •j .= Diag{S (t) •j }, where S •j denotes the j-th column of S. The update in ( 13) is then, after transposition, But then we can just solve columnwise, for j = 1, . . ., p, that is, to obtain v (t+1) j we just regress Z (t) •j on Λ (t) with weights S (t) •j .

Algorithm DMF
Having both scoring updates we just need to alternate between Λ and V -steps until convergence of the deviance loss in (5).Convergence can be assessed by relative change in L between iterations.To kickstart the procedure, we can first follow another common procedure in GLM fitting and initialize µ , we set V (0) as the first q columns of the transpose of [η (0) ij ] i,j .Note that we do not need to initialize Λ (0) since Λ (1) only needs V (0) .Finally, because for an arbitrary diagonal matrix D of order q we have ΛV = ΛD −1 DV = (ΛD −1 )(V D) , to achieve higher numerical stability and avoid overflows we fix the scale of the columns of Λ and V by normalizing the columns of V (t) before a Λ-step and the columns of Λ (t+1) before a V -step.
We summarize this specialized version of iteratively reweighted least squares (IRLS) in Algorithm 1.Given a continuous and differentiable link g, we can thus see that the algorithm converges to at least a stationary point by Proposition 2.7.1 in [3].The general solution to the stationary point optimization problem is either running the algorithm multiple times with random initialization or initializing the parameters Λ, V with the data X information.The GLM framework has initialization algorithm leverages the data X information, we here adopted the GLM initialization as the remedy to the stationary point optimization problem.A concrete implementation as R package dmf is available in the supplement.

Theoretical and Practical Guarantees
While Algorithm 1 enables us to find deviance matrix factorizations, there are still two issues to tackle: finding adequate exponential family distributions in the formulation of the DMF and a suitable, parsimonious decomposition rank.In this section,we provide solutions for these two problems, along with theoretical guarantees for the good performance of the DMF.We start by extending the consistency of GLM estimators to the DMF, and then discuss methods to determine factorization family and rank next.

Strong Consistency of DMF Estimator
In this section, we justify our DMF estimator η = Λ V fitted using Algorithm 1 as a consistent estimator of ΛV under mild conditions.Our setup is motivated by the setup in [38] and [7] which assume fixed rank q but diverging p/n → c for some constant c > 0. The difference is that these references assume i.i.d.Gaussian errors Here we assume that the error is non-Gaussian and not necessarily additive on the η ≡ ΛV .To simplify the notation from now on we denote m( Theorem 4.1.Let λ l,n,p (M) to be the l-th largest eigenvalue of matrix M ∈ R n×p .λ l,n,p (M) can take limits of n, p since the matrix input M is of dimension n × p.We abbreviated the notation as λ i (M) ≡ λ i,n,p (M).Under the following conditions: is defined as the fitted residual.For any random variables c ij satisfying p j=1 c 2 ij < ∞ for fix i and n i=1 c 2 ij < ∞ for fix j, the fitted residual c ij e ij are such that: The DMF estimator is strongly consistent in the sense that : where Λ and V are defined as infinite dimensional matrices.Remark 4.1.Notice that condition C1 restricts the significance of the principal components and when we impose the identifiability of DMF, we have the condition to be simplified into lim Similar conditions have been frequently used in [7,38,12].Condition C2 puts restrictions on the magnitude of factorization noise.For instance, a similar condition, but with c i being deterministic, has been assumed in [8].It essentially implies that we need a careful balance on the significance of the principal components such that condition C1 is satisfied and the variance does not grow unbounded.Conditions C3 and C4 are additional assumptions required to compensate the non-linearity imposed by GLM type link function g(•).A similar condition is also assumed in [8].Compared to condition C1, condition C4 is asking for non-dominating principal component Λ and projection matrix V .

Factorization family determination
Due to the setup of matrix factorization, the number of parameters in the DMF will increase according to the sample size.This has prevented us from directly applying the asymptotic distribution of χ 2 or deviance statistics to assess goodness of fit, as in regular GLMs.With degrees of freedom specified from stratified residual groups, [43] first generalized the Hosmer-Lemeshow test to other GLM families by deriving its asymptotic distribution, but such a generalized testing statistic requires determining a kernel bandwidth for variance estimation.To overcome bandwidth selection and the curse of dimensionality, [45] provided a more straightforward generalized Hosmer-Lemeshow test statistic with an empirical power simulation study to justify its performance.Although these test statistics are proposed within an usual GLM regression framework, we have been able to justify their use of a naive generalized Hosmer-Lemeshow test within the context of our matrix decomposition.The generalized Hosmer-Lemeshow test is based upon the idea of grouping the model fitted values η ij according to their magnitude and standardizing their summation into a chi-square statistic [44].Consequently, to ensure we have enough normalized samples to judge for normality, the general guideline is choosing the number of groups G such that G ≥ 15 and to have each group containing at least ten observations-taking a dataset with 150 observations as an example, we could choose the k-th group threshold u k according to the 15th quantile of η.
We adopt a notation similar to [44], defined below: Specifically, R and for Remark 4.2.Condition F1 is satisfied by following conditions C1-C4 in our Theorem 4.1, condition F2 is the Lindeberg CLT condition required for asymptotic normality, and condition F3 is the condition required for a strong law of large numbers, which together with Slutsky's theorem justifies the convergence in distribution.
The proof is available in the supplement.In comparison to the Hosmer-Lemeshow test [23] where T can be asymptotically approxmaited by a χ 2 G−2 distribution, our test statistic is a cruder estimator with condition F1 assumed to simplify the derivation.In comparison to [45], our method coincides with their naive estimator but we further justify from Theorem 4.2 the use of the naive estimator in large sample settings.This conclusion is also consistent with the empirical findings in [45] that differences between the naive estimator and the more rigorous generalized Hosmer-Lemeshow test are only noticeable in rather small samples.Typical datasets for matrix factorization are large enough so that condition F1 holds after the convergence of Algorithm 1.

Factorization rank determination
Rank determination is a difficult and evolving research field.Similarly to all other statistical problems, the factorization parameters and likelihood objective increase with the factorization rank.The solution typically requires a judicious balance between model complexity and model fit.Generally, within our DMF framework, there are two main types of methods we can leverage.
The first method is to use "out of sample" cross validation where the "out of sample" concept in achieved by assigning zero weights (and thus effectively "holding out" the observation) in Algorithm 1.To improve computational efficiency, [39] proposed block-wise cross validation by assigning zero weights to the partitioned Gaussian data matrix.[27] compared the performance between element-wise holdout and block-wise holdout.However, the analysis in [41] concluded that cross validation in linear regression (Gaussian case) requires a large (asympotically equal to the sample size) validation set.A more expensive predictive sequential type cross validation [40] was thus proposed and was proven to be consistent under GLM setup.In our experiments, probably due to the consideration of bi-way correlation [2] and a relatively larger hold-out sample size, we found that the block-wise cross validation method performs better than the element-wise hold-out method but failed to generalize it to GLM families except for Gaussian and Poisson.
The second type of rank determination method is based on estimation.[46] proposed Stein Unbiased Risk Estimators (SURE), which were proven to be unbiased, under mean squared error, for factor models.However, the derivation of SURE requires explicit solutions to d µij dXij , which are not available in closed form using our iterative Algorithm 1.Recent results using random matrix theory have gained deserved attention by providing great empirical and theoretical justifications on conventional rank determination problems.For example, [38] demonstrated that under similar conditions to C1-C2 in our Theorem 4.1, the conventional "elbow" rule is justified by the divergence of maximum eigengap on the true rank threshold.With additional assumptions, [12] improved the consistency of this rank determination method by replacing the covariance eigengap with a correlation eigenvalue threshold.Although we have been able to justify the use of both [38] and [12], we hereby adopt the covariance maximum eigengap method of [38] because its conditions permit a more general setup and are closely related to the consistency conditions C1-C2 assumed in our Theorem 4.1.We also compare the ACT method with the maximum eigenvalue gap method with simulated studies satisfying different setups to verify the generality of the former method in Section 5. Proposition 4.3.In addition to conditions C1-C2, if the following additional conditions holds ) is continuous with respect to the function input.

R2. The fitting error e
then for moderate n, p, there exist A ∈ R n×n , B ∈ R p×p and decomposition error ∈ R n×p such that given DMF full rank estimate η: where A, B, satisfy Assumption 2 in [38].Thus, as a corollary, we have [38]'s rank determination theorem: given the maximum rank q max ≤ min{n, p} and for some δ > 0, the following rank estimation converges in probability to the true rank, q(δ) = max{j : λ j ( η η) − λ j+1 ( η η) ≥ δ, j ≤ q max }.
(17) Remark 4.3.Conditions R1-R2 are fairly mild and satisfied by commonly used GLM link functions.Also notice that the ij in this definition is different from our previously defined fitting error e ij .The relationship between e ij and ij is elaborated in Appendix B. 3.
In practice, we need an algorithm to determine δ and thus output q.One obvious solution is to plot the eigenvalue up to q max and look for the maximum eigengap.A more straightforward solution is to adopt the original algorithm proposed in Section IV of [38].The algorithm simply sets q max = p − 5 and iteratively checks the convergence of the slope of the local linear regression (and thus automatically determine δ) based upon five neighboring points.The algorithm is by no means rigorous compared to the theorem, but is provided to automatically estimate the rank by leveraging on the convergence information of local linear regression.To guard against numerical artifacts in the δ calibration algorithm, plotting the eigenvalue gap of Cov( η) is always recommended.

Empirical Studies
In this section, we first conduct deviance matrix factorization on simulated datasets to justify the use of family test (Theorem 4.2) and rank estimation (Proposition 4.3).Then, after this initial validation, we perform real-world data analyses on benchmark datasets from computer vision, natural language processing, network community detection, and genomics.All these datasets have known labels/classes and the factorization results are thus compared through the separability of different classes in the lower dimensional factorized components.To organize our findings in a more compact way, we refer to Appendix A for the eigenvalue gap plots that justify the factorization ranks for each of the empirical benchmark datasets.

Family Simulation studies
To empirically validate the family determination theorem, we assume X ij iid ∼ F (g −1 (η ij )) as before and simulate matrix data to check the significance and power of the following hypotheses: The first simulation study is designed to validate the significance levels of the generalized Hosmer-Lemeshow test in Theorem 4.2, through which we also check the test sensitivity on a dispersion parameter φ under alternative families in H 1 .The second, more extensive simulation study is conducted to investigate type-II error and power.

Significance Analysis
The negative binomial distribution can be considered a generalization of the Poisson distribution since it adds a precision parameter to address additional dispersion in count data.Similarly, the gamma distribution better accounts for dispersed data via its shape parameter when compared to the log-Gaussian distribution.In logistic regression, the complementary log-log link, when compared to the canonical logit link, is better suited for data characterized by a heavy tail distribution.We here demonstrate that factorizations based on (over) dispersed families are in general more robust and better fit the data when compared to (unit) non-dispersed family factorizations, as expected.
Given a dispersed exponential factorization family F 0 with link g 0 , we first simulate data X ∼ F 0 (g −1 0 (η)) conditional on some η = Λ 0 V 0 + ΛV .Then we conduct DMF estimation to obtain 1) η 0 using the simulation factorization family F 0 with link g 0 and 2) η 1 using a comparing factorization family F 1 with link g 1 .Based upon the fitted η 0 and η 1 , we compute the test statistics in Eq (16).To maintain numerical stability after link function transformations, we simulated data with a center (offset) Λ 0 V 0 .To validate the generality of our proposed family test, we varied the null family F 0 across gamma (φ = 1), negative binomial (φ = 5), and binomial (w = 90) distributions, while also varying link functions from log, complementary log-log (cloglog), and logit functions.Here we highlight the results using q = 5, p = 20, n = 1000, Λ ij ∼ N (1, 0.5/q) and V ij ∼ N (0, 0.5/q).Simulations with larger sample sizes and various q/p ratios were also conducted but the results were similar so we do not report them here for brevity.Table 2 summarizes the three simulation cases, listing simulation (null) and comparison (alternative) distributions and links.By comparing Pearson residuals (third column in Figure 1), we can conclude that a DMF under correct family and link parameterizations indeed yields better results.Since we generate the data from the given family (in green), it is not surprising to observe that all χ 2 G−1 test statistics from correct family/link specifications have p-values equal to 1 and all χ 2 G−1 statistics from wrong family/link assumptions have p-values equal to 0. Notice that for the family test in Figure 1 we employed the gamma and negative binomial families with true dispersion parameter φ, which is unknown in real world datasets.We here suggest that a naive moment estimator of the dispersion estimator (e.g., φ = max{0.1,µ X 2 /(σ 2 X − µ X )} for negative binomial) can still effectively distinguish the true simulation factorization family from the comparing non-dispersed factorization family.With the same q, p, n and Λ 0 , V 0 setup, we simulate gamma and negative binomial data and compare χ 2 family test p-values among three different factorization families: 1) dispersed family with true dispersion φ, 2) dispersed family with estimated dispersion φ, and 3) non-dispersed family.We generate data matrices and evaluate family test p-values for 10,000 simulations.The results from a Poisson-negative binomial example are shown in Figure 2.
From Figure 2, we can conclude that the p-values are not significantly different when we compare between estimated and true dispersion parameters in factorizations for dispersed families.Both dispersed factorizations are clearly different from the Poisson factorization family.The high p-values for the dispersed families indicate that there is no sufficient evidence to reject the null hypothesis that the negative binomial (dispersed family) is the true factorization family.In fact, because the negative binomial distribution is more flexible and better represents the variance in data entries, it should consistently show higher p-values when compared to the Poisson distribution even when the data already show good agreement with the non-dispersed (Poisson) family.We have empirically confirmed this observation in a separate simulation study.

Power Analysis
We investigate the power of our test according to two sets of parameters: n and p, specifying the sample size; and η F , controlling the effect size.Power is, in general, sensitive to these parameters, increasing with them.However, in our setup we expect a more complex, combined effect: small data matrices with overall small entry magnitudes can be well explained by multiple families, while larger matrices provide more opportunities to spot discrepant entries under the null, specially with larger entries as they impact the variance function V more pronouncedly.
Since deriving the power of the test analytically is challenging under the DMF setup, we estimate it empirically under four commonly used factorization families in simulation studies.The four selected factorization families and links are poisson (log), binomial (logit) with weight w = 90, gamma (log) with φ = 1, and negative binomial (log) with φ = 5.For each combination of factorization family and link, we simulate data under n = 200, 600 and vary the dimension according to p = [0.2n,0.3n, . . ., 0.7n].The magnitude η F is controlled by a σ parameter with η ij = Λ i V j , Λ i ∼ N (0, σ 2 I q ), V j ∼ N (0, σ 2 I q ).The σ parameter is taken to be {0.1, 0.2, 0.3, 0.4} for the four families.We simulate data X according to each of the four families indexed by j = 1, . . ., 4 under each n, p, σ setup.The simulation(sampling) is repeated S = 30 for each of the family index.Conditional on each of the simulated data X, we then choose G = np/50 to ensure each group has 50 residual observations.Next, we compute the p-value for the generalized Hosmer-Lemeshow test using each of these four families indexed by i = 1, . . ., 4 as the null family, with i = j indicating a DMF fit under the true generating family/link assumption.For each combination of simulated hypothesis family j and fitted hypothesis family i, we denote as P s (i, j) the p-value for sample s.This simulation provides power estimates 1 − β(i, j) = 1 S S s=1 1(P s (i, j) ≤ α) at significance level α whenever i = j.In the study we set α = 0.05.As an initial sanity check, the power of two extreme cases of n, p, σ are shown in Figure 3.As we can see from the left matrix in Figure 3, in general, with small sample sizes (n = 200, p = 40) and magnitude (σ = 0.1) we have low power as expected, shown here by having 1 − β(i, j) ≈ 0 on the off-diagonal elements.True dispersed families such as gamma and negative binomial, however, show better discriminatory power even at these challenging conditions.As we focus on the right figure to increase sample sizes n, p and the effect size σ, we observe higher power, again as expected.The only exception to 1 − β(i, j) ≈ 1 in the off-diagonal entries in the right matrix is the true Poisson versus compared negative binomial case, which is also expected since the former is a particular case of the later.This observation additionally confirms the robustness of the negative binomial distribution in terms of explaining variation within Poisson data.
Despite the fact that the designed statistics is able to distinguish the correct families and links, we observe that the diagonal of the two matrices in Figure 3 is not converging to the advertised α = 0.05 significance level.From our empirical explorations, the desired α = 0.05 convergence can be achieved by requiring 1) asymptotically larger n, p, 2) a carefully chosen number of groups G and 3) a good initialization point for optimization convergence.The consideration of these factors will further depend on the family and link being simulated, which will make the design of the experiment unnecessarily lengthy.We decided to keep this interest open to follow-up research since this experiment has sufficiently provided the first assurance (to the best of our knowledge) for practitioners to access the appropriateness of the DMF family/link.That is, the experiments designed here clearly demonstrate that the test has high sensitivity and specificity with large sample sizes.
Next, to observe the effect of size on power, we estimate it from 10 replicates under different simulation scenarios of increasing n × p overall size with fixed σ = 0.4.The results are summarized in Figure 4.As we can see, the power of our family test generally increases as the sample size np increases while keeping the nominal significance level.As in the previous simulation study, negative binomial fitted models fail to reject Poisson data, as expected.Interestingly, negative binomial generated data is harder to fit under other families, even dispersed ones, while Poisson generated data is more flexible and be well fit even with a binomial family under a low sample size.Lastly, we are also interested in the impact of η F through the parameter σ.In Figure 5 we plot how power varies as a function of σ for fixed n = 200 and p = 80 over 10 replicates.As expected, power increases monotonically with the magnitude σ, indicating a clear effect of η F in distinguishing the variance structure and thus providing effective family selection.

Rank Simulation Study
To determine the effectiveness of factorization rank estimation, we continued to simulate data given a specific family and link function X ∼ F (g −1 (η)), where η = Λ 0 V 0 + ΛV is of a known true rank q * .To test and compare the generality and the accuracy of two possible rank estimation methods [Adjust Correlation Threshold, ACT; 12] and [38]'s maximum eigenvalue gap, we generate both random and deterministic Λ and V according to the conditions in our Theorem 4.1 and Proposition 4.3.
For the implementation of the ACT method, we simply take η = g(X) and use the Stieltjes transformation on the eigenvalues of its correlation matrix.For the maximum eigenvalue gap method, we simply adopt the δ calibration algorithm in [38] with q max = p − 5.For each simulation scenario, we set p = 50, n = 500, and vary the true rank q * .We set Λ 0 V 0 = 5 n×p to maintain the linear predictor positive, as required for the gamma, Poisson, and negative binomial distributions, but set Λ 0 V 0 = 0 n×p to avoid perfect separation in the binomial case.Here we highlight a simulation scenario with n = 500 and p = 50; experiments under different aspect ratios p/q ∈ [0.1, 0.2, • • • 1] are repeated and resulted in similar conclusions, so they are not shown for brevity.We simulate the rank determination process 1,000 times under the different scenarios listed in Table 3. Case 1 was designed with both the conditions of ACT and maximum eigenvalue gap method in our Proposition 4.3 satisfied.Case 2 was conducted with the true V matrix being deterministic, which is permitted in Proposition 4.3 but violates the assumption in the ACT method.Cases 3 and 4 were designed with q * relatively large compared to data dimension p, which violates the assumption of q * = O(p δ ) in ACT methods for δ large.Cases 5 and 6 are designed to demonstrate the failure of both methods in the case of weak signal (condition C1 in Theorem 4.1 is violated).The simulation results are shown in Figure 6.
We can conclude that although the ACT method works almost perfectly when the conditions in [12] are satisfied, it also consistently fails to estimate the true rank when the conditions are violated (e.g., in case 2 with V being deterministic, case 3 when q * is relatively large compared to p).On the other hand, the maximum eigenvalue gap method estimates the true rank q * with few mistakes for the first four simulation scenarios.
If we take a closer look at the maximum eigenvalue gap plot we notice that most, if not all, of the over estimations resulting from the maximum eigenvalue gap method are caused by the crude δ calibration algorithm in [38].That is, the algorithm attempts to capture the maximum eigenvalue gap through estimating the slope locally up to convergence, which might suffer from high variance under a limited local data sample.A typical overestimating effect is shown in Figure 7, which clearly indicates a rank 6 and 15 but the δ calibration algorithm outputted a rank of 9 and 17 for cases 2 and 4.  We thus encourage the user to always plot the eigenvalue gap to supervise the determination of factorization rank.We also refer to Onatski's δ calibration algorithm for a potential improvement on the automatic algorithm robustness.In conclusion, we adopt the maximum eigenvalue gap method for rank determination in our empirical data study as it is more flexible and robust.

Face Recognition
Since pattern recognition is closely related to matrix factorization, we tried our method on a computer vision benchmark dataset.The CBCL dataset is a two class dataset with 2,429 of them being human faces and 4,548 images being non-face images.Enriched with followup research, the CBCL dataset has been extensively employed in computer vision research [6].Containing the first comparison among PCA, NMF and factor model on the CBCL dataset, the work of [30] brought considerable attention to NMF in computer vision.
Here we compare factorization results of the CBCL dataset with four potential factorization families: 1) the constrained Poisson identity link (NMF), 2) the unconstrained Poisson log, 3) the negative binomial log with MoM-estimated φ = 0.462, and 4) the binomial family probit link with weights as 255 (max grayscale value in dataset images).The binomial factorization with weight 255 is particularly interesting because it offers the statistical view of computer images as binomial allocations.
We first determine the factorization rank by fitting η with a full rank (q = 361) and by performing the maximum eigenvalue gap method (with q max = 30).The results indicate a rank of three for negative binomial, Poisson (log link), and binomial, and a rank of four for Poisson (identity link, NMF).Eigenvalue plots can be found in Appendix A. We then perform the family test with G = 165 with results shown in Figure 8.As can be seen, the negative binomial family is closest to normality on its components and thus the highest p-value in χ 2 G−1 statistics.It is then followed by the binomial with probit link, NMF, and log-link Poisson.Note that even though the negative binomial factorization family is the best among the four potential families, we still observe tail deviations from the 45 deg line in the QQ-plot.We attribute this deviation to the unusual presence of zeros as the background color.The tail deviation thus points out to a potential factorization research direction with the generation of background pixels being modeled by a separate specification as in zero-inflated models.
Next, we compare the projected three dimensional Λ for these four families, as shown in Figure 9.Even though the negative binomial factorization family is the best among the four in the family test, the projected three dimensional plot indicates better separability between the face and non-face classes under the NMF factorization.This result suggests a better fit of negative binomial and binomial factorization to the image data by better accommodating the dispersion induced from different classes, but a less clear separability between faces and non-faces.This observation is related to the positivity constraint imposed by the NMF method, which has been shown in [30] to promote partial face learning.A potential research direction thus would be to fit DMFs under a classification scenario, for instance, a negative binomial or binomial family factorization for faces and non-faces, with different factors Λ and V for each class.

Natural Language Processing
Besides applications in computer vision, natural language processing (NLP) has also been a very successful application area for matrix factorization.The factorization is usually based upon a keyword co-occurrence matrix, then termed as latent semantic analysis [LSA; 10], or keyword-document matrix, known as hyperspace analogue to language [HAL; 34].One recent breakthrough is GloVe [24], which motivates the use of a log link on the probability of word co-occurrence matrix for word analogy task.
We hereby compare the NMF factorization against our DMF factorization with negative binomial and Poisson log links by experimenting on one of the NLP benchmark datasets known as "20Newsgroup" [29].The dataset contains 20,000 articles with each of them associated with one of 20 news topics.For visualization convenience, we focus on four topics-auto, electronics, basketball, mac.hardware-that are different enough and can adopt similar preprocessing methods to [17].Specifically, to avoid overfitting on specific documents, we filtered out the headers and footers and quote information within the news while tokenizing specific keywords using English stop words.For the tokenization of the keywords, we also filter out words that show up in 95% of the articles and keywords that appear in only one article to promote the effectiveness of keyword identification.After preprocessing, we end up with 3,784 documents and 1,000 keywords.We use the frequency matrix as the input of our matrix factorization with n = 3, 784 and p = 1, 000.
Since the data is a positive count matrix, we here compare the factorization results of NMF, Poisson log and negative binomial with MoM-estimated φ = 0.06.For both the Poisson log and negative binomial log family we have found a rank of q = 5, while for the NMF we have q = 3.To better visualize factorization results, we adopt a factorization with q = 3 for all the families to ensure a fair comparison to NMF.We then conduct a family test with G = 20 in Figure 10.With a unit p-value, we take the negative binomial family with log link and MoM-estimated dispersion ( φ = 0.06) as an adequate distribution while rejecting the Poisson family with log link and NMF with Poisson identity link with both of them having zero p-value.
Lastly, we adopt (7) for all the three factorization to better visualize the projected three dimensional plot in Figure 11.We can see after projecting the 3,784 documents onto the three dimensional factor space, the negative binomial factorization more clearly separates the four different topic classes.Compared to the negative binomial factorization, the Poisson log factorization can still separate the four different class albeit in a more compact way, which suggests the indispensability of dispersion to account for keyword frequency variation.The NMF with Poisson identity link perform the worst, with four topics overlapping in the center, at the origin.

Network Analysis
Another important application of our deviance matrix factorization is network data analysis, and here we focus on community detection [13].Since the adjacency matrix of a network has binary entries, we can just adopt a Bernoulli family for the factorization.To demonstrate the effectiveness of our DMF method on network data, we use the political blogs and Zachary's karate club datasets as examples, both of which have been benchmark datasets for network community detection since [13] and [1].
We thus just compare the binomial logit link factorization with the commonly used Laplacian spectral clustering method and use the separation of projected principal components in Λ to assign communities.
As before, we first employ maximum eigenvalue gap method to determine the factorization rank.With q max = 200 and q max = 29, we have found rank three to be true factorization rank for political blogs and a rank two for karate club network (see Figure 19).The projected two dimensional plot in Figure 12 suggests good separation between two groups, with the leaders being far from the centroid of their respective groups.Notice that both methods can identify two communities with a single principal component, but the DMF method provides more information about group variations in the second principal component.Although our method can take asymmetric and undirected network data as inputs for the factorization, the Laplacian clustering method depends heavily on the positive definite property of a connected symmetric graph.To ensure a fair comparison between our method and Laplacian clustering, we pre-process the data by focusing on the largest connected component, which consists of 1,222 nodes.As the projected three dimensional plot in Figure 13 indicates, compared to the spectral clustering method our method has a much clearer separation across the two known communities and better captures within group variation by exploiting all three factorized principal components.

Biomedical studies
The leukemia dataset reported in [15] has been a staple benchmark for gene classification.The dataset contains 5,000 features and 38 observations with two known labels of classes: acute myelogenous leukemia (AML) and acute lymphoblastic leukemia (ALL).Recent researches [5,54] have explored factorizations based on PCA and NMF with ranks ranging from 2 to 5.
For similar dispersion reasons, we compare factorization results among NMF, Poisson with log link, and negative binomial with φ = 1.93.Similar to the previous experiments, we first employed the maximum eigenvalue gap method with a full rank fit to determine the factorization rank.The results indicates a rank two factorization for negative binomial with log link and a rank one for the Poisson log-link model.For the NMF rank determination, we adopt the similar procedure by fitting the NMF with full rank and then plot the maximum eigenvalue gap.The results indicate possible ranks of two and possible rank of four for the NMF (Poisson with identity link) factorization.We then conduct the family test to determine the appropriate factorization family, as shown in Figure 14.We can see that Poisson (log), NMF, and PCA (identity) all failed the normality test, while the negative binomial family with log link demonstrates significant better normality.This result justifies a negative binomial family with log link as a reasonable factorization family.Lastly, we compare the factorized projected matrix V in Figure 15.Since in this particular factorization the three different families all have different ranks, we plot the projected V along sample indices with leukemia types (ALL versus AML) separated by a vertical line.As Figure 15 shows, the negative binomial family with log link can correctly distinguish the two types by using just two principal components.In contrast, the non-negative matrix factorization gives spurious results even with a higher factorization rank.The rank two factorization result for NMF is better compared to NMF rank four factorization, but the sample types are not as well separated as in the negative binomial factorization, especially for samples 6, 17, and 29.Thus we conclude that the NMF (Poisson with identity link) and Poisson family (with log link) are not capturing the essential information of the matrix completely; for instance, the factorized principal scores could not effectively distinguish the correct types.By better taking data dispersion into consideration, the negative binomial factorization is in general better suited for positive count gene data factorizations.

Case Studies -Classification
In the previous case studies we observe notable separability even from DMFs of low ranks of 2 and 3 when compared to other benchmark methods.Here we aim to further quantify this separability with objective downstream classification tasks.We conducted these comparisons under three inherently different classification models-tree, multinomial regression and k-nearest neighbors (KNN) with k = 9-using 20 factorized components.For each classification model we replicated training/testing for 10 runs with equal split on training (50%) and on testing (50%).Performances are summarized using boxplots on the 10 reproduced multi-class AUC [19], which has been shown to be invariant to class-skewness.As we can see from Figure 16, the DMF model has better performance in all the three classification models for the NLP dataset and for the polblog dataset.Perhaps due to a limited sample size (38 labels for leukemia), the performance for the leukemia dataset varies significantly on the classification model being used but overall we see that the DMF and NMF perform similarly with 20 PCs.For the CBCL face/non-face dataset, we see that the NMF model performs the best using multinomial regression and tree classification.However, both of the two methods are less-desired compared to KNN, achieving a nearly perfect score for the binomial DMF.

Conclusion
We have proposed a new approach for matrix factorization.The factorization method generalizes the commonly applied PCA and NMF methods to accommodate more general statistical families that take data features such as over dispersion into consideration.To this end, we have generalized a fitting algorithm with convergence to at least a stationary point being justified, and have provided methods, supported by theoretical guarantees, to determine factorization family and rank.
We also obtained encouraging experimental results in our simulated data studies and in many case studies using benchmark datasets from different fields.As we show in Section 5, our method can better explain sparsity patterns in keyword frequency and gene expression matrices through over-dispersed families such as the negative binomial.It also provides an alternative factorization perspective to image recognition by changing the original problem to an optimal binomial weights allocation problem.It further enriches community detection research in network analysis by allowing a general factorization method that is applicable to asymmetric and weighted networks.
Our work offers an important perspective to matrix statistical inference by extending the probability concept from Gaussian and Poisson to more general exponential families with different link functions.In fact, we plan to explore zero inflated models to improve fit and interpretability in face of matrices with excess zeros.For future research, we intend to enrich our deviance matrix factorization with a hierarchical structure to model matrix mixtures to effectively incorporate row or column classification and to investigate more efficient optimization procedures for very large datasets.
We thank two anonymous referees and the associate editor who provided many valuable comments that significantly improved the paper.We also thank Weixuan Xia who provided suggestions and validation on our consistency proof.The consistency proof is based upon Theorem 1 in [8] with some additional inequalities being imposed.We start with the following lemma:

A Factorization Ranks
Proof.The proof is detailed in the appendix of [8].

B.1.1 The first order conditions
Denote η ij = Λ i V j where Λ i is the i-th row of factorized matrix Λ, V j is the j-th column of factorized matrix V .As a notational convention, all vectors are defined as column vectors with their inner products being scalar.Under an exponential GLM setup, the log likelihood with parameter of interest θ ij = θ(η ij ), dispersion parameter a(φ), mean function µ ij = m(η ij ) and cumulant function b ij = b(η ij ) could be written in the following form: After convergence of Algorithm 1, denote l ij = l(η ij ).We will have the following first order condition to be satisfied: Specifically for the components, with V(µ ij ) being the variance function with mean µ ij as the scalar input, we have: Substituting into (19), we have: Observe that θ ij = η ij under canonical link.Thus g (µ ij ) = 1 V(µij ) , which further gives us the normal equation from the least square estimation for fixed Λ i or V j , ∀i = 1...n, j = 1...p: As we assume that there exist η of rank q that generates the data and maximize the likelihood, our DMF algorithm uses the data initialization to find the corresponding estimate η as the global maximum of the likelihood.However, the solution of (Λ, V ) is still not unique up to a rotational matrix until we impose an identifiability constraints.In together with our identifiability constraints and the convexity of the exponential log-likelihood, we thus observe that Λ i is unique conditional on V and that V j is unique conditional on Λ.At final algorithm convergence, our estimators are thus designed to maximize the likelihood and to satisfy the F.O.C in Eq (22):

B.1.2 Sketch of the proof
Intuitively speaking, if either Λ or V is known in Equation ( 23), the problem reduces to a multiple generalized linear regression model [8].When both of the estimators are unknown, it is thus hopeful to put some constraints on the space of our estimators Λ, V such that the consistency is still attained.In this proof, we first demonstrate that by restricting the L 2 norm of estimator V with the orthogonal design ( V V = I q ), the estimator of Λ i , ∀i = 1, • • • , n is strongly consistent.After which, we can derive the consistency property of estimator V j , ∀j = 1, • • • , p in a similar manner.We start by defining an auxiliary q by 1 vector G(•) : Notice that the G Λn ( V j ), G Vp ( Λ i ) vectors are defined in a similar manner to the normal equation in Eq (23) with In a comparison to GLM setup that has either Λ = Λ n ∈ R n×q or V = V p ∈ R p×q fixed as observed data, [8] justified the consistency through some inequality relationship based upon the following vectors: For the precise consistency inequality with G Vp ( Λ i ) as an example, implies the following from Lemma B.1 Our detailed proof then follows from building inequality relationship between Eq (24) and Eq (25).After which, consistency inequality in [8] can be readily applied.To facilitate the understanding, we focus on a summary on the proof in this section while postponing the inequality proof in Section B.1.3.

1). The Inequalities.
With • 2 denote the spectral norm for matrix and euclidean norm(inner product) for vector, we will show that for large p, the following inequality holds true where 0 < Z V δ (Λ i ) < ∞ is a positive and finite constant that introduced later in our detailed inequality proof.If the above inequality holds, we can readily apply Lemma B.1 to obtain: where δ Λi > 0 is an arbitrary small positive scalar.That is, we can find estimator Λ i as the pre-image of injection G Vp (•).The obtained estimator is sufficiently close to the true parameter Λ i with euclidean distance controlled by δ Λi .
for arbitrary vectors V 1 j , V 2 j and Λ 1 i , Λ 2 i , there exist Λ i ∈ (Λ 1 i , Λ 2 i ) and V j ∈ (V 1 j , V 2 j ) that satisfy: Now, expanding the Euclidean norm of G Λn ( V j ): where the first and only equality holds due to the mean value theorem in (32) and due to the expansion the L 2 norm.The last and only inequality arises from the definition of Z Λ δ .For the other lower bound, we could simply adopt similar strategy to derive: 2).Detailed proof on the Inequality † Proof.First, we denote With V V = I q , the desired inequality becomes: B.2 Proof of Theorem 4.2 Under conditions F1-F4 and using Taylor expansion, ∀k = 1, . . ., G, Now note that 1.Under conditions C1 -C4, after convergence of Algorithm 1, the consistency result of Theorem 4.1 guarantees that the second term is absorbed for large n and p.

Under H
has zero mean and variance given by 3. Under condition F3, we have from the SLLN that element-wisely Proof.Under our DMF setup, we assume X ij ∼ F (µ ij = m(η ij )) where m(•) is the inverse link g −1 (•) and F is a member of the exponential family that specifies the relationship between m(η ij ) and V(m(η ij )).From Theorem 4.1, we defined e ij = X ij − m(η ij ) with the following properties: • η ij = Λ i V j is deterministic; • e ij is random with E[e ij ] = 0 and VAR(e ij ) = V(m(η ij )).
• e 2 ij is random and converges to 0 a.s.In fact due to the square operation, it converges to 0 faster than e ij for large n, p.
When we have large n, p such that condition C2 implies the convergence result, we directly have e ij a.s.

− −−−− →
n,p→∞ 0. When we have moderate n, p, we should expect the order term m (η ij )g (µ ij )e 2 ij to converge to 0 faster than e ij , which leaves us with: where E[ ij ] = 0, E[ 2 ij ] = 1 and E[ 4 ij ] < ∞ by condition R2.To adapt [38]'s rank theorem for the case of moderate n, p, we thus only need to justify the existence of A and B such that A i B j = ( V(µ ij )/m (η ij )) ij where A, B and satisfy the setup and Assumption 2 of [38].Now observe that condition (i) in [38] is essentially our condition R1 with the definition of the Pearson residual ij ; conditions (ii) and (iii) in [38] are extremely mild in the sense that condition (ii) requires B B (as well as AA ) to be the auto-covariance matrix of any covariance-stationary process with a bounded spectral density; condition (iii) demands the prevention of a few linear combinations of idiosyncratic terms to have unusually large variation.
The existence of A, B, satisfying the above assumption can then be justified as follows.Let h be a variance stabilizing link function, that is, h(µ) ∝ The quality of the approximation depends on how close the general link is to a variance stabilizing link; schemes for this Kronecker approximation are discussed in [47].For our rank determination simulation described in Section 5, we empirically verified that the application of eigengap theorem performs reasonably well.
Figure 1 shows these results.

Figure 1 :
Figure 1: Goodness of fit tests: green indicates matched family/link fits.Columns (from left to right) represent respectively the QQ, density, and Pearson residual plots.

Figure 2 :
Figure 2: Sensitivity analysis on φ estimation: with red dot indicating the median, all simulations are under log link.

Figure 4 :
Figure 4: Sample size effect on the power of family test.

Figure 5 :
Figure 5: Sigma effect on the power of family test.

4 .
Under condition F2, applying Slutsky's theorem, we have from the Lindeberg CLT thatT (n, p) = (T 1 n,p ) (D 1 n,p ) −1 T 1 µ dt/ V(t), e.g.h(µ) = arcsin √ µ for binomial data and h(µ) = √ µ for Poisson data.If h is the true link, then h(µ ij ) = η ij and m (η ij ) = V(µ ij ), leaving ij = e ijand thus the assumptions are fully satisfied with A = I n and B = I p .Under a general link, similar to [38], we assume that the residual covariance e ij = ( V(µ ij )/m (η ij )) ij can be approximated with either A or B being diagonal and the other one being unrestricted such that E[vec e(vec e) ] = vec

Table 1 :
Distributions and their respective canonical links and half-deviance loss functions.
Moreover, we denote by R n,p , T n,p , D n,p the respective processes under true parameters η ij , instead of η ij .V(•) here is the variance function as defined in Table1.Based on this notation, we propose Theorem 4.2 to assess a factorization family: Theorem 4.2.Assume that F is the correct factorization family.Under conditions (F1)-(F3), F1. η → η a.s.element-wise, which can be justified by Theorem 4.1; F2.For δs n,p 1 n,p (u k ) is the empirical residual process given estimator η and cutoff point u k ; based upon R 1 n,p (u k ), we define a G-long vector T 1 n,p with k-th element being the empirical residual w.r.t.cutoffs (u k−1 , u k ); D 1 n,p is a diagonal matrix with entries being the empirical variances of T 1 n,p ; s 2 n,p (k) is the (summed) variance for group k under true parameter η.

Table 2 :
Family simulation cases.For each of the simulation cases, we examine the test statistics conditional on η 0 and η 1 from Eq (16) for both simulation and the comparison families.More specifically, we examine: (i) normal QQ plots of the normalized residuals

Table 3 :
Rank simulation cases.q