Double fused Lasso regularized regression with both matrix and vector valued predictors

: In many contemporary applications such as longitudinal studies, neuroimaging or civil engineering, a dataset can contain high dimensional measurements on both matrix-valued and vector-valued variables. Such structure demands statistical tools that can extract information from both types of measurements. In this paper, we propose a double fused Lasso regularized method to handle both matrix-valued and vector-valued predictors under the context of linear regression and logistic regression. An eﬃcient and scalable sGS-ADMM (symmetric Gauss-Seidel based alternating direction method of multipliers) algorithm is derived to obtain the estimator. Global convergence and the Q-linear rate of convergence for the algorithm is established. Consistency of the double fused Lasso estimators holds under mild conditions. Numerical experiments and examples show that the double fused Lasso estimators achieve eﬃcient gains in estimation and better prediction performance compared to existing estimators.


Introduction
In the era of big data, many datasets with complex structures emerge, which may contain matrix-valued and vector-valued variables simultaneously. For example, bike sharing schemes have gained increasing popularity in the recent years and become an integrated part of transportation network in many cities. Bike rental demand depends on the weather conditions and social factors [13], and a method to estimate or forecast the demand is important for bike sharing systems. The two-year historical log from Capital Bike sharing system in Washington D.C. contains daily bike rental counts from January 1st, 2011 to December 31th, 2012. It also includes a 24 × 6 matrix containing hourly weather information such as temperature, humidity and wind speed for each day. Additional information such as month, year, days of the week and holiday indicator is also recorded, which constitutes a vector-valued predictor. The daily bike rental count is taken to be the response. Multiple linear regression has been used to tackle such data [15] with vector-valued predictors. Since we have both matrix-valued predictors (weather conditions) and vector-valued predictors, traditional regression methods are not directly applicable. New regression tools that can be adapted to such data structure are in need. [50] proposed a matrix regression model y = X, B + z, γ + ε, (1.1) where y ∈ R is a continuous response, X ∈ R m×q is a matrix-valued predictor and z ∈ R p is a vector-valued predictor. The matrix B ∈ R m×q is a coefficient matrix with the same size as X and γ ∈ R p contains the coefficients for z. The inner product X, B is defined as tr(X T B). The ε ∈ R is the noise. Without the matrix-valued predictor X, (1.1) reduces to the standard linear regression y = z, γ +ε. Without the vector-valued predictor z, (1.1) reduces to the matrix regression model with only a matrix-valued predictor y = X, B + ε.
In some applications, response variable can be binary. For example, the diabetes dataset contains physical exam information of 2476 staffs of Beijing Jiaotong University from 2016 to 2018. During each annual physical exam, 62 features are measured including concentration and volume of erythrocytes, leukocytes and platelets, blood sugar concentration, kidneys and liver function tests, facial features and dietary preferences, giving a 62 by 3 matrix of physical exam results. In addition, seven covariates including gender, education, occupation, disability status are recorded for each patient. In 2018, 237 staffs are diagnosed to have diabetes, and 2239 staffs do not have diabetes. The association between diabetes and potential predictors is of special interest. In this case, we have a logistic regression model with both matrix-valued predictor (physical exam results) and vector-valued predictor. The response is the diabetic indicator, which takes value 1 if the patient has diabetes and 0 otherwise. The logistic regression model is formulated as logitP (y = 1) = log P (y = 1) 1 − P (y = 1) = X, B + z, γ , (1.2) where X ∈ R m×q is a matrix-valued predictor and z ∈ R p is a vector-valued predictor. The matrix B ∈ R m×q contains the coefficients for the matrix-valued predictor X and γ ∈ R p contains the coefficients for vector-valued predictor z. The regular logistic regression and matrix variate logistic regression are both special cases of (1.2) without the X term or z term respectively. Since the complex structure of models (1.1) and (1.2) and high-dimensionality in many applications, it is common to assume that the coefficients in (1.1) and (1.2) have sparse structure, or can be approximated by low-rank structure. To induce such structure, a variety of regularization methods have recently been proposed. Under the context of linear regression (1.1), if the regression model only has vector-valued predictors, popular penalization methods include the Lasso [44], the fused Lasso [45], the elastic net [53], and smooth clipped absolute deviation (SCAD) [12] are proposed. Regularization methods for matrix-valued parameters include, but not limit to, the nuclear norm regularization [8,30,34], multivariate group Lasso [36], multivariate sparse group Lasso [28] and matrix regression model based on singular values [50]. Under the logistic regression context (1.2), if we only have the vector-valued predictor, [1], [38] and [43] introduced the logistic regression model with sparse constraints. [31] dealt with the group Lasso for logistic regression model. [22] proposed the matrix-variate logistic regression for data only containing a matrix-valued variable.
In this paper, we propose a double fused Lasso regularized method which imposes a nuclear norm and a fused Lasso norm on the rows of B. This induces a low rank structure in B as well as the sparsity in the difference of successive rows, since in a longitudinal or imaging processing application, coefficients may change smoothly over a particular period of time. An L 1 norm and a fused Lasso norm is imposed on γ, which encourages sparsity in γ as well as the difference of successive elements. The formulation is then Here γ k denotes the kth element in γ, B j· denotes the jth row in B and n denotes the sample size. We impose the L 1 norm γ 1 and fused Lasso term p k=2 |γ k − γ k−1 | on the coefficients γ of the vector-valued predictors, and impose the nuclear norm B * and matrix-type fused Lasso term m j=2 B j· − B (j−1)· 1 on the coefficients B of the matrix-valued predictors. We call model (1.3) as double fused Lasso regularized matrix regression, or DFMR for simplicity. DFMR degenerates to the fused Lasso [45] when B = 0. It becomes the matrix-type fused Lasso with γ = 0, which is an extension of the regularized matrix regression [8,10,50]. We call model (1.4) as double fused Lasso regularized matrix logistic regression, or DFMLR for simplicity. DFMLR degenerates to the fused Lasso regularized logistic regression [29] when B = 0. It becomes the matrix-type fused Lasso regularized logistic regression with γ = 0.
To solve similar optimization problem as in (1.3) or (1.4), first-order methods such as alternating direction method of multipliers (ADMM) and augmented Lagrangian method (ALM) are widely used, see e.g., [17,21]. Specifically, [26] proposed linearized ADMM algorithm for sparse group Lasso and fused Lasso model. [48] considered ALM as a solver for the fused Lasso signal approximator problem. [49] developed an efficient ALM for large-scale non-overlapping sparse group Lasso problems. [50] proposed the Nesterov optimal gradient method for spectral regularized matrix regression. Due to coupled variables in the double fused Lasso penalties, a natural choice for solving the (1.3) and (1.4) is the ADMM algorithm. It is more efficient than ALM because ADMM solves B or γ alternatively instead of solving B and γ simultaneously. However, in high dimensional scenarios, the inversion of high dimensional matrices is the computation bottleneck for scalability. To reduce the computational cost, we consider the dual of (1.3) and (1.4), where the inversion of high dimensional matrices are avoided in such scenarios.
We propose an efficient and scalable symmetric Gauss-Seidel based ADMM (sGS-ADMM) algorithm to solve the dual of (1.3) and (1.4). In particular, every subproblem for dual of (1.3) has a closed-form solution. The global convergence and Q-linear rate of convergence of the algorithm is established. The resulting DFMR or DFMLR estimators show a superior estimation and prediction performance compared with the matrix Lasso and Lasso methods [50]. We also investigate the theoretical properties of the DFMR and DFMLR estimators.
The rest of the paper is organized as follows. Section 2 proposes an efficient sGS-ADMM algorithm to obtain the DFMR and DFMLR estimators, and establishes the global convergence and Q-linear rate of convergence of the algorithm. In Section 3, we investigate the consistency for the DFMR and DFMLR estimators. Section 4 demonstrates the performance of the DFMR estimator and the DFMLR estimator through simulations. Examples are included in Section 5. We conclude this paper and discuss future directions in Section 6. Proofs of theorems and other technical details are deferred to the Appendix.
We first introduce some notations which are useful for further discussion. Given a vector x ∈ R n , its L 1 norm, L 2 norm and L ∞ norm are defined as x, x , x ∞ = max{|x i |, i = 1, 2, · · · , n}. The closed balls centered at 0 with radius r ≥ 0 based on L 2 norm and L ∞ norm are defined by For random variable x, we denote its sub-Gaussian norm as x ψ2 = sup p≥1 (E|x| p ) 1/p / √ p. For any matrix A ∈ R m×n , its singular value decomposition is denoted by A = U ΣV T , where U ∈ R m×r and V ∈ R n×r have orthonormal columns, r is the rank of A, and Σ ∈ R r×r is a diagonal matrix. The diagonal elements of Σ are called the singular values of A, and we denote them by σ i (A), i = 1, 2, · · · , r. The sub-differential of A * is ∂ A * = {UV T + W |W ∈ R m×n , U T W = 0, W V = 0, W 2 ≤ 1}. The Frobenius norm (F -norm), nuclear norm and spectral norm r}. We use λ max (A) and λ min (A) to denote the largest and smallest eigenvalues of A, respectively. The closed ball centered at 0 with radius r ≥ 0 based on spectral norm is defined as

Estimation algorithm
In this section, we propose an efficient sGS-ADMM algorithm to obtain the DFMR estimator, and then generalize the algorithm to solve for the DFMLR estimator. Convergence of the algorithm is discussed.

Model reformulation and dual
We first reformulate the model (1.3) to find its dual. Let (y i , X i , z i ) be independent and identical copies of (y, X, z), i = 1, . . . , n. For simplicity, we denote y = (y 1 , y 2 , · · · , y n ) T , X = (vec(X 1 ), · · · , vec(X n )) T and Z = (z 1 , · · · , z n ) T , where vec operator stacks a matrix into a vector columnwise. Let A i be an (i − 1) × i matrix that has the following structure Then (1.3) can be reformulated as We introduce two slack variables ξ ∈ R n , η ∈ R (m−1)q with ξ = y−Xvec(B)−Zγ and η = Cvec(B). Then (2.3) can be written as min B,γ,ξ,η The objective function in (2.4) is convex with respect to every variable B, γ, ξ and η, but it is a smooth function only with respect to ξ. The two constraints are both linear. So (2.4) is a convex and nonsmooth optimization problem. Let where u ∈ R n and v ∈ R (m−1)q are Lagrange multipliers. Let P * (·) be the Fenchel conjugate function of P (·). Then the dual of (2.4) is equivalent to Based on the duality theorem, we can use the ADMM algorithm to get the solutions of (2.4) and (2.5). The augmented Lagrangian function is given by If we directly optimize L σ (B, γ, ξ, η; u, v) using the ADMM algorithm, it involves the computation of the inverses of X T X and Z T Z, which can be computationally expensive in high dimensional problems. Furthermore, note that (2.4) is a multiblock convex composite optimization problem with linear equality constraints. For such problems, convergence cannot be achieved by ADMM in general [4]. An augmented ADMM algorithm is proposed in [52]. However, its convergence is guaranteed only for two-block optimization problems and it may not achieve convergence for multi-block optimization problems. Therefore we consider the optimization problem (2.5), and employ the sGS-ADMM algorithm to solve (2.5). The sGS-ADMM algorithm is first proposed in [5], which combines the sGS technique with ADMM algorithm. A brief introduction on the sGS technique and sGS-ADMM algorithm for a general convex composite programming model is included in Appendix A.2.

Algorithm analysis
Now we employ the sGS-ADMM algorithm to solve the optimization problem (2.5). For convenience, let α = (vec(D) T , w T , t T ) T . Although (2.5) contains five variables u, v, D, w, t, it can be considered as a 3-block optimization problem with u, v and α, and only contains a nonsmooth term involving the variable α. The augmented Lagrangian function is where σ > 0, x 1 ∈ R mq , x 2 ∈ R p and x 3 ∈ R (m−1)q are Lagrange multipliers, and x = (x T 1 , x T 2 , x T 3 ) T . Note that the augmented Lagrangian function is strongly convex with respect to every variable u, v and α. Hence, the majorization step given in [5] is not necessary. Moreover, every subproblem in the sGS-ADMM algorithm has a Table 1 Iterative scheme of the sGS-ADMM algorithm for solving (2.5)
Note that each subproblem in Table 1 has a closed-form solution, which is due to the properties of the augmented Lagrangian function or the properties of proximal mapping (see Appendix A.1 for properties of proximal mapping and Appendix A.  Table 1. The stopping criterion eta is derived from the KKT condition. See Appendix A.3 for details.

Convergence analysis
In this section, we establish the global convergence and Q-linear rate of convergence of the sGS-ADMM algorithm in Table 1. While general results for the convergence of sGS-ADMM algorithm are available in [5] and [20], we verify that the assumptions for the global convergence and Q-linear rate of convergence are satisfied in our context, see Appendix A.5 for details. Let θ k = (u k , v k , α k , x k ) be the value of θ after the kth iteration.
To investigate the convergence rate, we first give a brief review on Q-linear rate of convergence. Let {x k } be the sequence of iterates and x * the optimal solution. Suppose that we say that the Q-order of convergence of {x k } to x * is r. In particular, if r = 1 and 0 < q < 1, then the convergence is said to be Q-linear. If r = 1 and q = 0, the convergence is Q-superlinear. The prefix "Q" means "quotient". Another type of linear convergence is R-linear convergence, where the prefix "R" is for "root". We say that {x k } converges to x * R-linearly if for all k, and ν k converges Q-linearly to zero. More details are included in [35]. Now we introduce some notations. For any self-adjoint positive semi-definite linear operator M 0 : X → X, let dist M0 (x, Ω) = inf x ∈Ω x − x M0 for any x ∈ X and any set Ω ∈ X. Recall that α = (vec(D) T , w T , t T ) T , we define h(α) = P * (w)+δ(t; B · ∞ (0;λ2) )+δ(D; B · 2 (0;λ1) ). According to Theorem 12.17 in [41], there exists a self-adjoint and positive semi-definite linear operator Σ α such that for all α, α , ζ ∈ ∂h(α) and ζ ∈ ∂h(α ), Σα . Let Φ be a linear operator such that for all (u, v, α, x), its adjoint is , and define a self-adjoint linear operator as follows M := Diag(I, σ(I + CC T ), σΣ I + Σ α , (τ σ) −1 I x ) + s τ σΦΦ * , where Σ I = I x = Diag(I mq , I p , I (m−1)q ). LetΩ be the optimal solution set satisfying the KKT conditions. Theorem 2.2. There exists 0 < μ < 1 such that for all k ≥ 1, . Theorem 2.2 establishes the Q-linear rate of convergence for the sGS-ADMM algorithm, which guarantees that the sequence {u k , v k , α k , x k } generated by our algorithm converges to the optimal solution.

Model reformulation and dual
To obtain the DFMLR estimator, we first reformulate the optimization problem (1.4) as where matrices C and A p are defined in (2.1) and (2.2). Introduce two slack variables ξ ∈ R n , η ∈ R (m−1)q with ξ = Xvec(B) + Zγ and η = Cvec(B). Then (2.6) can be written as The objective function in (2.7) is convex and nonsmooth. The dual of (2.7) is equivalent to

Algorithm analysis
The optimization problem (2.8) can also be solved by the sGS-ADMM algo- The augmented Lagrangian function is The iterative scheme of the sGS-ADMM algorithm for solving (2.8) is similar to the scheme in the context of DFMR as described in Table 1. The objective functions of the D, w, t subproblems are the same as those for DFMR, so the D, w, t subproblems have the same solutions as for the DFMR. Now we look at the remaining subproblems. The objective functions are strongly convex and smooth for the u-subproblem and v-subproblem. Their closed-form solutions can be obtained directly from the derivative of the augmented Lagrangian function (see Appendix A.4).
Finally, after k iterations, the s-subproblem can be written as Unfortunately, there is no closed-form solution to the s-subproblem. But its objective function is smooth. We take the derivative of the objective function with respect to each component of s. The derivative with respect to the i-th Then we use the Bisection method to find the solution. More details are included in Appendix A.4.

Convergence analysis
In order to explore the convergence result, we denote Note that the operator M is different from that in DFMR. LetΩ denote the optimal solution set satisfying the KKT condition.
is an optimal solution of (2.8), andx is an optimal solution of (2.6). Moreover, there exists 0 < μ < 1 such that for all k ≥ 1,

DFMR
In this section, we investigate the consistency ofB andγ. Let B * and γ * be the true value of B and γ. We state the following conditions. Condition 1. There exist two positive constants C X and C X such that

Condition 2.
There exist two positive constants C Z and C Z that bound all eigenvalues of n −1 Z T Z from below and above, respectively. Condition 3. The error ε i satisfies Eε i = 0 and follows the sub-Gaussian distribution, i.e., there exist two constants k and σ 0 such that Take the tuning parameters as Condition 1 is the matrix version of Restricted Isometry condition which was suggested by [39]. It is widely used in the analysis for high-dimensional low-rank matrices, for example [24,25,42]. It indicates that the smallest eigenvalue of n −1 X T X is lower bounded by C X , and its largest eigenvalue is upper bounded by C X . Conditions 2-3 are commonly used in variable selection in high-dimensional linear regression, for example [2,9,54]. Theorem 3.1 establishes the consistency of the estimator matrixB and vectorγ.

Theorem 3.1. Assume that Conditions 1-4 hold. Suppose that
where the symbol ⊗ indicates Kronecker product. Then, The assumption (3.1) assumes that X and z are only weakly correlated. Similar condition is used to study the consistency of bridge estimator for sparse regression with quadratic measurements in [11]. From Theorem 3.1, we learn that ||B − B * || 2 F + ||γ − γ * || 2 2 converges at the rate of O(max{mq, p}/n).

DFMLR
Now we discuss the consistency ofB andγ for DFMLR. Let B * and γ * be the true value of B and γ.
Condition 5. There exist two positive constants C X and C X such that

Condition 6.
There exist two positive constants C Z and C Z that bound all eigenvalues of n −1 n i=1 K i z i z T i from below and above, respectively.
Take the tuning parameters . Condition 7 assumes that vec(X) and z both follow sub-Gaussian distributions, which is also assumed by [10] in high-dimensional trace regression with a nuclear norm penalty. Theorem 3.2 provides the convergence rate and consistency results forB andγ. 2 2 converges in probability to zero.

Theorem 3.2. Assume that Conditions 5-8 hold. Suppose that
2 also converges at the rate of O(max{mq, p}/n) under the context of DFMLR.

Simulation
In this section, we demonstrate the performance of DFMR and DFMLR with numerical experiments. To evaluate estimation performance, we computed the average root mean squared errors (RMSEs) for each estimator of B and γ, denoted by RMSE(B) and RMSE(γ) based on 100 repetitions. To evaluate the prediction performance, we use a testing dataset with the same sample size as the training dataset. For the DFMR estimator, the prediction error is the root mean squared error over the testing dataset, denoted by RMSE(y). Specifically, ,whereŷ i is the fitted value for the ith observation. For the DFMLR estimator, the prediction error is the misclassification rate, denoted by MCR, since we consider the prediction as classification [23] on the testing dataset. The experiments were implemented in MATLAB 2017b on a desktop computer with i5-8250U, 1.80GHZ CPU and 8 GB of RAM.

DFMR
We compare the following three estimators: the DFMR estimator obtained by solving (2.5) and two estimators obtained by two regularization methods from [50], i.e., matrix Lasso regression estimator (MLR) obtained from and Lasso regression estimator (LR) from solving The MLR and LR estimators are computed by the MATLAB toolbox TensorReg from [50]. For the DFMR estimator, we adopt the common choice for tuning parameters as We fixed m = 100, q = 50, p = 250 and varied n from 200 to 1600. Each element in the matrix-valued predictor X and vector-valued predictor z was drawn independently from the standard normal distribution. We generated are two vectors. The first m/4 elements and last m/4 elements in b 1 were 0 and the middle m/2 elements were 1, and b 2 had similar structure. Then the elements in B 0 were 0 except that the elements in the center square were 1. Then rank of B 0 is 1. Let R < min(m, q) denote the rank of the matrix coefficient B. Then B was constructed by adding a (R − 1) × (R − 1) identity matrix above the left corner of the center square in B 0 . Let s denote the sparsity level of γ, 0 ≤ s ≤ 1, which means that the proportion of the nonzero elements in γ is s. The nonzero elements in γ all took value 1. We fixed R = 5 and s = 0.01. The error ε was drawn from a standard normal distribution. The estimation and prediction performance of each estimator is summarized in Table 2. The numbers in the parentheses are the standard deviations computed from the 100 repetitions.
The DFMR estimator has the best estimation and prediction performance for all sample sizes. For example, at sample size 400, the DFMR estimator reduces RMSE(B) and RMSE(γ) by 92% and 96% compared to the MLR estimator; and reduced RMSE(B) and RMSE(γ) by 93% and 97% compares to the LR estimator. For prediction performance, the DFMR estimator reduces RMSE(y) by 92% compared to the MLR estimator and 93% compared to the LR estimator. The MLR estimator has very large RMSEs when the sample size is small, but it performs much better when the sample size increases. The LR estimator also has a large RMSE in estimation and prediction, but in contrast to the MLR estimator, its performance does not improve much when the sample size increases.   Then we fixed n = 300, m = 100, q = 50 and varied the dimension of γ under the setting of Table 2. The results are summarized in Table 3. The DFMR estimator still has the best estimation and prediction performance. Both MLR and LR have very large RMSE in the estimation of γ, because their objective functions do not consider the sparsity structure on γ.
We also fixed n = 300, p = 250 and varied the dimension of B under the setting of Table 2. The results are presented in Table 4. DFMR again shows the best estimation and prediction performance for different m and q.

DFMLR
In this section, we compared the numerical performance of three estimators: the DFMLR estimator obtained by solving (2.8), and two estimators from two regularization methods in [50], i.e., matrix Lasso penalized logistic estimator (denoted by MLLR) from solving The MLLR and LLR estimators are computed by the MATLAB toolbox Ten-sorReg from [50]. For the DFMLR estimator, we choose λ 1 , λ 2 , λ 3 and λ 4 using [18]: , n + is the number of y i 's that takes value 1, and n − = n − n + is the number of y i 's that takes value 0.
We fixed m = 100, q = 50, p = 250, R = 5 and s = 0.01 and varied n from 200 to 1600. The coefficients B and γ, the matrix-valued predictor X and the vector-valued predictor z were generated in the same way as in Section 4.1. The measures of estimation performance RMSE(B) and RMSE(γ) as well as the misclassification rate MCR of the DFMLR, MLLR and LLR estimators are summarized in Table 5. The numbers in the parentheses are the standard deviations computed from the 100 repetitions.
Based on the results in Table 5, the DFMLR estimator has the best estimation and classification performance for all sample sizes. Take the sample size 400 as an example, the DFMLR estimator reduces RMSE(B) and RMSE(γ) by 54% and 93% compared to the MLLR estimator. It reduces RMSE(B) by 54% and reduces RMSE(γ) by 92% compared to the LLR estimator. For the prediction performance, the DFMLR reduces the misclassification rate by 66% compared to the MLLR estimator and 71% compared to the LLR estimator. When the sample size is 1600, the MATLAB program for computing the LLR estimator fails, and thus no results are recorded.
We then fixed the sample size n at 300, and varied the dimension of γ under the setting that produced Table 5. Finally we fixed n = 300 and p = 250, but varied the dimension of B. The results are shown in Tables 6 and 7. In both settings, the DFMLR estimator has the smallest RMSEs for estimation of B and γ and smallest misclassification rate compared to other estimators.

Signal shapes
In this section, we illustrate the effect of the nuclear norm penalty in estimation. For this purpose, we introduce a new estimator, called the fused matrix Lasso regression estimator, denoted by FMR. The FMR estimator is obtained from the following optimization problem, whose objective function is the same as that in (1.3), but without the nuclear norm penalty: Thus the effect of the nuclear norm penalty can be illustrated by the comparison between the DFMR estimator and the FMR estimator. We generated the data from (1.1) without the vector-valued predictor z, i.e., y = X, B + ε, where ε followed a standard normal distribution. The matrix predictor X was a 64 × 64 matrix with entries independently drawn from the standard normal distribution. The elements in the coefficient matrix B were binary, which took value 1 according to a variety of signal shapes displayed in the first column of Figure 1. The sample size n was fixed at 500. The DFMR, FMR, MLR, and LR estimators were plotted in the second, third, fourth and fifth columns of Figure 1, and their RMSE(B) are arranged in Table 8. The rank of the true signals varies from 9 to 20. Regardless of the rank, the DFMR estimator has the best visual recovery of the true signal, followed by the FMR estimator (with only the fused Lasso penalty) and the MLR estimator (with only the nuclear norm penalty). The LR estimator (with the L 1 penalty, but no nuclear norm or fused Lasso penalties) has the worst visual recovery. This trend is confirmed by the RMSEs in Table 8. The DFMR estimator has the smallest RMSE(B), followed by the FMR and MLR estimators, and with the LR estimator trailing behind. Take the shape of Cross as an example, the DFMR estimator reduces RMSE(B) by 50% compared to the FMR estimator, 46% compared to MLR estimator and 67% compared to the LR estimator. Clearly, the nuclear norm penalty is helpful to explore the structure of the coefficient matrix B.

Bike sharing dataset
The bike sharing system plays a more and more important role in urban traffic, due to its positive effects in energy consumption, environmental protection and public health. Currently, there are nearly 900 bike-sharing programs worldwide operating about 1,000,000 bicycles. The rental and return of bicycles has become quite convenient for users. Through automated stations, users can rent a bike from one position and return at a different position. Bike-sharing demand is highly dependent on weather conditions and social factors such as temperature, precipitation, and holidays. The bike sharing dataset [13] consists of bike rental records from Capital Bikeshare, the metro Washington D.C.'s bike share system, for a two-year period from January 1st, 2011 to December 31th, 2012. The weather conditions, including weather type (sunny, mist or others), temperature, apparent temperature, humidity and wind speed are measured every hour. We took the measurements as a 24×6 matrix-valued predictor X. The vector-valued predictor z contains indicators of months (January to December), days in a week (Sunday to Saturday), year (2011 or 2012) and holiday (work day or not). The response y is the daily aggregated count of rented bikes. The DFMR, MLR and LR estimators were computed. Due to the Lasso and fused Lasso penalties on γ, the DFMR estimator shows that Mondays have the least demand for bike rental; then the demand increases on Tuesdays and Wednesdays, where Wednesdays reach the same level of Sundays, the demand further increases on Thursdays and reaches its peak on Fridays and Saturdays, then it falls on Sundays. The MLR estimator and the LR estimator show the same weekly pattern but without a sparse pattern. For example, the coefficients for Fridays and Saturdays are very similar under the MLR and LR, while they are the same under the DFMR. Due to the fused Lasso penalty on B, DFMR estimator reveals that the coefficients for temperatures are very similar before 9am or after 7pm, indicating that the time of the day has an effect on the rental demand. Without the fused Lasso penalty, the MLR estimator of B is variant and does not have an obvious pattern. Without both the fused Lasso penalty and the nuclear norm penalty, the LR estimator of B is zero. The prediction performance measured by the average RMSE(y) was computed by 5-fold or 10-fold cross-validation (CV) with 100 random splits. The results are in Table 9. The numbers in the parentheses are standard deviations of the RMSE(y). The DFMR estimator again has smallest prediction error. Take 5-fold CV as an example, it reduces the RMSE(y) by 36.5% and 28.1% compared to the MLR estimator and LR estimator.

Diabetes dataset
The physical examination information for 2476 staffs at the Beijing Jiaotong University is collected by the university clinic from 2016 to 2018. Out of the 2476 staffs, 237 staffs were diagnosed to have diabetes in 2018. During the physical exam each year, a total of 62 measurements are recorded for each patient including blood sugar concentration, dietary preferences, concentration and volume of erythrocyte, leukocyte and platelets, and facial features, which yields a 62×3 matrix. In addition, information on seven characteristics that are relatively stable over the three-year period including gender, education, occupation, or disability status is available for each staff. We took the characteristics as the vector-valued predictor and physical exam results as the matrix-valued predictor. The response y i is a binary variable which takes value 1 if the patient is diabetic. The misclassification rates of the DFMLR, MLLR and LLR estimators were computed by 5-fold or 10-fold CV with 100 random splits. The results are included in Table 10. The numbers in the parentheses are standard deviations. For both 5-fold and 10-fold CV, the DFMLR estimator is able to reduce the misclassification rate by more than 50% compared to the LLR estimator or the MLLR estimator.

COVID-19 dataset
The COVID-19 dataset [47] consists of daily measurements related to COVID-19 for 138 countries around the world. We focus on the period June 13, 2020 to July 12, 2020. The response y is the total count of newly confirmed case in this 30-day period for each country. The matrix predictor X is the COVID-19 related government policy every day. The policies include school-closing, restrictions on gathering, stay-at-home requirement, income support and so on. Each of these policies may have several levels, for example, school closing includes no closing, recommend closing, require some closing (e.g. just high school) or

Concluding remarks
In this paper, we propose a regularized method in linear regression and logistic regression which can incorporate high-dimensional matrix-valued predictor and vector-valued predictor. The proposed method can be extended to models with tensor-valued predictors, which has many applications in neuroimaging and signal processing fields. In addition, the proposed method can be adapted to other generalized linear regression model such as Poisson regression, which is widely used in medical insurance, business statistics and geography [6].
Let p : R n → (−∞, +∞] be a closed proper convex function such that for a given ν > 0, the Moreau envelope function ψ p (·) of p [32] is defined by and the corresponding solution is called as the proximal mapping: Let p : R n → (−∞, +∞] be a closed proper convex function, then the Fenchel conjugate function of p is defined as p * (x) := sup Proposition A.1. [33] Let p : R n → (−∞, +∞] be a closed proper convex function, and p * (x) be its Fenchel conjugate function. Then for any t > 0, The equality (A.2) is often referred to as the Moreau identity. If Ω = B · ∞(0;r) , the proximal mapping of δ(x; Ω) is If p(z) = λ z 1 , the proximal mapping of p is P rox p (x) = shrink(x, λ) := sign(x) · max {|x| − λ, 0} , which is called as the soft-thresholding operator in [16].
If p(z) = λ 1 z 1 + λ 2 A p z 1 , where λ 1 , λ 2 ≥ 0 are given parameters and then the proximal mapping of p is If λ 1 = 0 in (A.5), we denote the proximal mapping of p(z) = λ 2 A p z 1 by z λ2 (x), and z λ2 (x) is Proposition A.2. [16] Let p(z) = λ 1 z 1 + λ 2 A p z 1 , A p ∈ R (p−1)×p has the form as in (A.4), then we have Now we present the proximal mapping for the matrix-form function. Let p(M ) = M * , then and it has a closed-form solution, which is given by and ς is a vector that contains the diagonal element of Σ D . The proof can be found in [3].
Let ν > 0, and p(M ) = δ(M ; Ω * ), then the proximal mapping of p is with Ω * = B · 2 (0;λ) and it also has a closed-form solution, which is In special cases, the proximal mapping is a projection and plays an important role in solving the problem. Based on it, we derive an efficient sGS-ADMM algorithm to solve DFMR and DFMLR.

A.2. An introduction to sGS-ADMM algorithm
Now we give a brief introduction on sGS technique and sGS-ADMM algorithm for a general convex composite programming model as discussed in [5]. The sGS means the symmetric Gauss Seidel method which is an extension of Gauss Seidel (GS) method. The GS method has been applied in linear or nonlinear systems, unconstrained or constrained optimization problems, see [19,37,46]. Chen et al. [5] designed the sGS method to solve convex composite conic programming. We now illustrate the GS and sGS methods with an example. Consider solving the linear system where A ∈ R m×n is the coefficient matrix and b ∈ R m is the right-hand side.
GS: successively update the elements of x in a fixed order and turn to the first one once the last one is updated.
sGS: successively update the elements of x in a fixed order and turn to the first one in reverse order once the last one is updated.
Let m and n be two nonnegative integers, X , Y i , 1 ≤ i ≤ m and Z j , 1 ≤ j ≤ n, be finite dimensional Euclidean spaces. Define Y := Y 1 × · · · × Y m and Z := Z 1 × · · · × Z n . Consider the following general convex composite programming model: where p 1 : Y 1 → (−∞, +∞] and q 1 : Z 1 → (−∞, +∞] are two closed proper convex functions, f : Y → (−∞, +∞) and g : Z → (−∞, +∞) are continuously differentiable convex functions whose gradients are Lipschitz continuous. The linear mappings A : Y → X and B : Z → X are defined such that their adjoints are given by B * j z j for z = (z 1 , · · · , z n ) ∈ Z, where A * i : Y i → X and B * j : Z j → X are the adjoints of the linear mappings A i : X → Y i and B j : X → Z j , respectively.
The augmented Lagrangian function of problem (A.8) is defined as follows.

A.3. Iterative scheme sGS-ADMM algorithm for DFMR
The each subproblems in Table 1 have closed-form solutions, which are obtained from the derivative of the augmented Lagrangian function or the properties of proximal mapping. Now let us look at the resulting subproblems in Table 1 one by one. The u−subproblem can be written as It is a quadratic form of u and has a unique closed-form solution Similarly, the unique closed-form solution of the v−subproblem is The D−subproblem can be written as , the closed-form solution is D k+1 = UDiag(e * )V T , where U, V, e satisfy the singular value decomposition of E, i.e., E = U ΣV T . The vector e contains the diagonal element of Σ, and e * = Π(e; B · ∞ (0;λ1) ).
The t−subproblem can be written as By (A.3), the closed-form solution can be obtained from the soft-thresholding operator The w−subproblem can be written as Applying the Moreau identity (A.2), we have The closed-form solution for P rox σP (σZ T u k+ 1 2 − x k 2 ) is given by Finally, the stopping criterion eta for DFMR estimator is derived from the KKT condition.
The maximum number of iterations k is set to be 5000.

A.4. Iterative scheme of sGS-ADMM algorithm for DFMLR
The iterative scheme of sGS-ADMM algorithm for solving (2.8) is summarized in Table 12.
The D, w, t subproblems have the same solutions as in the DFMR. Now we give the closed-form solutions for u−subproblem and v−subproblem. The solution of u−subproblem is
The maximum number of iterations k was set to be 5000.

A.5. Proofs
Proof of Theorem 2.1. The global convergence of sGS-ADMM algorithm is established by [5]. For the problem (2.5) in our paper is a convex optimization problem which has better structures where I + σXX T + σZZ T and I + CC T are positive definite matrices. The global convergence of the sGS-ADMM algorithm for solving problem (2.5) is easily satisfied. In order to prove the Theorem 2.2, we established two lemmas which give the upper bound of the KKT system of the iterative point and investigate the distance between iterative points and the optimal solution, respectively. We give the definition of metric subregularity from [7]. Let F : X → Y be a multi-valued mapping. Denote its inverse by F −1 . Define the graph of multi-valued functions F as follows graphF := {(x, y) ∈ X × Y|y ∈ F(x)}.
Lemma A.5. Let {θ k := (u k , v k , α k , x k )} is generated by the sGS-ADMM. Then for any k ≥ 1, Proof. The optimal condition for every subproblem in Table 1 is . We obtain from the definition of R(·) that By schemes of Lagrange multipliers and the definition of Λ k+1 , we have Thus we can immediately imply (A.9).
Combing Let ϕ(B * , γ * ) = e X i ,B * + z i ,γ * 1+e X i ,B * + z i ,γ * − y i . The first-order gradient of loss function is given as Note that |ϕ(B * , γ * )| ≤ 1, we obtain (A.26) leads to We obtain We have proved that ||B − B * || 2 F + ||γ − γ * || 2 has a upper bound. Next we will prove that the estimation error tends to zero. By (A.24) we know Condition 7 shows that the vec(X i ) and z i obey the sub-Guassian distributions. Hence the vec(X i )vec(X i ) T and z i z T i satisfy the sub-exponential distributions. Applying Bernstein inequality for every t ≥ 0 yields Taking t = Due to we have with probability at least 1 − c 1 e −c2mq − c 3 e −c4p . Further, under the Condition 8 and assumption for max{mq, p}/n → 0 as n → ∞, it follows that ||B − B * || 2 F + ||γ − γ * || 2 2 converges in probability to zero.