CDPA: Common and Distinctive Pattern Analysis between High-dimensional Datasets

A representative model in integrative analysis of two high-dimensional correlated datasets is to decompose each data matrix into a low-rank common matrix generated by latent factors shared across datasets, a low-rank distinctive matrix corresponding to each dataset, and an additive noise matrix. Existing decomposition methods claim that their common matrices capture the common pattern of the two datasets. However, their so-called common pattern only denotes the common latent factors but ignores the common pattern between the two coefficient matrices of these common latent factors. We propose a new unsupervised learning method, called the common and distinctive pattern analysis (CDPA), which appropriately defines the two types of data patterns by further incorporating the common and distinctive patterns of the coefficient matrices. A consistent estimation approach is developed for high-dimensional settings, and shows reasonably good finite-sample performance in simulations. Our simulation studies and real data analysis corroborate that the proposed CDPA can provide better characterization of common and distinctive patterns and thereby benefit data mining.


Introduction
Modern biomedical studies often collect multiple types of large-scale datasets on a common set of objects [8,23]. For example, The Cancer Genome Atlas (TCGA) [18] collected for tumor samples the multi-platform genomic data such as mRNA expression and DNA methylation; the Human Connectome Project (HCP) [52] acquired multi-modal brain imaging data, including structural MRI and functional MRI, from healthy adults. The use of multiple data types can allow us to enhance understanding the mechanisms underlying complex diseases like cancers [26,5] and neurodegenerative diseases [55,43], or to improve the performance in various learning tasks such as clustering and classification [51,46].
The most straightforward approach to the integrative analysis of multi-type datasets is to concatenate all their data matrices into one matrix and then implement standard data analysis tools. One such example is the simultaneous component analysis (SCA) [47], which applies the principal component analysis (PCA) to the concatenated data matrix and thus is also known as SUM-PCA. These methods are simple to implement, but they are unable to explore or interpret the relationships among datasets. As pioneers to overcome this drawback, the canonical correlation analysis (CCA) [20,33] and its various generalizations [6,24,49] measure the correlations and extract the most correlated components among datasets. The CCA methods only account for correlated features and fail to reveal a more detailed relationship on the common and distinctive patterns across datasets.
A family of data integration methods has emerged recently to identify and separate the common and distinctive variations across datasets, including orthogonal n-block partial least squares (OnPLS) [31], distinctive and common components with SCA (DISCO-SCA) [44], common orthogonal basis extraction (COBE) [58], joint and individual variation explained (JIVE) [30], angle-based JIVE (AJIVE) [13], and decomposition-based CCA (D-CCA) [45]. Consider the case with two datasets. All these methods decompose each data matrix into a low-rank common matrix generated by latent factors shared across datasets, 1 a low-rank distinctive matrix corresponding to each dataset, and an additive noise matrix. Despite different constraints in the decomposition, these methods refer the common pattern of the two datasets to the common latent factors, but ignore the common pattern between the two coefficient matrices of these common latent factors. It may be more appropriate to name their "common" and "distinctive" matrices as common-source and distinctive-source matrices.
We propose a new unsupervised learning method, called the common and distinctive pattern analysis (CDPA), to improve the delineation of the common and distinctive patterns between two correlated datasets. The CDPA method defines the common pattern by incorporating both the common latent factors and the common pattern of their coefficient matrices, and determines each distinctive pattern as the residual part of the corresponding signal dataset. In factor analysis [17], a coefficient matrix of latent factors is called a factor pattern matrix , containing the factor loadings (i.e., coefficients) that represent the contributions of latent factors to the signal data. A coefficient matrix is also known as a mixing channel in signal processing [39,41] which introduces correlations into the uncorrelated source variables to generate the output data. Hence, the two coefficient matrices of the common latent factors for two correlated datasets may contain common and distinctive patterns of the ways in which these common latent factors form their corresponding common-source matrices. Such common and distinctive patterns in the two coefficient matrices are also important and should be separated into the common and distinctive patterns of the two datasets. Our defined common-pattern and distinctive-pattern matrices together with the aforementioned common-source and distinctive-source matrices constitute a more comprehensive picture that depicts the relationship of two datasets.
Three challenging issues arise in the construction and estimation of commonpattern and distinctive-pattern matrices: (i) There exists the row matching problem of the two coefficient matrices, or equivalently the variable pairing problem of the two datasets, if the rows of either observed data matrix can be arbitrarily ordered independent of the other matrix; (ii) The common pattern of the two coefficient matrices must be identified; (iii) Recovering the high-dimensional common-pattern and distinctive-pattern matrices confronts the curse of dimensionality where the unknown large covariance matrices may not be consistently estimated by the traditional sample covariance matrices [56]. We successfully convert the row matching problem (i) into the classic graph matching problem [32]. We extract the common pattern in (ii) by our extended analogy of the state-of-the-art D-CCA. To address the challenge (iii), we develop consistent estimators of proposed common-pattern and distinctive-pattern matrices under the high-dimensional spiked covariance model [12,53,45], which has been widely used in various fields, such as signal processing [36], machine learning [21], and economics [7].
The rest of this article is organized as follows. Section 2 introduces the CCA and D-CCA methods as preliminaries. Our CDPA method and its consistent estimation are established in Section 3. Section 4 examines the finite-sample performance of proposed estimators via simulations. We also compare CDPA with six D-CCA-type methods through simulated data in Section 4 and through two real-data examples from HCP and TCGA in Section 5. Section 6 concludes with discussion. All theoretical proofs and additional simulation and real-data results are provided in Appendices. A Python package for the proposed CDPA method is available at https://github.com/shu-hai/CDPA.

Preliminaries
Let Y k ∈ R p k ×n for k ∈ {1, 2} be the k-th dataset obtained on a common set of n objects, where p k is the number of variables. The decomposition model considered in aforementioned existing methods (e.g., D-CCA) is for which the n columns of each matrix are assumed to be independent and identically distributed (i.i.d.) copies of the corresponding mean-zero random vector in where {X k , x k } 2 k=1 and {E k , e k } 2 k=1 are signals and noises, respectively, {C k } 2 k=1 and {c k } 2 k=1 are common-source matrices and random vectors that are generated from the common latent factors of the two datasets, and D k and d k are the distinctive-source matrix and random vector from distinctive latent factors of the k-th dataset. Write each k-th common-source random vector by c k = B k (c 1 , . . . , c L12 ) , where c 1 , . . . , c L12 are the common latent factors and B k is their coefficient matrix. The common pattern of B 1 and B 2 is not considered by the existing methods, which motivates our current research.
We start with signal vectors {x k } 2 k=1 for simplicity, and introduce the CCA and D-CCA methods in the two following subsections. The signal estimation is deferred to Section 3.3.
Notation. For any matrix M = (M ij ) 1≤i≤p,1≤j≤n ∈ R p×n , denote theth largest singular value and the -th largest eigenvalue (if p = n) by σ (M) and λ (M) respectively, the spectral norm M 2 = σ 1 (M), the Frobenius norm . Denote the angle between two elements v 1 and v 2 in an inner product space by θ(v 1 , v 2 ). Let (L 2 0 , cov) be the inner product space composed of all real-valued random variables with zero mean and finite variance, and endowed with the covariance operator as the inner product. Let both x := y and y =: , r min = r 1 ∧r 2 , r max = r 1 ∨r 2 , and r 12 = rank(Σ 12 ). Throughout the paper, our asymptotic arguments are by default under n → ∞.
A similar method to CCA is the principal angle analysis (PAA) [3], which investigates the closeness of any two subspaces, denoted by F and G, in the Euclidean dot product space (R p , ·). For 1 ≤ ≤ q := min{dim(F ), dim(G)}, the -th principal angle θ ∈ [0, π/2] between F and G is defined by The vectors {u , v } are called the -th pair of principal vectors of F and G. Let Q F and Q G be the matrices whose columns form the orthonormal bases of F and G, respectively. The principal angles and principal vectors can be obtained by The PAA and CCA methods are essentially the same except their respective inner product spaces (R p , ·) and (L 2 0 , cov). The principal vectors and the cosines of principal angles of PAA correspond to the canonical variables and the canonical correlations of CCA. The cosines of principal angles are also called canonical correlations in PAA [3]. Similar to (4), the bi-orthogonality between different pairs of principal vectors also holds.

Decomposition-based canonical correlation analysis
For random vectors {x k } 2 k=1 , the D-CCA method [45] aims to decompose each x k into a common-source vector c k and a distinctive-source vector d k by subject to three desirable constraints in (L 2 0 , cov): To this end, guided by the bi-orthogonality (4) of augmented canonical variables z k = [z k ] r k =1 , k = 1, 2, D-CCA divides the decomposition problem (7) of span([x 1 ; x 2 ] ) into r max subproblems, each within one of the mutually orthogonal subspaces span( where z k = 0 for > r k , and c = 0 for > r min . For ≤ r min , the common variable c is defined by such that c increases as θ z := θ(z 1 , z 2 ) decreases on [0, π/2].
Constraint (12) equivalently says that c indicates the correlation strength of z 1 and z 2 . The unique solution of (10) subject to (11) and (12) is Combining the solutions of subproblems yields the D-CCA decomposition: for k = 1, 2, with β k = cov(x k , z k ). Here, {c } r12 =1 are the common latent factors of x 1 and x 2 , and {d k } r k =1 are the distinctive latent factors of x k . Figure 1 (b) shows the decomposition structure of D-CCA.

Common and distinctive pattern analysis
The CDPA method aims to more comprehensively define the common and distinctive patterns of two datasets by incorporating the common and distinctive patterns of the two coefficient matrices of common latent factors. We use a graph matching approach to match the unpaired rows between the coefficient matrices. Consistent estimators are established for the CDPA-defined commonpattern and distinctive-pattern matrices.

Common and distinctive patterns
As shown in Figure 1 (b), D-CCA only focuses on the common latent factors , and ignores the common pattern of their coefficient matrices B k = (β k1 , . . . , β kr12 ) for k = 1, 2. So do its five previous methods mentioned in Section 1. In factor analysis [17], the coefficient matrix B k of latent factors is called their factor pattern matrix, and the entry B is the factor loading on c for variable c k . In signal processing, B k is called a mixing channel [39,41], which introduces correlations into the uncorrelated input sources {c } r12 =1 to generate the output signal c k that has cov(c k ) = B k diag(var(c 1 ), . . . , var(c r12 ))B k . Thus, B 1 and B 2 may possess common and distinctive patterns of the respective ways in which the common latent factors {c } r12 =1 constitute c 1 and c 2 . In CDPA, we define a common-pattern vector c for {x k } 2 k=1 which takes into account both the common latent sources {c } r12 =1 and the common pattern of their mixing channels {B k } 2 k=1 . The distinctive-pattern vector of signal x k is then defined as the residual part of the signal after removing c.
In the process =1 β k c , the -th column β k of the mixing channel B k is the sub-channel transmitting c , and the linear mixture of sub-channel outputs {β k c } r12 =1 reflects the "mixing" performance of the channel B k . We disentangle the common and distinctive latent structures for the two sub-channel spaces {colsp(B k )} 2 k=1 in a similar way as D-CCA does for the two signal spaces {span(x k )} 2 k=1 . Two issues need to be solved before the analysis. First, the sub-channel vectors {β k } k≤2, ≤r12 may have unequal lengths p 1 and p 2 , Without loss of generality, we let p 1 ≥ p 2 throughout the paper. When p 1 > p 2 , we zero-pad B 2 to be a p 1 × r 12 ]. This zero padding is equivalent to adding (p 1 − p 2 ) zeros into x 2 . In other words, we are now equivalently studying the patterns between x 1 and x 2A = [x 2 ; 0 (p1−p2)×1 ]. Second, sometimes the rows between B 1 and B 2A or equivalently the entries between x 1 and x 2A are not one-to-one matched due to their arbitrary ordering. For this scenario, we match their rows by permuting the rows of B 2A with a permutation matrix P. The permutation can be defined so that colsp(B 1 ) and colsp(PB 2A ) are closest to each other by maximizing r12 =1 cos 2 θ B , where θ B is theirth principal angle. This row-matching procedure will be discussed in detail in Section 3.2. For the generalization of our results to other row-matching criteria, we assume that the permutation matrix P is prespecified in the following text. For notational simplicity, we use the superscript " " to indicate adding zero padding and P to the given vector or matrix if necessary, for example, We now consider the latent structure of the two sub-channel spaces {colsp(B k )} 2 k=1 by using an analogy of D-CCA on (R p1 , ·), where constraints is a rank-r 12 matrix. We define the common and distinctive components of {v B1 , v B2 } using a decomposition similar to that in (9) and (13): for k = 1, 2 and = 1, . . . , r 12 . Because the principal vectors (v B k 1 , . . . , v B k r12 ) =: V B k form an orthonormal basis of colsp(B k ), the mixing-channel matrix can be written as The part of x k that contains the common latent factors (source variables) {c } r12 =1 and the common mixing-channel basis The difference between c * 1 and c * 2 is the matrices S k := V B1 B k , k = 1, 2 in the middle of their formulas, which contain the weights dually owned by We define the common part of the two dual weight ma- To avoid overweighting a dataset when signals x 1 and x 2 have different scales, we weight S k by the scale factor [tr(Σ k )] −1/2 in (18). This is equivalent to rescaling each x k by the factor [tr(Σ k )] −1/2 at the very beginning as in [30]. Our definition of S in (18) is motivated by the consensus configuration in generalized procrustes analysis [15,16] which minimizes the sum of squared Euclidean distances to transformed configurations of interest (i.e., the scaled {S k } 2 k=1 in our case). This minimization is equivalent to that of the sum of Kullback-Leibler divergences [35,Lemma 17.4.3] and yields a closed-form solution.
We combine the three types of common parts where S is defined as the common pattern of B 1 and B 2 . For each individiual unscaled signal vector x k , we rescale c to be c (k) = [tr(Σ k )] 1/2 c and express the CDPA decomposition as For signal vector x k , the vector h k represents the distinctive pattern retained within the common-source vector c k , and the vector δ k characterizes the total distinctive pattern by incorporating both h k and the distinctive-source vector as the vector c given in (19), and the scaled common-pattern vector for are called the common-pattern, the scaled commonpattern, and distinctive-pattern matrices of {X k } 2 k=1 , respectively. The population CDPA decomposition is summarized in Algorithm 1 and its uniqueness is given in Theorem 1. Theorem 1. Given any p 1 × p 1 permutation matrix P, the common-pattern vector c defined in (19) for (x 1 ,

Remark 2. The common-pattern vector c differs only in its sign for
We assume the sign of each entry in y k or x k cannot be arbitrarily changed, but the sign of y k or equivalently that of x k may change. The assumption is generally true if each dataset represents a data type. For example, let y 2 be mRNA expression data and its entry y [i] 2 measure the mRNA expression level on the i-th gene. The arbitrary entry-wise sign changes can result in two different measurements applied to y 2 . Regarding the different c's due to the sign change (if allowed) of entirely y 2 or x 2 , we suggest to choose the one with larger variance tr{cov(c)} or, in practice, larger 1 n C 2 F = tr( 1 n C C ), where C is the estimate of C that will be introduced in Section 3.3. It will be shown later in Theorem 2 that 1 n C 2 F P → tr{cov(c)} under mild conditions. The confidence interval (CI) of 1 n C 2 F can be constructed by bootstrapping samples [11] once the ranks {r 1 , r 2 , r 12 } and the permutation matrix P are determined.

Row matching of coefficient matrices
When the rows of coefficient matrices B 1 and B 2A are not one-to-one matched, we match them by permuting the rows of B 2A with the following permutation matrix where θ B is the -th principal angle of colsp(B 1 ) and colsp(PB 2A ), and Π p1 is the set of all p 1 × p 1 permutation matrices. This optimization is equivalent to minimizing the Frobenius distance (2 r12 =1 sin 2 θ B ) 1/2 . Commonlyused distances between vector spaces [9] also include the geodesic distance , the Asimov distance θ B1 , and the gap distance sin θ B1 . The four alternative distances appear more difficult to be minimized, but our criterion based on the Frobenius distance can be converted to the famous graph matching problem [32].
Specifically, by equations in (6), the optimization problem in (21) is equivalent to where Q k ∈ R p k ×r12 is a matrix whose columns are an orthonormal basis of colsp(B k ), which can be the r 12 where the last objective function is the formula (4) of [32] for the graph matching problem. Graph matching is known to be NP-hard for the optimal solution. We use the doubly stochastic projected fixed-point (DSPFP) algorithm of [32] to obtain an efficient approximation of P * , which has time complexity only O(p 3 1 ) per iteration and space complexity O(p 2 1 ). For ultra-large p 1 , one may further apply the approximation procedure of [37] that employs a clustering method before DSPFP.

Estimation
Often in practice, the data matrices {Y k } 2 k=1 are high-dimensional and are the only observable data in decomposition (1). The literature of (1) regularly assumes high-dimensional {Y k } 2 k=1 to be "low-rank plus noise". Indeed, big data matrices are often approximately low-rank in many real-world applications [50], so their low-rank approximations provide feasible or more efficient computation and meanwhile preserve the major information [25]. Moreover, the low-rank plus noise structure can circumvent the curse of dimensionality [56,27] in recovering the common-source and distinctive-source matrices {C k , D k } 2 k=1 from which our defined common-pattern and distinctive-pattern matrices are derived. Following the D-CCA paper [45], we consider the low-rank plus noise structure: where B f k ∈ R p k ×r k is a real deterministic matrix, the columns of F k and E k are respectively the n i.i.d. copies of mean-zero random vectors f k and e k , the columns of F = [F 1 ; F 2 ] are also statistically independent, and the vector f k ∈ R r k contains r k latent factors such that cov(f k ) = I r k ×r k , cov(f k , e k ) = 0 r k ×p k , and span(f k ) is a fixed subspace in (L 2 0 , cov) that is independent of {n, p 1 , p 2 }. Hence, r 1 , r 2 and r 12 are fixed numbers. We can choose f k to be the augmented canonical variables z k . The covariance matrix cov(y k ) = B f k B f k + cov(e k ) is assumed to be a spiked covariance matrix for which the largest r k eigenvalues are significantly larger than the rest, namely, signals are distinguishably stronger than noises.
Before recovering our common-pattern and distinctive-pattern matrices, we introduce the D-CCA's estimators of X k and C k . For simplicity, we write all estimators with true matrix ranks {r 1 , r 2 , r 12 }. In practice, as implemented in D-CCA, ranks {r k } 2 k=1 and r 12 can be well selected by the edge distribution (ED) method of [38] and the minimum description length information-theoretic criterion (MDL-IC) of [48], respectively; see Appendix A2. The estimator of X k is defined by using the soft-thresholding method of [53] as where Then from X k , define the estimator of Σ k by Σ k = 1 n X k X k , and denote its SVD by , and write the latter's full 1+σ ( Θ) ) 1/2 for ≤r θ , and otherwiseâ = 0. The estimators of C k and D k are defined by is the estimated sample matrix of (c 1 , . . . , c r12 ) .
We now derive the estimators of our common-pattern and distinctive-pattern Recall that we assume the permutation matrix P is prespecified. If the row matching of B 1 and B 2A is necessary, one may choose P to be the matrix P * in the NP-hard problem (22), approximated by the DSPFP method with data samples. Note that P * , as a permutation matrix, is either obtained exactly or approximated with at least two wrong entries. To ease theoretical analysis without such misspecification, we assume that P is well determined. Write the full SVD of Θ B by Θ B = U B1 Λ B U B2 , where Λ B has nonincreasing diagonal elements, and define V B1 = Q 1 U B1 and V B2 = P Q 2A U B2 . It follows from (6) (19), we define the estimator of C by where [tr( Given {r 1 , r 2 , r 12 , P}, the computational complexity of obtaining C and C (k) is O(np 2 1 ∧ n 2 p 1 ) majorly due to the SVD of {Y k } 2 k=1 . We define the estimators H k = C k − C (k) and Δ k = H k + D k for H k and Δ k , respectively.
The estimation approach for the CDPA decomposition is summarized in Algorithm 2.

Algorithm 2 CDPA estimation
Input: Observed datasets Y k ∈ R p k ×n , k = 1, 2 1: Select ranks r k = rank(Σ k ) and r 12 = rank(Σ 12 ), respectively, by the ED method and the MDL-IC method (Appendix A2). 2: Obtain the denoised data X k by the soft thresholding in (26). 3: Obtain coefficient matrix estimates { B k } 2 k=1 and the sample matrix C 0 of common latent factors by the sample D-CCA (Section 3.3). 4: If necessary, zero-pad and/or row-match { B k } 2 k=1 by the graph-matching based approach (Section 3.2). 5: Compute the common-pattern matrix estimate C by (27). 6: Obtain the scaled common-pattern matrix estimate C (k) = [tr( Σ k )] 1/2 C with Σ k = X k X k /n, and the distinctive-pattern matrix estimate Δ k = X − C (k) . Output: Common-pattern matrix estimate C, scaled common-pattern matrix estimates The following assumption given in [53,45], which guarantees the consistency of { X k } 2 k=1 , is also used to derive our asymptotic results. Readers are referred to [53,45] for detailed discussions on this assumption. Assumption 1. We assume the following conditions for model given in (24) and (25).

Remark 3. From Theorem 3 and Corollary 1 of [45], we have
Additionally by our Theorem 2 and the triangle inequality of norms, we also have this error bound for M k ∈ {H k , Δ k }. Note that the scaled squared error in the Frobenius norm indicates the scaled loss in matrix variation (sum of squares). Theorem 3. Let P * = arg max P∈Πp 1 tr( Q 1 P Q 2A ( Q 1 P Q 2A ) ). Suppose that Assumption 1 and r 12 ≥ 1 hold. Then, we have For the row matching problem of B 1 and B 2A , Theorem 3 provides an asymptotically vanishing bound on the change in the objective function value of (22) when the optimal solution P * is replaced by P * .

Simulation studies
In this section, we evaluate the finite-sample performance of the proposed CDPA estimation via simulations, comparing with the six D-CCA-type methods mentioned in Section 1.

Finite-sample performance of CDPA estimators
We numerically evaluate the finite-sample performance of proposed CDPA estimators by comparing to the asymptotic results given in Section 3.3. Since the signal-to-noise ratio SNR k = 1500/(p k σ 2 e ) in the above simulation setups, for simplicity we examine the trend of estimation errors with respect to (p k , σ 2 e ) instead of (p k , SNR k ) in the theorems. We use the true {r k } 2 k=1 , r 12 , and P in our matrix estimation here to exclude the error induced by their misspecification. The ranks {r k } 2 k=1 and r 12 can be well selected by the ED and MDL-IC methods, respectively, as shown in [45]. The selection of P by the DSPFP-based row-matching method in Section 3.2 is evaluated later in this subsection.
We first investigate the performance of our common-pattern matrix estimator C defined in (27). The first two rows of Figures 2 and 3 summarize the scaled squared errors of C as studied in Theorem 2 and also its relative squared errors under Setup 1 with θ ∈ {15 • , 75 • }. The squared errors in the Frobenius norm represent the scaled or relative losses in matrix variation (sum of squares). The average estimation errors increase as the dimension p 1 or the noise variance σ 2 e grows, and are even well controlled under 0.1 for many cases with large p 1 ≥ 900 and large σ 2 e ≥ 4 (or SNR k ≤ 0.42). These results are consistent with the influence of p 1 and σ 2 e (= 1500/(p k SNR k ), here) on the convergence rates given in Theorem 2. Similar numerical results are observed for the scaled version C (k) = [tr( Σ k )] 1/2 C and the distinctive-pattern matrix estimator Δ k = X k − C (k) for k ∈ {1, 2}, and hence are omitted for brevity.
As a similarity indicator of signals x 1 and x 2 , the common-pattern explained proportion of signal variance, tr{cov(c)}, is estimated by tr( 1 n C C ) = 1 n C 2 F . The third rows of Figures 2 and 3 plot the average absolute error and the average relative error of this estimator for Setup 1 with θ ∈ {15 • , 75 • }. Same with Theorem 2, the row shows that the average estimation errors grow with increasing p 1 or σ 2 e and have a larger magnitude than those squared errors of C as shown in the first two rows of the figure. The errors are controlled below 0.1 even for some cases with large p 1 ≥ 900 or σ 2 e ≥ 4.
For the row-matching approach of coefficient matrices {B k } 2 k=1 described in Section 3.2, its theoretical performance stated in Theorem 3 is numerically investigated with the intractable P * and P * being replaced by their DSPFP approximations denoted as P a and P a . The fourth rows of Figures 2 and 3 display the average absolute and relative errors of tr{ Although its absolute error seems to have larger values than that of its oracle version (with P * ) expected in Theorem 3, its relative error is controlled under or around 0.1 even for some cases with large p 1 ≥ 900 or σ 2 e ≥ 4, and moreover, the two types of errors both follow the influence of p 1 and σ 2 e (= 1500/(p k SNR k ), here) on the convergence rate shown in the theorem.
The above result patterns also generally hold for settings with more different values of θ (or equivalently tr{cov(c)}) and for those under Setup 2 where p 1 = p 2 , which are provided in Appendix A3.

Performance of related methods
We now investigate the numerical performance of the six competing methods, including D-CCA, OnPLS, COBE, JIVE, AJIVE, and DISCO-SCA. Unlike our CDPA, all the six D-CCA-type methods are developed without taking into account the common and distinctive patterns between the two coefficient matrices and thus different values of θ B1 under our simulation setups. The ground-truth θ B1 is θ ∧30 • for D-CCA in our simulation, but may not be easy to theoretically determine for the other five methods and is thus estimated by simulated data. Table 1 summarizes the first principal angle θ B1 and its cosine cos θ B1 of colsp(B 0 1 ) and colsp(B 0 2 ) estimated by the six methods under the two simulation setups with (θ, p 1 , σ 2 e ) = (75 • , 300, 1). We see that the COBE and AJIVE give zero common-source matrix estimates and thus fail to discover any common pattern of the two correlated signal datasets. The average estimates of θ B1 and cos θ B1 from the other four methods are all close to 30 • and 0.866, respectively, with very small standard deviations. Therefore, there is significant statistical evidence that their θ B1 / ∈ {0 • , 90 • } and cos θ B1 / ∈ {1, 0}. This indicates the nonnegligible existence of both the common and the distinctive patterns between their coefficient matrices, but these patterns are unfortunately not considered by these D-CCA-type methods. 2 We implement the OnPLS with the post-processing step in footnote 1 to obtain the coefficient matrices {B k } 2 k=1 of its common latent factors.

Real data analysis
We apply our CDPA to two real-world data examples, respectively, from the HCP and TCGA, comparing with the six D-CCA-type methods mentioned in Section 1. We focus on the comparison with the state-of-the-art D-CCA in this section, and present the results of the other five methods in Appendix A4.

Application to HCP motor-task functional MRI data
We consider the HCP motor-task functional MRI data obtained from 1080 healthy young adults [2]. All participants were asked by visual cues to perform five motor tasks during the image scanning, including tapping left and right fingers, squeezing left and right toes, and moving tongue. From the acquired brain images, for every participant and each task, the HCP computed a z-statistic map of the task's contrast against the fixation baseline at 91,282 grayordinates including 59,412 cortical surface vertices and 31,870 subcortical gray matter voxels. The z-statistic maps of all participants for each individual task constitute a 91,282×1080 data matrix. We focus on the left-hand and righthand tasks, and apply the proposed CDPA to discover their common pattern on the brain, with comparison to the D-CCA method. Each of the two observed data matrices is row-centered by subtracting the average within each row. Since all z-statistic maps of the two motor tasks are obtained from the same measurement and at the same set of grayordinates, there is no need to choose the signs or match the rows of the two data matrices. We consider the variance maps of {x L , x R , c L , c R , c} on the brain, which are estimated by the sample variances computed from the sample matrix estimates { X L , X R , C L , C R , C} obtained by D-CCA and CDPA. Here, the subscripts L and R denote the left-hand and right-hand tasks. The ranks {r L , r R } and r 12 are all selected as two by the ED and MDL-IC methods, respectively. The proportions of corresponding signal variances explained by common-source vectors c L and c R = 0.111. The common-pattern explained proportion of signal variance is tr{cov(c)} ≈ 1 n C 2 F = 0.077. Figure 4 presents the estimated variance maps of D-CCA and CDPA. For all the five maps, the estimated variances of cortical surface vertices overall dominate those of subcortical voxels. We hence focus on the part of each variance map for the cortical surface. From the estimated signal variance maps var(x L ) and var(x R ) shown in Figure 4 (a) and (b), we see that the right half brain is more active, with larger variances, on the cortical surface for the left-hand task, while the pattern is almost opposite for the right-hand task. In particular, the contralateral pattern is clearly seen on the somatomotor cortex annotated by green circles, a brain region known to be linked with hand tasks [4]. A similar contralateral pattern is also observed for D-CCA's var(c L ) and var(c R ) in Figure 4   some distinctive pattern of x k for k ∈ {L, R}. It is not surprising because c L and c R have different coefficient matrices of the common latent factors, which are r 12 columns in the coefficient matrices of canonical variables for x L and x R , respectively, as shown in equation (14). In contrast, our CDPA's commonpattern vector c in Figure 4 (e) has an estimated variance map that is nearly symmetric on the two hemispheres, and thus is more reasonable than D-CCA's common-source vectors c L and c R to represent the common pattern of the lefthand and right-hand tasks on the brain.

Application to TCGA breast cancer genomic datasets
With the aim to discover new breast cancer subtypes, we apply the proposed CDPA to two TCGA breast cancer genomic datasets [26], and compare the results with the D-CCA. We consider the DNA methylation data and mRNA expression data obtained from a common set of 703 tumor samples. Following the preprocessing procedure of [29], we select the top 1100 variable probes for the DNA methylation dataset and the top 896 variably expressed genes for the mRNA expression dataset. The tumor samples are categorized by the classic PAM50 model [40] into four intrinsic subtypes, including 124 Basal-like, 58 HER2-enriched, 348 Luminal A, and 173 Luminal B tumors.
The two data matrices of interest have sizes 1100×703 and 896×703, and are row-centered before analysis. The ranks (r DNA , r mRNA , r 12  Since only 126 (11.5%) DNA methylation probes can be mapped to the genes of the considered mRNA expression data, for simplicity we match the rows of the two data matrices by using the graph-matching based approach described in Section 3.2 before implementing CDPA. The CDPA method shows that the common-pattern explained proportion of signal variance tr{cov(c)} ≈ 1 n C 2 We explore new cancer subtypes by conducting clustering analysis on each observed or recovered matrix from the CDPA and D-CCA methods. We use the Ward's hierarchical clustering method [54] with the Euclidean distance, and simply specify the number of clusters to be four, which is the same number of the PAM50 intrinsic subtypes. Table 2 compares the differences in survival curves of identified clusters or given subtypes using two most popular methods, the log-rank test [34] and the Peto-Peto's Wilcoxon test [42], where the latter test is more sensitive to early survival differences. Our CDPA's Δ mRNA -identified clusters and the PAM50 intrinsic subtypes both have very significantly distinct survival behaviors with the two smallest p-values ≤ 0.009 in both tests, while the other matrices generate much less pronounced clusters, in particular, the matrices { C k , D k } k∈{DNA,mRNA} of D-CCA all have large p-values ≥ 0.290. By comparing the p-values of C, X k and Δ k for each k, the improved discriminative power of distinctive-pattern matrix estimate Δ k can be attributed to removing the less sensitive commonpattern matrix estimate C from the denoised data matrix X k . The adjusted Let Δ mRNA -i denote the i-th cluster identified from Δ mRNA . Figure 5 displays the Kaplan-Meier survival curves of Δ mRNA -identified clusters and PAM50 subtypes. With the worst survival curve among the four identified clusters, Δ mRNA -3 behaves similar to the HER2-enriched subtype, but is notably different with all other identified clusters and intrinsic subtypes. This is further confirmed in Table 3 by the minimum p-value of corresponding log-rank test and Peto-Peto's Wilcoxon test. Also seen in the table, the other three Δ mRNAidentified clusters have no significant survival differences with large p-values ≥ 0.320. Moreover, the matching matrix in Table 4 shows that most of Δ mRNA -1 and Δ mRNA -2 samples belong to the Basal-like and Luminal A subtypes, respectively. Hence, the other three Δ mRNA -identified clusters are less of interest  Notes: The columns of the matching matrix are well reordered such that its diagonal sum is maximized. Receptor status for estrogen (ER), progesterone (PR) and human epidermal growth factor 2 (HER2) includes positive (+), negative (−), and N/A or equivocal.
to be new subtypes, and we focus on Δ mRNA -3 which has the poorest survival, and further compare it with the HER2-enriched subtype. From Table 4, we see that the Δ mRNA -3 cluster (118 samples) and the HER2-enriched subtype (58 samples) share only 10 samples and have substantially distinct clinical features in terms of the three important receptors' status. In particular, the Δ mRNA -3 cluster primarily includes those samples that are ER+ and/or PR+, whereas the HER2-enriched subtype contains those that are HER2+ and/or PR−. To conclude, the Δ mRNA -3 cluster, with a low survival rate, is remarkably different from the four PAM50 subtypes and appears to be an important new breast cancer subtype worth further investigation.

Discussion
In this paper, we propose a new decomposition method, called CDPA, to extract the common and distinctive patterns of two correlated datasets by incorporating the conventionally ignored common and distinctive patterns between the two coefficient matrices of common latent factors. We also develop a graph-matching based approach to match the unpaired rows between the coefficient matrices. Consistent CDPA matrix estimation is established under high-dimensional settings and is supported by simulations. Our simulation studies and two real-data examples show that CDPA can better delineate the common and distinctive patterns between datasets than D-CCA-type methods, thereby benefiting data mining applications. There are two possible extensions of the CDPA. The first is to extend it to three or more datasets. One may construct a multi-set CDPA method by first developing a multi-set D-CCA from the generalized CCA [24]. The next challenge is how to appropriately match the rows of the multiple coefficient matrices of the resulting common latent factors. The second extension is to incorporate the nonlinear patterns between the two datasets. The CDPA only considers the linear patterns extracted from the inner product spaces (L 2 0 , cov) and (R p1∨p2 , ·). A nonlinear version of our row-matching approach and a nonlinear D-CCA may be expected in this extension, where the latter is possibly developed from the kernel CCA [14] or the deep CCA [1].

A1.1. Proof of Theorem 1
For k = 1, 2, denote z [1:r12] k andz [1:r12] k to be the vectors containing two different sets of the first r 12 canonical variables associated with x k . By the first paragraph of page 5 in the supplement of [45], there exists an orthogonal matrix O zk such thatz , Note that Hence, [c B ] r12 =1 V B k is unique for k = 1, 2. By Theorem 2 in [45], we have that c k in (14) (17), we obtain the uniqueness of c * k for k = 1, 2. Hence, c = 1

A1.2. Proof of Theorem 2
Letr k = rank( X k ). From (S.17) in [45], we haver k = r k with probability tending to 1 as n → ∞. Due to Lemma S.1 in [45], we simply assumer k = r k in the rest of the proof. Thus, Λ k is rank-r k , and then B k = V k Λ Let Q k ∈ R p k ×r12 be the left singular matrix of B k . Note that B k 2 ≤ V k Λ . By Lemma 1 of [28] and then Theorem 3 of [57], there exists an orthogonal matrix O k such that Here and in the following text, we write A P B if and only if A = O P (B). Note that for any real matrices M 1 and M 2 , we have and Note that the columns of Q k form an orthonormal basis of colsp(B k ), and those of PQ 2A also form an orthonormal basis of colsp(PB 2A ). Let Θ B = Q 1 PQ 2A . Then by (A4) and (A2), we have By Weyl's inequality (see Theorem 3.3.16(c) in [19]), is an orthogonal matrix with column dimension equal to the repetition number of σ B, , such that and By (A3), (A5) and the above two inequalities, By the above inequality, Θ B − Θ B 2 P δ θ , and the triangular inequality of matrix norms, we have It follows that Combining (A10), (A7) and (A9) yields By (6), we have that the -th columns of V B1 := Q 1 U B1 and V B2 := PQ 2A U B2 are the -th pair of principal vectors of colsp(B 1 ) and colsp(PB 2A ). By (A4), (A2) and (A8), we have Similarly, by (A11) we obtain Then, together with (A4) and (A1), we have and similarly, By the results given in (S.16), (S.17) and (S.7) of [45], we have |λ . Then by the mean value theorem, we obtain Hence, By (A4), (A14) and (A17), Similarly, by (A15), Then by (A4), From (S.23) in [45], Θ − Θ F δ θ . Using the same proof technique for (A8) and (A11), we have U From the results given in (S.9), (S.13), (S.15) and (S.32) of [45], we have that max{ X k F , X k F } P nλ 1 (Σ k ), X k − X k F P min{ λ 1 (Σ k )/n + √ p k log p k , nλ 1 (Σ k )}, and A C − A C F P δ and thus, From (A18), (A19), (A20) and (A4), we obtain Combining (A16) and (A21) yields By (S.14) in [45], there exists a constant κ 3 ∈ (0, 1] such that X k F ≥ X k 2 ≥ [κ 3 + o P (1)] nλ 1 (Σ k ). Hence, Hence, Then, By (A21), we have

H. Shu and Z. Qu
Combining the above two inequalities yields By Weyl's inequality (see Theorem 3.3.16(c) in [19]), Then, The proof is complete.

A1.3. Proof of Theorem 3
By (A2), there exists a matrix Q k , whose columns form an orthonormal basis of colsp(B k ), such that Q k −Q k F = O p (δ θ ). Note that tr(Q 1 PQ 2A (Q 1 PQ 2A ) ) = Q 1 PQ 2A 2 F . Then by (A4), for any P ∈ Π p1 , we have Hence, tr Q 1 P * Q 2A (Q 1 P * Q 2A ) − tr Q 1 P * Q 2A (Q 1 P * Q 2A ) = O P (δ θ ). Fig A5. The variance maps estimated by the DISCO-SCA, JIVE, and AJIVE methods for HCP motor-task functional MRI data. The notation var denotes the sample variance vector obtained from the corresponding recovered sample matrix. In each subfigure, the left part displays the cortical surface with the outer side shown in the first row and the inner side in the second row; the right part shows the subcortical area on 20 xy slides at the z axis. The somatomotor cortex is annotated by green circles.

A4.2. Additional results of TCGA breast cancer genomic datasets
We also apply the same clustering method used in Section 5.2 to each recovered matrix from the five D-CCA-type methods: OnPLS, COBE, JIVE, AJIVE, and DISCO-SCA. Table A1 reports the p-values of the log-rank test and the Peto-Peto's Wilcoxon test for the survival differences among the clusters from each of these matrices. All the five methods have the p-values above 0.05 and thus fail to discover breast cancer subtypes with significant survival differences.