Heterogeneous large datasets integration using Bayesian factor regression

Two key challenges in modern statistical applications are the large amount of information recorded per individual, and that such data are often not collected all at once but in batches. These batch effects can be complex, causing distortions in both mean and variance. We propose a novel sparse latent factor regression model to integrate such heterogeneous data. The model provides a tool for data exploration via dimensionality reduction while correcting for a range of batch effects. We study the use of several sparse priors (local and non-local) to learn the dimension of the latent factors. Our model is fitted in a deterministic fashion by means of an EM algorithm for which we derive closed-form updates, contributing a novel scalable algorithm for non-local priors of interest beyond the immediate scope of this paper. We present several examples, with a focus on bioinformatics applications. Our results show an increase in the accuracy of the dimensionality reduction, with non-local priors substantially improving the reconstruction of factor cardinality, as well as the need to account for batch effects to obtain reliable results. Our model provides a novel approach to latent factor regression that balances sparsity with sensitivity and is highly computationally efficient.


Introduction
A first important task when dealing with large datasets is to conduct an exploratory analysis.
Dimensionality reduction techniques have proven a highly popular tool for this purpose. They provide a lower-dimensional representation that can give insights into the underlying structure to visualise, denoise or extract meaningful features from the data. See Johnson and Wichern (1988, chap. 3) or Hastie et al. (2001, chap. 14) for a gentle introduction and Burges (2010); Cunningham and Ghahramani (2015) for more recent reviews.
Large datasets are common in modern statistical applications. For instance, technological advances in bioinformatics such as high-throughput sequencing, microarrays, mass spectrometry and single cell genomics allow the gathering of a vast amount of biological data, enabling researchers to create models to explain the complex processes and interactions of biological systems (see Bersanelli et al. (2016) for a recent review). Cancer is a prominent example. Largescale projects such as The Cancer Genome Atlas (TCGA), Cancer Genome Project (CGP) and the International Cancer Genome Consortium (ICGC), as well as many individual laboratories are generating extensive amounts of biological data (e.g. gene expression, mutation annotation, DNA methylation profiles, copy number changes) in addition to recording other covariates (e.g. gender, tumour stage, medical treatment and patient history). These projects aim to give a better understanding of the disease and improve prognosis, prevention and treatment. However, the large and heterogeneous nature of the data make the analyses and interpretations challenging. Furthermore, such data are often generated under different experimental conditions, when new samples are incrementally added to existing samples, or in analyses coming from different projects, laboratories, or platforms; collecting data in this matter often produces batch effects (Rhodes et al., 2004). Which, unless properly adjusted for, may lead to incorrect conclusions (Leek et al., 2010;Goh et al., 2017). In the context of bioinformatics, several approaches have been developed for removing batch effects (see Scherer (2009) for a review and examples). These include data "normalization" methods using control metrics or regression methods (Schadt et al., 2001;Yang et al., 2002), matrix factorisation (Alter et al., 2000;Benito et al., 2004) and locationscale methods (Leek and Storey, 2007;Johnson and Li, 2009;Parker et al., 2014;Hornung et al., 2016). Our examples focus on cancer-related gene expression; nonetheless, batch effects are also present in many other settings, e.g. structural magnetic resonance imaging (MRI) data from Alzheimer's disease (Shinohara et al., 2014;Fortin et al., 2016), multiple sclerosis (Shah et al., 2011), attention deficit hyperactivity disorder (Olivetti et al., 2012) or even different tissues of marine mussels (Avio et al., 2015).
We address dimensionality reduction via a model-based framework relying on Bayesian factor analysis and latent factor regression. Our model builds on the approaches introduced by Lopes and West (2004); Lucas et al. (2006); Carvalho et al. (2008) and Ročková and George (2017).
An important practical extension of these works is to increase the flexibility to account for systematic biases or sources of variation that do not reflect any underlying patterns of interest, i.e. batch effects. Our other main contribution is developing a scalable non-local prior based formulation to induce sparsity and learn the underlying number of factors. Strategies for batch effect correction include data pre-processing, for example via the so-called ComBat empirical Bayes approach (Johnson et al., 2007) or via singular value decomposition (SVA) (Leek and Storey, 2007). Another possibility is to use factor models to learn on the one hand the biological patterns via common factors shared across the different data sources and on the other hand the non-common sources of variation via data-specific factors (De Vito et al., 2018b,a). Those approaches while useful, do not provide a model-based approach for dimensionality reduction while correcting mean and variance batch effects that returns sparse loadings and ease of interpretation. We model observations with a regression on latent factors, observed covariates and batch effects that can alter the mean and intrinsic variance structures. Model fitting is done via an Expectation-Maximisation (EM) algorithm to obtain maximum posterior mode parameter estimates in a computationally efficient manner. We focus on three different continous prior formulations for the loadings: flat, Normal-spike-and-slab (George and McCulloch, 1993) and a novel Normal-spike-and-MOM-slab, based on a continuois relaxation of the non-local prior configuration by Rossell (2010, 2012). We also discuss non-local Laplace-tailed extensions, along the lines in Ročková and George (2017). Spike-and-slab priors provide sparse loadings, effectively performing model selection on the number of required factors and non-zero loadings. We obtain closed-form updates, a novel contribution to the non-local prior literature.
As we will discuss later, the main advantage of non-local priors in this setting is to help achieve a better balance between sparsity and sensitivity in inferring non-zero loadings. To our knowledge this is the first adaptation of these non-local priors to factor models. See also Bar et al. (2018) who argued for improved sensitivity via 3-component mixture priors that resemble non-local priors in generalised linear models.
The outline of this paper is as follows. d Section 2 reviews latent factor regression and introduces our extension, which includes a variance batch effect adjustment. Section 3 proposes prior formulations including non-local priors on the loadings and important aspects related to prior parameter elicitation. Section 4 describes several EM algorithms for model fitting, parameter initialisation and post-processing steps required for effective model selection and dimension reduction. Section 5 presents applications on simulations and on cancer datasets under unsupervised and supervised settings. Section 6 concludes. The supplementary material contains the derivation of the EM algorithm and additional results. Software implementing our methodology is available at https://github.com/AleAviP/BFR.BE.

Latent factor regression with batch effects
Consider vectors x i = (x i1 , x i2 , . . . , x ip ) ∈ R p , observed for i = 1, . . . , n individuals. The factor regression model defines x i as a regression on p v observed covariates denoted by v i ∈ R pv , and q low-dimensional latent variables denoted z i ∈ R q , also known as latent coordinates or factors.
Let X be the n × p matrix with the i th row equal to x i , V the n × p v matrix of known covariates with the i th row equal to v i and Z the n × q matrix of latent coordinates, containing z i in the i th row. The standard factor regression model is where θ ∈ R p×pv is the matrix of regression coefficients, M ∈ R p×q is the matrix of factor loadings and e i ∈ R p is the error, distributed as e i ∼ N (0, Σ) independently across i = 1, . . . , n, where Σ is a diagonal matrix. Factors are assumed to be standard normal, z i ∼ N (0, I), independent across i = 1, . . . , n and also independent of e i .
Equation (1) regresses the observed data X on known covariates and on a latent factor structure. In particular, it allows additive batch effects to be accounted for by incorporating the variables recording the batches into v i . However, in practice one often observes more complex batch effects; specifically in bioinformatics it is common to observe multiplicative effects on the variance (Johnson et al., 2007). We will later describe an example of this, shown in Figure 5.
Such artefacts cannot be captured by (1) given that Σ is assumed constant across all individuals.
To address this issue we extend (1) by allowing Σ to depend on i. Suppose the data were obtained in p b batches, e.g. from different days, laboratories or instrumental calibrations, with n l individuals in batch l, for l = 1, . . . , p b , such that n 1 + n 2 + · · · + n p b = n. Let b i be the indicator vector of length p b defined as b il := 1 if individual i is in batch l, b il := 0 otherwise.
We incorporate batch effects by adding a mean and variance adjustment. We let where θ, v i , M and z i are as (1), β ∈ R p×p b captures additive batch effects and the variance of e i captures multiplicative batch effects. We denote by τ jl , j = 1, . . . , p and l = 1, . . . , p b as the j th idiosyncratic precision element in batch l. Then, given b il = 1, the errors are independently distributed as e ij ∼ N (0, τ −1 jl ). Further, denote by T the p × p b matrix that has τ jl as its (j, l) element.
To help interpret the practical implications of the model, suppose that one has orthonormal factor loadings M M = I. Then (2) implies and thus, That is, the mean of the latent coordinates is the projection M x i plus a translation given by the batch effect adjustment and (potentially) the observed covariates. An interesting observation is that their covariance Cov(z i | depends on the multiplicative batch-dependent noise. As an example, the middle-left panel in Figure 5 show the two first factors of an ovarian dataset preprocessed by ComBat. Relative to the unadjusted upper-left panel, ComBat removes systematic differences in mean and variance accross the 2 batches, however the latent coordinates exhibit distinct covariances. To obtain suitably-adjusted low-dimension coordinates one should estimate T jointly with (M, θ, β).
Model (2) can be represented in matrix notation as where E ∈ R n×p is the matrix of errors.
The latent factor model is non-identifiable up to orthogonal transformations, of the form M * = A M and Z * = ZA, where A is any orthogonal q × q matrix. Thus, the factor model in (4) can equivalently be rewritten as X = V θ + Z * M * + Bβ + E. To obtain unique point estimates of M and Z, several alternative prior specifications have been developed. One option is restricting the parameter space. Seber (1984) constrained M such that M ΩM is diagonal. Lopes and West (2004) restricted M to be lower-triangular with a strictly positive diagonal, m jj > 0, and assumed M to be full-rank. More recently, Frühwirth-Schnatter and Lopes (2018) suggested a factor reordering via a Generalized Lower Triangular loading matrix. However, under this approach the interpretation of M depends on the arbitrary ordering of the columns in X, and it gives special roles to the first factors. Another option is to encourage sparsity in M , e.g. the classical varimax solution (Kaiser, 1958) maximises the variance in the squared rotated loadings. A more modern strategy is to favour sparse solutions containing exact zero loadings, e.g. Ročková and George (2017) proposed an EM algorithm that seeks rotations based on a so-called Parameter Expansion (PX) that aims to avoid local suboptimal regions. We adopt a similar strategy where sparse solutions are prefer by the introduced non-local penalties.

Prior formulation
To complete Model (2) we set priors for the loadings M , precisions τ jl , and regression parameters (θ, β). Through our proposed default prior formulation we assume that the columns in X have been centred to zero mean and unit variance. For the idiosyncratic precisions τ jl we set independently across j = 1, . . . , p and l = 1, . . . , p b . By default in our examples we set the fairly informative values η = ξ = 1, leading to diffuse though proper priors.
For the regression parameters we set (θ j , β j ) ∼N (0, ψI), j = 1, . . . , p where ψ is a user-defined prior dispersion that in our examples by default we set to ψ = 1. The choice of ψ = 1 assigns the same marginal prior variances to elements in (θ j , β j ) as the unit information prior often adopted as a default for linear regression (Schwarz, 1978).
We remark that this prior does not encourage sparsity in the regression parameters (θ, β) or factor loadings, which we view as reasonable provided the number of variables p v and batches p b are moderate. For large p v or p b , a direct extension of our prior on the loadings M could be adopted.
The loadings matrix M plays an important role in improving shrinkage and simplifying interpretation. Some recent strategies include a LASSO-based method (Witten et al., 2009), horseshoe priors (Carvalho et al., 2009), an Indian buffet process (Knowles and Ghahramani, 2011), an infinite factor model (Dunson and Bhattacharya, 2011) among others. In this paper, we consider three priors on the loadings: an improper flat prior p(M ) ∝ 1, a Normal spike-and-slab and a novel non-local pMoM spike-and-slab. The local and non-local spike-and-slab prior formulations are detailed bellow, along with Laplace-based extensions. These build on the approach by George (2014, 2017), our main contribution being the introduction of non-localbased variations.

Local spike-and-slab prior
A traditional Bayesian approach to variable selection is the spike-and-slab prior, a two-component mixture prior (Mitchell and Beauchamp, 1988;George and McCulloch, 1993). This prior aims to discriminate those loadings that warrant inclusion, modelled by the slab component, from those that should be excluded, modelled by the spike component.
Specifically, a spike-and-slab prior density for the loadings M has the form where p(m jk | λ 0 , γ jk = 0) is a continuous density, λ 0 is a given dispersion parameter of the spike component and λ 1 > λ 0 is that of the slab component. The indicators γ jk ∈ {0, 1} signal which m jk were generated by each component, and serve as a proxy for which loadings are significantly non-zero. We take as a base formulation the Normal-spike-and-slab prior by George and McCulloch (1993) were the spike is a Normal density with a small variance λ 0 and the slab a Normal distribution with large variance λ 1 . Although Laplace-Spike-and-Slab priors have been shown to possess better properties for sparse inference (Ročková and George, 2018), as discussed bellow the introduction of non-local penalties improves certain undesirable features of the Normal-based prior. The elicitation of λ 0 and λ 1 is an important aspect of the formulation and will be discussed in Section 3.3. Specifically, the Normal-spike-and-slab is The continuity of the spike distribution gives closed form expressions for the EM algorithm, making it computationally appealing. We refer to (8) as Normal-SS.
Furthermore, note that a ζ k encourages increasingly sparse solutions in subsequent factors. That is, related to our earlier discussion of non-identifiability (Section 2), we encourage loadings where the first factors have larger importance, leading to solutions that are sparse both in the rank of M and its non-zero entries.
A potential concern with Normal-SS is that the slab density assigns non-negligible probability to regions of the parameter space that are also consistent with the spike, namely when m jk lies close to zero. We will address this via non-local priors and show that these, by enforcing separation between two components, help increase sensitivity.

Non-local spike-and-slab prior
Non-local priors (NLPs) are a family of distributions that assign vanishing prior density to a neighbourhood of the null hypothesis (Johnson and Rossell, 2010). Definition 3.1 is an adaptation of the definition in Johnson and Rossell (2010) to (7).
We call any prior not satisfying Definition 3.1 a local prior. Non-local priors possess appealing properties for Bayesian model selection. They discard spurious parameters faster as the sample size n grows, but preserve exponential rates to detect important coefficients (Johnson and Rossell, 2010;Fúquene et al., 2018) and can lead to improved parameter estimation shrinkage (Rossell and Telesca, 2017). To illustrate the motivation for NLPs in our setting consider Figure 1.
As an alternative, we consider a product moment (pMOM) prior (Johnson and Rossell, 2012).
We denote (10) as MOM-SS. This prior assigns zero density to m jk = 0 given γ jk = 1, which implies p(γ jk = 1 | m jk = 0) = 0 ( Figure 1). Prior elicitation forλ 0 andλ 1 is discussed in Section 3.3. From a computational point of view, the EM algorithm can accommodate this extension by using a trivial extra gradient evaluation at negligible additional cost relative to the Normal-SS.
Parameter estimation and algebraic details are described in Section 4. The prior on the inclusion indicators is set as in (9).
Beyond (8) and (10), another natural extension is to use Laplace-based priors based on the Spike-and-Slab LASSO by Ročková and George (2018) with a slab component with variance 2λ 2 0 , and a spike component with 2λ 2 1 , where Laplace(m jk ; 0, λ) = 1 2λ exp −|m jk | λ . We refer to (11) as Laplace-SS. As illustrated in Figure 1 (right panels) this prior can help encourage sparsity, setting p(γ jk = 1 | m jk = 0) to substantially smaller values (though still non-zero) than the Normal-SS.
As an extension, akin to (10), one could set a moment penalty on the Laplace density.
We discuss prior elicitation for Laplace-MOM-SS in Section 3.3 and derive an EM algorithm in Section 4.2 but in our examples we focus on the MOM-SS for simplicity. However, the Laplacebased (12) can also be shown to lead to closed-form EM updates.

Prior elicitation for the variance of the spike-and-slab priors
A crucial aspect in a spike-and-slab prior is the choice of the prior scale parameters. It is common to fix the variance of the spike distribution λ 0 to a value close to zero. Regarding λ 1 , one option and Laplaced-based (right) priors. Scales (λ 0 , λ 1 ) are set to the defaults from Section 3.3.
is to set a hyper-prior or to try to estimate it from the data McCulloch, 1993, 1997;George, 2014, 2018). Setting a hyper-prior does not bypass prior elicitation, as one then needs to set the hyper-prior parameters, whereas estimating λ 1 from the data increases the cost of computations. Instead, we capitalise on the fact that factor loadings have a natural interpretation in terms of the fraction of explained variance in X. Thus, we propose default values that dictate which coefficients are considered as meaningfully different from zero. These defaults are guidelines in the absence of a priori knowledge. A convenient feature of such an elicitation is that it can be easily extended to local priors and other non-Gaussian spike-and-slab priors.
Regarding the Normal-SS prior,we set λ 0 =λ 0 and λ 1 such that it is comparable to the MOM-SS in terms of informativeness, namely it matches the variance of the MOM-SS, obtaining

Parameter estimation
Parameter estimation in factor analysis is usually conducted using Expectation-Maximisation (EM, Dempster et al. (1977)), MCMC algorithms (Lopes and West, 2004) or approximated via variational inference (Ghahramani and Beal, 2000). At the core of these algorithms is the fact that, conditional on the data and all other model parameters, we can setx i = x i − θv i − βb i and express the model in (2)

EM algorithm under a uniform prior
We outline an EM algorithm to fit Model (2)  iteratively. For convenience we denote by T b i the idiosyncratic precision matrix in batch l, i.e. if b il = 1 by τ jl , then the errors are distributed as e i ∼ N (0, T −1 b i ). We also denote witĥ ∆ = (M ,θ,β,T ) the current value of the parameters We briefly describe the algorithm; see Supplementary Section A for its full derivation.
The E-step takes the expectation of log p(M, θ, β, T | X, Z) with respect to p(Z |∆, X) where C is a constant. Expression (13) only depends on Z through the conditional posterior and the conditional second moments is the conditional covariance matrix of the latent factors.
We emphasise that (14) and (15) depend on batch-specific precisions T b i .
The M-step maximises Q(∆) with respect to M, θ, β, T . Setting its partial derivatives to 0 gives the updateŝ The updates for (θ j , β j ) are Equation (18) has the form of a ridge regression estimator with penalty ψ.
Algorithm 3 summarises the EM algorithm. The stopping criteria is reaching a tolerance * in the log-posterior change, a maximum number of iterations T or a change * M on the loadings. By default we set * = 0.001, T = 100 and * M = 0.05. Parameter initialisation is an important aspect that helps obtain better local modes and reduce computational time; its discussion is deferred to Section 4.3.
Algorithm 1: EM algorithm for factor regression model with uniform p(M ) jk | and t = t + 1 end

EM algorithm for spike-and-slab priors
The algorithm is derived analogously to Section 4.1. The expected complete-data log-posterior with C a constant and E[z i |∆, X] and E[z i z i |∆, X] as in (14) and (15).
Q 1 (θ, M, β, T ) resembles the E-step for the flat prior in Section 4.1, plus an extra conditional . Q 2 (ζ) arises from the Beta-Binomial prior on γ jk and the E[γ jk | ·] are straightforward to compute. In the M-step we maximise Q 1 w.r.t. (θ, M, β, T ), this can be done in a completely independent fashion from optimising Q 2 w.r.t. ζ.
Further the conditional expectation of E[γ jk |∆] =p jk iŝ For the Normal-SS prior, Equation (21) iŝ for the MOM-SSp for the Laplace-SSp and for the Laplace-MOM-SŜ Equations (22) and (24) are analogous to the EM posterior update for m jk in a two-component Gaussian or Laplace mixture (Ročková and George, 2014). Equations (23) and (25) are simular to ther local counterparts, but incorporate a penalty for small m 2 jk . The main difference between the local and non-local priors lies in updating the loadings and the idiosyncratic variances. We discuss these separately for each prior later in this section.
The updates for the precision T l and the regression parameters (θ, β) are given in Equations (17) and (18) respectively.
Algorithm 2 summarises the algorithm. It is initialised with the two-stage least-squares method described in Section 4.3 and ζ k = 0.5 for k = 1, . . . , q. The stopping criteria are as in Algorithm 3. The different updates for M are outlined below, separately for each prior specification.
Algorithm 2: EM algorithm for factor regression model with spike-and-slab p(M ) jk | and t = t + 1 end + see Section 4.2, Supplementary Sections B and C for details.
wherep jk is as in (22). Thus, the EM update for the j th row of matrix M is, For the MOM-SS wherep jk is given in (23) For the M-step, we use a coordinate descent algorithm (CDA) that performs successive univariate optimisation on (19) with respect to each m jk . An advantage is that the updates have a closed-form that is computationally inexpensive. As a potential drawback it could require a larger number of iterations to converge relative to performing joint optimisation with respect to multiple elements in M . However, we have not found this to be a practical problem in our examples.
Viewed as a function of only m jk , it is possible to express Q 1 (m jk ) as where See Supplementary Section C. The global maximum of (30) is summarised in Lemma 1.
Thus the solution for m jk is given by setting ∂Q 1 ∂m jk = 0 as given in Lemma 3.
We remark that if either x i or v i are continuous, the event of b = 0 has zero probability. If both x i and v i are discrete and in presence of the rare event of b = 0, then the sign of the update for m jk is set to the previous one.

Initialisation of parameters
The EM algorithm can be sensitive to parameter initialisation. We propose two different strategies: least-squares and least-squares with rotation.
The first option is a simple two-step least-squares that is computationally efficient and performs well in many of our examples.
Step 3: varimax rotation for the loadings obtained in Step 2.
The reason for this extra step is to help escape local modes. The EM algorithm does not guarantee convergence to a global maximum, but it increases the log-posterior at each iteration.
This local maxima issue is intensified by the non-identifiability of the factor model through the rotational ambiguity of the likelihood and the strong association between the updates of loadings and factors.

Post-processing for model selection and dimensionality reduction
The EM algorithm gives point estimates (M ,θ,T ,ζ). Under Laplace-SS one can obtain exact sparsity viam jk = 0, however this is not the case for our other priors. To address this, we define γ as the solution of the following optimisation problem γ = argmax γ p(γ | X,M ,θ,T ,ζ) = argmax γ jk p(γ jk |m jk ,ζ k ) where the right-hand side follows from the assumed conditional independence of m jk . That is, we setγ jk = 1 if p(γ jk = 1|m jk ,ζ k ) > 0.5 and γ jk = 0 otherwise. Whenγ jk = 0 we setm jk = 0 effectively selecting the number of factors and the non-zero loadings within each factor.
As an alternative post-processing step we consider that in some applications one may want to select only the number of factors. We then consider to settingγ jk = 1 if p j=1γ jk = 0 and γ jk = 0 otherwise. Finally we re-order of the factors so that p j=1 γ jk is decreasing in k, which under our prior (9) is guaranteed to increase the log-posterior. This is the so-called left-ordered inclusion matrix of Griffiths and Ghahramani (2011). This facilitates the interpretation of latent factors.
Latent factors are also post-processed for data visualisation purposes. The aim of this is to obtain new standardised factorsz i = [Cov(z i |∆, X)] −1 E[z i |∆, X], with Cov(z i |∆, X) = (I q +M T b iM ) −1 , whose covariance does not depend on their batch.

Results
We assess our approach on simulated and on experimental datasets. Section 5.1 studies simulations under two different loading matrices M (truly sparse and dense) and two different scenarios (without and with batch effects). We compare our methods with the Fast Bayesian Factor Analysis via Automatic Rotations to Sparsity (FastBFA) of Ročková and George (2017) and the Penalized Likelihood Factor Analysis with a LASSO penalty (LASSO-BIC) of Hirose and Yamamoto (2015). We also use the ComBat empirical Bayes batch effect correction of Johnson et al. (2007) for scenarios with batch effects, doing an MLE estimation of the factor analysis model (ComBat-MLE). In Section 5.2 we analyse a high-dimensional gene expression data under a supervised and an unsupervised framework. We use the clinically annotated data for the ovarian cancer transcriptome from R package curatedOvarianData 1.16.0 (Ganzfried et al., 2013) and the lung cancer data from The Cancer Genome Atlas (TCGA) from R package TCGA2STAT 1.2 (Wan et al., 2015).
The R code for our model is available at https://github.com/AleAviP/BFR.BE. We used R function FACTOR ROTATE of Ročková and George (2017) for FastBFA, the R package fanc 2.2 for LASSO-BIC (Hirose et al., 2016) and package sva 3.26.0 for ComBat (Leek et al., 2017). Hyper-parameters for the Normal-SS and MOM-SS were set as in section 3.3, the hyperparameters for FastBFA were set via Dynamic Posterior Exploration as in Ročková and George (2017) with 1/λ 0 = 0.001 and 1/λ 1 ∈ {5, 10, 20, 30} and using varimax robustifications. For the LASSO-BIC we selected the model with smallest BIC to set the regularization parameter.
Finally, for scenarios with batch effects, we adjusted the data via a ComBat correction and performed a Factor Analysis via EM algorithm to maximise likelihood with the fa.em function in the cate package (Wang and Zhao, 2015).

Simulation study
To assess the precision of the parameter estimates returned by the EM algorithm, we simulated data from two different data-generating truths: truly sparse and dense for the loadings M . In both, the truth was set to q * = 10 factors. The dense loadings matrix has a grid of elements set uniformly between (−1, 1), whereas the truly sparse M has a banded-diagonal structure with m jk = 1 for the non-zero elements, as shown in Figure 2.
Some visual representations of our findings are display in the Supplementary Figures 3-13.

No batch effect
We simulated n = 100 observations from x i = M * z i , +e i , with growing p = 1, 000 and 1, 500, where the factors z i ∼ N (0, I q ), the errors e i ∼ N (0, T −1 ) with T −1 = I p , and the loadings M * are set as dense or sparse as in Figure 2. For comparison, FastBFA was initialised as our models via two-step least-squares (Section 4.3). Table 1  mean but performed poorly on the covariance, whereas FastBFA outperformed all the models to estimate the covariance but performed poorly for the mean. However, MOM-SS had a good balance in terms of estimating the expected value and the covariance, being the second best in both cases.
We further illustrate our model under the arguably more interesting case of truly sparse loadings. First we set q = 10 the true cardinality. In this scenario MOM-SS and Normal-SS presented the best results both for mean and covariance. This example reflects the advantages of shrinkage and the varimax rotation for the initialisation in the loadings, leading to good sparse solutions. Finally we considered the same scenario with q = 100. LASSO-BIC was best to estimate the mean at the cost of reduced precision in the covariance reconstruction. MOM-SS displayed the lowest error for the covariance and second smallest for the mean, showing a good balance between those metrics.
In general, MOM-SS achieved a good balance between estimating the mean, which is useful for dimensionality reduction, and sparse covariance estimation. Recall that we used a coordinate descent algorithm for the non-local prior, which as a potential drawback could require a larger number of iterations than performing jointly optimising multiple elements in M . However,

Batch effect
We evaluate our method in our main setting of interest where there are mean and variance batch effects. We emphasise that, to our knowledge, there are no other model-based methods to learn the latent structure while correcting for non-biological variance simultaneously. Thus, this is not a fair comparison with the competing methods but rather an illustration of how much inference when not properly accounting for batch effects in a model-based manner.
We simulated data with a mean and variance batch effect, sample size n = 200 and growing p = 250 or p = 500. We set q * = 10, p v = 1 and p b = 2 batches and considered the truly sparse and dense loadings M * in Figure 2. Factors z i were drawn from N (0, I q ), errors e i from N (0, T −1 b i ), where τ −1 j1 = 0.5 and τ −1 j2 = 1.5τ −1 j1 for j = 1, . . . , p; v i from a continuous Uniform(0,3) and b i from a discrete Uniform{0,1}. We set the first p/2 values of θ * ∈ R p to -2 and the other p/2 to 2 and β * j1 = 0, β * j2 = 2 for j = 1, . . . , p we fixed to 2 for the first batch and 0 for the second. We compared our models with FastBFA and LASSO-BIC

Applications to cancer datasets
We applied our method to two high-dimensional cancer datasets, related to ovarian and lung cancer. For the ovarian cancer we combined information from two datasets from the package curatedOvarianData 1.16.0. The first was the Illumina Human microRNA array expression dataset E.MTAB.386, formed by Angiogenic mRNA and microRNA gene expression signature with n 1 = 129 patients (Bentink et al., 2012). The second was the NCI-60 GEO dataset GSE30161 and consisted of multi-gene expression predictors of single drug responses to adjuvant chemotherapy in ovarian carcinoma for n 2 = 52 patients (Ferriss et al., 2012). For the lung cancer, we used microarray and mRNA-array, data from two different high-throughput platforms: Affymetrix Human Genome U133A 2.0 Array with n 1 = 133 patients and Affymetrix Human Exon 1.0 ST Array with n 2 = 112 (Wan et al., 2016).
We considered two main tasks: to give a visual representation of the latent factors of the data, i.e. an unsupervised dimension reduction task and a supervised survival analysis using the factors obtained in our method as predictions. Prior to our analyses, we selected the 10% genes with highest total variance across all samples obtaining p = 1, 007 for ovarian and p = 1, 198 for lung. All data sets have been normalised to zero mean and unit variance. We included the age at initial pathologic diagnosis as a covariate.

Unsupervised: Data visualisation
Our first goal was to demonstrate the usefulness of our method as a data visualisation tool. We remark that there are no other model-based approaches to jointly adjust for batch effects and estimate latent factors. Thus, for comparison we first corrected the data using ComBat and then estimated the parameters via MLE akin to Section 5.1.2. To decide the number of factors for ComBat-MLE, we carried a principal component analysis to the corrected data prior to factor analysis and chose a number of componentsq that explained 90% or 70% of the total variance.
It is important to notice that we are doing a over-optimistic assessment of ComBat-MLE as we are doing a cross-validated factor analysis over the whole ComBat-corrected data, as opposed to running ComBat in an out-of-sample fashion.

Supervised: Survival analysis
We also illustrate the potential of our method as a surpervised tool, performing a survival analysis that aims to predict the time until death. To do that, we applied a Cox proportional hazards model (Cox, 1972) using as covariates the latent coordinates obtained in our models. We used the coxph function of the R package survival 2.38 (Therneau, 2015). We then used the concordance index to asses the quality of our predictions. This index is a non-parametric metric to quantify the power of a prediction rule via a pair-wise comparison that measures the probability of concordance between the predicted and the observed survival time (Harrell Jr. et al., 1982).
To obatain the concordance index we used the function concordance.index in the R package survcomp (Schröeder et al., 2011). The presented results are from 10 independent runs of 10-fold cross-validation.
We initialised MOM-SS with the values obtained for the Flat model along with the other initialisations discussed in Section 4.3 and chose the one with smallest leave-one-out cross-validated concordance index.
For the ovarian cancer data set, Table 3 shows that MOM-SS achieved a concordance index similar to ComBat-MLE 90% with considerably less factors (4 instead of 101) and a bit higher than Normal-SS. In the lung data MOM-SS achieved a high concordance index, particularly relative to Normal-SS and ComBat-MLE 70%. Interestingly, in these data a non-sparse solution is recoverd, and this results in an increase in predictive accuracy. ComBat-MLE despite having a good performance, its concordance index proved to be sensitive to the number of factors.  We also remark that our novel MOM-SS and its closed-form EM updates can be extended to frameworks of interest beyond factor models such as: linear regression, multivariate regression models, generalised linear models as well as graphical models.
It should be emphasised that our model assumes common factors across the datasets being integrated. Thus, another direct extension for future research is to consider more complex settings where some of the factors differ across data sources or where one wishes to integrate datasets by adding variables (as opposed to adding individuals as we did here), or where potentially same variables were only recorded for a subset of the individuals.
Roots are m jk : If f (m jk ) − f (m jk ) > 0 then the global max ism jk , else the global max is m jk . After trivial For ease of notation let z = √ b 2 − 16ac. Note that z > 0 and that, since a < 0, c > 0, Proof. Our purpose is to find the maximum of f (m jk ) = am 2 jk + bm jk + c|m jk |, where a < 0, and c < 0. Setting ∂Q 1 ∂m jk = 0, we obtain ∂Q 1 ∂m jk = 2am jk + b + c · sign(m jk ) = 0.
The roots of this polynomial are . Note that (b + c) 2 − 16ad > |b + c| since a < 0 and d > 0. Hence, the only acceptable root is m + jk := as the other one is negative.
The roots of this polynomial are Hence, the only acceptable root is m − jk := < 0, as the other one is positive.
• Suppose b = 0. Then clearly f (m jk ) = f (−m jk ) for all m jk , i.e. the function is even.
Therefore, m + jk and m − jk are opposite and both arg maxima.
To maximise (34) the EM algorithm make use of complete-data log-posterior associated to For simplicity we will denote by p(Z |∆, We first outline the E-step, which is based on taking the expectation of Expression (35) with respect to p(Z |∆, X), namely: where C is a constant, and we have defined as usual T l := diag(τ 1l , . . . , τ pl ) and l i to be the unique l = 1, . . . , p b such that b il = 1.
Expression (36) only depends on Z through the conditional posterior mean and the conditional second moments The M-step consists in maximising Equation (36) with respect to ∆. To this end, we set its partial derivatives to 0, as shown below.
The maximum of the j th row of matrix M can be found solving (39) as: for j = 1, . . . , p.
Maximisation of T for a fixed batch l is obtained by taking the derivative with respect to T l Solving Equation (41) and using the diagonal constraint we obtain: To maximise with respect to (θ, β) we set Taking the j th row of matrix (θ,β) and solving Equation (43): Equation (44) has the form of a ridge regression estimator with penalty ψ, inducing an equal shrinkage to each coefficient of (θ,β).

B EM algorithm under Normal-SS
Akin to the Flat prior, we first take the expectation of the complete-data log-posterior with respect to the latent variables and conditioning upon the current ∆ = (M, θ, β, T , ζ): Due to the conjugate Normal-SS hierarchical construction, Expression (45) can be split in order to simplify the EM algorithm as Q(∆) = C + Q 1 (θ, M, β, T ) + Q 2 (ζ), where: The E-step for Q 1 resembles the one for the flat prior model shown in Supplement A, plus an extra conditional expectation: withp jk = p(γ jk = 1 |∆) given bŷ The first and second moments E[z i |∆, X] and E[z i z i |∆, X] respectively are given in Supplement A.
In the M-step we proceed by optimising Q 1 and Q 2 independently, in 2 steps: a maximisation of Q 1 with respect to M ,T l and (θ, β), followed by a maximisation of Q 2 with respect to ζ.
Setting to 0 the partial derivative with respect to M gives: with D γ ∈ R p×q , d jk =

1
(1−γ jk )λ 0 +γ jk λ 1 and A • B being the Hadamard (element-wise) product of two matrices A and B. Taking the j th row of matrix M and solving equation (48) we obtain: for j = 1, . . . , p. Updates forT l and (θ,β) are the same ones given in Supplement A. Finally, Solving Equation (49) and substituting E[γ jk |∆]:

C EM algorithm under MOM-SS
Analogous to Normal-SS, we first take the expected complete-data log-posterior Q(∆) = C + Q 1 (θ, M, β, T b i ) + Q 2 (ζ). By construction Q 2 is of the same form than in Equation (47) and Q 1 is given by For the M-step of the loadings, we consider using a coordinate descent algorithm (CDA) that leads to closed-form expressions for m jk . The partial derivative of (51) is: with D γ ∈ R p×q , d jk = ((1 − γ jk )λ 0 + γ jk λ 1 ) −1 ,M inv a matrix with elements 1/m jk and A • B being the Hadamard (element-wise) product of two matrices A and B.
Finally, the updates forT l , (θ,β) j andζ k are equivalent to the ones obtained for Normal-SS.

D EM algorithm under Laplace-SS
Now, the part of the expected comple-data log-posterior Q 1 for the Laplace-SS is In Q 2 the conditional expectations E[γ jk |∆] =p jk iŝ The M-step update for M is obtain by setting to 0 the partial derivative with respect to M , for m jk = 0 with D γ,M ∈ R p×q with element jk: d γ,M jk = sign(m jk )E[d jk |∆]. Taking the partial derivative of (54) with respect to m jk , when m jk = 0 and setting it to 0 we obtain:

E EM algorithm under Laplace-MOM-SS
Finally, Q 1 for Laplace-MOM-SS is given by: The new conditional expectation for the inclusion probability E[γ jk |∆] =p jk iŝ For the M-step of the loadings, we consider using a coordinate descent algorithm (CDA) that performs successive univariate optimisation with respect to each m jk .
Notice that when m jk = 0, the value of Q(m jk = 0) = −∞, thus the solution for the optimisation is given by setting ∂Q 1 ∂m jk = 0. The partial derivative of (57) w.r.t. M is with D γ,M ∈ R p×q with element jk: d γ,M jk = sign(m jk )E 1 (1−γ jk )λ 0 +γ jkλ1 |∆ ,M inv a matrix with elements 1/m jk and A • B being the Hadamard (element-wise) product of two matrices A and

B.
Taking the partial derivative of (57) with respect to m jk , when m jk = 0 and setting it to 0 we obtain:

F Weighted 10-fold cross-validation
We aim to a pseudo-code-algorithm for the weighted 10-fold cross-validation used through this paper.
Algorithm 3 i − (θv [r] i +Mẑ i +βb [r] i ) T b i || F end set X = X 10 G Plots simulations of truly sparse loadings M scenario and no batch effect The following plots present the heatmaps of the reconstruction ofM , Cov(x i | ·) −1 andγ for the scenario with truly sparse loadings M , setting q = 100 and without batch effects.

H Plots simulations dense loadings M and no batch effect
The following plots show the heatmaps of the reconstruction ofM , Cov(x i | ·) −1 andγ for the scenario with dense loadings M , setting q = 100 and without batch effects.