Biclustering via Semiparametric Bayesian Inference

. Motivated by classes of problems frequently found in the analysis of gene expression data, we propose a semiparametric Bayesian model to detect biclusters, that is, subsets of individuals sharing similar patterns over a set of conditions. Our approach is based on the well-known plaid model by Lazzeroni and Owen (2002). By assuming a truncated stick-breaking prior we also ﬁnd the number of biclusters present in the data as part of the inference. Evidence from a simulation study shows that the model is capable of correctly detecting biclusters and performs well compared to some competing approaches. The ﬂexibility of the proposed prior is demonstrated with applications to the analysis of gene expression data (continuous responses) and histone modiﬁcations data (count responses).


Introduction
Assume we record measurements {y ij } corresponding to a sample of i = 1, . . . , n individuals (e.g. genes) on each of j = 1, . . . , J conditions. The data structure can thus be summarized in matrix form. Assume also the main interest is in identifying subsets of individuals sharing consistent patterns over a subset of conditions. In other words, we are concerned with the detection of biclusters (Pontes et al., 2015). Such problems arise frequently in the analysis of gene expression data (Tanay et al., 2002).
Several methods have been proposed in the literature to tackle the problem of finding biclusters in data coming in this matrix format. Biclustering was first discussed by Hartigan (1972) in the context of creating a method for simultaneously grouping rows and columns of a matrix (coining for this the term direct clustering). A generic form of approaching this problem consists of clustering analysis, in which individuals and conditions are each separately partitioned and the combined subsets later assessed. This idea could be criticized because of the potentially large number of clusters being created, including possibly meaningless subsets, and also because no overlapping is * AM was partially funded by a Discovery grant 2019-05444 from the Natural Sciences and Engineering Research Council of Canada, and by a Fundamental Research Projects grant PRF-2017-20 from the Institute for Data Valorization (IVADO) of Quebec, Canada. FQ was partially supported by grant FONDECYT 1180034 and by ANID -Millennium Science Initiative Program -NCN17 059. †  allowed among selected subsets. Nevertheless, this has been traditionally a common way to tackle the problem of biclustering. Approaches based on this idea can be found in Getz et al. (2000) and in Tang and Zhang (2005). The case of latent block models (LBM, see, e.g. Govaert and Nadif, 2014) is one such instance where independent partitions are considered for rows and columns of the data matrix. These partitions are inferred from the data in a model-based fashion. For a review of computational approaches to LBM see also Bouveyron et al. (2019). A different approach was considered by Cheng and Church (2000), who proposed an algorithm based on identifying submatrices with similar entries, as measured by a mean squared residue. Tanay et al. (2002) proposed SAMBA (Statistical-algorithmic method for bicluster analysis), which is based on viewing the data structure as a bipartite graph with edge weights assigned according to a certain probability model in such a way that heavy sub-graphs coincide with biclusters with high probability. The Factor analysis for bicluster acquisition (FABIA), developed by Kasim et al. (2010) considers instead a class of multiplicative models that exploits a sparse factorization of the data matrix that allows for heavy-tailed data, paired with a model selection approach to detect biclusters under a Laplacian prior to enforce sparsity. For a recent review on these and other bicluster methods for gene expression data, see Pontes et al. (2015), and for the case of biclustering discrete multivariate data, see Fernández et al. (2019).
Other model-based approaches devised for detection of biclusters are available in the literature. One traditional alternative (on which the approach to be discussed later is based) is the plaid model (Lazzeroni and Owen, 2002). In the plaid model, the expected value for each y ij is computed by summing contributions from each bicluster, and biclusters need not be disjoint, i.e. they may have nonempty overlapping. An improvement of their original algorithm for finding biclusters is described in Turner et al. (2005a). Extensions of this approach were considered, with different forms of prior construction, in Zhang (2010), Chekouo and Murua (2015a), and Chekouo et al. (2015). Under a parametric prior framework for the plaid model, Caldas and Kaski (2008) discuss an efficient implementation using a collapsed Gibbs sampler. Gu and Liu (2008) present an approach that, based on identifiability considerations, restricts biclusters to overlap on either rows or columns. Many of these approaches involve specifying a fixed number of biclusters. A different approach that does not involve this restriction was discussed in Xu et al. (2013). They proposed a nonparametric Bayesian Poisson model for histone modifications (HMs) by means of a zero-enriched Pólya urn model (Sivaganesan et al., 2011) that allows for clustering of HMs and also accounting for the fact that many HMs are idle and play no role for clustering. They also allow for each subset of HMs to define their own clustering of genomic locations, thus constructing a nested structure of biclusters. We stress here that by construction, the nested biclustering structure does not allow for overlapping. Ni et al. (2020) proposed a model for feature allocation that can also produce overlapping biclusters of patient-disease and symptom-disease, but using a completely different approach than ours, based on matrix factorizations. Their model is in reality specifically designed for the case of categorical entries in the data matrix. Li et al. (2020) described a mixed effects model for periodontal data that takes into account the spatial configuration of teeth, using a non-overlapping bicluster construction that features repulsion to induce sparsity by way of a determinantal point process. Ren et al. (2020) propose a method that identifies biclusters of patients sharing similar patterns over time, with applications to monitoring data on 24-hr ambulatory blood pressure. They construct the biclusters by means of a Dirichlet process prior at the level of patients and a baseline distribution with time change points. And recently, Zhou et al. (2021) consider Dirichlet-multinomial mixtures and matrix factorizations for the analysis of taxon abundance data.
In this work, we revisit the plaid model with a nonparametric prior that allows us to detect biclusters that may have nonempty intersections, while at the same time lifting the restriction of fixing a priori the number of biclusters. At the heart of the plaid model definition is the introduction of two binary matrices (details will be given later in Section 2) that indicate whether genes and conditions form part of a certain bicluster. These matrices also imply a null cluster that has the purpose of concentrating the background noise. Our hierarchical model construction specifies the prior distribution of these matrices by way of two separate stick-breaking processes on latent variables that define these binary quantities via thresholding. Furthermore, we model the collection of latent variables defining the binary matrix for genes by way of a conditional autoregressive specification. This allows us to introduce information on which genes are a priori more likely to be part of a given bicluster. A byproduct of the double stick-breaking construction is that we can infer the number of biclusters. In practice, we assume a certain maximum K of biclusters, which can be large enough to pose no practical restriction. At the same time, we introduce a penalization term that discourages the formation of very large biclusters that may otherwise be detected. Large biclusters are usually hard to interpret and, many times, meaningless. The prior construction is general and can be coupled with specific sampling models such as a Gaussian, unimodal or Poisson likelihood, depending on particular needs and/or data features. We illustrate and implement all these instances, and compare results with other competitor approaches that were designed for data structures matching our inferential target. The comparisons are carried out in the context of extensive simulations studies and also with datasets already analyzed elsewhere.
The main novelties of this paper can then be summarized as follows: (1) we free the traditional plaid model from the restriction of a pre-specified number of biclusters, while allowing for a wide range of possible sampling models to accommodate for various data formats; (2) we define the binary indicator matrices by way of a novel approach based on thresholding a double stick-breaking prior that allows us to provide inference on the number and conformation (i.e. genes and conditions) of possibly overlapping biclusters; and (3) we introduce a penalty prior that controls the size of biclusters. This paper is organized as follows. Section 2 describes the proposed hierarchical model and the corresponding prior construction. The model has several components, and we describe each of these, discussing their role and impact on the resulting inference. A detailed posterior simulation strategy is described in an accompanying Supplementary Materials (Murua and Quintana, 2021) file, and for convenience, a summary of some of its less standard aspects is presented in Section 2.3. A simulation study aimed at evaluating and comparing model performance is described in Section 3. Data illustrations for both continuous and count outcomes are described in Section 4. The article concludes with a discussion in Section 5.

The modeling approach
The proposed model is of hierarchical type and builds on the plaid model. As discussed below, the sampling model adopts either an additive or multiplicative form depending on the type of data available. The prior involves a double stick-breaking construction that is used to overcome the restrictions posed by fixing a priori de number of biclusters.
In what follows, we specify all of the model components, discussing their role for the desired inference problem, namely, uncovering meaningful biclusters.

Sampling model
We start our discussion with the definition of the mean response, and then proceed to the specification of sampling distribution. We deal here mainly with the continuous and count response cases, which are the most frequently found in practical applications from fields such as genetics.

Plaid model for signals in sampling model
Let y ij denote the score collected for individual (say, gene) 1 ≤ i ≤ n and condition 1 ≤ j ≤ J. The data structure is then of matrix form, with individuals typically represented as rows and conditions as columns. We consider below the case of continuous and count scores, which may naturally arise in various applications. Our approach for the detection of biclusters, i.e., subsets of individuals exhibiting similar behavior across a subset of conditions or columns, is based on the plaid model introduced by Lazzeroni and Owen (2002). In the continuous response case, this model expresses each y ij as the sum of signals coming from potentially several biclusters plus a noise term. To explain the main idea, we start by introducing the bicluster indicators. Let ρ = {ρ i } and κ = {κ j } be matrices with binary entries defining the biclusters in the following way: gene i and condition j are part of bicluster 1 ≤ ≤ K provided that ρ i = κ j = 1. Note that a maximum number K of possible biclusters is assumed (this number can be assumed to be large enough to pose no practical restriction). Also define γ ij = K =1 (1 − ρ i κ j ), and note that γ ij = 1 only when gene i and condition j are not part of any bicluster. We refer to this cluster as the null cluster, representing what we term as the noise present in the data. The null cluster was first introduced by Chekouo and Murua (2015a), and was also used in Chekouo et al. (2015). Xu et al. (2013) also used a similar null cluster, but they defined it as a collection of subclusters nested inside larger biclusters. The plaid model assumes a sampling model where responses are conditionally independent given the bicluster matrices ρ and κ. In our work, depending on the nature of the data, the plaid model is either an additive or a multiplicative model, the latter case arising as an additive specification in the log-scale. Details follow.

Sampling model for continuous responses:
In the case of continuous data, the mean response under the plaid model, μ ij . = E(y ij ), is assumed to be additive in the row and column effects. Specifically, it is expressed as

Sampling model for count responses:
In the case of count responses (e.g., Poisson data), the mean response in the plaid model is assumed to have multiplicative row and column effects (or equivalently, additive in a logarithmic scale). It is expressed as . . , n, j = 1, . . . , J. (2. 2) The rightmost term in (2.1) or (2.2) represents the background noise or null cluster, i.e., everything that does not belong to a bicluster. Under either case, the sampling level parameters α i1 , . . . , α iK and β j1 , . . . , β jK are not identified and need to be constrained. A standard procedure for doing so consists of imposing the restriction This constraint can be easily implemented with a change of variables. Concretely, write α = (α 1 , α 2 , . . . , α n ), and denote the number of rows in bicluster as r = n i=1 ρ i . The constraint on α can be expressed as α = V a , for an n-dimensional vector a , and an n × n matrix V whose components are given by (V ) ii = ρ i (1 − (1/r )), and (V ) ij = −ρ i /r if i = j. In other words, where ρ = (ρ 1 , . . . , ρ n ). It is easily seen that with these transformations, n i=1 α i ρ i = ρ α = 0, for all . So, instead of working with α , we can work with a . In a similar manner, we can replace the vector β with a vector b , so that β = W b , where (W ) ii = κ i (1 − (1/c )), and (W ) ij = −κ i /c if i = j, and where c = J j=1 κ j is the number of columns in bicluster . Prior choices for these parameters are discussed below.

Error distribution for sampling model
So far we have defined the mean of the signals through (2.1) or (2.2), depending on the data type under consideration. To properly define a sampling model, we consider two versions of likelihood function: Gaussian, and Poisson. The first is appropriate for signals on a continuous scale, while the second is suitable for count data such as in the histone modification (HMs) example discussed in Section 4. In defining the sampling model we recall that the distribution for observed responses should combine the signals from potentially overlapping biclusters and also those coming from the background noise. The likelihood models we use are thus: (a) Gaussian likelihood model: In this case we combine the Gaussian signals with three different specifications for the noise term that we have found useful in practice. These choices result in the sampling models p(y ij |ρ, κ) that are specified in Table 1 below: These choices are motivated to provide a sensitivity assessment to specific distributional assumptions for the background noise. The third choice consists of noise modeled as a unimodal distribution based on the Rayleigh distribution. More precisely, by employing the unimodal distributions representation derived by Khintchine (1938), and following the work of Paez and Walker (2018), we find that any unimodal density f Y (y) can be expressed as Even though the Khintchine representation just described is general, we are mainly interested in assessing how departures from normality assumptions for the background noise component affect the likelihood model. We thus consider a parametric mixture specification in the above unimodal representation and leave the nonparametric aspects of the model for the definition of prior distribution of the binary matrices ρ and κ, as detailed below. Under this view, we have found in practice the Rayleigh distribution to be a useful choice for mixing distribution. This is because by taking f S (s) = (s/σ 2 0e ) exp(−s 2 /[2σ 2 0e ]) we get a closed-form expression: which simplifies to what is indicated in Table 1. The prior specification of this part of the model proceeds by assuming standard distributions for mean and variance parameters: (2.5) In addition we assume Finally the specification of this part of the model is completed by assuming that (2.7)

(b) Poisson likelihood model:
In this case, we have found it useful to assume that a realization in the null cluster comes from a Beta negative binomial distribution with parameters (r, α N , β N ). Under this assumption, the density of an observation y in the null cluster is given by where B(·, ·) stands for the beta function. We suppose that β N is fixed (for example, we can fix β N to a small positive value). Therefore, the likelihood model for count data can be expressed as Unlike the previous Gaussian case, we find it convenient to assume the priors on the bicluster means {μ }, and unconstrained row and column effect coefficients where β G is a constant, and ψ(·) denotes the digamma function. Note that this choice gives a zero-mean prior to each of the {μ }, {a i } and {b j } parameters inside the exponential expression in (2.2).

Hierarchical prior structure
An essential component of the plaid model is the pair of binary matrices indicating bicluster membership. Recall that gene i and condition j form part of bicluster 1 ≤ ≤ K provided that ρ i = κ j = 1. The prior distribution we construct, which is explained in detail below, has three main distinct goals: (1) to allow for overlapping biclusters by adopting the plaid model, but without a pre-specified number of biclusters; (2) to express prior beliefs on how genes or rows interact with each other when forming biclusters; and (3) we use information on the expected proportion of data that will form the biclusters, without restricting the number of detected biclusters in any way. The third goal is specifically designed to help avoiding the detection of too many (potentially spurious or hard to interpret) biclusters. We describe all these elements next.

Prior for matrix κ
We assume first a maximum number K of biclusters, taken to be large enough so that it does not restrict in any practical way the desired inference. Next, the probabilities of bicluster membership are defined by way of a stick-breaking prior (Ishwaran and James, 2001) truncated at K, which frees us from the restriction just described. The main idea here is to use an ordered set of latent thresholds (the ζ 's below), computed by deterministically transforming the probability mass (the t 's below) induced by a stick-breaking process. This way, elements are assigned to biclusters according to a thresholding process that employs a collection of latent scores (the T j 's below). Specifically, for the case of the column membership variables κ we assume a truncated process construction of the form Setting U K = 1 we have the same weight distribution as a truncated version of the Dirichlet process (Sethuraman, 1994) conditional on M c , and in particular, P ( Based on the previously constructed stick-breaking process we specify the latent thresholds as ζ 1 = 0 = Φ −1 (1/2), and where Φ −1 (·) denote the inverse of the standard normal CDF. The binary matrix κ is then defined as and I{A} denotes the indicator function of the set A. This defines a categorical assignment of bicluster memberships with values ranging from 1 to K, and with a stochastic reduction in the underlying probabilities, to discourage the formation of small spurious biclusters.

Prior for matrix ρ
For the gene selection matrix ρ we set up a prior that uses available information describing which genes are more likely to form part of the same group. This information is commonly available in biclustering for genetic problems. See the discussion below. We translate this information into a prior distribution in terms of a neighboring structure, which is a natural way of encoding known relations among genes. We construct the joint prior distribution for ρ by resorting to a sequence of latent multivariate normal random vectors that follow a conditionally autoregressive (CAR) specification, and another stick-breaking process similar to the one describing the prior on κ. These elements are described next.
We start with the first column of ρ, i.e. (ρ 11 , . . . , ρ n1 ) . Let Z 1 = (Z 11 , . . . , Z n1 ) be a corresponding random vector of latent scores. We assume that ( 2.11) which is a common way to define a set of binary outcomes in terms of a deterministic link involving the latent scores. See, e.g. Albert and Chib (1993). To build dependence in the joint distribution of the binary variables, we consider a multivariate normal distribution, expressed conditionally as where Z −i 1 is Z 1 with the ith component removed, q i1r is non-zero if and only if i 1 ∼ r and r = i 1 , that is, if the distinct genes i 1 and r are a priori believed to belong to the same bicluster (more on this below), and d i is a precision (reciprocal of the variance) parameter. Note then that q ii = 0 by definition. Let Q be the n × n matrix with Q ii = d i and Q ir = −d i q ir . Besag (1974) proved that the CAR assumption (2.12) defines a valid joint multivariate normal distribution provided that Q is symmetric and positive definite, in which case we have that (2.13) An additional consideration is that (2.11) implies lack of identifiability under changes of scale of Z i1 . To overcome this deficiency, we choose d i = 1 for all 1 ≤ i ≤ n in the definition of Q. The relation given by the prior belief membership "∼" induces a graph with vertices given by 1, . . . , n and edges connecting i 1 and i 2 if and only if i 1 ∼ i 2 . The degree ν i of any vertex i in the graph is given by its neighborhood size, that is, and zero otherwise, is known as the adjacency matrix. The matrix L = diag(ν 1 , . . . , ν n ) − A is the so-called Laplacian of the graph. When d i = 1 for all 1 ≤ i ≤ n, and q ir = ν −1 i , the matrix Q coincides with the random-walk normalized Laplacian matrix. We work with a slight modification of this matrix by setting q ir = λ/ν i , for λ ∈ R. Thus, our model for Q corresponds to a generalized Laplacian. Moreover, we assume that all degrees ν i are the same (that is, that all points have the same number of neighbors in the graph).
Our generalized Laplacian is diagonally dominant provided that |λ| < 1. This latter condition ensures the invertibility of Q. Therefore, we assume a prior distribution λ ∼ Beta(a λ , b λ ) (2.14) for this parameter that so defines the prior mean for the latent variables {Z 1 }.
The remaining columns of ρ are defined through additional latent scores. Denote these columns by Z 2 , Z 3 , . . .. We assume these to be i.i.d with the same distribution as Z 1 , given in (2.13). Our prior construction aims at encouraging a change in the number of genes participating in subsequent biclusters, and that row sizes of these are stochastically sorted in decreasing order, with largest biclusters (in number of rows) appearing first. To this effect, we move the cutoff in (2.11) and consider a stick-breaking construction of the form As in the case of the column variables κ, this prior structure implies P ( ∞ =1 w = 1 | M r ) = 1. With this, we assume the additional columns of ρ to still be defined via thresholding, but change (2.11) to where θ 1 = 0 = Φ −1 (1/2), and (2.17) Note that, by construction, 0 = θ 1 < θ 2 < θ 3 < · · · and θ ↑ ∞ as → ∞ with probability 1.
The above definition does not enforce a maximum number of biclusters as described earlier. To set this maximum number to K, we simply truncate the stick-breaking construction as was done when defining the prior for κ. We achieve this by letting V K = 1, so that w 1 + · · · + w K = 1 surely, and furthermore set θ K+1 = ∞ in (2.16). Of course we can set a large enough value of K so as to represent no practical limitation in our capacity to detect biclusters.

The neighborhood structure of genes
There are several ways to incorporate prior beliefs in the gene graph edges. A simple solution consists of considering the distances between genes (e.g., correlation or Euclidean distances), and placing an edge between the k nn -nearest-neighbors of every gene. A more involved distance corresponds to using Lin's pairwise similarities (Lin, 1998;Resnik, 1995) between genes obtained from gene ontologies (Ashburner et al., 2000). These latter distances were used in Chekouo et al. (2015). We adopt here a symmetrized version of the k nn graph. In practice k nn depends on the size of the graph. In our applications and simulations we used the Euclidean distances, and set k nn to 30 as recommended by authors using similar graphs (Stanberry et al., 2008;Chekouo et al., 2015).

A penalty on the number of elements in biclusters
We have found it useful to incorporate in the model a prior belief on the proportion of data that actually forms biclusters under each of the likelihood models described earlier. This is done so as to avoid the formation of biclusters that are too large when there is little evidence that they are present in the data. In principle, the prior belief is passed as a geometric distribution, so that However, note that Thus, we may see the penalty as a product of nJ distributions Taking into consideration the normalizing constants of these probability mass functions, we get nJ identical Binomial(K, e −λp /(3 + e −λp )) distributions This result comes from the fact that for each pair (i, j) The cardinalities of the sets R m , for m = 0, . . . , K are computed next. To have exactly m biclusters for which ρ i κ j = 1, we need to have ρ i = 1 and κ j = 1 simultaneously exactly m times. So for each = 1, . . . , K, ρ i κ j = 1 only 2 J−1 2 n−1 times, and ρ i κ j = 0, the remaining (2 J 2 n − 2 J−1 2 n−1 ) = 3 × 2 J−1 2 n−1 times. So, |R m | = K m 3 K−m 2 J−1 2 n−1 K . Therefore The desired penalty is now introduced by assuming independent identically distributed Binomial distributions for each point (i, j). With this, the prior on the number of observations in biclusters is specified through a Binomial distribution with parameters nJK, and e −λp /(3 + e −λp ), that is The mean and variance of this distribution depend on the maximum number of clusters K. To lessen this dependency, we replace λ p by log K + λ p . This change results in the expected number of observations in biclusters equal to nJe −λp /(3 + e −λp /K) ≈ e −λp nJ/3, and a variance of nJ3e −λp /(3 + e −λp /K) 2 ≈ e −λp nJ/3 for moderate to large values of K. This can also be derived from the Poisson approximation to the binomial, noting that which is independent of K. That is, by replacing λ p by log K+λ p we obtain a distribution on the number of observations in biclusters very close to a Poisson with mean e −λp nJ/3. To choose a reasonable value for λ p , we can set the mean of the Binomial distribution to a value close to the prior belief on the number of points that form clusters:λ p = − log(3EB/nJ), where EB is the number of observations that are expected to form part of biclusters. Alternatively, one can also set a prior on λ p , for example, λ p ∼ Exp(θ p ), with θ −1 p =λ p . In our experiments, this prior works well.

Implementation
To implement inference for the models discussed earlier we use a hybrid MCMC posterior simulation scheme. A complete description of all the required steps is given in the Supplementary Materials file. This also includes the specific changes needed for each of the likelihood models adopted here. Nevertheless, we offer in this section a general discussion and summary of the main steps involved in the implementation.
The full model considers likelihood location-level parameters {μ }, {a i }, {b j }, {ρ i }, {κ j } and μ N . In addition to these, the model involves various variance parameters (some of which change with the assumptions on background noise distribution), the CAR matrix parameter λ, the penalty parameter λ p , and the stick-breaking parameters M c and M r . To begin this summary, recall that the bicluster row membership indicators We discuss next some of the special moves required to sample from those parameters defining the binary matrices ρ and κ, which as described earlier, are constructed via thresholding. For the row labels, ρ i = I{Z i > θ } together with (2.17) imply that the full conditional distribution of Z i can be written as the mixture of two truncated normal distributions. The full conditionals of the stick breaking variables V are more difficult to handle. Let η i, +1 denote the variable where Φ(·) stands for the standard normal cdf. Set μ ij, , else) stands for the likelihood of the model (i.e., either, Gaussian or Poisson), evaluated at μ ij, = μ (r) ij, , r = 0, 1, and "else" refers to all other model parameters fixed at their currently imputed values. In Section A.4.1 of the Supplementary Materials file we show that (3) K ρ i κ j is the total size of the biclusters. Note that there are about 2 nJ possible combinations in the above product term (depending on the values of the κ-labels). Therefore, it is more efficient to consider a MH sampling for the stick-breaking variables. For example, we could sample proposal updates from the associated Beta prior distribution, or consider transformations to near-normality of the stick-breaking variables followed by MH moves that mimic the suggestion of Roberts and Rosenthal (2009). More details are provided in Section A.4 of the Supplementary Materials file.
A similar type of full conditional is obtained for the latent T j scores, related to the column labels. Here, we again obtain that the full conditionals are mixtures of truncated normal distributions and MH moves are considered when updating the stick-breaking variables U (see Sections A.1 and A.4.2 in the Supplementary Materials file).
Finally, the full conditional distribution of λ implied by assumptions (2.12) through (2.14) has a complicated form, but we are able to derive a good approximation as is the corresponding normalizing constant, which we evaluate numerically, ν is the common degree of the graph, and the matrix A is the adjacency matrix associated with the data graph (see Section 2.2).

Simulation study
We next describe a simulation study designed to evaluate the performance of the proposed model, and to compare it against some alternative methods designed to uncover biclusters in the same setting we have described. These methods are (1) the algorithm by Cheng and Church (2000); (2) the plaid method of Lazzeroni and Owen (2002), as improved by Turner et al. (2005a); and (3) the penalized biclustering method described in Chekouo and Murua (2015a). The first two methods are implemented in the R package biclust (Kaiser and Leisch, 2008), while the third has been implemented in Java and is publicly available on the web-sites of the authors (Chekouo and Murua, 2015b). While other methods are available, as described in Kasim et al. (2017), we have limited this comparison to some of the model-based approaches, discarding purely algorithmic alternatives.
Data were generated according to a number of different scenarios based on the number of biclusters K ∈ {2, 4, 8, 16}, the type of overlap between biclusters, either conditions-only overlap (C, or columns-only overlap), or conditions and genes overlap, (C&G, or row-and-columns overlap), and the type of noise, either normal (Gaussian), uniform, or unimodal noise. Three data sizes were considered. The smaller datasets generated consist of 355 rows and 17 columns, a choice that mimics the well-known yeast cycle dataset that recorded the gene expressions of yeast over seventeen different conditions (Cho et al., 1998;Mewes et al., 1999;Tavazoie et al., 1999;Yeung et al., 2001). Two large datasets were also considered. They consist respectively of 2000 rows and 38 columns, and 2000 rows and 76 columns. These mimic the retina detachment dataset GSE28133 found at NCBI/GEO (Edgar et al., 2002) and used by several researchers (Delyfer et al., 2011;Chekouo et al., 2015). The synthetic data were generated following the setup used by Chekouo et al. (2015). Sizes and positions of biclusters were visually generated so as to have sufficient but not extensive overlap. Some datasets are illustrated in Figure 1. For each bicluster k ∈ {1, . . . , K}, the overall bicluster mean was set to a draw from a normal distribution with mean k + 1 and unit variance. Row effects were generated as realizations of a multivariate normal with mean vector given by α ki = 2/(1 + exp{−i}) −ᾱ k , whereᾱ k = mean{2/(1 + exp{−i}) i = 1, . . . , |r k |} (recall that r k and c k denote, respectively, the number of rows and the number of columns in bicluster k). The column effects were generated similarly by replacing r k by c k . We devised scenarios corresponding to low and to high variance. The low variance scenario has the row and column effects variances as well as the bicluster mean variances set to 2.0. The error variance was drawn from a gamma distribution with parameters 2.0 and 5.0. The high variance scenario has the row and column effects variances as well as the bicluster mean variances set to 10.0. The error variance was drawn from a gamma distribution with parameters 1.5 and 15.0. For the larger datasets (with 76 columns), we also considered a moderate variance scenario. This has the row and column effects variances as well as the bicluster mean variances set to 6.0. The error variance was drawn from a gamma distribution with parameters 2.0 and 12.0. For the three scenarios, the noise parameters were set as follows: Normal Noise: in this case the data were generated from a zero-mean normal distribution with variance defined as twice the error variance; Uniform Noise: in this case each data point in the null cluster is generated as a draw from a uniform distribution in (−5|u 1 |, 5|u 2 |), where for each data point, the pair (u 1 , u 2 ) is a draw from a standard bivariate normal distribution; Unimodal Noise: in this case each point is a draw from a Rayleigh-standard-Normal unimodal distribution with Rayleigh parameter σ 2 0e = 20.
We considered five replications for each data scenario described above, except for scenarios associated with the larger dataset (76 columns) which were replicated according to the Graeco-Latin square design displayed in Table 2. Three factors were simultaneously varied: variance scenario {Low, Moderate (Mod), High}, noise distribution used to generate the data {normal, uniform, unimodal}, and number of true biclusters {4, 8, 16}. A fourth factor, the noise model assumed by our model, denoted here by SbCAR for the stick-breaking-CAR prior, was also varied. Because the larger datasets are considerably more computational demanding, this design was conceived so as to measure the performance of the different algorithms on several factors, but with less runs.
The biclustering results were evaluated using the F 1 -measure (Santamaria et al., 2007;Turner et al., 2005b;Chekouo et al., 2015) of agreement between the true biclustering and the estimated biclustering yielded by the different methods that were compared. The F 1 -measure is the harmonic mean between the so-called "recall" and "precision." For given biclusters B 1 and B 2 , these are defined as follows: recall = |B 1 ∩ B 2 |/|B 2 |, and precision = |B 1 ∩ B 2 |/|B 1 |, so that F 1 (B 1 , B 2 ) = 2(1/ recall +1/ precision) −1 . (A i , B), and similarly, for each B j ∈ B, F 1 (B j , A) = max A∈A F 1 (B j , A). We measure the similarity between the two biclustering  Table 2: Simulation designed to vary three factors simultaneously: Variance, Noise, and Number of true biclusters. The notation SbCAR = xyz, means that the model SbCAR was run assuming a noise distribution equal to xyz.

Given two biclustering collections
collections using a symmetrized version of the F 1 -measure defined as Choosing the biclusters To choose the biclusters and their number we look at the posterior means of the labels p(ρ i = 1 | y), p(κ j = 1 | y), and also at p(ρ i κ j = 1 | y).
We set the -th bicluster to be the submatrix formed by the ensemble of rows i and columns j for which p(ρ i = 1|y) and p(κ j = 1 | y) are large enough. That is, if they are larger than a predetermined threshold such as 0.5, which is the threshold used in our simulations. The same procedure can also be done by only looking at p(ρ i κ j = 1 | y).
Our simulation results yielded the same or similar results with either choice of posterior means. Besides empty biclusters, we also discarded biclusters containing only a handful of cells (less than five).
Since, as discussed earlier, our normal-likelihood model allows three types of noise in the data, namely, normal, uniform and unimodal noise distributions, our model was run assuming normal, uniform and unimodal noise distributions on each dataset, regardless of the actual noise used to generate the data. For the cases K ∈ {2, 4, 8}, we also compare the results with our Poisson-likelihood model. For this, we applied the exponential transform to the data (that is, x → exp(x)), and scaled the data so as to avoid numerical problems with the exponentiation. In our experiments, we transform a data point x to y = exp(scale x/ max x ), where max x denotes the maximum over the absolute values of the data, and scale is chosen so that log(y) ≤ 30. The hyperparameters M r and M c were given Gamma priors with shape and rates equal to (2, 20), and (2, 100), respectively. The inverse-variance hyperparameters were set as follows: a μ = a α = a β = 2.1, and b μ = b α = b β = 1.1. Finally, a symmetrized 10-nearestneighbor graph structure with Euclidean distance was employed when running these simulations.
Figures 2 and 3 display barplots with means and standard deviations of the F 1 measure comparing the true collections of biclusters to what is estimated by each of the methods studied, for the case of sixteen biclusters, and for the case of eight biclusters, Figure 2: F 1 comparison results for the large datasets (with 36 conditions/columns). "Low" and "High" refer to the low and high variance scenarios, respectively. "C" and "C&G" refer to the conditions-only and conditions and genes overlap scenarios. The barplots titles "normal", "uniform" and "unimodal" refer to the type of noise used to generate the data. The methods displayed under the bars in the figure are, from left to right, Cheng & Church (CC), penalized-plaid (PenPld), plaid (Plaid), and the proposed normal-likelihood model with normal (SbCARn), uniform (SbCARu) and unimodal (Sb-CARm) noise fits (sampling models). Error bars represent variability over five repeated simulations; see the text for further explanation. respectively (the results for two and four biclusters for the smaller datasets are similar and are shown in Section B of the Supplementary Materials file). Our proposed model is denoted here as SbCAR for the stick-breaking-CAR prior. In Figure 3 only the normalnoise fit (SbCARn) results are displayed. For the small datasets, the unimodal noise model's performance is very similar to that of the normal noise; the performance of the uniform noise model is also similar but not as good as in the case of the normal and unimodal noise models. This is in contrast with the performance of the uniform fit in the large datasets (see Figure 2).
Overall, the results for the small datasets show that the best performing methods are the proposed normal-likelihood model with either normal or unimodal (not shown) noise fit, the proposed Poisson-likelihood model, and the Penalized-Plaid model of Chekouo et al. (2015). However, for four biclusters (not shown here), the best performer is the Penalized-Plaid model. But, its performance is closely followed by those of the Poissonlikelihood and normal-likelihood models. For both type of datasets, our proposed models Figure 3: F 1 comparison results for the small datasets. "Low" and "High" refer to the low and high variance scenarios, respectively. "C" and "C&G" refer to the conditions-only and condition-and-genes overlap scenarios. The barplots titles "normal", "uniform" and "unimodal" refer to the type of noise used to generate the data. The methods displayed under the bars in the figure are, from left to right, Cheng & Church (CC), penalizedplaid (PenPld), plaid (Plaid), the proposed Poisson-likelihood model with negativebinomial noise (Poisson), and the proposed normal-likelihood model with normal noise fit (SbCARn). Error bars represent variability over five repeated simulations; see the text for further explanation.
are clearly the best performers when the number of biclusters is large (eight or sixteen). For the large datasets, the performance of our models is by far the best. The specific type of noise (i.e. sampling model) used when fitting our model does not appear to have much impact on inference and generally speaking, model performance. Nevertheless, the model with uniform noise has a slightly better performance, but this finding is shadowed by the typically longer computational time it requires to carry out posterior inference when compared to the other two alternatives we tried. In the accompanying Supplementary Materials file, Section C, we report on summary statistics obtained from some of the runs we have carried out. Taking all of this into account, our practical recommendation is to choose either the unimodal or normal sampling models.
The results from the simulation with the larger datasets (with 76 conditions/columns) are displayed in Table 3. All statistics were computed from three runs of each model as specified in the Graeco-Latin design of the simulation shown in Table 2. For this simulation we added the method BBC for Bayesian biclustering described in Gu and Liu (2008). This method adopts a parametric prior for the plaid model, and based on identifiability considerations, imposes a certain restriction on biclusters, namely, that they can overlap either on rows or columns but not both. However, this method did not yield any results when the number of biclusters were set to 8 or to 16. Note that in all the runs of our model SbCAR, we have not imposed any constraints on the biclusters found except for the minimum number of cells which, as explained earlier, was set to five cells. Furthermore all competing methods were run by letting them know the true number of biclusters to be found. So in this simulation we have experimented with setting a minimum number of genes and conditions in each bicluster in the results of SbCAR. It seems reasonable to set the number of genes and the number of conditions forming a bicluster in proportion with their dimension in the data. So for the larger dataset with 76 conditions, we asked for 5% of genes and 10% of conditions as minimum sizes to form a bicluster. These translate to be a minimum of 100 genes and a minimum of 8 conditions to form an "interesting" bicluster. We remark here that these constraints are considered only as part of the posterior processing required to actually determine the biclusters, and are not at all related to other aspects of the method such as the definition of K or the penalty term discussed earlier. The results with these constraints are denoted by SbCARx in the corresponding table. The last row of Table 3 gives the overall performance of each method over all nine scenarios of

Data illustrations
We next consider applications in the case of continuous and count data, using the models developed earlier in Section 2. Recall that, as explained in Section 2.2 we adopt here the procedure of defining the graph structure in terms of Euclidean distances obtained from gene ontologies, specifically by resorting to the structure induced by the symmetrized 30-nearest-neighbor of every gene.

Yeast cell data
We consider data on gene expression for yeast cell expression data discussed and available in Eisen et al. (1998), and analyzed by several authors. See Chekouo and Murua (2015a) and references therein. The data are the result of combining information on yeast expression levels from microarray data coming from shock experiments and some previously published microarray data originating in time-courses from various cell processes, including the mitotic cell division cycle, diauxic shift, and sporulation. Specifically, we have the fluctuation of the log (base 2) expression levels of 2467 genes over 79 timepoints (conditions). A small fraction of the data (about 1.9%) are missing. To make our results comparable with previous analyses of these data, we imputed the missing data following the procedure applied in Lazzeroni and Owen (2002) and Chekouo and Murua (2015a). This consists of replacing missing values by the sum of the corresponding row (gene) mean plus the column mean (time point) minus the overall mean. The main goal of the analysis is to find genes that exhibit consistent patterns over a subset of time-points. A close study of the noise cluster yielded by the penalized plaids model of Chekouo and Murua (2015a), hinted that a unimodal Rayleigh distribution such as the one describe in Section 2.1 would fit better the noise in the data than a normal distribution. For more details, see the Supplementary Materials.
Previous published work dealing with these data report a fair number of large biclusters with a large proportion of data forming part of at least one bicluster. Based on these studies and our observation on the cluster noise, we fit a unimodal-noise model with K = 16, and use a log-Normal proposal for the penalty parameter λ p with mean 0.05 when executing the Metropolis step associated to running the MCMC procedure described in the Supplementary Materials file. The hyperparameters were set as in the simulation studies of Section 3.
We found 14 biclusters. In particular, at least three of the most important biclusters found by Lazzeroni and Owen (2002) and Chekouo and Murua (2015a) are also found by the proposed model, but one of the biclusters was in our case split into two. These are displayed in Figure 4, with genes shown in the rows and time points in the columns. Note that these biclusters overlap with each other, which implies that a model that imposes a nested structure would be inadequate to properly capture this behavior. The similarity between the conditions selected in these biclusters and those found in Chekouo and Murua (2015a) is F 1 = 0.72, while the similarity between the genes selected is F 1 = 0.62. The overall similarity is F 1 = 0.48. The drop in overall similarity is mostly due to the splitting of the Penalized-Plaid bicluster 3 into two biclusters, and the fact that, in general, our model selected less genes in the biclusters. This latter fact surely entails better enrichment in the biclusters found by the proposed model.

Histone modifications data
We consider here a dataset on histone modifications (HMs), previously analyzed in Xu et al. (2013). Histones are proteins that package and order the DNA into structural units called nucleosomes. Histones have been found to play a role in gene regulation, and combinations of HMs have also been linked to cancer prognosis, and DNA repair, among other things. See the discussion in Xu et al. (2013). The data come from a ChiP-Seq experiment for CD4+ T lymphocytes (Barski et al., 2007;Wang et al., 2008) where 39 HMs are reported, with a total of 300 genomic locations including both, promoter and insular regions. The main goal of the application is to identify subsets of genomic locations with similar patterns of HMs.
As in the paper of Xu et al. (2013), we reduced the data to a hundred genomic locations: fifty from promoter regions and 50 from insulator regions. These genomic locations are exactly the same ones selected by Xu et al. (2013). The data consist of counts of HMs over 100 genomic locations and 39 types of HMs. We applied our Poissonlikelihood model with row and column multiplicative effects and Negative Binomial noise to these ChIP-Seq data. We set K = 16, and use an exponential prior of intensity 10 for the penalty parameter λ p . The hyperparameters were set as in the simulation studies of Section 3.
The model found three biclusters. These are displayed in Figure 5. They are very similar to those reported by Xu et al. (2013). In fact, the F 1 similarity measure between the two associated clusterings formed by the genomic locations is 0.61. Basically, bicluster 1 corresponds to active-HM set 1 of (Xu et al., 2013), bicluster 2 to active-HM set 3, and bicluster 3, to active-HM set 2. Thus, the conclusions from our analysis are very similar to the ones in (Xu et al., 2013). In particular, they corroborate the findings on the study of histone patterns in the human genome of Wang et al. (2008).

Discussion
We have proposed a model to detect biclusters from a sample in the form of a matrix of data, typically with subjects in the rows and conditions in the columns. Our proposal can be used with continuous or count responses and may be regarded as a nonparametric extension of the plaid model by Lazzeroni and Owen (2002). The hierarchical prior structure features a CAR model on a latent scale to incorporate prior information on which genes are more likely to form part of the same group, teamed with a stick-breaking prior for encouraging changes in the number of genes that constitute subsequent biclusters. A suitable MCMC posterior simulation procedure was devised to make inference under this model, particularly in what refers to detecting biclusters. Extensive simulation studies were also carried out to test the model and to compare its performance against other competitors. The results showed that our proposal performs well for our bicluster detection goals.  Our model included a particular definition of the columns in the gene selection matrix ρ based on a stick-breaking construction. An alternative way to define the joint prior model p(ρ) could consider an Ising model with a neighboring structure constructed from prior information on gene interactions, just as in the approach described in Section 2. Developing further this idea is part of work to be carried out in the future.