Bayesian Learning of Graph Substructures

Graphical models provide a powerful methodology for learning the conditional independence structure in multivariate data. Inference is often focused on estimating individual edges in the latent graph. Nonetheless, there is increasing interest in inferring more complex structures, such as communities, for multiple reasons, including more effective information retrieval and better interpretability. Stochastic blockmodels offer a powerful tool to detect such structure in a network. We thus propose to exploit advances in random graph theory and embed them within the graphical models framework. A consequence of this approach is the propagation of the uncertainty in graph estimation to large-scale structure learning. We consider Bayesian nonparametric stochastic blockmodels as priors on the graph. We extend such models to consider clique-based blocks and to multiple graph settings introducing a novel prior process based on a Dependent Dirichlet process. Moreover, we devise a tailored computation strategy of Bayes factors for block structure based on the Savage-Dickey ratio to test for presence of larger structure in a graph. We demonstrate our approach in simulations as well as on real data applications in finance and transcriptomics.


Introduction
Graphical models provide a flexible tool to describe the conditional independence structure in multivariate data: the nodes of the graph represent variables and the edges amongst them define conditional dependence (Lauritzen, 1996).Most inferential approaches focus on estimation of individual edges rather than on identification of informative structure in a graph on a larger scale.This is despite the fact that such largescale structure is often present and of interest in multivariate data (Ravasz et al., 2002;Yook et al., 2004).Moreover, estimation of a single edge is (often extremely) sensitive to the number of observations as well as to the presence of specific nodes in the graph.We therefore propose graphical models that also enable learning of large-scale structure.These models build on extensive work in random graphs and networks (Newman, 2011;Fortunato and Hric, 2016;Lee and Wilkinson, 2019) such as the stochastic blockmodel (Holland et al., 1983).

Bayesian Learning of Graph Substructures
It is important to stress the distinction between random graph theory and graphical models as two parallel, large research areas with only limited interplay such as the work by Bornn and Caron (2011) and Peixoto (2019).Within the first field, models for random graphs have evolved substantially from initial approaches such as the Erdős-Rényi model (Erdős and Rényi, 1959) to methods for large-scale structure.They usually involve the description of network formation/evolution.See Fienberg (2012) and Barabási (2016) for an overview.Such developments contrast with the literature on graphical models (Friedman et al., 2007;Armstrong et al., 2009;Zhou et al., 2011;Maathuis et al., 2019;Ni et al., 2022) that aims to infer a graph from multivariate data.In this context, focus of inference is usually to determine the presence of an edge between two nodes whereas modelling of large-scale graph structures is often neglected (Bornn and Caron, 2011).
The rationale underpinning our work derives from the following consideration.In the random graph literature, there is major interest on large-scale structures as they often arise in applications.A common example is provided by scale-free networks which imply a hub structure (Yook et al., 2004).More recently, Newman (2011) advocates for more complex structures such as modularity.This consideration motivates the need to investigate such components also when inferring graphs from multivariate data.On the other hand, the Bayesian graphical model literature commonly focuses on single edges, and specification of a prior on graph space is achieved assuming the same probability of inclusion for each edge with all edges being independent.This prior corresponds to the Erdős-Rényi model.
Focusing on single edges can be restrictive in many applications, often preventing detection of important data features.For instance, assume we are interested in estimating a graph from gene expression data.It could be of biological interest (e.g.disease aetiology) to group genes in co-expression modules (i.e.block, larger structure) (Yook et al., 2004).Similarly in metabolomics, it is of interest to identify metabolites that are involved in the same biochemical reaction/pathway (Ravasz et al., 2002).Social networks provide another area where such graph substructures are relevant.For instance, we might estimate a graph from voting records of members of parliament with the goal of identifying political factions.
The increasing interest in estimating large structures in multivariate data is reflected in recent work.For example, Zhang (2018) first estimates a graph from the data and, then, identifies large-scale structure using random graph methods.Such an approach is suboptimal, for instance because it does not propagate the uncertainty from network estimation to the estimation of large-scale structure.In the machine learning literature, methods for identification of graph substructure can be found in Marlin et al. (2009).
Our work is positioned in this new line of research.Exploiting advances from random graph theory, we propose graphical models able to accommodate single-edge as well as block structure.The benefits of joint graph and structure recovery compared to a two-step approach are multiple: (i) if present, large-scale structure can guide graph estimation; (ii) ad hoc specification of a graph estimator (e.g. through an edge inclusion probability threshold) is not required; (iii) data-driven detection of structure or lack thereof; (iv) uncertainty in graph estimation propagates to large-scale structure learning; (v) extension to complex set-ups (e.g.different biological conditions) is in principle straightforward which leads to; (vi) effective use of information as the developed framework allows combining data from multiple sources in a principled way.We consider both single and multiple graph scenarios as well as different blockmodels, namely the usual stochastic blockmodel and also one where blocks are cliques.Here we focus on blockmodels because of their popularity, but the setting is general and other priors could be employed.
One of our contributions is an algorithm (derived as by-product of the MCMC) to compute Bayes factors to test for the presence of block structure, which is equivalent to the presence of clusters in a nonparametric partition model.Our approach, based on the Savage-Dickey ratio (Dickey, 1971), offers computational advantages over existing methods (Basu and Chib, 2003;Legramanti et al., 2022a).To define blocks in multiple graphs, we introduce a novel Bayesian nonparametric prior.Specifically, we propose a Dependent Dirichlet process that does not enforce exchangeability within groups as in previous approaches (e.g.MacEachern, 1999;De Iorio et al., 2004;Müller et al., 2004;Camerlenghi et al., 2019;Quintana et al., 2022).
The paper is structured as follows.Sections 1.1 and 1.2 review related work on blockmodels and graphical models, respectively.Section 2 introduces Gaussian graphical models (GGMs, Dempster, 1972).In Section 3, we propose various priors on graphs that allow recovery of large-scale structure.Section 4 introduces Bayes factors for testing for block structure.We demonstrate the proposed approach in simulation studies in Section 5 and on real data applications in Section 6.We conclude the paper in Section 7. The paper is accompanied by Supplementary Material which contains further details on the methods and results.

Stochastic Blockmodels
Arguably, the most widely used model for large-scale structure in graphs is the blockmodel (Fienberg, 2012) which is therefore our starting point.A stochastic blockmodel (Holland et al., 1983) consists of a partition of the set of nodes into blocks or communities, where we use both terms interchangeably.Then, nodes in the same block are more likely to be connected than nodes from different blocks.Thus, the structure of interest is the clustering of nodes and the connectivity within and between these clusters.Introducing block structure in graph estimation allows highlighting macro-organisation (instead of focusing on single edges) and important hubs/connectivity clusters, which ultimately will aid interpretation of the results and hypothesis generation.
To this end, the key modelling strategy that we adopt is to employ tools from the Bayesian nonparametric literature for estimation of block structure in a graph, as such a strategy provides uncertainty propagation across the full graphical model and data-driven determination of blocks' number and membership.We limit this literature review of blockmodels to Bayesian nonparametric approaches though many others exist (e.g.Fortunato and Hric, 2016;Abbe, 2018;Funke and Becker, 2019;Lee and Wilkinson, 2019;Gao and Ma, 2021).See Schmidt and Mørup (2013) for an introduction to Bayesian nonparametric modelling of graphs including blockmodels.Kemp et al. (2006) introduce a blockmodel where the prior on the partition of nodes is a Chinese restaurant process (CRP, Pitman, 2006) which closely relates to the Dirichlet process (DP, Ferguson, 1973).Geng et al. (2018), Gao et al. (2020) and Jiang and Tokdar (2021) place a mixture prior with random number of components (Miller and Harrison, 2017;Argiento and De Iorio, 2022) on the partition: Geng et al. (2018) obtain posterior consistency results for the number of blocks , and Jiang and Tokdar (2021) do so for the partition and the edge probabilities.Gao et al. (2020) provide posterior concentration rates for the edge probabilities and show that their posterior mean achieves the minimax rate.Legramanti et al. (2022b) employ Gibbs-type partition priors which generalise both the CRP and the mixture with random number of components.
In general, these approaches require also specification of prior edge inclusion probabilities jointly with the block structure prior.For instance, Kemp et al. (2006), Geng et al. (2018) and Legramanti et al. (2022b) place Beta distributions on the edge probabilities, and Reyes and Rodríguez (2016) and Jiang and Tokdar (2021) add structure by using different priors for within-and between-block edge probabilities, while Tan and De Iorio (2019) use a DP to build a joint prior on the partition of nodes and edge probabilities.Additionally, they extend the model to a degree-corrected blockmodel, i.e. they introduce a popularity parameter for each node.Passino and Heard (2020) and Loyal and Chen (2022) consider Bayesian blockmodels where edge probabilities derive from a latent space embedding.Caron and Fox (2017) model edge probabilities by associating edges with realisations from a Poisson process with rate described by a random measure.Herlau et al. (2016) and Todeschini et al. (2020) extend this approach, respectively, to (i) blockmodels, with blocks corresponding to subsets of the support of a random measure; and (ii) to overlapping community detection, with each community corresponding to an element of a compound random measure.The Bayesian nonparametric blockmodel by Peixoto (2017) defines a generative process based on a random partition prior on the block configuration where structural constraints are imposed on the number of edges across blocks.This approach avoids explicit modelling of edge inclusion probabilities.

Learning Block Structure in Graphical Models
Proposals for the estimation of large-scale structures in graphical models can be categorised in two main strategies: (i) regularisation methods; (ii) imposing structure on the precision matrix.Within the first framework, Ambroise et al. (2009) and Marlin et al. (2009) do not model graphs explicitly, but learn a block structure as part of a shrinkage estimator for the precision matrix as in the graphical lasso (Friedman et al., 2007), where every block is characterised by its own regularisation parameter.
Within the second framework, Sun et al. (2014) consider a GGM with a CRP as prior on the partition of nodes.Then, the partition informs the sparsity pattern of the scale matrix of the Wishart prior on the precision matrix rather than of the precision matrix itself as is commonly done in GGMs.See Section S3 of Supplementary Material (van den Boom et al., 2022b) for details.Marlin and Murphy (2009) impose sparsity in the precision matrix of a GGM by first approximating the joint distribution of the nodes via the specification of the conditional distribution of each node given the others.Then, they impose a continuous spike-and-slab prior on "edge weights" that capture the association of a node with the others.Finally, the prior on edge weights incorporates a block structure.Sun et al. (2015) fix the number of blocks, place a Dirichlet prior on the partition of the nodes in an exponential graphical model and compute a point estimate of the partition using an expectation-maximisation algorithm.Bornn and Caron (2011) consider decomposable graphs, which allow modelling of cliques and separators separately, and use a product partition model as prior on the graph.Their prior can induce large cliques and, as such, allows the identification of larger structures than edges.Peixoto (2019) uses a stochastic blockmodel as prior for network reconstruction in two discrete-valued graphical models, i.e. the Ising model and an epidemic model of infection status across time.He shows empirically that joint estimation of the graph and block structure increases accuracy as compared to two-step approaches.Colombi et al. (2022) and Cremaschi et al. (2022) consider inference in GGMs under a known block structure with either all or no edges present between a pair of blocks.

Gaussian Graphical Models
Let the graph G = (V, E) be defined by a set of edges E ⊂ {(i, j) | 1 ≤ i < j ≤ p} that represent links among the nodes in V = {1, . . ., p}.The data are represented by an n × p matrix Y with independent and identically distributed rows corresponding to p-dimensional random vectors whose elements are represented by nodes on the graph.A graphical model (Lauritzen, 1996) is a family of distributions on the rows which is Markov over G.That is, the distribution p(Y | G) is such that the i-th and j-th columns of Y are independent conditionally on the other columns if and only if (i, j) / ∈ E.
While our development for learning large-scale structure applies to graphical models in general, here we focus on GGMs (Dempster, 1972), which consider a Gaussian law for p(Y | G).Then, each row of Y is distributed according to a Multivariate Gaussian distribution N (0 p×1 , Ω −1 ) with precision matrix Ω.The conditional independence structure implied by G implies that Ω ij = 0 if and only if nodes i and j are not connected.For the complete matrix Y , Ω ij = 0 implies that the i-th and j-th columns of Y are independent conditionally on the others.In this context, a blockmodel on G enables learning of sparse block-structured precision matrices where the block structure is unknown.
A popular choice as prior p(Ω | G) for the precision matrix Ω conditional on the graph G is the G-Wishart distribution W G (δ, D) as it induces conjugacy and allows working with non-decomposable graphs (Giudici, 1996;Roverato, 2002).It is parameterised by degrees of freedom δ > 2 and a positive-definite rate matrix D. Then (e.g.Atay-Kayis and Massam, 2005), 3 Graph Priors for Large-Scale Structure Recovery Key to learning large structure in graphs is specification of a prior p(G) on graphs.To this end, we borrow ideas from random graph theory, adapting them effectively in our context.
Moreover, our approach is based on the Dirichlet process (Ferguson, 1973), a probability model for random probability distributions.Readers familiar with the DP can skip to the next subsection.If a random measure H ∼ DP(ν, H 0 ), then H is almost surely discrete.H 0 is the base measure, a distribution around which the DP is centred, while ν > 0 denotes the precision parameter.Due to its discreteness, H admits the well-known "stick-breaking" construction (Sethuraman, 1994) and can be represented as a countable mixture of point masses: Here δ β k is a point mass at β k , the weights w k are generated by rescaled Beta distributions, ∼ Beta(1, ν), and the locations {β k } ∞ k=1 are i.i.d.samples from the base measure H 0 .Finally, the sequences {β k } ∞ k=1 and {ξ k } ∞ k=1 are independent.

Degree-Corrected Stochastic Blockmodel
The fundamental idea behind our strategy is the following.Each node i in the graph forms a connection with another node j according to (i) its own propensity (or popularity) captured by the parameter θ i ; (ii) its block membership captured by the"interaction" parameter β ij , with nodes in the same block having higher probability to share an edge.The popularity parameter can be thought of as the node-specific propensity to form connections with other nodes.To guarantee parsimony, popularity parameters are modelled assuming a DP prior.On the other hand, prior specification on block-specific parameters is more complex, as it depends on the prior on the partition of nodes, which defines the block structure.We exploit the discreteness of the DP to define the blocks, where each component in the DP discrete mixture (see the "stick-breaking construction") corresponds to a block and nodes allocated to the same component share the same block-specific interaction parameter.Figure 1 summarises the modelling strategy.
Our starting point is the Bayesian nonparametric degree-corrected stochastic blockmodel by Tan and De Iorio (2019) who propose a probit model for the edge inclusion probabilities.More specifically, Pr{(i, j) ∈ E} = Φ(µ ij ), independently over distinct pairs (i, j) for 1 ≤ i < j ≤ p, where Φ(•) is the cumulative distribution function of the standard Gaussian distribution N (0, 1).Then, The allocation variable z i denotes the community node i belongs to.The parameter β ij measures the strength of interaction among members of the same community with nodes in the same block expected to share more edges among themselves than with nodes outside.The popularity parameter θ i allows for degree correction in the blockmodel.
That is, nodes have varying popularity as captured by the number of their neighbours, i.e. the number of nodes they are connected to via an edge.
Figure 1: Visualisation of the degree-corrected stochastic blockmodel.Each node is characterised by its own popularity parameter θ i .Block membership is captured by the allocation parameter z i .Nodes in the same block share the same value for z i as well as for the interaction parameter β i = β zi .There are more edges within (solid lines) than between (dashed lines) blocks represented by square boxes.The resulting graph G constrains the precision matrix Ω of the data Y .The direction of the arrows reflects the hierarchical specification of the model.
To specify a prior on β ij , we introduce auxiliary variable β i for each node i and The discreteness of the DP implies a positive probability of ties in a sample from H = ∞ k=1 w k δ β k and this, in turn, induces a clustering structure so that the nodes will be grouped together in K clusters, where the number K of clusters is unknown and learnt from the data through the posterior distribution.In our context, each cluster corresponds to a block, the parameters w k denote the prior probabilities of belonging to each mixture component, and β k denotes the block-specific interaction parameter.Nodes are clustered based on their edge inclusion probability, so that nodes in the same block k share a common value β k such that in (2) i=1 reduces a posteriori to the set of unique values β 1 , . . ., β K assigned to within-block edges.We denote block membership with the variable z i ∈ {1, . . .K}, i ∈ V .

Bayesian Learning of Graph Substructures
Each node is characterised by its own popularity parameter θ i , which allows for higher flexibility, but can also challenge identifiability of large-scale structures.This is due to the fact that some connection patterns might be explained nearly as well (in terms of graph likelihood) by either granular specification of the vector θ or by blocks.To avoid such identifiability issues and associated inflated posterior uncertainty, we reduce the number of unique elements of θ through clustering via a DP.Specifically, The proposed model has wide applicability as blocks can represent communities.Algorithm S1 in Supplementary Material (van den Boom et al., 2022b) details an MCMC algorithm for posterior inference.We remark that the block-specific parameter β k does not appear in the likelihood for a block k of size one, i.e. when |{i | z i = k}| = 1, differently from conventional applications of the DP.Therefore, the Metropolis-Hastings proposal to add a block of size one does not require a proposal for the parameter specific to the new block.This property causes Algorithms 2 and 8 in Neal (2000), which we use to update the DP parameters, to coincide.

Southern Italian Community Structure
In the stochastic blockmodel, nodes in the same block are not necessarily connected.This level of flexibility is particularly desirable when the network is observed directly, and focus is on understanding network formation and evolution.On the other hand, in graphical models, the graph is a latent variable and, in applications, it might be appropriate to impose a more restrictive definition of block/community.More restrictions can also provide the benefit of improving computations as the space to explore gets smaller.
Here we assume that nodes in a block form a clique.In a clique, all pairs of nodes are connected by an edge.Indeed, the earliest approaches for community structure in graphs consider cliques (Luce and Perry, 1949;Festinger, 1949), the rationale being that a community is strongest if each pair of its members is connected.Cliques have for instance biological significance in protein-protein interaction networks (Yu et al., 2006).We thus introduce a Bayesian nonparametric graph prior where each block is a clique.
In this context, main object of inference are the cluster allocation variables z i , i ∈ V , as in Section 3.1, as then the block structure is given: Pr{(i, j) ∈ E} = 1 if z i = z j .We name such construction the Southern Italian community structure (SICS) with reference to traditional Southern Italian communities where everybody knows each other.Still, there can be connections between nodes from different blocks and, for these edges, we assume a prior inclusion probability ρ: Pr{(i, j) ∈ E} = ρ for 1 ≤ i < j ≤ p and z i = z j .This construction defines a prior p(G | z).
A prior on z completes the graph prior p(G) = p(G | z) p(z)dz.We assume that z follows the Chinese restaurant process (Pitman, 2006) with concentration parameter ν a priori for concreteness and consistency with the DP in Section 3.1.We note that our approach is flexible and other priors on the partition of nodes z = (z 1 , . . ., z p ) can be straightforwardly adopted borrowing from the rich Bayesian nonparametric literature.The CRP assumption implies: We now highlight some of the implications of the SICS prior on the overall graph structure and on the corresponding MCMC algorithm.In the standard GGM framework, moves on the posterior space of graphs usually involve a single edge and consequently, when updating the graph, we only need to integrate out one element of the Cholesky decomposition of the precision matrix (assumed to have a G-Wishart prior) leading to efficient computation (van den Boom et al., 2022a).In such a context, updating more than an edge at a time is extremely challenging.On the other hand, under the SICS framework, change of block membership for a single node can affect a number of edges in the graph: • The node joins a new block, and forms connections with every node already present in that block, thus a number of edges are introduced in the graph.
• The node is removed from the current block (which is represented by a clique).Thus, as many edges as the number of nodes left in that block are removed.Some of these edges might be readded as there is a positive probability ρ of connection between nodes in different communities.
From a computational point of view, the structure in SICS poses a challenge to posterior inference using MCMC, due to the multiple edge change.Updates of z conditional on G are not possible.Instead, we devise MCMC steps to update z and G jointly, in addition to updating G conditional on z.Section S2.2 of Supplementary Material (van den Boom et al., 2022b) details the MCMC.
The SICS prior is a limiting case of the stochastic blockmodel, obtained taking the limit for β k → ∞ in the model described in Section 3.1.In the limit, blocks become cliques with probability one.Palla et al. (2012) consider a latent factor model that is, conditionally on the factor loadings, equivalent to a GGM with the SICS prior where ρ = 0.That is, there is an edge between two nodes if and only if they belong to the same block.

Multiple Graphs
In applications, it is common that data (i.e. the rows of the observation matrix Y ) are naturally clustered due to experimental conditions.For instance, Y could represent gene expression measurements with different groups of rows corresponding to different Bayesian Learning of Graph Substructures cancer types.One way to deal with such heterogeneity is a multiple graphical model (e.g.Peterson et al., 2015;Ma and Michailidis, 2016;Mitra et al., 2016;Tan et al., 2017), where each graph corresponds to an experimental condition (e.g.case/control status).
Consider multiple graphs G x = (V, E x ) and associated data (n x , Y x ) for x = 1, . . ., q.
Here, x indexes the groups such that we have n x observations in Y x , with each group characterised by its own graph and data-generating process p(Y x | G x ).We model the graphs G 1 , . . ., G q jointly through the specification of a prior p(G 1 , . . ., G q ).The goal is to identify common patterns, as well idiosyncratic edge/block structures.We now introduce the multiple graph extension of the degree-corrected stochastic blockmodel, described in Section 3.1.For each graph G x , we have Pr{(i, j) ∈ E x } = Φ(µ xij ) and ( 2) becomes where z xi is the allocation variable for group x, x = 1, . . ., q, and node i ∈ V .Similarly to Section 3.1, we introduce auxiliary variable ∼ H for i ∈ V , marginally for each x.Then, β xij = β xi = β xj when i, j belong to the same community under condition x.The other parameters of the blockmodel are shared across graphs and have priors as specified in Section 3.1.Thus, marginally for each x, we recover the blockmodel of Section 3.1.
We treat one group (and so graph) as baseline and the other graphs as offset from the baseline group.For ease of notation, we set x = 1 as baseline group and, for clarity, corresponding parameters by a subscript 'b'.
There is a vast literature on Dependent Dirichlet processes (DDPs, see, e.g., MacEachern, 1999;De Iorio et al., 2004;Müller et al., 2004;Camerlenghi et al., 2019;Quintana et al., 2022), where the goal is to cluster subjects based also on group information.These tools are not directly applicable to our context as we are actually clustering variables (i.e.nodes on the graph) observed on n x subjects under each of q experimental conditions (groups).Since we are assuming that ∼ H marginally for each group x, the same node i under different groups can either belong to a different cluster (block) or to the same.In the multiple graph context, it is then desirable to have Pr(z bi = z xi ) > Pr(z bi = z xj ) for i = j and x ≥ 2 to reflect that node i in G b and in G x correspond to the same variable and to encourage sharing of large-scale structures across graphs.In a multiple networks context, Reyes and Rodríguez (2016) require the same property.See also the discussion on identifiability for multiple blockmodels in Section 2.2 of Matias and Miele (2016) and arguments for the related concept of separate exchangeability in Lin et al. (2021).On the other hand, DDP models typically assume exchangeability within each group x across subjects which implies that Pr(z xi = k) = Pr(z xj = k) a priori.We thus consider the following set-up, where the block structures {z xi } p i=1 , x ≥ 2, are estimated as offsets from the baseline {z bi } p i=1 . Let ∼ H for i ∈ V and H ∼ DP(ν, H 0 ) where H 0 = N (0, s 2 β ).Then, set β xi = β bi with probability γ ∈ (0, 1) and β xi | H ∼ H with probability 1 − γ, independently for i ∈ V and x = 2, . . ., q.We note that our construction is invariant to any relabelling of the blocks.This is due to the fact that the distribution of β xi is a mixture of a point mass at β bi and H: where H defines the overall block structure (across multiple graphs).Hence, our prior specification allows keeping the same block labels across multiple graphs, while biasing the probability that a node belongs to the same block across conditions.Moreover, the implied dependence across block structures closely resembles the type of dependence across partitions described in Page et al. (2022).In a dynamic clustering framework, they avoid the use of cluster labels by specifying a prior directly on random partitions, inflating the probability of belonging to the same cluster across time.
Posterior computations are greatly simplified by the introduction of binary "genealogical indicators" Note that even in the case g xi = 0, there is a positive probability that β xi = β bi due to the discrete nature of H.This implies that the probability of β xi = β bi , x ≥ 2, is greater than Pr(g xi = 1) = γ a priori conditionally on H. Section S2.3 of Supplementary Material (van den Boom et al., 2022b) details an MCMC algorithm for inference which involves a joint update for (g xi , β xi ).The prior dependence among the z 1i , . . ., z qi enables learning of block structure both within and across graphs.The indicators g xi capture the extent to which structure in G x is shared with G b .At the same time, the cluster indicators {z xi } p i=1 capture the within-graph block structure.Thus, the proposed prior construction allows for borrowing of large-scale information across graphs, as well as the detection of graph-specific blocks.
We want to highlight that our construction differs from the hierarchical Dirichlet process (HDP, Teh et al., 2006), as assuming an HDP-type prior would imply that ∼ DP(•, H), H ∼ DP(ν, H 0 ).This means that the β xi have a positive probability to be equal across group x (and obviously node i), but the same node would not have higher probability to belong to the same block across groups.
There are proposals in the graphical model literature where a graph is considered baseline (e.g.Mukherjee and Speed, 2008;Telesca et al., 2012;Mitra et al., 2016;Tan et al., 2017), but their focus is on differences in individual edges instead of blocks.Moreover, Paul and Chen (2020) consider blockmodels with multiple graphs in a frequentist framework where the number of blocks is known and assume a hierarchical structure for the block memberships under each experimental condition, linking block membership to an unknown baseline membership.Reyes and Rodríguez (2016) and Stanley et al. (2016) induce dependence among graphs by assuming that they either share the same block structure or have unrelated block structures, which leads to a less flexible modelling tool than our approach.Amini et al. (2021) use an HDP to link block structure across graphs which presents the limitation described in the previous paragraph.Ma and Michailidis (2016) assume that multiple graphs share the same known block structure and only edges between and across blocks might differ across experimental conditions.Edges are estimated using regularised nodewise regression (Zhou et al., 2011), instead of working directly on graph space.In a different context, previous work on hidden Markov models (Fox et al., 2008), including time-varying blockmodels (Ishiguro et al., 2010;Fan et al., 2015), involves a similar dependence for cluster indicators across time, but Bayesian Learning of Graph Substructures this dependence is induced through a more involved construction with a "spiked" base measure for the DP on the transition probability vector of the Markov chain.
In our construction, the distribution of {z xi } p i=1 , x ≥ 2, is defined conditionally on {z bi } p i=1 such that {g xi } p i=1 captures differences from {z bi } p i=1 .Alternative dependence structures could be easily considered within our framework.For instance, instead of setting a group as baseline, we could specify a latent block structure {z 0i } p i=1 and then define {z xi } p i=1 , x ≥ 1, as deviations from {z 0i } p i=1 , for which a prior process needs to be specified (e.g.simply assume H as prior).Finally, the multiple graph set-up can be straightforwardly extended to the SICS prior from Section 3.2.

Testing for Large-Scale Structure
In this section, we describe a strategy to test if there is any block structure in an individual graph.Although the description below will only involve one graph for ease of explanation, the same techniques can be employed to test for the presence of structure in multiple graphs.
In model ( 2), block structure is represented through indicator vector z.Thus, testing for presence of block structure is equivalent to testing whether K ≥ 2. In the Bayesian paradigm, we can use Bayes factors and here we describe a computational method for their evaluation based on the Savage-Dickey ratio (Dickey, 1971), as they are not available analytically.Consider a prespecified block structure z .For instance, z can consist of a single (K = 1) block, i.e. no large-scale structure, such that the test assesses the evidence for any block structure.Our method will often be computationally infeasible for other choices of z as discussed later but the idea applies to any z in principle.
The computation of Bayes factors for DP-based models has received attention in the literature though with some drawbacks: the method from Basu and Chib (2003) requires an extra MCMC run with z fixed to z and the use of sequential importance sampling, resulting in an involved strategy, not easily integrated into an existing MCMC implementation.Legramanti et al. (2022a) evaluate the marginal likelihood of each model using the harmonic mean approach (Newton and Raftery, 1994;Raftery et al., 2007), which can be unstable or slow to converge.Application of their method in our (and other's) context would benefit from the direct evaluation of p(Y | z) which is not available in closed form.Without an analytical form for p(Y | z), the harmonic mean approach requires an extra MCMC run with z fixed to z to approximate p(Y | z ).Indeed, one of the main advantages of our method is that Bayes factors can be evaluated directly from the MCMC output for the model of interest.
More in details, the Bayes factor of the relative evidence of where the last ratio is the Savage-Dickey ratio.Note that the second equality uses the property that the prior on all remaining parameters in the model such as β i are the .
This scheme can be employed to compute B from an MCMC chain as long as p(z ) is not too small.In that case, reliably estimating p(z | Y ) might be hard as the MCMC chain could visit z only rarely after convergence.Furthermore, p(z ) will often be too small if z corresponds to multiple blocks due to the combinatorially many ways to assign p nodes to K ≥ 2 blocks, but it usually assumes reasonable values for z corresponding to absence of block structure (K = 1), which refers to the conventional null hypothesis of no structure (i.e."no effect").For instance, we test for K = 1 in the examples considered in this work.We remark that the methods from Basu and Chib (2003) and Legramanti et al. (2022a) do not have such limitation for small p(z ), but in general require additional MCMC runs.Note that p(z | Y ) being estimated as (close to) zero is not problematic, but leads to an accurate estimate of B ≈ 0 (as long as p(z ) is sufficiently far from zero).In Section S4 of Supplementary Material (van den Boom et al., 2022b), we show empirically that the proposed Bayes factor estimation converges faster than the harmonic mean approach.

Simulation Studies
We demonstrate the performance of our approach in two simulation scenarios.For all empirical results, we set the hyperparameters as γ = 0.5, s 2 β = s 2 θ = 1, a ν = b ν = a α = b α = 2, δ = 3 and D = I p unless otherwise stated.See Section S1 of Supplementary Material (van den Boom et al., 2022b) for an overview of the models.

Karate Club Network
We investigate the importance of uncertainty propagation when learning community structure in a graphical model.As true underlying graph G, we consider the karate club network (Zachary, 1977) which Tan and De Iorio (2019) analyse using the degreecorrected stochastic blockmodel of Section 3.1.The network's p = 34 nodes correspond to members of a karate club while its 78 edges signify friendships between members.Conditionally on G, we sample a precision matrix Ω | G ∼ W G (δ, D).The n rows of Y are sampled according to the GGM in Section 2 independently from N (0 p×1 , Ω −1 ).Finally, we fit the model from Section 3.1 using 6000 MCMC iterations, discarding the first 1000 as burn-in.In this case, we set a ν = b ν = a α = b α = 5 as in Tan and De Iorio (2019) for a fair comparison.
We repeat the simulation for n = 10 4 , 10 3 , 10 2 , 10 while keeping Ω the same and present the estimated community structure in Figure 2.For n = 10 4 , the results are  very close to those in Tan and De Iorio (2019) where the underlying network is known, with the two main blocks corresponding to the karate instructor Mr Hi and the club's president John A. The increased uncertainty in the estimation of G for smaller values of n obviously affects inference on the block structure, with too little information present in the data with only n = 10 observations to recover the two main blocks.This is also reflected in the estimate of the Bayes factor comparing the model with no block structure vs the model with z ∼ p(z): B = 0 for n = 10 4 , 10 3 , B = 0.28 for n = 10 2 and B = 0.40 for n = 10.These results show that uncertainty in graph estimation can have a major impact on community estimation and this uncertainty should not be ignored as is often done in applications where a two-step approach is adopted (first graph estimation and then blocks).

Block Structure Recovery
We now investigate how accurately the proposed methodology can recover block structure.We assign p = 20 nodes to K clusters by sampling z i with replacement from {1, . . ., K} for i ∈ V .Then, we generate a graph G according to the SICS prior from Section 3.2 with between-block edge inclusion probability ρ = 0.2.Data Y corresponding to G are sampled as in Section 5.1.We consider the following scenarios: K = 4 for n = 20, 100, 500, 1000 and n = 500 for K = 2, 3, 4, 5.The performance of the algorithms is assessed over 50 replicates for each scenario.
We fit both models from Sections 3.1 and 3.2, as well as the model by Sun et al. (2014) (see Section S3 of Supplementary Material (van den Boom et al., 2022b) for a description) for comparison.We run the MCMC for 1000 iterations after a burn-in of 500 for the for the stochastic blockmodel and the model by Sun et al. (2014) while we record 5000 iterations after a burn-in of 1000 for the SICS model to account for the slower convergence and mixing of its MCMC.
The cluster allocation vector z informs the block structure.As point estimate for z, we report the configuration that minimises the posterior expectation of Binder's (1978) loss function under equal misclassification costs, which is a common choice in the applied Bayesian nonparametrics literature (Lau and Green, 2007).See Appendix B of Argiento et al. (2014) for computational details.Briefly, this expectation of the loss measures the difference for all possible pairs of nodes between the posterior probability of co-clustering and the estimated cluster allocation.Following Sun et al. (2014), we use the Rand (1971) index to quantify the difference between the true allocation and its Binder point estimate.A Rand index of one corresponds to a perfect match while a lower value indicates worse block structure recovery.
Figure 3 shows that the proposed methodology recovers the block structure comparably to or substantially more accurately than the model by Sun et al. (2014).The superior performance of the stochastic blockmodel over SICS, when occurring, is most likely due to the fact that the SICS model imposes more stringent assumptions on the correlation structure of the data, which might not be captured with small sample sizes (left panel of Figure 3).This is in line with power considerations for detecting correlation in a frequentist framework, with large sample size usually required, especially for partial correlations (see, e.g., Castelo and Roverato, 2006;Knudson and Lindsey, 2014).Secondly, the SICS structure is more easily recovered when fewer nodes belong to a block (right panel of Figure 3) as this relaxes the assumption on the overall dependence structure among the random variables.Finally, posterior inference for the stochastic blockmodel is performed through an exact MCMC (van den Boom et al., 2022a) while, for the SICS, we employ a Laplace approximation for the graph likelihood to update z and G jointly.See Section S2 of Supplementary Material (van den Boom et al., 2022b).

Applications
We apply the proposed models to two real data sets.We discuss MCMC mixing and convergence in Section S5 of Supplementary Material (van den Boom et al., 2022b).

Mutual Fund Data
We consider data on monthly returns of p = 59 mutual funds described in Scott and Carvalho (2008).The funds are divided into four types by the sectors they invest in with 13 funds investing in U.S. bonds, 30 in U.S. stocks, 7 in both U.S. stocks and bonds, and 9 in international stocks.The data contain observations on n = 86 months.Here, we ignore the dependence of the returns across time and focus on the dependence between funds as in Scott and Carvalho (2008) and Marlin et al. (2009).Note that time dependence could be easily incorporated through a mean term.The returns are quantile-normalised so that they marginally follow a standard Gaussian distribution.We fit both the degree-corrected stochastic blockmodel from Section 3.1 and SICS from Section 3.2.We run the MCMC chain for 15000 iterations discarding the first 5000 as burn-in for the stochastic blockmodel, and for 110000 iterations discarding the first  The stochastic blockmodel identifies clear blocks of funds per Figures 4 (left panel) and 5. Specifically, the U.S. bonds, U.S. stocks and bonds, and international stocks funds are each blocked together without overlap between these fund types except for two international stocks funds that are grouped with the U.S. stocks and bonds.The other funds, which invest in U.S. stocks, are mostly blocked with the U.S. stocks and bonds, or with the international stocks but not with the U.S. bonds.These results are intuitive as funds with a mixture of U.S. stocks and bonds make investments which overlap with funds with only U.S. stocks, and correlation between the returns of U.S. and international stocks is likely.This identified block structure is notably more in line with the fund types than the blocking results presented in Marlin et al. (2009) obtained by shrinkage estimation and optimisation, where only a clear separation of the U.S. bonds funds from the others is detected.
The SICS prior leads to a large number of blocks with a posterior mode at K = 24 blocks.A larger number of blocks with SICS than with the stochastic blockmodel is expected as SICS' definition of a block as a clique is more stringent such that larger blocks are less likely to appear.The large-scale pattern of the similarity matrix for SICS is still similar to that of the stochastic blockmodel in Figure 4 (Barbieri and Berger, 2004), which consists of edges with posterior inclusion probability greater than 0.5, from the stochastic blockmodel.Colour of nodes refers to fund type.Boxes group nodes belonging to blocks estimated by minimising Binder's loss function.
Generally, the posterior inference contains information on whether the stochastic blockmodel or SICS is the most appropriate model for the data at hand.We assess this by computing the Bayes factor of the stochastic blockmodel versus SICS using the harmonic mean approach.This results in a log Bayes factor of 235 indicating strong evidence that the stochastic blockmodel fits the data better than SICS.

Gene Expression Data
As a second application, we consider gene expression levels, the interactions between which are often represented as networks.An important concept in the gene network literature is that of module, which is a densely connected subgraph.Thus, learning the block structure of a graph allows module detection.Typically, a two-step approach is adopted: first the graph is estimated from the gene expressions, and then the modules are derived from the graph estimate (see, e.g., Zhang, 2018), which underestimates uncertainty and often leads to false positives.
We analyse data on gene expressions from n 1 = 590 breast cancer tissue samples and n 2 = 561 ovarian cancer samples from The Cancer Genome Atlas.We focus on p = 44 genes identified by Zhang (2018) as spread across four estimated modules (Modules 6, 14, 36 and 39 in Table 2 of the cited paper) which are highly enriched in terms of Gene Ontology (GO, Ashburner et al., 2000) annotations.For each cancer, the gene expressions are quantile-normalised to marginally follow a standard Gaussian distribution.We apply the proposed multiple graph methodology from Section 3.3 with q = 2 separate groups, corresponding to the two different cancers.We run the MCMC algorithm for 55000 iterations, discarding the first 5000 as burn-in.
Posterior inference on block structure, shown in Figure 6, carries strong similarities with the modules identified by Zhang (2018).For ease of discussion, we refer to the modules from Zhang (2018) as Module 1 (comprising Nodes 1 through 7), 2 (Nodes 8 through 32), 3 (Nodes 33 through 38) and 4 (Nodes 39 through 44), and highlight them in Figure 6.The proposed methodology finds differences in block structure between breast and ovarian cancer (see middle panel of Figure 6) as well as differences from Zhang's modules, which are forced to be the same across both cancers by construction.Across both breast and ovarian cancer, we find that Nodes 39 and 40 (GSTM3 and BCAR3 genes, respectively) are grouped with genes from Module 2, which has GO annotations relating to inflammatory response.For Breast cancer, we cluster Nodes 42 through 44 (GSTM1, GSTM2 and GSTM5) with Module 2 while we put them together with Module 1, which has GO annotations relating to pattern specification, for ovarian cancer.Note that these three genes are paralogs of each other which suggests that they have similar function.Finally, Nodes 35 and 41 (HOXB13 and GSTM4) are together with nodes in Module 1 for both cancers.These results show the flexibility of the proposed model to capture differences as well as commonalities in large-scale dependence structure across multiple biological conditions.

Discussion
In this work, we combine advances from random graph theory with graphical models to obtain joint estimation of the graph and its large-scale structure.The resulting graphical models are able to go beyond estimation of individual edges to provide inference on the community structure of the graph, while appropriately propagating uncertainty in the estimation.We introduce a novel DDP prior process tailored to the multiple graph setting and propose a convenient computation of Bayes factors in partition models.Advantages of the proposed approach include interpretability, flexibility (due to the nonparametric component) and wide applicability.We focus on two different block structures: stochastic blockmodels and SICS.We note that the SICS prior is more suitable in applications where strong partial correlation between a small number of nodes is expected.
Alternative priors on the block structure could be considered.For instance, Gibbstype priors (Gnedin and Pitman, 2005) and microclustering priors (Betancourt et al., 2022) are drop-in replacements for the respective DP terms used in (3), and cover a wide range of partition priors such as mixture with random number of components (Miller and Harrison, 2017;Geng et al., 2018;Argiento and De Iorio, 2022).In general, standard Bayesian nonparametric priors assume that the location parameters are i.i.d.draws from a base measure.Note that, in our case, the location parameters, β k , correspond to the within-block interaction strength and they can be arbitrarily close or far given the prior assumptions.On the other hand, in the context of mixture with random number of components, it is easier to introduce constraints on the locations, if the application or interpretability require it.For instance, we might impose that the β k vary significantly across blocks and, to that end, assume a repulsive mixture prior (Petralia et al., 2012).
We focus on GGMs for convenience and because of their popularity.Our methodolog-ical contribution is however not constrained to this specific set-up and can be extended to work with other graphical models, e.g. to graphs with discrete or mixed type nodes.Moreover, our computational strategy to estimate Bayes factors finds general applicability in the context of Bayesian nonparametric models to test the presence/absence of a partition structure.
We conclude the discussion with few remarks on asymptotic properties of the proposed methodology.Posterior contraction on the graph space does not imply contraction for the block structure and vice versa: posterior contraction of the graph requires the number of observations n → ∞.See Lee and Cao (2021) and Niu et al. (2021), and references therein.On the other hand, asymptotic results for Bayesian learning of block structure involve the number of nodes p → ∞ (e.g.Geng et al., 2018;Gao et al., 2020;Jiang and Tokdar, 2021).For n large, the graph might be estimated with high precision.In that scenario, considerable posterior uncertainty about the block structure can remain as such uncertainty can be present even if the graph is observed.Furthermore, for large p, block structure can be recovered accurately even if the estimates of the overall graph connection pattern are characterised by high uncertainty.

S2 Markov Chain Monte Carlo Algorithms
This section discusses the Markov chain Monte Carlo (MCMC) algorithms used in the main text.

S2.1 Degree-Corrected Stochastic Blockmodel
Algorithm S1 details an MCMC step to compute the joint posterior on (G, θ, β, z) with the degree-corrected stochastic blockmodel as prior on the graph space.It uses the G-Wishart weighted proposal algorithm (WWA) of van den Boom et al. (2022) to update the graph G.We use the latent variable update from Albert and Chib (1993) for β and θ.The update of the block labels z uses that (3) also applies to the Dirichlet process.Define the cluster label c i for the popularity parameter θ analogously to z i for β, and let M denote the number of unique elements in θ such that 1 ≤ c i ≤ M .Then, analogously to (3), where n θ −i,k = |{1 ≤ j ≤ p | c j = k, j = i}| and M −i is the number of unique elements in c −i = {c j | j = i}.Also, denote the M -dimensional vector with the unique values of θ by θ .The updates for the Dirichlet precision parameters ν and α follow Escobar and West (1995).
Algorithm S1 is similar to Algorithm 1 of Tan and De Iorio (2019).A difference is that we marginalise over the latent variables ζ ij when updating the block labels z i .
Algorithm S1 MCMC step for the degree-corrected stochastic blockmodel.
2. For all 1 ≤ i < j ≤ p: Update the block labels z.For i = 1, . . ., p: (a) Set K −i equal to the number of unique values in z −i and relabel z −i such that 1 ≤ z j ≤ K −i for j = i.
(b) Denote the µ resulting from z i = k by µ k .Here, µ K −i +1 does not involve the not yet specified β K −i +1 .The neighbourhood of node i is N i = {j | (i, j) ∈ E}.Sample z i according to where π ν is defined by where π α is defined by

S2.2 Southern Italian Community Structure
As mentioned in Section 3. The joint update for z and G is a Metropolis-within-Gibbs step where we update (z i , G) conditional on z −i for i ∈ V sequentially.The remainder of this subsection considers this conditional update where we do not always make the conditioning on z −i explicit.The target distribution is proportional to π We choose a Metropolis-Hastings proposal that factorizes as q where (z i , G) denotes the proposed value.We set q(z i | z i ) equal to a uniform distribution over the (K −i + 1) possible values for zi such that q(z i | z i ) = 1/(K −i + 1).The terms q(z i | z i ) and q(z i | zi ) cancel out in the acceptance probability, since z i and zi lead to the same value of K −i .We define q( G | zi , G) by 1. adding all edges {(i, j) | z j = zi }, if any, following from zi = z j = k implying (i, j) ∈ G, and, 2. if zi = z i , resampling all edges {(i, j) | z j = z i } with probability ρ, 3. if zi = z i , resampling all edges {(i, j) | z j = z i } with probability ρ.
Algorithm S2 MCMC step for the Southern Italian community structure.(a) Sample a proposal (z i , G) from q(z i , G | z i , G) defined in the text.

Compute n
(b) Set (z i , G) = (z i , G) with probability min{1, R} where R is given by (S2).

Update ν as in
Step 7 of Algorithm S1.
Since this construction follows the prior p ) which simplifies the Metropolis-Hastings acceptance probability.The acceptance probability equals min{1, R} where

S2.3 Multiple Graphs
Algorithm S3 details the MCMC for the proposed model.For notational convenience, we define g xi also for x = 1 by g 1i = 1.Conditionally on g xi , it largely mimics Algorithm S1.It uses a modified version of (3): where where Algorithm S3 MCMC step for the multiple graph stochastic blockmodel.
2. For x = 1, . . ., q, for all 1 ≤ i < j ≤ p:  p(Y x | Ω x ) is the density resulting from the rows of Y x being independently distributed according to N (0 p×1 , Ω −1 x ).Analogously to Algorithm S1, Ω x is sampled as part of Step 1 of Algorithm S3.
Figure S2 contains the resulting trace plots.They do not suggest deficient MCMC convergence or mixing.
where δ = δ + n, D = D + Y Y and I G (δ, D) is the normalising constant of the density of W G (δ, D).The constant I G (δ, D) is not analytically available for general, non-decomposable G. Thus, we make use of the Markov chain Monte Carlo (MCMC) methodology from van den Boom et al. (2022a) and of a Laplace approximation of I G (δ, D) from Moghaddam et al. (2009) to perform posterior inference on G.
same under M and M in such a way that we recover the same model specification as M when z = z in M. Now, an estimate B for B is obtained by plugging in the usual (in terms of sample frequency) estimate of p(z | Y ) derived from the MCMC chain while p(z ) is readily computed by numerical quadrature: p(z) = p(z | ν) p(ν) dν where p(ν) = Γ(ν | a ν , b ν ) and (e.g.Legramanti et al., 2022b)

Figure 2 :
Figure 2: Karate club network: posterior similarity matrices for the simulation studies.The nodes are ordered as in Tan and De Iorio (2019).

Figure 3 :
Figure 3: Block structure recovery: Rand index versus the number of observations (left) and the number of clusters (right).The lines represent means over the 50 replicates for the stochastic blockmodel, SICS and the model by Sun et al. (2014).The shaded areas are 95% bootstrapped confidence intervals.
(z i = z j Y)

Figure 4 :
Figure 4: Mutual fund data: posterior similarity matrices.The red dotted lines demarcate the fund types.
, though with much lower values for Pr(z i = z j | Y ).Still, the posterior fit indicates strong evidence for the presence of a block structure as the Bayes factor in favour of absence of block structure Bayesian Learning of Graph Substructures

Figure 5 :
Figure5: Mutual fund data: median probability graph(Barbieri and Berger, 2004), which consists of edges with posterior inclusion probability greater than 0.5, from the stochastic blockmodel.Colour of nodes refers to fund type.Boxes group nodes belonging to blocks estimated by minimising Binder's loss function.
Figure 6: Gene expression data: posterior similarity matrices.The three matrices visualise the posterior probabilities of z 1i = z 1j , z 1i = z 2j and z 2i = z 2j , respectively, where x = 1 corresponds to breast cancer and and x = 2 to ovarian cancer.The red dotted lines demarcate the modules identified by Zhang (2018).

Figure S2 :
Figure S2: Trace plots of the log-likelihood for the recorded MCMC iterations of the applications in Section 6.
Bayesian Learning of Graph Substructures i = z j Y)

Table S1 :
Community that node i belongs to β i Interaction strength of community z i θ i Popularity parameter of node i H Almost surely discrete random measure for the block structure F Almost surely discrete random measure for the popularity parameters ν Concentration parameter of the DP for the block structure α Concentration parameter of the DP for the popularity parameters Y x Data matrix of group x G x Graph corresponding to group x Ω x Precision matrix corresponding to graph G x E x Set of edges of graph corresponding to group x z xi Community that node i belongs to in group x β xi Interaction strength of community z xi Overview of model notation.