Hierarchical Stochastic Block Model for Community Detection in Multiplex Networks

Multiplex networks have become increasingly more prevalent in many fields, and have emerged as a powerful tool for modeling the complexity of real networks. There is a critical need for developing inference models for multiplex networks that can take into account potential dependencies across different layers, particularly when the aim is community detection. We add to a limited literature by proposing a novel and efficient Bayesian model for community detection in multiplex networks. A key feature of our approach is the ability to model varying communities at different network layers. In contrast, many existing models assume the same communities for all layers. Moreover, our model automatically picks up the necessary number of communities at each layer (as validated by real data examples). This is appealing, since deciding the number of communities is a challenging aspect of community detection, and especially so in the multiplex setting, if one allows the communities to change across layers. Borrowing ideas from hierarchical Bayesian modeling, we use a hierarchical Dirichlet prior to model community labels across layers, allowing dependency in their structure. Given the community labels, a stochastic block model (SBM) is assumed for each layer. We develop an efficient slice sampler for sampling the posterior distribution of the community labels as well as the link probabilities between communities. In doing so, we address some unique challenges posed by coupling the complex likelihood of SBM with the hierarchical nature of the prior on the labels. An extensive empirical validation is performed on simulated and real data, demonstrating the superior performance of the model over single-layer alternatives, as well as the ability to uncover interesting structures in real networks.


Introduction
Networks, which are used to model interactions among a set of entities, have emerged as one of the most powerful tools for modern data analysis. The last few decades have witnessed explosions in the development of models, theory, and algorithms for network analysis. Modern network data are often complex and heterogeneous. To model such heterogeneity, multiplex networks (Boccaletti et al., 2014;Kivelä et al., 2014), which also go by various other qualifiers (multiple-layer, multiple-slice, multi-array, multirelational), have arisen as a useful representation of complex networks.
A multiplex network typically consists of a fixed set of nodes but multiple types of edges, often representing heterogeneous relationships as well as the dynamic nature of the edges. These networks have become increasingly prevalent in many applications. Typical examples include various types of dynamic networks (Berlingerio et al., 2013;Majdandzic et al., 2014) including temporal networks, dynamic social networks (Carley et al., 2009) and dynamic biological networks such as protein networks . An example of a multiplex social network is the Twitter network where different edge information is available such as "mention", "follow" and "retweet" (Greene and Cunningham, 2013). Another example is the co-authorship network where the authors are recorded on relationships such as "co-publishes", "convenes" and "cites" (Hmimida and Kanawati, 2015). Many other examples can be found in economics (Poledna et al., 2015), neuroscience (De Domenico et al., 2016), biology (Didier et al., 2015) and so on.
Due to the ubiquity of multiplex network data, there is a critical need for developing realistic statistical models that can handle their complex structures. There has already been growing efforts in extending static (or single-layered) statistical network models to the multiplex setting (Mucha et al., 2010). Many statistical metrics have already been extended from basic notions such as degree and node centrality (Bródka et al., 2011;Battiston et al., 2014) to the clustering coefficient and modularity (Cozzo et al., 2013;Blondel et al., 2008). Popular latent space models (Raftery et al., 2002) have also been extended to dynamic (Sarkar and Moore, 2005) and multiplex (Gollini and Murphy, 2016;Salter-Townshend and McCormick, 2017) networks. Based on such models, one can perform learning tasks such as link probability estimation or link prediction.
Another task that has been extensively studied in single-layer networks is community detection, that is, clustering nodes into groups or communities. There have already been a few works proposing models or algorithms for community detection in multiplex and multilayer networks (Mucha et al., 2010;Berlingerio et al., 2011;Kuncheva and Montana, 2015;Wilson et al., 2017;De Bacco et al., 2017;Bhattacharyya and Chatterjee, 2018). The general strategy adopted is to either transform a multiplex network into a single-layer and then apply the existing community detection algorithms, or extend a model from static networks to the multiplex setting. One of the key shortcomings of many existing methods is that communities across different layers are assumed to be the same, which is clearly restrictive and often unrealistic. Instead, there is interest in monitoring or exploring how the communities vary across different layers. Note that there is a literature on community detection for dynamic or temporal networks mainly through developing a dynamic stochastic block model or a dynamic spectral clustering algorithm, see e.g., Liu et al. (2018), Pensky and Zhang (2019) and Matias and Miele (2017), which typically require smoothing networks over time thus assuming networks are observed over a significant number of time points.
We propose a novel Bayesian community detection model, namely, the hierarchical stochastic block model (HSBM) for multiplex networks that is fundamentally different from the existing approaches. Specifically, we impose a random partition prior based on the hierarchical Dirichlet process (Teh et al., 2006) on communities (or partitions) across different layers. Given these communities, a stochastic block model is assumed for each layer. One of the appealing features of our model is allowing the communities to vary across different layers of the network while being able to incorporate potential dependency across them. The hierarchical aspect of HSBM allows for effective borrowing of information or strength across different layers for improved estimation. Our approach has the added advantage of being able to handle a broad class of networks by allowing the number of nodes to vary, and not necessarily imposing the nodes to be fixed across layers. In addition, HSBM inherits the desirable property of hierarchical Dirichlet priors that allow for automatic and adaptive selection of the number of communities (for each layer) based on the data.
Our simulation study in Section 4 confirms that HSBM significantly outperforms a model that assumes independence of layers, especially when it comes to matching community labels across layers. In particular, the superior performance of our model over its independent single-layer counterpart is manifested by the significant improvement in the slicewise or aggregate NMI (Normalized Mutual Information) measures.
Although a hierarchical Dirichlet process is a natural choice for modeling dependent partition structures, extending the ideas from simple mixture models to community structured network models is not that straightforward. In particular, leveraging ideas from one of our recent work (Amini et al., 2019), we develop a new and efficient slice sampler for exact sampling of the posterior of HSBM for inference. We will discuss some of the technical difficulties in Remark 1.
The rest of the paper is organized as follow. Section 2 introduces the HSBM in details. Section 3 is devoted to describing a novel MCMC algorithm for inference of HSBM. Simulation study and real data analysis are presented in Sections 4 and 5. The code for all the experiments is available at Amini et al. (2021). We conclude with a discussion in Section 6.

Hierarchical stochastic block model (HSBM)
Consider a multiplex network with T layers (or T types of edge relations) and n t nodes with labels in [n t ] = {1, . . . , n t } for each layer t = 1, . . . , T . Denote by A t the adjacency matrix of the network at layer t, so that an observed multiplex network consists of the collection A t ∈ {0, 1} nt×nt for t = 1, . . . , T . We let A denote this collection and view it as a partial (or irregular) adjacency tensor. That is, where A tij = 1 if nodes i and j in layer t are connected. Our goal is to estimate the clustering or community structure of the nodes in each layer, given A.
Specifically, to each node i, at each layer t, we assign a community label, encoded in a variable Z = (z ti ) ∈ N T ×nt , where N = {1, 2, . . . , } is the set of natural numbers. Let be a matrix of link probabilities between communities, indexed by N 2 , for t = 1, . . . , T . At times, we will use the equivalent notation η t (x, y) ≡ η txy to increase readability.
For example, we interpret η t12 = η t (1, 2) as the probability of an edge being formed between a node from community 1 and a node from community 2 at layer t. Note that we have assumed that the total number of community labels is infinite. However, for a given adjacency matrix A t observed at layer t, the number of community labels is finite, unknown, and will be denoted by K t .
For each layer t, we model the distribution of the adjacency matrix A t ∈ {0, 1} nt×nt as a stochastic block model (SBM) with membership vector z t = (z ti ) ∈ N nt , and edge probability matrix η t , that is, In an SBM, the link probability between nodes i and j is uniquely determined by which communities these nodes belong to. In our notation, at a layer t, the link probability between nodes i and j is given by η t (z ti , z tj ). Note that our SBM notation is slightly different from the traditional stochastic block model where the set of community labels is random. In writing SBM(z; η), we assume that z is given and nonrandom.
If there is belief or prior information that the community structure in one layer of the network is independent of the others, independent stochastic block models (SBM) could be assumed for the A t 's. This assumption is, however, too restrictive when we are dealing with networks of different kinds of relations but among the same set of nodes; or with different sets of nodes but similar types of relations. The other extreme is to assume that all layers in the network share the same partition-meaning that z t = z for all t and the A t 's are conditionally independent given z.
We believe, however, that a model that can incorporate various dependencies between these two extremes is more appropriate in many applications. In other words, it may be desirable to allow some change in the partition structure between layers, but also impose some kind of dependence among them. Here, we propose a model that achieves this goal by allowing the community structures at various layers to be different but dependent, using a hierarchical specification for the distribution of the partitions.

Hierarchical community prior
Before stating the prior on the labels, let us introduce a simplification, namely, that η t does not vary by layer. Therefore, we drop index t, and denote the matrix of link probabilities by η and its elements by η xy ≡ η(x, y). This assumption is not necessarily restrictive since, as will become clear shortly, our model allows for an infinite number of communities. By imposing this restriction, we are simply stating that cluster i at a certain layer corresponds to the same cluster i at every other layer. We then assume independent Beta prior distributions for each element of η, given by: where Beta stands for the usual Beta distribution.
Our key idea is to impose a dependent random partition prior on the membership labels at different layers, i.e., z t , t ∈ [T ]. We adopt the prior based on a hierarchical Dirichlet process (HDP).  Figure 1: Toy example of the nodes of a two-layer network. The numbers are the indices of the nodes in each layer. The shape of a node represents its group and the letter symbol inside represents the community it belongs to.
We assume that the reader is familiar with the HDP and its various interpretations, and in particular, the Chinese Restaurant Franchise process (CRF); see for example (Teh et al., 2006). In the CRF interpretation, each layer t of the network corresponds to a restaurant, and nodes are gathered around tables, or groups, in each restaurant. Let g ti denote the group of node i in restaurant (i.e., layer) t. All the nodes in the same group g, at the same layer t, share the same dish (i.e., community) which is denoted by k tg . Thus, given all the groups g t = (g ti ) i and group-communities k t = (k tg ) g , the label of node i (at layer t) is uniquely determined: To help the reader understand the notation of the label part of the model, we present a simple toy example: Example 2.1. Consider a two-layer network with n 1 = 8 and n 2 = 6 nodes. At each layer (restaurant), the nodes are presented in a line, and are numbered from left to right, as illustrated in Figure 1. The shape of each node represents the group (table) it belongs to and the fill, as well as its label ∈ {a, b, c}, represents the community (dish). For example, in layer 1, nodes 1 and 2 are grouped together, represented by a circle, and similarly for {3, 4, 5} and {6, 7, 8} represented by square and diamond, respectively. In this first layer, each group is associated with a different community. In layer 2, we have the following groups: {1, 4, 5}, {2, 6} and {3}, represented by a pentagon, triangle and hexagon, respectively. Also, note that the groups represented by a triangle and a hexagon share the same community (dish). In general, groups in the same layer or in two different layers can be linked via their group-community assignment, which is encoded by k tg . For example, the circular group in the first layer shares the same community (dish) with the triangular and hexagonal groups in the second layer. This information is carried through the definition of k tg : According to Figure 1, This, in turn, implies that all the nodes in these groups (nodes {1, 2} in layer 1 and nodes {2, 6} ∪ {3} in layer 2) share the same community, that is, we have z 11 = z 12 = z 22 = z 26 = z 23 = a.
As an example of how (2.4) is used to determine the node community labels, we have g 26 = and k 2, = a, hence z 26 = k 2,g26 = a.
A prior on z ti can be obtained via priors on k tg and g ti . To impose dependence among layers, we assume that group-communities k t , across layers t, are drawn from the same prior: where γ t , t = 1, . . . , T are categorical distributions with infinite categories, with the weight of each category representing the fraction of the nodes in layer t that would end up in group g (eventually).
To complete the prior specification, we impose (2.8) where GEM stands for Griffiths, Engen and McCloskey (Picard and Pitman, 2006); it is a distribution for a random measure on N which has the well-known stick-breaking construction (Sethuraman, 1994). Equations (2.7), (2.8), (2.5), (2.6) and (2.4) together specify our hierarchical label prior on the node labels (z t ) T t=1 . The hierarchical label prior above, together with prior on η in (2.3) and the SBM of (2.2) provide the full specification of the HSBM.
Remark It turns out that the label prior described above is exactly the label prior implicit in the well-known Hierarchical Dirichlet Process of Teh et al. (2006). Since this equivalence is not central to the model we present, we do not go into further details here and refer the interested reader to Amini et al. (2019).

Joint distribution
The joint density for the label part of HSBM, that is, equations (2.4)-(2.7), can be expressed as: The new variables γ t and π are related to γ t and π via the stick-breaking construction for the GEM distribution (Sethuraman, 1994;Ishwaran and James, 2001). The idea behind this construction is to imagine a stick with length 1, which will be successively broken into smaller pieces.
. Then, [F (x)] j is the length of the piece broken at iteration j after successive fractions x 1 , x 2 , . . . , x j are broken off. Denote which is the density for Beta(α, β) up to a normalization constant. Both π and γ j above have stick-breaking representations of this form: where γ t = (γ tg ) and π = (π k ).
Adding the network part to the label part of the model, we obtain the full joint density of HSBM: is the Bernoulli likelihood Note that we have used the alternative notation η(x, y) = η xy for readability. In the next section, we derive a novel MCMC algorithm for sampling the posterior distribution of our model.

Slice sampling for HSBM
We propose a slice sampler for HSBM, based on a slice sampling algorithm we recently developed for HDP (Amini et al., 2019). Recall that in slice sampling from a density f (x), we introduce the nonnegative variable u, and look at the joint density g( . Then, we perform Gibbs sampling on the joint g. In the end, we only keep samples of x and discard those of u. This idea has been employed in Kalli et al. (2011) to sample from the classical DP mixture and extended in Amini et al. (2019) to sample HDPs.
In order to perform the slice sampling for HSBM, we introduce (independent) variables u = (u ti ) and v = (v tg ) so that the augmented joint density for (2.9) is where for example and similarly for p(k, v | π). Note that marginalizing out u from p(g, u | γ) gives back p(g | γ) = t i γ t,gti as before. The full augmented joint density is now with the support understood to be restricted to u ≥ 0 and v ≥ 0. We then perform block Gibbs sampling on the augmented density. Note that marginalizing variables (u t ) and (v t ) out, we get back original joint density (2.11). The idea is to sample (γ , u) jointly given the rest of the variables, and similarly for (π , v). The updates for variables u, γ , v and π are similar to those in Amini et al. (2019). However, the updates for the underlying latent groups g and group-communities k require some care due to the coupling introduced by the SBM likelihood. As can be seen from the derivation below, these updates will be quite nontrivial in the case of SBM relative to the case where the data follows a simple mixture model given the partition.

Sampling (v, π )
First, we sample (v | π , Θ −vπ ) which factorizes and coordinate posteriors are Next, we sample from (π | Θ −vπ ). As in the case of γ , we first marginalize v which leads to the usual block Gibbs sampler updates: The posterior factorizes over k, and where n k (k) = |{(t, g) : k tg = k}| and similarly for n >k (k).

Sampling g
This posterior factorizes over t (but not over i). From (3.1), we have Let G ti := sup{g : u ti ≤ γ tg }. According to the above equation, g ti given everything else will be distributed as where, using k t,gtj = z tj and L(p; a) = p a (1 − p) 1−a , (3.5)

Sampling k
This posterior also factorizes over t (but not over g). First, note that since we are conditioning on g, we can simplify as (3.6) and the summations are over 1 ≤ i < j ≤ n t . Then, the posterior of k | · · · factorizes over t, and for any fixed t, Note that ξ tgg is not symmetric in g and g , because of the condition i < j in the summation. Letting h tgg (k, ) := h tgg (k, ) h tg g ( , k), it follows that for any fixed t and g: By the symmetry of η(k, ) in its arguments, h tgg (k, ) is also symmetric in (k, ), and Let us simplify the last factor in (3.8) further. Using (3.9), we have (3.10) According to the above, k tg given everything else will be distributed as

Sampling η
Recalling that z ti = k t,gti , the relevant part of (3.1) is Using the fact that η(k, ) = η k is symmetric in its two arguments (that is, we treat η k and η k as the same variable), we have, for k ≤ , (3.11) and the (i, j) summations are over 1 ≤ i < j ≤ n t for k = and over 1 ≤ i = j ≤ n t for k = . We conclude that for k ≤ , Remark 1. Comparing with the updates in Amini et al. (2019), we observe that, under HSBM, the updates for k and g (which is equivalent to t in Amini et al. (2019)) and η (which is equivalent to parameters of f in Amini et al. (2019)) are much more complex. For the usual HDP, the data are assumed to follow a simple mixture model where observations are independent given the labels, and each, only depends on its own label. This causes the posterior for k, g and the parameters of the mixture components to factorize over their coordinates. In contrast, the SBM likelihood makes each observation A tij dependent on two labels z ti and z tj . This causes the posterior for k, g and η to remain coupled under HSBM. Nevertheless, the Gibbs sampling scheme above allows us to effectively sample from these coupled multivariate posteriors.

Computational speedup
Let us discuss some ideas that lead to the implementation of a fast sampler. We consider three ideas: Spare matrix computations, parallel versus sequential label updates and truncation versus slice sampling.
Many of the key computations for the slice sampler can be sped up for sparse networks, using fast sparse matrix-vector operations. Consider for example the computation of τ t = (τ i ). This can be done by defining the operator "row-compress(A, z)" that takes a A ∈ R n×n and a label vector z ∈ [K] n and compresses each row of matrix A by summing over entries having the same label according to z, producing an n × K matrix. Then, we will have τ t = row-compress(A t , z t ). (3.12) Let M t be the number of nonzero elements of A t ∈ R nt×nt . The above operator can be implemented in O(M t ) operation by iterating only over the nonzero entries of A t once. When A t is sparse, this allows for a significant computational saving relative to the naive approach which takes O(n 2 t ), since in the sparse case, usually M t = O(n t ). The operation (3.12) is thus extremely fast if implemented in parallel, i.e., the entire matrix τ t is computed all at once. To have a valid Gibbs sampler, however, one needs to perform row compression one row at a time in a sequential manner, since once one element of z t is updated, the row compression for subsequent rows will be affected. The sequential row compression can be implemented with the same complexity of O(M t ), but cannot benefit from parallelism, hence will potentially be slower than the the parallel version. We refer to the two versions of the sampler as HSBM-par and HSBM-seq, respectively, based on whether the label updates are done in parallel or sequential.
To summarize, HSBM-par is doing an approximation of a valid Gibbs sampler (not an exact Gibbs sampler). HSBM-seq computes the current row compression, samples the corresponding label and then uses this new label when computing the compression for the next row. HSBM-par, however, computes the row compressions all in parallel and samples the labels all in parallel, using the current snapshot the labels. HSBM-par does not take into account the effect of sequentially sampling of labels on subsequent rowcompressions. In practice, we have found that HSBM-par performs as good as HSBM-seq and will be the default implementation used in simulations.
Similarly, computing ξ t = ( ξ tgg ) and λ = (λ k ) which are the key computations in updating k and η, can be done extremely fast. To see this, consider the operator "block-compress(A, z)" that returns the block compression of A according to labels z, by summing all the entries of A over blocks having the same row-column label pair. If z ∈ [K] n , the output will be a K × K matrix and it can be computed by traversing only the nonzero entries of A once. We have ξ t = block-compress(A t , g t ) and λ = t block-compress(A t , z t ), ignoring the minor modifications needed depending on whether diagonal entries need to be accounted for or not. Both of these calculations can be done in O( t M t ) operations.
Truncation sampler. Finally, it is possible to speed up the convergence by switching to a truncation sampler : Instead of introducing auxiliary variables u and v to truncate the infinite measures automatically, one can truncate them at a fixed sufficiently large index. That is, we sample the (approximate) joint density (3.13) where K and G are pre-specified large integers. The resulting Gibbs sampler is almost identical to the slice sampler with minor modifications. In particular, in the g-update, we need to replace ρ ti (g) with ρ ti (g)γ t,g and in the k-update, δ tg (k) with δ tg (k)π k . Otherwise, everything else remains the same. Empirically, we have found that the truncation sampler mixes faster than the slice sampler, hence will be our default choice in the simulations. Figure 2 illustrates the convergence speed of the sampler following these implementation choices. Figure 2(a) shows the aggregate NMI (Section 4.4) versus iteration for 15 realizations of the HSBM-par and HSBM-seq chains. The underlying network is a multilayer personality-friendship network (Section 4.1) with T = 5 layers and n t = 200 nodes per layer. Figure 2(a) is the absolute deviation between the estimated total number of communities and the truth (i.e., 3 communities), averaged over the 15 realizations. Both plots indicate fast mixing, with convergence achieved under 50 iterations for most realizations. We refer to Section 4 for more details on the simulation setup. The code for the sampler(s) is available at Amini et al. (2021).

Simulation study
We consider networks generated from a multilayer SBM with a single connectivity matrix, but potentially varying labels across layers. We start with an example to illustrate how such assumptions naturally arise in some applications.

Multilayer personality-friendship network
Consider, as an illustration, the group of young women who run, every year, for the Miss World title. Each country holds a preliminary competition to choose their delegate to the main race. After meeting each other in the competition, some of the girls tend to bond and become friends at a social networking website. Suppose that we are interested in identifying groups of competitors by their personalty types: extrovert, introvert or ambivert (a balance between the former two). Let us also suppose that around 40% of the female population can be classed as extrovert, 35% introvert, and 25% is in the ambivert group. Assume that the distribution of personality types among the contestants in the Miss World competition is similar to that of the general population.
It is natural to assume that extroverts have a higher chance than introverts to form Extrovert Ambivert Introvert Extrovert 90% 75% 50% Ambivert 75% 60% 25% Introvert 50% 25% 10% Table 1: Hypothetical probability of friendship between personality types bonds with other contestants. For this experiment, we consider the probabilities of friendship between and within groups listed in Table 1. We consider each year of the competition as a layer and the competitors represent the nodes. Alternatively, since each country is represented by a single competitor, we can consider nodes as representing different countries. The membership group of each node (i.e., its personality type) varies from layer to layer since the same country is represented by different girls in different years. However, the connection probability, i.e., the η matrix in our notation, remains the same across layers, assuming that there is a fundamental pattern of connections among personality types rooted in human nature and invariant across years. This example thus naturally conforms to our setup of fixed η and potentially variable z t over t = 1, . . . , T .

Markovian labels and random connectivity
In order the systematically study the performance of HSBM and its competitors, we introduce additional degrees of freedom in generating simulated networks: 1. In addition to the η introduced in Table 1, we also consider a random symmetric connectivity matrix with entries on and above diagonal generated i.i.d. from Unif(0.1, 0.9). In the experiments where the random connectivity matrix is used, all layers share the same η, however, different replications of the experiment use different randomly generated η. Note that we do not restrict η to be assortative and the resulting random ensemble captures the full complexity possible in an SBM connectivity matrix.
2. We generate the labels z t = (z ti ), t = 1, . . . , T based on the following Markovian process: For simplicity, we assume that all layers have the same number of nodes n. The elements of z 1 are generated i.i.d. from a Categorical distribution with probability vector π 0 , that is, z 1i ∼ Cat(π 0 ), i = 1, . . . , n. Subsequent labels are held fixed at previous layer value with probability 1 − τ or randomly generated from Cat(π 0 ), that is, for t = 2, . . . , T and i = 1, . . . , n, where δ x is the point-mass measure at x. We refer to τ as the transition probability.
3. We vary the number of layers, T , say from 2 to 12.

Competing methods
We compare the performance of HSBM with a wide variety of methods. The first natural competitor is to apply DP-SBM separately to each layer. Here, DP-SBM is a SBM with a DP prior on its labels. It can be thought of as HSBM with a single layer and with w 1 set equal to π. This model is essentially the same as the one considered in Mørup and Schmidt (2012).
We also compare with various spectral approaches: SC-sliced that applies spectral clustering separately to each slice; SC-avg that applies spectral clustering to the average adjacency matrixĀ = 1 T T t=1 A t ; SC-ba, the debiased spectral approach of Lei and Lin (2020); SC-omni, the omnibus spectral approach of Levin et al. (2017). In addition, we consider two versions of the PisCES algorithm (Liu et al., 2018) that solves an optimization problem that smooths out spectral projection matrices across time. PisCES implements the version described in Liu et al. (2018). PisCES-sh is a modification we made where we use the same initialization to start the k-means clustering algorithm across different layers. Without the shared initialization, there is no guarantee of a matching between clusters obtained by k-means applied to the spectral representation of different layers.
HSBM, DP-SBM, SC-sliced, SC-omni and PisCES naturally produce labels for all layers. SC-avg and SC-ba are designed to produce a single set of labels for all layers; their underlying assumption is that the labels remain the same across layers. For these two methods, we repeat the estimated label vector for all layers to obtain a multilayer label estimate.

Measuring performance
We measure the accuracy of the estimated cluster labels using the normalized mutual information (NMI), a well-known measure of similarity between two cluster assignments. NMI takes values in [0, 1] where 1 corresponds to a perfect match. A random assignment against the truth is guaranteed to map to NMI ≈ 0. The NMI penalizes mismatch quite aggressively. In our setting, an NMI ≈ 0.5 corresponds to a roughly 90% match.
In the multilayer setting, we can compute at least two NMIs: (1) The slicewise NMI where one takes the average of NMIs computed separately for each layer, and (2) the aggregate NMI where we consider the labels for all the layers together and compute a single NMI between competing label assignments. Here, we focus mostly on the aggregate NMI, since achieving a high aggregate NMI is more challenging, requiring consistency both within and across layers.
For the HSBM and DP-SBM the estimated labels used in the NMI calculations will be the MAP estimates, calculated as detailed below.
MAP estimate. For HSBM, we compute the maximum a posteriori (MAP) label assignment by finding, for each node, the label which is most likely according to the posterior: argmax k P(z ti = k | · · · ). Associated with the MAP estimate, there is a confidence which is the value of the posterior probability. To compute the MAP estimate we use the posterior estimate given by 1 it ) is the label assignment at MCMC iteration j, N is the total number of iterations and N 0 the length of the burn-in. Here, we ignore the potential mismatch betweenẑ (j) andẑ (j ) due to the potential label-switching. In practice, all labels after MCMC convergence are coming from a single mode of the posterior as can be verified by computing the NMI between consecutive samples z (j) and z (j+1) .

Results
We ran the HSBM and DP-SBM samplers for N = 100 iterations with a burn-in of N 0 = N/2. We consider two main settings, one where we fix the number of layers at T = 5 and change the label transition probability τ , and one where we fix the transition probability at τ = 0.25 and vary the number of layers T . In both cases, there are n t = 200 nodes in each layer. The results are shown in Figures 3 and 4 respectively. In each case, the expected aggregate NMI is computed by averaging over 500 replications. For each of the two settings, we have considered both a fixed η, namely the personality-friendship connectivity of Table 1, and a random η generated as discussed in Section 4.2. Table 2 provides numerical values for the mean and the standard deviation of the aggregate and slicewise NMIs in a typical setting.
The results clearly show the superior performance of HSBM relative to the competing methods. In particular, Figure 3 shows that the performance of HSBM remains almost constant as one varies the transition probability (τ ). Most spectral methods in contrast deteriorate as τ is increased. Note that at τ = 1, the labels across layers are completely independent. Nevertheless the aggregate NMI for HSBM remains high even in this case. What allows HSBM to match the clusters correctly across layers is the shared connectivity matrix η. The personality-friendship connectivity (Figure 3(a)) is especially challenging for spectral methods, mainly due to the fact the connectivity matrix is nearly rank-deficient and non-assortative. In contrast, HSBM performs almost perfectly in this case, due to the very different connectivity patterns across clusters.
A similar qualitative behavior is observed as one varies T in Figure 4. The performance of most methods deteriorates as T is increased while that of HSBM remains roughly the same. Interestingly, in both experiments, SC-omni is overall the most performant among the spectral methods.
Remark Note that aggregate NMI is used as a measure of performance in most of our figures. For a method to have a high aggregate NMI, it has to also properly match the communities across layers. This is the main reason why most of the SC methods fail when looking at the aggregate NMI. For example, the SC-sliced method, which is applied to each layer separately, does not have any such capability.
SC-ba method is designed for the same nodes having exactly the same communities across all layers (with only the connectivity matrix changing across layers). Thus, it fails to have high aggregate NMI in multiplex networks with varying community structures over time.
SC-omni and PiCES allow for variable community across layers and the SC-omni is doing quite well in terms of the aggregate NMI if the transition probability is not quite high. See for example Figure 3 Table 2: Mean and standard deviation for aggregate and slicewise NMIs as well as runtimes for a typical experiment. The setting is that of random η with Markov labels having transition probability τ ≈ 0.57 and T = 5 layers, corresponding to roughly the midpoint of the horizontal axis in Figure 3.
If we only want to measure how the method performs in each layer separately, slicewise NMI is a more appropriate measure. This measure for example is reported in Table 2 and one can verify that it is quite high for SC-sliced. On the other hand, slicewise NMI ignores the multiplex nature of the network, and that is why we focused on aggregate NMI which is a much more natural measure if one believes shared communities exist across layers.

Real data analysis: FAO Trade Network
In this section, we illustrate the performance of our model and algorithm on one real data example. We consider the multilayer FAO trade network provided by De Domenico et al. (2015). The data contains trade connections between 214 countries, treated as nodes, across 314 product categories considered as layers. We sorted the layers according to their total edge count, and chose the 20 most dense layers. Figure 5(b) shows the distribution of the edge counts, suggesting there is a natural cut-off at about 20 layers. We then selected the nodes that had a degree greater than 20 in the sum network, obtained by summing all the adjacency matrices. After filtering, we were left with a multilayer network on n = 172 nodes and 20 layers.
We then ran the HSBM sampler for 2500 iterations with a burn-in of 1250 and obtained the MAP labels. Figure 5(a) shows the sequential NMI plot for the sampler, obtained by computing the aggregate NMI between consecutive labels across the iterations of the chain. The plot suggests that by iteration 1000 the chain is already well-mixed.
The estimated labels uncover a rich community structure. Figure 9 shows the community assignment across some of the layers. The nodes are laid out using a forcedirected algorithm that puts highly connected nodes closer together. That the com- munities inferred by HSBM align with the physical proximity in these layouts is an indication of the quality of the inferred communities. There are a total of 15 estimated communities. Figure 6 shows the (MAP) estimated connectivity matrix η and Figure 7 shows the distribution of the 15 communities across the 20 layers. Community 9 which is the most frequent across a majority of the layers corresponds to countries that sparsely trade, as evidenced by the corresponding row (and column) in η.  The connectivity matrix is not assortative-as evidenced, for example, by the near loop in Figure 6 among communities 10-14. This figure also suggests that communities 8 and 10 are rather interesting. Figure 8 shows the full community assignment for all the countries that are assigned communities 8 or 10 in at least one layer. The figure shows the complex nature of the inferred communities, with countries allowed to change communities across layers. Nevertheless, the countries that belong to group 10 across the majority of layers tend to form the tightly knit group {Germany, USA, Netherlands, UK, Italy, France, China, China (mainland)}. This is a reasonable group since these countries are known to have extensive trade relationships with each other. Interestingly, some of these countries behave like group 8 countries in some layers. Spain also has an interesting position, with equal assignments to group 10 and 8, respectively. Looking at the layers, group 8 seem to have something to do with the pattern of trade in crude materials and group 5 to the pattern of trade in green coffee. Perhaps the best way to interpret the inferred groups (or communities) is to note that they correspond to patterns of trade, hence a country can exhibit multiple of these patterns across different product.
Since under HSBM, the same country can be assigned a different label for each layer, to visualize the community assignments better, we consider the following reduction: For each label, we collect all the nodes that have that label across at least 8 out of 20 layers. Figure 10 shows the word cloud of major groups obtained by this procedure. The size and color of a country indicate the frequency of the label across layers. For example, in Group 6, Canada has been assigned label 6 for 13 out of 20 layers, more than any other country in that group. Switzerland, Australia and Belgium were all assigned label 6 in   Figure 8: FAO network: Community assignments across layers for the countries that are assigned to one of the communities 8 or 10 in at least one layer.
11 layers. In addition to the groups shown in the figure, we have three singleton groups: Chile, UAE and "unspecified".
The groups in Figure 10 make intuitive sense for the most part, with some of the groups roughly corresponding to geographical regions. The analysis, however, reveals many unintuitive connections, with seemingly unrelated countries grouped together. Examples include: Macao and Rwanda, Iraq and Guinea, Guam and Grenada, Fiji and Uganda, Bahrain and Mauritius, and so on.
To verify these connections and the overall quality of clustering, we consider the following quantitative evaluation metric: For each pair of nodes, we compute the normalized Hamming distance for their corresponding rows in the adjacency matrices, and then take the average over layers. That is, we consider the following distance with H(·, ·) denoting the Hamming distance. We refer to d(i, j) as the Average Normalized Hamming (ANH) distance and note that it measures how dissimilar the connection patterns of two nodes are across layers. We expect nodes that are grouped together to have a lower value of d(i, j) relative to a randomly selected pair.
Figure 11(a) shows plots of the distributions of d(i, j) when the pairs are selected randomly within the groups versus the case were they are selected completely at random. The figure shows clearly that the ANH is much lower on average for within-group pairs relative to random pairs. This confirms that HSBM has uncovered groups based on trading patterns. The median for "Within" and "Random" distributions are 0.027 and nes., Chocolate products nes. The layouts are generated by the Fruchterman-Reingold (FR) algorithm, applied separately to each layer. FR positions the nodes according to forces exerted along the edges, resulting in spatial proximity being correlated with network connectivity. The size of a node reflects its degree. More precisely, si ∝ log(di + 3) where si and di are the size and degree of node i, respectively. 0.21, respectively. The unintuitive pairs mentioned earlier have ANH much lower than the average for random pairs. For example, the ANH for Macao-Rwanda, Iraq-Guinea and Guam-Grenada, Fiji-Uganda and Bahrain-Mauritius pairs are 0.074, 0.019, 0.0088, 0.14 and 0.14 respectively. This shows that these seemingly unrelated countries have similar trading pattern across multiple food product categories.
In Figure 11(a) the ANH is computed in-sample, that is, using the same 20 most dense layers used for fitting the HSBM. Figure 11(b) shows the corresponding plot when ANH is computed out-of-sample, using the remaining 344 layers. These layers are generally sparser, as reflected in the absolute value of ANH, but the plot shows a similar separation among the two distributions.

Discussion
In this work, we proposed a novel Bayesian model for community detection in multiplex networks by adopting the well-known HDP as a prior distribution for community assignments. Under the random partition prior, a block model is assumed. This model facilitates flexible modeling of community structure as well as link probabilities with its ability of incorporating potential dependency and borrowing strength among networks from different layers. For posterior inference of HSBM, we develop an efficient slicer sampler. The principles behind the slice sampler can be applied to developing sampling algorithms for many other models. Future work will be focused on developing models for community detection in networks with covariates, and for inference of network-valued objects.