Mixtures of multivariate generalized linear models with overlapping clusters

With the advent of ubiquitous monitoring and measurement protocols, studies have started to focus more and more on complex, multivariate and heterogeneous datasets. In such studies, multivariate response variables are drawn from a heterogeneous population often in the presence of additional covariate information. In order to deal with this intrinsic heterogeneity, regression analyses have to be clustered for different groups of units. Up until now, mixture model approaches assigned units to distinct and non-overlapping groups. However, not rarely these units exhibit more complex organization and clustering. It is our aim to define a mixture of generalized linear models with overlapping clusters of units. This involves crucially an overlap function, that maps the coefficients of the parent clusters into the the coefficient of the multiple allocation units. We present a computationally efficient MCMC scheme that samples the posterior distribution of the parameters in the model. An example on a two-mode network study shows details of the implementation in the case of a multivariate probit regression setting. A simulation study shows the overall performance of the method, whereas an illustration of the voting behaviour on the US supreme court shows how the 9 justices split in two overlapping sets of justices.


Introduction
Clustering approaches are popular in many fields as they allow to identify unknown grouping structures from multivariate data. In a regression context, mixtures of Generalized Linear Models (GLMs) provide a natural model-based approach to account for the heterogeneity in the data due to the presence of heterogeneous clusters. These models have been developed extensively in the case of a single response variable, i.e., via a mixture of univariate GLMs (Grün and Leisch, 2008a). However, with the advent of complex multivariate data both at the level of predictors and responses, multivariate regression models are now common in many fields (Fahrmeir and Tutz, 2013). Mixtures of multivariate GLMs were introduced by Wedel and DeSarbo (1995) with an implementation provided in the R package FlexMix (Grün and Leisch, 2008b). Further extensions to the high dimensional case have been studied recently by Price and Sherwood (2017) using penalised inferential approaches.
In the traditional formulation of a mixture model, a unit can belong to only one of the clusters -and therefore be modeled by only one component -thus limiting the way we can build and discover more complex grouping structures of the units. In some applications, one can imagine that a statistical unit can be assigned simultaneously to more than one cluster (multiple allocation) or perhaps even to none of them. Some extensions to allow for overlapping clusters have been proposed with respect to conventional mixtures of exponential family distributions and mixtures of univariate regression models. For example, in Banerjee et al. (2005) and Fu and Banerjee (2008) the model-based clustering strategy is modified to allow for components in the mixture that arise as a product of densities from the exponential family, inducing a resulting density that is still within the same exponential family. Heller and Ghahramani (2007) extended this approach in a nonparametric Bayesian fashion to allow for the selection of the number of components of the mixture.
In both cases, however, the way the multiple allocation is handled is reflected into a strict definition of the parameters of the resulting cluster, which are limited by the mathematical derivation of the product of the densities. This hinders the flexibility of the model and the interpretability of the regression coefficients and related parameters. An alternative view to account for overlap is that of re-parameterizing the model in such a way that parameters of the overlapping clusters are linked to those of the originating clusters. This idea was explored by Ranciati et al. (2017b) in the context of univariate mixture models and by Ranciati et al. (2017a) in the context of network data.
In this paper, we take this alternative view forward and propose a novel definition of and inferential strategy for a mixture of multivariate GLMs with overlapping clusters of units. In detail, Section 2 introduces multivariate GLMs and the definition of overlapping mixture components. We discuss the Bayesian implementation of our proposal, including cluster allocation and model selection. In Section 3 our general proposal is worked out for the case of multivariate binary observations, motivated by actor-event network data where explanatory variables may be available at the level of actors and/or events, and where overlapping clusters of actors may occur naturally. Section 4 shows the computational and inferential performance of our method through a simulation study, whereas Section 5 illustrates the method on a two-mode network related to rulings from the Supreme Court of the United States of America.

Mixture of multivariate GLMs with overlap
In this section, we describe our proposal for a mixture of multivariate GLMs that can accommodate overlapping clusters. We call this miro, mixture of regression models with overlap. We start by defining a mixture of multivariate generalized linear models. Suppose that we have n multivariate observations measured on a d-dimensional space, so that each response vector y i = (y 1i , . . . , y id ) corresponds to a row of the n × d data matrix y. The aim is to cluster these n observations based on their d features, while accounting for possible additional information, either at the level of the units i = 1, . . . , n or of the variables j = 1, . . . , d. In a mixture of regression framework, the assumption is that the response variables come from a mixture of K-components (also called clusters) from the same multivariate exponential family, i.e., where α = (α 1 , . . . , α K ) are the prior cluster probabilities, θ k (x i , W ) is the natural parameter of the GLM and a function of cluster-specific parameters and covariate information, which could be unit-specific, x i = (x i1 , . . . , x iL ), or response variable-specific, w j = (w j1 , . . . , w jd ). Furthermore, depending on the distribution, Σ k refers to additional nuisance parameters, such as the dispersion parameters. An alternative hierarchical representation of the mixture model is achieved by introducing a unit-specific binary latent vector z i = (z i1 , . . . , z iK ), made up of all zeros with the exception of a single element z ik = 1 for unit i belonging to cluster k. In particular, using this notation, the hierarchical formulation of a mixture of multivariate GLMs is given by For example, if the response variables are binary outcomes denoting the attendances of n units to d events, the n × L matrix X, with i-th row x i , collects L characteristics of the n units, such as age, gender and education, whereas the d × Q matrix W describes the Q features of the d events, such as time, location and type of event. This is indeed the specific case that will be discussed more in detail in Sections 3 and 5.
In finite mixtures of GLMs (Fahrmeir and Tutz, 2013), the parameters of each component are linked to the covariates via an appropriate link function applied to a linear combination of the predictors. Here we consider both individual covariates x i and responsespecific covariates w j . In particular, the natural parameter for individual i part of cluster k is given by θ kij ≡ θ k (x i , w j ) = g(η kij ) with g(·) a link function and η kij the linear predictor.
In a multivariate general setting, where covariates are available both at the level of units and responses, we specify each component of the d-dimensional vector η i by where {µ k , β k , γ k } are cluster-specific vectors of parameters. In particular, for each cluster k, the model includes an intercept µ k , an L-dimensional vector of regression coefficients β k for the unit covariates and a Q-dimensional vector of regression coefficients γ k pertaining to the response-specific covariates. According to (2), the effect of unit-specific covariates x i is the same for all the d response variables within a particular cluster k. The elements {η kij } of the linear predictor differ, with respect to j, only due to the effect of the response-specific covariates w j . We note that in the case d = 1 and K > 1, the model reverts back to a mixture of univariate GLMs (Wedel and DeSarbo, 1995;Grün and Leisch, 2008a); when d > 1 but K = 1, we obtain a multivariate GLM (Fahrmeir and Tutz, 2013) and if both K = 1 and d = 1, we are back in a simple GLM (McCullagh and Nelder, 1989).

Miro: mixture of regressions with overlap
Differently from the traditional setting descibed above, we are interested in defining a heterogeneous regression setting where units may belong simultaneously to more than a single cluster. Using similar ideas to Ranciati et al. (2017a), we modify the hierarchical model in (1) by relaxing conditions on the allocation vectors z i in order to allow for a multiple classification of the units. In particular, we will allow z i ∈ {0, 1} K .
If there are K primary clusters, then there are K = 2 K multiple cluster allocations.
Each of these K allocations is a non-overlapping heir cluster, which defines a new Kdimensional allocation vector z i for each unit i. This vector satisfies K h=1 z hi = 1, and has a 1-to-1 correspondence with the original z i , which allocates units into the overlapping parent clusters. The z re-parametrization can now be used as the basis of a traditional hierarchical model, namely The key questions are (i) how to connect the new model parameters θ and Σ to the original parameters θ and Σ; (ii) how to connect the new mixture parameters α to the old mixture parameters α and (iii) how to interpret the resulting multiple allocation framework.
Overlap function: connecting heir parameters to parent parameters. In this paragraph we will describe a generic function for linking the new parameters θ and Σ with the original θ and Σ, which we call the overlap function. Various overlap functions are possible and each one of them determines the way we can interpret and build the potential overlap between the components of the mixture. Each specific choice leads to different models that have their own computational and inferential considerations. Although the connecting function can be directly at the level of the parameters θ, it is computationally more advantageous to define this at the level of the linear predictors and regression coefficients. Furthermore, the overlap functions will be presented explicitly for the location parameters, but they are also natural choices for the nuisance parameters Σ.
Given the linear predictor η kij = µ k + β k x i + γ k w j , three natural choices to define the function η ij = ψ(η .ij , z i ) are the pointwise average, the minimum or the maximum of the linear predictors η 1ij , . . . , η Kij of the multiple allocations z i . In particular, using small, mean and maximum as subscripts. Figure 1 shows an example of the three overlap functions applied to a single covariate in a situation with three overlapping clusters. The mean overlap function is itself linear, whereas both the maximum and minimum overlap functions are only piecewise linear. In fact, the mean overlap function can be seen as an average of the intercepts and regression coefficients across those clusters, to which the unit is allocated, where µ = (µ 1 , µ 2 , . . . , µ K ) is the vector containing all the K intercepts, B the K × L matrix whose rows are β 1 , . . . , β K , and Γ the K × Q matrix with rows given by γ 1 , . . . , γ K .
So, units in multiple allocation clusters have the average effects for each covariate with respect to their parent clusters.
For the case of being allocated to none of the primary parent clusters, z i = 0, the overlap function requires a special definition. This definition depends on the individual situation. Sometimes it might be possible to define the overlap as an a priori interpretable constant, whereas most other times it can be defined in a data driven way. In particular, one can use the overall minimum, ψ(η .ij , 0) = min {η kij (x, w)}, on the observed range for x and w. They are particularly recommended for the ψ x , ψ m and ψ s , respectively.
For the rest of this manuscript, we focus on the mean overlap function ψ m . This function offers computational advantages. Assuming that the d response variables are independent, the mean overlap function translates the multivariate problem into a univariate GLM regression and allows estimations of all the coefficients of the K clusters simultaneously. We give an intuition about the technical details of this implementation in the Appendix of the manuscript.
Connecting primary allocation parameters to heir parameters. It is more complicated and restrictive to connect the allocation probabilities α of the primary clusters z to the allocation probabilities α of the heir clusters z . In fact, the only meaningful, but restrictive choice is to use independent allocations, whereby the probability α z = k:z k =1 α k .
In practical applications this is generally too restrictive and, therefore, the heir probabilities are estimated without any restrictions.

Bayesian implementation
We approach inference by following a Bayesian paradigm, which requires specification of prior distributions for all parameters in our model. First, we assume the prior cluster sizes α to come from a Dirichlet distribution with hyper-parameters a 1 , . . . , a K . The intercepts µ k and regression coefficients (β k , γ k ) are assumed to be a priori independent normally distributed centered at zero and with scalar variance parameters (σ 2 µ , σ 2 β , σ 2 γ ) treated as hyper-parameters. The updated hierarchical formulation of the model is the following: Once one writes the complete joint posterior distribution of all the parameters and latent quantities in the model, an MCMC algorithm can be defined to sample from it. This will depend on the specific distributional choices, as we shall see with a specific example in Section 3. In general, the pseudo-code for miro is an MCMC algorithm with Gibbs samplers, that iterates the following instructions across t = 1, . . . , T iterations: ..,n , and then compute the updated cluster Sample new values for {α h } from their full conditional α |y, z, a ∼ Dir(a 1 + n 1 , . . . , a h + n h , . . . , a K + n K ); 3. Sample intercepts and regression coefficients (µ, B = {β}, Γ = {γ}), by using the stacked design matrices X and W , from their full conditional distributions. The instructions inside this step are dependent on distributional assumptions for the data matrix y; we will see a specific example in Section 3;

Model selection, posterior allocation, and label switching
In the context of our approach, model selection is equivalent to choosing the number of primary clusters K. A fully Bayesian specification would prescribe a prior on this quantity and, due to the nature of the model, would require the implementation of a transdimensional MCMC version of our algorithm, such as a reversible jump MCMC (Green, 1995). For two reasons we avoid this approach. First, we expect the 'true' number of primary clusters K to be small in practice, as even K = 4 can accommodate for up to K = 16 clusters. Second, sampling K from its prior distribution might lead to computational issues. Given that K scales exponentially with K, the support of the prior distribution would have to be rather narrow to avoid unfeasible values of K , which defeats the main purpose of using a distribution for K.
On the basis of these considerations, we opt for a more heuristic model selection approach, in which we fit our model for different values of K and then select the optimal one through an information criterion. In particular, we rely on the BIC-MCMC (Frühwirth-Schnatter, 2011), recently employed to select infinite mixtures of infinite factor analyzers models (Murphy et al., 2019). The BIC-MCMC criterion typically encourages the selection of parsimonious models, and it is defined as where l max is the maximum value of the log-likelihood across the MCMC chain (after burnin) and p θ is the number of parameters in the model.
After the choice of K and, implicitly, K , units are allocated into clusters according to their average posterior probabilities and using the Maximum-A-Posteriori (MAP) rule. In particular, unit i will be assigned to the cluster h that attains the highest value for computed after the initial burn-in window.
A well-known problem with mixture models in the Bayesian paradigm is the label switching phenomenon (Celeux, 1998;Stephens, 2000;Sperrin et al., 2010). available as a function in the R package label.switching (Papastamoulis, 2016). The procedure needs a pivotal labelling, which we select according to the strategy proposed in (Carmona et al., 2018). In particular, we first compute the matrix of co-occurences C (t) at each iteration t = 1, . . . , T of the MCMC (after burn-in). This is an n × n matrix where a generic element c ij is equal to one if unit i is in the same cluster of unit j for that iteration, and zero otherwise. Then, an average of these matrices is computed, denoted byC. Finally, we select the labelling of iteration t min as our pivotal quantity, where Once samples have been relabelled according to the algorithm, we can compute posterior quantities of interest.
3 Two-mode binary networks with covariates Two-mode networks (Wasserman and Faust, 1994, Chapter 8), also known as bipartite or affiliation networks, are networks consisting of two types of nodes, where links can occur only between nodes of different type. An example of a two-node network is a group of actors attending or not attending a set of events. From an agent-based point of view, a two-mode network can be seen as a collection of d-dimensional multivariate binary response variables for n actors. For each of the actors, the response vector indicates whether he or she attended or did not attent each of the d events.
One of the interests in two-mode networks is detecting clusters of actors that represent organizational structures inside the network itself. Recent studies have shown how allowing for overlap can improve the characterization of the clusters as well as leading to model parsimony (Ranciati et al., 2017a). With additional data often available, both at the level of actors (units) and events (responses), and with a natural expectation of clusters to be potentially overlapping, we present this setting as a prime example of our mixture of multivariate GLMs (miro) approach presented in Section 2.
We organize our data in an n × d matrix y of observations y ij , recording attendances of i = 1, . . . , n units or actors to j = 1, . . . , d events. Each y ij is a realization from a binary random variable, where y ij = 1 indicates that individual i attended event j, and zero otherwise. We assume {y ij } to be (conditionally) independent for all i, j. The hierarchical formulation of the model is therefore α ∼ Dir(a 1 , . . . , a K ), where probabilities of attendance {π kij } play the role of {θ hi } in (4). A natural choice for overlap function of the zero cluster, z i = (0, . . . , 0), is to set π = 0 and use it to allocate units for which we record no attendances. Moreover, for the binomial setting, we note that there are no additional dispersion parameters Σ.
In this two-mode binary network setting, covariates in X could be characteristic related to an actor, such as gender, age, etc, and those in W could be features of an event, i.e., type of event, date, duration, and so forth. We bridge the linear predictors and their corresponding probabilities of attendance via g(·), which we choose to be the Gaussian cumulative distribution function π hij = Φ η hij . This choice of a link function induces a probit regression model formulation, which, combined with the mean overlap function, allows one to sample the parameters of interest, {µ, β, γ}, directly from a single probit regression model instead of working separately for each cluster k = 1, . . . , K. To be more specific, we can write the likelihood as and pair this with the same prior distributions shown in (4), in order to derive the full joint posterior of the parameters in our model.
For inferential purposes, we employ the Bayesian probit regression framework proposed by Holmes and Held (2006) and we implement the MCMC scheme previously described in Section 2.2. More specifically, in Step 3 of the pseudo-code, the set of regression coefficients, µ, B, and Γ, is sampled together in a single probit regression step. We adopt the framework suggested by Holmes and Held (2006), where a random latent utility r is introduced, such that y ij = 1 if r ij > 0, otherwise y ij = 0, for i =, . . . , n and j = 1, . . . d. Then utility is defined as r ij = η ij + ij with ij ∼ N (0, 1) and the linear predictor η ij is the same of a probit regression on the original y ij . The approach leads to the following update mechanism for the regression coefficients θ = [β γ] in η ij : ii) for i = 1, . . . , n and j = 1, . . . , d with truncNorm(·) denoting sample values from a truncated Normal distribution; where V is the prior block-covariance matrix of the coefficients in θ; A is the full design matrix obtained by stacking all the columns of both X and W , S = V A and Chol(·) extracts the lower-triangular matrix from the Cholesky decomposition.

Simulation study
We investigate the performance of our model under two different data generating processes:

Synthetic data from miro model
Data are simulated from K = 2 overlapping clusters, using the hierarchical formulation of the miro model as the data generating process, namely a mixture model with overlapping components and a probit regression formulation. We consider 5 scenarios in total, according to the type of covariates considered, and by varying either the sample size n or the number of events d. In particular, we consider: • Settings with Event covariates only: sample size n = {50, 150} actors, d = 15 events, Q = 2 binary covariates (W 1 , W 2 ) from one categorical covariate with three levels; • Setting with Actor and Event covariates: sample size n = 250 actors, d = 21 events, L = 1 continuous covariate X sampled from a Standard Normal distribution, Q = 2 binary covariates (W 1 , W 2 ) from one categorical covariate with three levels.
The results are reported in Table 1. With the exception of the scenario with the lowest sample size and number of events, n = 50 and d = 5, respectively, we are able to obtain appreciable values for both MER and ARI. This is indeed expected, given that we are simulating from the same model which we use for inference. However, we further notice that not only sample size but also increasing the number of events leads to better performances.
In particular, increasing the number of events d has a positive effect for the setting with only actor covariates: this is due to the fact that, having more attendances is analogous to having more time points in a repeated measures model framework. A similar argument can be made for the effect of sample size n on the performances in scenarios where there are only event covariates.

Synthetic data from misspecified model
We now simulate data from K = 4 non-overlapping groups via a mixture model of binary regression models, where probabilities {π kij } are computed as a logit transformation of the linear predictor Here, we simulate n = 300 units and d = 20 events. This data generating process differs from miro due to: (i) logit instead of probit link function; (ii) units belong only to one component at a time, as in a conventional mixture model. As covariates, we use: L = 1 continuous actor covariate X sampled from a standard Normal distribution; Q = 1 binary event covariate W. We perform inference using four competing models: (i) mixtbern, a conventional mixture of Bernoulli distributions; (ii) manet, a mixture of Bernoulli distributions with overlapping clusters (Ranciati et al., 2017a); (iii) mixtprobit, a classical mixture of probit regression models; (iv) miro, our proposed model. For manet and miro, prior on the cluster sizes P(α 1 , α 2 , . . . , α h , . . . , α K ) = Dir(a 1 , a 2 , . . . , a h , . . . , a K ) are set to have the following hyper-parameters: a h = K if K h=1 u h = 1, otherwise a h = 1. The results are reported in Table 2  to the previous section, due to the misspecification of the models we fit. However, the results for miro with K = 3 and K = 4 are less affected than those for the two competing models, mixtbern and manet. As expected, being the closest to the data generating process, mixtprobit performs on par or slightly better than miro in terms of ARI and MER.

Agreement and polarization in U.S. Supreme Court
We illustrate the performance of our proposed method on the 26 "important decisions" handed down by the US Supreme Court in their 2000-2001 term with the aim of clustering its nine justices. Originally described in Greenhouse (2001), these data were analyzed in Doreian and Fujimoto (2003) and then further explored in Doreian et al. (2004). The n = 9 actors in our data are justices Breyer, Ginsburg, Souter, Stevens, O'Connor, Kennedy, Rehnquist, Scalia and Thomas, whereas the decisions considered are d = 26. According to Greenhouse (2001), the decisions can be categorized into 7 main topics, which can be used as a categorical covariate W with the following levels: "Presidential Election", "Criminal law", "Federal authority", "Civil rights", "Immigration law", "Speech and Press", "Labor and Properties". Each observation is coded as y ij = 1 when justice i was part of the majority decision j, while y ij = 0 stands for the situation where the justice voted with the minority.
On these data we apply four clustering algorithms: mixtbern, manet, mixtprobit and miro. For all of them, we opt for 10,000 MCMC iterations with a generous 5,000 burn-in window. Overall model fit comparison is done quantitatively in terms of BIC-MCMC and qualitatively using the clustering output. In Table 3 we report the selected number of clusters K according to the BIC-MCMC value. We also provide a proxy for the complexity of the models by reporting the number of parameters.
We gather from the summary that miro stands out as the best model according to the BIC-MCMC. Moreover, manet and miro produce the same classification of the justices, which suggests that no real better fit is provided by manet at the expense of increasing model Unlike any of the other methods, miro is able to make use of additional covariate information to aid the clustering. Figure 2 visualizes the results for miro in terms of clustered data and posterior means of regression coefficients for the two primary clusters.
Coherently with the allocations of the model, there are some decision types that better discriminate between the voting behaviour of the two primary clusters. In particular, those decisions belonging to categories "Federal Authority", "Presidential Election" and "Labor and Properties" on the right side of bottom plot in Figure 2 clearly discriminate the liberal judges in cluster 2 from the conservative judges in cluster 1. On the other four decision categories, the 9 justices are in much closer agreement.

Conclusions
In this manuscript we have presented an approach to perform model-based clustering on multivariate data, via a mixture of generalized linear models that allows for units to be where the simulated environment is not the same as the fitted model. The comparison was favorable also against some close competitors, such as mixtures of GLMs without overlapping clusters or mixtures with overlapping clusters but without covariates. Finally, we adopted our proposed methodology to the voting records of the US supreme court and identified interesting clusters of voting behaviour for its 9 justices in the 2000-2001 term.
The data and R code for the analysis can be found in https://github.com/savranciati/ miro.
When considering possible future extensions of the method, the definition of the linear predictor could be further enriched by adding, for example, random effects for knowing grouping structures, or specifying regression coefficients that vary for each possible combination of cluster k, unit i, and variable j. However, this will have the drawback of significantly increasing the number of parameters to be inferred. Also, further extensions could revolve around the idea of introducing dependence between the response variables. In this direction, Nikoloulopoulos and Karlis (2010) adopted Copula models in the context of multivariate GLMs, while other authors explored the case of mixtures of bivariate Poisson GLMs (Bermúdez and Karlis, 2012).
withη i being an element ofη =Xβ +Wγ . The design matricesX andW are built by filtering the corresponding matrices of covariates' values X and W through the allocation of each unit, and building a block structure to reflect the possible configurations of z i . In particular, we collect in a matrix X [h] the predictors' recorded values for the n h = n i=1 z hi units allocated into cluster h, and we stack them vertically d times; also, we append a column unitary vector of length n h to account for the intercept. The matrix W [h] is simply built by stacking W exactly n h times, in order to have conforming dimensions. All the resulting matrices have n h × d rows and number of columns, respectively, (L + 1) and Q.
where u h is the h-th row of U , and ⊗ is the Kronecker product. The final design matrices X andW haveñ = n · d rows and, respectively, (L + 1) × K and Q × K columns. The sub-matrices involving h = 1 are not used to sample the regression coefficients. The matrix U is defined to contain all the possible configurations of 1s and 0s of length K, so that we have z hi = 1 [u h =z i ] with u h denoting the h-th row of U and 1 [·] the indicator function.