Probability Based Independence Sampler for Bayesian Quantitative Learning in Graphical Log-Linear Marginal Models

Bayesian methods for graphical log-linear marginal models have not been developed in the same extent as traditional frequentist approaches. In this work, we introduce a novel Bayesian approach for quantitative learning for such models. These models belong to curved exponential families that are difficult to handle from a Bayesian perspective. Furthermore, the likelihood cannot be analytically expressed as a function of the marginal log-linear interactions, but only in terms of cell counts or probabilities. Posterior distributions cannot be directly obtained, and MCMC methods are needed. Finally, a well-defined model requires parameter values that lead to compatible marginal probabilities. Hence, any MCMC should account for this important restriction. We construct a fully automatic and efficient MCMC strategy for quantitative learning for graphical log-linear marginal models that handles these problems. While the prior is expressed in terms of the marginal log-linear interactions, we build an MCMC algorithm that employs a proposal on the probability parameter space. The corresponding proposal on the marginal log-linear interactions is obtained via parameter transformation. By this strategy, we achieve to move within the desired target space. At each step, we directly work with well-defined probability distributions. Moreover, we can exploit a conditional conjugate setup to build an efficient proposal on probability parameters. The proposed methodology is illustrated by a simulation study and a real dataset.


Introduction
Statistical models which impose restrictions on marginal distributions of categorical data have received considerable attention especially in social and economic sciences; see, for example, in Bergsma et al. (2009). A particular appealing class is that of log-linear marginal models introduced by Bergsma and Rudas (2002), that includes as special cases log-linear and multivariate logistic models. The marginal log-linear interactions are estimated using the frequencies of appropriate marginal contingency table, and expressed in terms of log-odds ratios. This setup is important in cases where information is available for specific marginal associations via odds ratios (i.e. marginal log-linear interactions) or when partial information (i.e. marginals) is available.
Log-linear marginal models have been used to provide parameterisations for discrete graphical models; see Lupparelli et al. (2009), Rudas et al. (2010) and Evans and Richardson (2013). In particular, Lupparelli et al. (2009) used them to define a parameterisation for discrete graphical models of marginal independence represented by a bi-directed graph. The absence of an edge in the bi-directed graph indicates marginal independence, and the corresponding marginal log-linear interactions (i.e. the corresponding log-odds ratio) are constrained to zero.
Despite the increasing interest in the literature for graphical log-linear marginal models, Bayesian analysis has not been developed as much as traditional methods. Some context specific results have been presented by e.g. Silva and Ghahramani (2009b), Bartolucci et al. (2012) and Ntzoufras and Tarantola (2013). For graphical log-linear marginal models, no conjugate analysis is available. Therefore, Markov chain Monte Carlo (MCMC) methods must be employed. Nevertheless, the likelihood of the model cannot be analytically expressed as a function of the marginal log-linear interactions. This creates additional difficulties on the implementation of MCMC methods since, at each step, an iterative procedure needs to be applied in order to calculate the cell probabilities and consequently the model likelihood. Moreover, in order to have a well-defined model of marginal independence, we need to construct an algorithm which generates parameter values that lead to a joint probability distribution with compatible marginals. To achieve this, we need an MCMC scheme which moves within the restricted space of parametrisations satisfying the conditions induced by the compatibility of marginal distributions.
In this paper we construct a novel, fully automatic, efficient MCMC strategy for quantitative learning for graphical log-linear marginal models that handles the previously discussed problems. We assign a suitable prior distribution on the marginal log-linear parameter vector, while the proposal is expressed in terms of the probability parameters. The proposal distribution of marginal log-linear interactions is constructed by simply transforming generated candidate values of probability parameters. The corresponding proposal density is directly available by implementing standard theory about functions of random variables. The advantages of this strategy are clear: the joint distribution factorises under certain conditional independence models, and the likelihood can be directly expressed in terms of probability parameters. Furthermore, efficient proposal distributions can be constructed applying the conditional conjugate approach of Ntzoufras and Tarantola (2013), that exploit the representation of the model in terms of an augmented Direct Acyclic Graph (DAG). We present two probability based samplers: the probability-based independence sampler (PBIS) and the prior adjustment algorithm (PAA). The first one is an augmented Metropolis-Hasting algorithm, while the latter it is only an approximation of an independence Metropolis-Hastings algorithm. Hence, related theoretical results cannot be invoked directly for it. Nevertheless, empirical comparisons between the two methods indicate that PAA provides results similar to the ones obtained via PBIS in a faster and in more efficient way.
Assigning a prior distribution on the marginal log-linear interactions rather than on the probability parameters represents a novel approach and is particularly handy in the presence of informative prior about odds for specific marginal associations. For instance, symmetry constraints, vanishing high-order associations or further prior information about the joint and marginal distributions can be easily specified by setting linear constraints on marginal log-linear terms instead of non-linear multiplicative constraints on the probability space. Finally, marginal log-linear parameters are more appealing than probabilities because they are interpretable measures of association between observed variables.
The plan of the paper is as follows. In Section 2, we introduce discrete graphical models of marginal independence and the marginal log-linear parameterisation. In Section 3, we describe the considered prior set-up. Section 4 is devoted to the proposed MCMC strategies. The methodology is illustrated in Section 5 which presents a simulation study and a real data analysis. In Section 6, we conclude with a brief discussion and ideas for future research.

Model Specification and Parameterisation
In this section we briefly introduce discrete graphical models of marginal independence, the related notation and terminology, and the corresponding marginal log-linear parameterisation.
A bi-directed graph G = (V, E), is a graph with vertex set V, and edge set E, such that (u, v) ∈ E if and only if (v, u) ∈ E. Following Richardson (2003) edges are represented via bi-directed arrows. An alternative representation, proposed by Cox and Wermuth (1993), is by undirected dashed edges. The skeleton G of a bi-directed graph G is the graph obtained by making all edges undirected. A ∨ configuration is every triplet of vertices (u, v, z) in G with edges (u, v) and (v, z) and with no edge connecting u and z. A vertex set is connected if there is a path between every pair of vertices belonging to it. We consider a set of random variables where I v is the set of possible levels for variable v. The cross-tabulation of variables Y V produces a |V|-way contingency table with cell frequencies n = (n(i), i ∈ I) where I = × v∈V I v . We further assume that n ∼ M ultinomial(p, N) with p = (p(i), i ∈ I); p(i) is the joint probability for cell i ∈ I, and N = i∈I n(i). A bi-directed graph G is used to represent marginal independencies between variables Y V which are expressed as non-linear constraints over the set of the joint probabilities p. The list of independencies implied by a bi-directed graph can be obtained using the pairwise Markov property (Cox and Wermuth, 1993) and the connected set Markov property (Richardson, 2003). For discrete variables the connected set Markov property implies the pairwise Markov property, whereas the converse is not generally true. Following Drton and Richardson (2008), we define a discrete graphical model of marginal independence as the family of probability distributions for Y V that satisfy the connected set Markov property. For example the bi-directed graph in Figure  The marginal log-linear parameterisation for bi-directed graphs has been proposed by Lupparelli (2006) and Lupparelli et al. (2009); it is based on the class of log-linear marginal models of Bergsma and Rudas (2002). According to Bergsma and Rudas (2002) the parameter vector λ, containing the marginal log-linear interactions, can be obtained as Figure 1: Example of bi-directed graph.
where vec(p) is a vector of dimension |I| obtained by rearranging the elements p in a reverse lexicographical ordering of the corresponding variable levels, with the level of the first variable changing first. Each marginal log-linear interaction satisfies identifiability constraints (here sum-to-zero constraints), and this is achieved via an appropriate contrast matrix C. The interactions are calculated from a specific marginal table identified via the marginalisation matrix M . They are characterised by two sets of variables: one set that refers to the marginal table in use and a second set (subset of the first one) that identifies which variables are involved in the interaction of interest. Finally, the first order interactions correspond to the main effects. We denote with λ Mm the parameter vector containing all interactions estimated from the marginal probability table p Mm . Matrices C and M define a smooth mapping between the probability space and the marginal log-linear space; see Forcina et al. (2010) for technical details about the smoothness. Matrix C controls the induced log-linear parameterisation. In this work we have used the sum-to-zero parameterisation which is one of the possible configurations leading to different sets of log-linear parameters. Details for the construction of C and M are available in Section 3.1 and in the Supplementary Material of this paper; see Appendices B and C (Ntzoufras et al., 2018).
A graphical model of marginal independence is defined by zero constraints on specific marginal log-linear interactions. More precisely, we apply the following procedure presented by Lupparelli (2006) and Lupparelli et al. (2009): (i) define a hierarchical ordering (see Bergsma and Rudas, 2002) of the marginals corresponding to disconnected sets of the bi-directed graph; (ii) append the marginal corresponding to the full table at the end of the list if it is not already included; (iii) for every marginal table estimate all interactions that have not been already obtained from the marginals preceding it in the ordering; (iv) for every marginal table corresponding to a disconnected set of G, restrict the highest order log-linear interaction to zero.
The graphical structure imposes constraints of the type K log M P = 0 with K being the sub-matrix of C for which the corresponding elements of λ are restricted to zero. This parameterisation depends on the ordering of the marginals selected in step (i). Furthermore, it does not always satisfy variation independence; see Lupparelli et al. (2009), Rudas et al. (2010, and Evans and Richardson (2013). If the marginal selected in step (i) are order decomposable then variation independence is guaranteed (Bergsma and Rudas, 2002) so that the marginal-log linear parameter space is rectangular. Hence, in this paper we focus on models based on an order decomposable set of marginals. Graphical log-linear marginal models based on bi-directed graphs with three and four vertices are always variation independent whichever the chosen ordering of the marginals.

Prior Specification for Marginal Log-Linear Interactions
In order to specify our prior setting for graphical marginal log-linear models it is useful to rewrite the model described by (1) in the following extended form where Every λ Mm may contain interactions that are constrained to zero due the graphical structure G and the induced contrast matrix. In the following we focus only on nonzero elements of λ, on which we assign a suitable prior distribution. We denote by λ the set of elements of λ not restricted to zero, that is where r Cm is the number of rows of the contrast matrix C m for marginal M m ∈ M.
When no information is available about λ, we can work separately on each element of λ, assigning suitable independent normal prior distributions with large variance to express ignorance, i.e.
where d λ is the number of elements of λ.
A more sophisticated approach can be based on the prior suggestion of Dellaportas and Forster (1999) for standard log-linear models of the form log μ = Xβ. They considered the following prior distribution where μ is the vector of the expected number of cell frequencies and |I| = v∈V |I v | is the number of cells of the contingency table. Matrix X represents the design matrix of the model; see Supplementary Material (Appendix C) for more details and full specification.
This prior was suggested as a default choice for model comparison of log-linear models and it arises naturally for log-linear marginal models since within each marginal we essentially work by fitting standard log-linear models. In fact, following the procedure described in Section 2, for every marginal we fit a saturated log-linear model, and then we keep only interactions that have not been already estimated from the marginal preceding it in the examined ordering.
Dellaportas and Foster prior was obtained after matching the prior moments of other default priors used for the probability parameters in conjugate analysis of graphical models such as the Jeffreys and the Perks prior distributions, for more details see Sections 3.2 and 3.3 in Dellaportas and Forster (1999). Moreover, the prior distribution induced for the log-odds ratios remains the same even if extra, marginally independent, categorical variables are added in the model. Finally, such prior arises naturally in our context since the adopted parameterisation allows us to work by fitting standard loglinear models within each marginal. These are the main reasons why we recommend adopting this prior here.
The prior mean vector θ has all its elements equal to zero except the first one which is equal to the logarithm of the average number of observations per cell n. Under sumto-zero constraints, X T X is a block diagonal matrix resulting to a set of independent priors for interactions referring to different set of variables.
In order to construct the prior distribution on λ, we work separately on each single set λ Mm obtained from marginal M m and we proceed as follows. Let λ Mm S be the parameter vector for the saturated model that can be estimated from marginal M m ; by construction, it coincides with the parameter vector of the saturated standard loglinear model obtained from this marginal. In terms of Dellaportas and Forster (1999) parameterisation, λ Mm S can be written as λ Mm The prior of λ Mm is obtained via marginalisation from (4) since λ Mm is a subset of λ Mm S . When using sum-to-zero constraints, then X −1 Mm 1 = (1, 0, . . . , 0) T resulting to a prior mean equal to θ j = 0 for all λ Mm j except for the intercept for which the prior mean is given by log n − log(N ). This prior is greatly simplified to a product of independent N (0, 2) for all marginal log-linear interactions (except for the intercept) in the case of binary variables. Finally, the prior for the full parameter vector λ is obtained as a product of the priors on λ Mm .

Likelihood Specification and Posterior Inference
The likelihood cannot directly be expressed in terms of λ (or equivalently λ) but only as a function of the probability parameter Unfortunately, in order to obtain ℘ i (λ), or equivalently P , from (1), we need to implement an iterative algorithm, and therefore the likelihood cannot be written in a closed form expression; see, for example, Bergsma and Rudas (2002) and Lupparelli et al. (2009). As a consequence of this, the corresponding posterior distribution of λ (or equivalently λ), cannot be evaluated straight away. Hence, MCMC methods are needed for posterior inference of λ.
Metropolis-Hastings scheme on the marginal log-linear interactions can be implemented to estimate the posterior distributions of the parameter vector. Nevertheless, such an algorithmic strategy will be inefficient since in every MCMC iteration, the joint probabilities corresponding to the proposed values of λ need to be calculated using an iterative algorithm. This will considerably slow down the MCMC sampler. The use of the iterative procedure may also result to unstable solutions for specific combinations of lambda values. During simulations, we have encountered several inconsistencies in the estimates for specific configurations that may have unpredicted effect on the MCMC runs and the corresponding estimated posterior distributions. Another important issue is the fact that the constructed algorithm should generate parameter values that lead to well-defined joint probability distributions with compatible marginals. Defining a proposal distribution that allows moving within the space of variation independent log-linear marginal models is still an open issue.
For the above reasons, in the following section, we propose a novel MCMC strategy based on the probability representation of the model. Such probabilities are compatible by construction, the corresponding marginal log-linear interactions will be also compatible. For comparative purposes, we have implemented a "vanilla" random walk algorithm MCMC on λ (referred to as RW-λ), which proposes to change each log-linear parameter vector independently for every single marginal based on a decomposable ordering.

Initial Set-Up and Data Augmentation for MCMC
Following the notation of Ntzoufras and Tarantola (2013), we can divide the class of graphical log-linear marginal models in two major categories: homogeneous and nonhomogeneous models. A bi-directed graph is named homogeneous if it does not include a bi-directed 4-chain or a chordless 4-cycle as sub-graphs. Both type of models are shown to be compatible, in terms of independencies, with a certain DAG representation (augmented DAG); see Pearl and Wermuth (1994), Drton and Richardson (2008) and Silva and Ghahramani (2009b,a). Nevertheless, while homogeneous models can be generated via a DAG with the same vertex set, for the non-homogeneous ones it is necessary to include some additional latent variables which makes Bayesian inference challenging. The advantage of the augmented DAG representation is that the joint probability over the augmented variable space (including both observed and latent variables) can be written using the standard DAG factorisation. An example of the DAG representation with latent variables for a non-homogeneous bi-directed graph is provided in Figure 2. In order to construct the augmented DAG, with the minimal set of latent variables, we apply the following procedure presented in Pearl and Wermuth (1994). Given the skeleton G of the examined graph, we assign arrows constructing in this way the sink orientation of G. If no edge in the sink orientation is bi-directed we consider an acyclic orientation of the undirected edges. If the sink orientation contains bi-directed edges, we substitute every bi-directed edge v 1 ←→ v 2 with the directed configuration v 1 ←− −→ v 2 , where vertex represents a hidden or latent variable. Finally, a DAG is constructed via an acyclic orientation of the undirected edges which are present in the sink orientation of the graph. Any DAG obtained via the previously described procedure is called augmented DAG of G. In case of homogeneous models the augmented DAG will not involve latent variables, while for non homogeneous ones we add as many latent variables as the number of bi-directed edges of the corresponding sink orientation. Ntzoufras and Tarantola (2013) exploited the connection between bi-directed graphs and DAGs to develop a Gibbs sampler based on a probability parameterisation of the model. They parameterised the augmented DAG model in terms of a set of marginal and conditional probability parameters Π, on which they implemented a conjugate analysis based on products of Dirichlet distributions. For homogeneous bi-directed graphs, the posterior distribution of the probability vector for the observed variables can be obtained analytically. On the other hand, for non-homogenous graphs, a Gibbs sampler can be obtained by calculating appropriate contrasts of the logarithms of the probability parameters within the proposed data augmentation scheme. The Gibbs sampler for the non-homogenous models is essentially composed by three steps: Step 1: Generate random splits for each bi-directed edge in the sink orientation of G by using a multinomial distribution (i.e. we generate the latent data).
Step 2: Generate random samples from the induced Dirichlet posterior conditional distributions for each set of probability parameters of the augmented DAG, given the augmented table of Step 1.
Step 3: Calculate the joint and the marginal probabilities for the observed variables by implementing the appropriate marginalising transformations to the probability parameter values of Step 2.
Additionally, the posterior distribution of the log-linear interactions can be obtained as by-product of this approach. This can be achieved by calculating appropriate contrasts of the logarithms of the joint probability parameters obtained in Step 3. We may implement this by either adding a step to the above algorithm or by implementing the induced transformation on the final MCMC output of the algorithm.
However, this approach is open to the following critics. First of all, prior information is usually available about the marginal association of the observed variables. This is typically expressed by the size of log-linear interactions which correspond to log-odds. Any prior information available for log-odds cannot be incorporated in a probability based method in a trivial manner. For instance, symmetry constraints, context specific independence assumptions, vanishing high-order associations or further prior information about the joint and marginal distributions can be specified by setting zero or linear constraints on the marginal log-linear parameter space, instead of non-linear multiplicative constraints on the probability space. Specifying sub-models based on a priori knowledge is an important issue for discrete bi-directed graph models. In fact, they generally require to estimate a higher number of parameters compared to models of conditional independencies, such as undirected graph and DAG models; see Lupparelli et al. (2009) for a discussion which compares the number of parameters in discrete bi-directed and undirected graph models with the same skeleton. From this perspective, the marginal log-linear parameterisation is definitively more appealing than the probability based approach.
In this paper, we introduce a novel MCMC strategy, in which the model and the prior are expressed in terms of the marginal log-linear interactions, while the proposal is defined on the probability parameter space. By this way, we can incorporate any source of prior information that might be available regarding marginal associations by specifying the prior values of selected log-odds ratios (i.e. interactions). Moreover, we can exploit the advantages of the conditional conjugacy of the algorithm of Ntzoufras and Tarantola (2013) in order to obtain efficient proposals based on probability parameters. This approach allows us to control the induced values and constraints on interactions in the log-linear space in more natural way than working in the probability space of the (augmented) DAG.
Our approach does not account for any additional non-independence constraints which may arise by marginalising the latent variables from the augmented DAGs. Defining the exact class of marginal DAGs still represents an open issue; see Evans (2016) who discusses possible approximations. Moreover, these inequality constraints are not of great interest as they do not have a straightforward interpretation. Intuitively, if non-independence constraints are included after marginalising over the latent variable, we expect the resulting posterior distribution of the marginal log-linear interactions to be a close approximation of the posterior distribution of the corresponding interactions obtained by the true model.
Using the methodology of Ntzoufras and Tarantola (2013), we construct an MCMC sampler based on proposing values in terms of joint probabilities, avoiding compatibility problems. If the model is homogeneous the dimension of Π is the same as the dimension of λ. This is not true for non-homogeneous models since the dimension of Π is greater than the dimension of λ. In this case we need to augment the parameter space in order to implement Metropolis-Hastings algorithm. In the following, we denote the augmented set of marginal log-linear interactions by with λ A . If the model is homogeneous, then The vector ξ can be thought as an auxiliary set of parameters which are used to retain the dimension balance in the Metropolis-Hastings algorithm.
More precisely, for any graphical log-linear marginal model G we can obtain a Markov equivalent DAG over the observed margins, denoted by D G , with augmented vertex set A = {V, L}, where L is the set of additional latent variables; if G is homogeneous then L = ∅ and A = V. Under this approach, the joint probabilities can be written as and the probability parameter set is given by is the joint probability for the observed variables Y V and the latent variables Y L , pa(v) stands for the parents set of v in graph D G , is the conditional probability of each variable Y v given variables Y pa (v) in the parent set pa(v) of v. By using the induced augmented likelihood representation, we are able to construct a Gibbs sampler based on the conditional conjugate Dirichlet prior distributions on Π (Ntzoufras and Tarantola, 2013).

The General Algorithm
The posterior distribution of the augmented set of marginal log-linear interactions is given by where f (ξ) is a pseudo prior used for the additional parameters.
We consider a Metropolis-Hastings algorithm which can be summarised by the following steps.
3. From p , calculate λ using (1) and then obtain the corresponding non-zero elements λ .

Set
; in our implementation we have used as ξ the last d ξ terms of Π ξ .

Accept the proposed move with probability α = min(1, A) with
where abs(·) stands for the absolute value, Π ξ = ξ, and J = J (Π, λ, ξ) is the determinant of the Jacobian matrix of the transformation Π = g( λ, ξ) specified by (6), (7), and (1). Similarly to Step 1, λ (t) , ξ (t) and λ (t) are used to denote the values of the corresponding parameters in the current iteration t of the algorithm.
The pseudo-parameter vector ξ is used only to retain the dimension balance between the marginal log-linear parameterisation and the probability parameterisation used in Ntzoufras and Tarantola (2013). Furthermore, it is directly matched to specific probability parameters of the bi-directed graph G. Hence, we can indirectly "eliminate" the effects of ξ on the algorithm by assuming that its elements are uniformly distributed in the zero-one interval. Under this view, we set f (ξ ı ) = I {0<ξı<1} having as a result the elimination of the ratio f (ξ )/f (ξ (t) ) from (8). In the following, we consider this choice in order to simplify all proposed algorithms.
A good choice of the proposal q(Π |Π (t) ) will lead to high (close to one) acceptance rates. Therefore, an efficient proposal for the Metropolis Hastings scheme described in this section, intuitively seems to be the following where f (n A |Π, n) is the distribution of counts n A given the observed frequency table n and the probability parameter set Π of the augmented table induced by A; and f q (Π|n A ) is the conditional posterior distribution of the probability parameter vector Π given a proposed set of augmented data n A . Considering all possible configurations in (9) within each MCMC iteration is cumbersome and time-consuming. One solution can be obtained by using a random sub-sample of n A . Hence, we construct our MCMC by employing just one realisation of n A which will play the role of intermediate nuisance parameter that facilitates the construction of a sensible proposal distribution. This approach corresponds to an MCMC scheme with a target posterior distribution given by The parameter vector n A plays the role of (nuisance) augmented data with f (n A ) being a pseudo-prior. More precisely, we introduce n A in order to augment the target posterior and to achieve a dimension balance between the two parameterisations: the marginal log-linear parameterisation and the probability based one. This trick assists us to retain the balance between the two parametrisations and it enables us to implement standard MCMC methods.
In order to simplify the MCMC configuration, we consider a uniform distribution as pseudo-prior over all possible configurations of n A . Under this formulation, the acceptance probability in the Metropolis-Hastings is equal to α = min(1, A) with A given by In (10), the probability functions f (n A |Π, n) and f (n A(t) |Π , n) are readily available from the model construction and the likelihood representation of the augmented table. We only need to specify f q (Π |n A ) which is the first component of (9) and it has the form of a posterior conditionally on the frequencies of the augmented table. For this component, we can exploit the conditional conjugate approach of Ntzoufras and Tarantola (2013). In order to do so, we consider as a "prior" f q (Π) a product of Dirichlet distributions in order to obtain a conjugate "posterior" distribution f q (Π |n A ). Under this approach, and by further considering that f (n|Π)f (n A |Π, n) = f (n A |Π), then (10) is further simplified to In the following of this manuscript, we will refer to this approach as the probability-based independence sampler (PBIS).

Prior Adjustment Algorithm
As we have already stated, although PBIS simplifies the MCMC scheme, the parameter space is still considerably extended by considering the augmented frequency table n A . We can further simplify PBIS by using the following two-step procedure: Step 1: Run the Gibbs sampler of Ntzoufras and Tarantola (2013) to obtain a sample from Π (see Section 4.1 for more details).
Step 2: Use the sample of step 1 (or sub-sample of it) as a proposal in the general Metropolis-Hastings algorithm with acceptance rate (8).
Since the sample of Step 1 is obtained from an MCMC algorithm, auto-correlation will be present. A random (independent) sub-sample from the posterior distribution can be obtained by following different strategies. The most common approach can be obtained by using thinning, where we keep one observation for every set of K iterations. The thinning interval K can be easily defined by monitoring autocorrelation function or using the convergence diagnostic of Raftery and Lewis (1992) which is available in R packages such as CODA (Plummer et al., 2006) or BOA (Smith, 2007). A drawback of this approach is that we may end up having a considerably lower number of simulated observations. For this reason, we suggest considering a random permutation of the full MCMC sample to "destroy" the induced autocorrelation. We believe that the effect of this strategy will be minimal, since this sample is only used as a proposal in the second MCMC procedure. As a referee pointed out, by following either of these approaches, the sample variance will get close to the independent variance asymptotically. Although intuitively this approach will lead to a sample from a close an approximation of the target posterior distribution, to our knowledge no mathematical proof is available. Indeed, empirical comparisons (see Section 5.1) indicate no differences between the posterior distribution obtained by a correctly thinned MCMC sample and the re-ordered sample or even with the one-shot MCMC sampler of Section 4.2. Therefore, we believe that this sampler provides an efficient approximation of an independence Metropolis-Hastings algorithm.
Using such sample (or sub-sample) as proposed values within the Metropolis-Hasting algorithm is equivalent to using the posterior distribution f q (Π|n) as proposal in (8), that is q(Π |Π (t) ) = f q (Π |n). Under this proposal, (8) now simplifies to We will refer to this algorithm as the prior-adjustment algorithm (PAA) due to its characteristic to correct for the differences between the prior distributions used under the two parameterisations.
The "prior" distribution f q (Π) is only used to build the proposal. Therefore, it can be considered as a pseudo-prior. It does not influence the target posterior distribution but only affects the convergence rates of PAA. We can choose the parameters of this pseudo-prior in such a way that (12) is maximized and an optimal acceptance rate is achieved. When a non-informative prior distribution for λ is used, all Dirichlet parameters involved in f q (Π) can be set equal to one. Under this choice, the effect of the pseudo-prior is eliminated from the proposal, leaving the data-likelihood to guide the MCMC algorithm. PAA is less computationally demanding that the single-run MCMC algorithm introduced in Section 4.2, since in we avoid four additional likelihood evaluations at each iteration required in the later.

The Jacobian
We conclude this section by providing analytical expressions of the Jacobian required in the acceptance probabilities within each MCMC step in Sections 4.2 and 4.3; see (11) and (12). Specifically, the Jacobian terms are given by J (Π, λ, ξ) = | ∂Π ∂( λ,ξ) | and are simplified to where Π \ξ is obtained from Π excluding the elements of Π ξ .
The elements of the Jacobian matrix are given by where P ı denote the ı element of P , and c C is the number of columns of the contrast matrix C. For the saturated model, the above equation simplifies to More details on the calculation of (13) are provided in Appendix A in the Supplementary Material.
In order to complete the specification of (13), we need to calculate the derivative terms Δ ıj . Let us now denote by i the index of vector P such that P i ≡ p(i). Moreover, the index j corresponds to a variable u j ∈ A such that Π j ≡ π uj|pa(uj) (j uj |j pa(uj) ) for a specific cell j of the augmented table I A . Therefore, for any ı = i , terms Δ ıj are given by (uj) for every i → i and j → (u j , j). (14) For the computation of each Δ ıj we consider two different cases: (A) u j ∈ V and (B) u j ∈ L. In the following, to simplify notation, we denote u j by u. Furthermore, we indicate with L u = L ∩ pa(u) the latent variables that are parents of u, and with For case A, when u is an observed variable, we obtain with where For case B, when u is a latent variable, pa(u) = ∅ due to the structure of the DAG representation. Hence, the derivative is given by Detailed derivation of expressions (15)-(17) are available in Appendix A in the Supplementary Material.

Simulation Study
In this section, we evaluate the performance of the proposed methodology via a simulation study. We generated 100 samples from the marginal association model represented by the bi-directed graph of Figure 1 and true log-linear interactions given in Table 1 We compare the methods introduced and discussed in this article in terms of acceptance rate, effective sample (ESS) per second of CPU time and the Monte Carlo Error (MCE). In addition to the algorithms described in Section 4 (PBIS and PAA) we also consider random walks on marginal log-linear interactions λ and on logits of probability (2, 2) = −0.30, λ ABCD ABC (2, 2, 2) = 0.15, λ ABCD BCD (2, 2, 2) = −0.10, λ ABCD ABCD (2, 2, 2) = 0.07. Table 1: True effect values used for the simulation study.
parameters π (RW-λ and RW-π respectively). For all interactions and for each method under consideration, we calculated the ESS per second of CPU time using both CODA and RSTAN (Stan Development Team, 2017) packages in R. We calculated MCEs for the mean and standard deviation of all interactions via the batch mean method using 50 batches of equal size. The previous quantities have been adjusted for computational time by fixing the number of iterations of PAA and then considering as number of iterations for the remaining methods the ones which correspond to the computational time of PAA.
In order to proceed with a more rigorous analysis, we first present the results for a single randomly selected sample (see Table 2 for the specific dataset under consideration) and then we discuss the main findings for all samples.  In terms of acceptance rate, PAA achieves a 50% rate which is satisfactory for an independence sampler and considerably higher than the corresponding one achieved by the PBIS algorithm (≈ 15%). This could be justified by the fact that, in the later method, increased uncertainty was introduced by the data augmentation approach, which expands the parameter space by introducing the latent counts as proposed in Section 4.2. On the other hand, the RW algorithms (either for λ or for π) were tuned in order to achieve an acceptance rate close to 35%. The acceptance rates of the independence samplers (PBIS and PAA) and of the RW algorithms cannot be compared due to the different nature of the proposals. Table 3, while the detailed ESS for all interactions are depicted in Figure 3. These results indicate that PAA is the most efficient method followed by PBIS and RW-λ having similar values, while the RW-π appears as a rather inefficient way to approach the problem.     Table 2. higher order interaction terms λ ABCD ABC , λ ABCD BCD and λ ABCD ABCD estimated from the ABCD marginal table. More specifically, for λ ABCD ABC only RW-π seems to estimate exactly the true value while PBIS slightly overestimates this effect and the rest of the methods underestimate it. For λ ABCD BCD , all methods provide similar posterior distributions with the PBIS having lower dispersion. Finally, for the four-way interaction, all methods correctly identify the true effect except for RW-λ that overestimates the true value. Generally, the estimated posterior distribution obtained by our proposed method, PAA, seems that correctly identifies the true values for all interactions.

Summary statistics of ESS per second of CPU time are presented in
In Figure 5 we represent the time adjusted MCEs for the posterior means for all methods and all marginal log-linear interactions. We notice that PAA performs better than all competing methods, since the corresponding MCEs are lower for almost all interactions. In contrast, more dispersion is observed for the posterior distribution of RW-π that is clearly performing worse than the other methods.    Table 2. across all generated data sets. In most cases, PAA achieves values lower or at least of comparable size to the corresponding one of the other methods.
If we examine the distribution of the ESS per second (over 100 simulated datasets), we confirm the results found in the single-sample analysis which indicate that PAA is clearly the most efficient between the four methods under consideration. PBIS and the RW on λ are equally efficient (in terms of ESS/minute) while the RW-π is the least efficient method. For a graphical representation see Figure   We conclude this section with a comparison of the dispersion of PAA and RW-λ. Figure 9 illustrates the distributions of the ratio of the posterior standard deviations of PAA versus RW-λ across all simulated datasets. We observe that for most interactions this ratio is distributed around one. For the last four interactions, where the latent is involved, we observe that PAA has systematically lower standard deviation.

Torus Mandibularis in Eskimoid Groups Data
We illustrate the proposed methodology by using the dataset of Muller and Mayhall (1971) studying the incidence of the morphological trait torus mandibularis in different Eskimo groups. Torus mandibularis is a bony growth in the mandible along the surface nearest to the tongue. This morphological structure of the month is frequently used by anthropologist to study differences among populations and among groups within the same population. This data have been previously analysed by Bishop et al. (1975) via log linear models, and by Lupparelli (2006)   For our analysis, we consider the data presented in Table 4, cross-classifying age (A), incidence of torus mandibularis (I), sex (S) and population (P). The dataset is a dichotomized version of the original data of Muller and Mayhall (1971). The examined Eskimo groups refers to different geographical regions, Igloolik and Hall Beach groups are from Foxe Basin area of Canada whereas Aleut are from Western Alaska. Furthermore, the data of the Aleuts group were collected by an investigator different from the one who collected the data for the first two groups, with a time difference between investigations of about twenty years. For the previous reasons we decided to reclassify the data in two groups: the first one including Igloolik and Hall Beach and the second one Aleut. Finally, variable age has been classified in two groups according to the median value.
From the analysis of Lupparelli (2006) we know that the model represented in Figure 10 fits well the original data, hence, in the following, we concentrate on this specific four-chain graph.  Table 5 reports the posterior means and standard deviations for the marginal loglinear interactions obtained via 10000 iteration with a burn in of 1000 with the proposed PAA algorithm and the RW-λ. The maximum likelihood estimates (MLEs) and corresponding approximate standard errors are also reported for comparative purposes. MLEs are obtained implementing an algorithm for the optimization of the Lagrangian log-likelihood which includes a set of zero constraints satisfying the marginal independence model; see Lupparelli (2006) for details. Standard errors for the parameter estimates can be simply derived from the estimate of the Hessian matrix of the Lagrangian log-likelihood; see Aitchison and Silvey (1958).
From Table 5 we observe that the posterior estimates (for both MCMC methods) and the MLEs coincide for all interactions and main effects obtained by marginals where no latent variable is involved; also Figure D.3 in the Supplementary Material. This is not the case for λ AIP S IP (2, 2), λ AIP S IP S (2, 2, 2), λ AIP S AIP S (2, 2, 2), where the latent variable is involved. More specifically, the posterior standard deviations are lower by 6.5%, 34% and 26%, respectively. This result is intuitively expected since PAA moves across the correct posterior distribution defined on the space of parameterisation with compatible marginal probabilities. On the other hand, both the RW-λ and the approximate MLEs standard errors are obtained without considering the restrictions imposed in order to obtain parameterisations leading to compatible marginal probabilities.
Finally, differences are also observed for interaction λ AIP S AIP (2, 2, 2) where RW-λ provides posterior means far away from the corresponding MLEs with PAA being quite closely and standard deviance slightly higher than both the corresponding values of RW-λ and the MLEs standard errors.  In terms of computational time, PAA was found to be faster with elapsed CPU time lower by 46% compared with the corresponding one for RW-λ (32.6 versus 60.8 seconds for 1000 iterations). Equivalently, the reported user CPU time was 48% lower for PAA than the one for RW-λ (29.9 versus 57.2 seconds for 1000 iterations). All runs were performed in a Windows 10 PC Intel Core i7-5500U CPU 2.4GHz with 16GB memory; all timings were obtained with the system.time function in R.
For 11,000 iterations, the effective sample size (ESS) of the methods are of similar size ranging from 43% in favour of RW-λ (for S main effect) to 36% in favour of our method (for the four-way interaction). Taking into consideration also the computational time of the two algorithms, the MCMC efficiency (ESS/Elapsed CPU time) for the proposed algorithm is considerably higher for all interactions ranging from 2% increase to 141%. The average relative MCMC efficiency is higher for our method by 65% compared with the one of the random walk MCMC. The above statistics were calculated with the effectiveSize function of CODA package in R. The overall picture is similar if we consider results using the monitor function of rstan package in R; see Figure 11 for a visual representation of the relative efficiency for all interactions.
Finally, the MCMC errors were lower for the majority of the interactions by using the naive estimator of CODA package with relative values ranging from 0.61 up to 1.13 while the corresponding relative values for the time series based estimator are ranging from 0.66 up to 1.30. The naive estimator of MCE is simply given by the usual standard error ignoring autocorrelation (i.e. the MCMC estimated posterior standard deviation over the square root of the number of iterations).

Discussion
A possible way to parametrise discrete graphical models of marginal independence is by using the log-linear marginal models of Bergsma and Rudas (2002). The marginal log-linear interactions are calculated from specific marginals of the original table, and independencies imply zero constraints on specific set of interactions in a similar manner as in conditional log-linear graphical models.
In this work we focus on the Bayesian estimation of the log-linear interactions for graphical models of marginal independence. In particular, the method we propose allows us to assign prior directly on marginal log-linear interactions rather than on the probability interactions. This facilitates the incorporation of prior information since several models of interest can be specified by zero/linear constraints on log-linear terms. Bayesian analysis of such models is not widespread mainly due to the computational problems involved in the derivation of their posterior distribution. More specifically, MCMC methods need to be used since no conjugate analysis is available. Major difficulties arise from the fact that we need to sample from a posterior distribution defined on the space of parameterisations with compatible marginal probability distributions. In the proposed algorithms (PBIS and PAA), we satisfy such restrictions by sampling from the probability space of the graphical model of marginal independence under consideration. Then we transform the probability parameter values to the corresponding marginal log-linear ones avoiding the iterative procedure needed for the evaluation of the likelihood. In order to achieve this, we exploit the augmented DAG representation of the model. This not only facilitates the prior elicitation but also the construction of the Jacobian matrix involved in the acceptance probability of the induced Metropolis steps. PBIS is a more elaborate method but it leads to an automatic setup for sampling log-linear interactions by the posterior distribution. On the other hand, PAA can be implemented in a more straightforward manner. It leads to an efficient and fully automatic setup, which can be thought as a approximate method for sampling from the posterior distribution. It further avoids any time-consuming and troublesome tuning of MCMC parameters.
For future research, the authors would like to exploit and study the connections between the prior and the posterior distributions for the two different parameterisa-tions (probability versus marginal log-linear). Moreover, extension of the method to accommodate fully automatic selection, comparison and model averaging techniques is an intriguing topic for further investigation.