Dependence in elliptical partial correlation graphs

The Gaussian model is equipped with strong properties that facilitate studying and interpreting graphical models. Specifically it reduces conditional independence and the study of positive association to determining partial correlations and their signs. When Gaussianity does not hold partial correlation graphs are a useful relaxation of graphical models, but it is not clear what information they contain (besides the obvious lack of linear association). We study elliptical and transelliptical distributions as middle-ground between the Gaussian and other families that are more flexible but either do not embed strong properties or do not lead to simple interpretation. We characterize the meaning of zero partial correlations in the (trans)elliptical family and show that it retains much of the dependence structure from the Gaussian case. Regarding positive dependence, we prove impossibility results to learn (trans)elliptical graphical models, including that an elliptical distribution that is multivariate totally positive of order two for all dimensions must be essentially Gaussian. We then show how to interpret positive partial correlations as a relaxation, and obtain important properties related to faithfulness and Simpson's paradox. We illustrate the transelliptical model potential to study tail dependence in SP500 data, and of positivity to help regularize inference.


Introduction
Several papers study graphical models for elliptical and transelliptical distributions in the standard Finegold & Drton (2009) ;Vogel & Fried (2011) and high-dimensional settings Barber & Kolar (2018); Bilodeau (2014); Liu et al. (2012b); Zhao & Liu (2014). These models found applications in many fields, such as finance and biology Behrouzi & Wit (2019); Stephens (2013) ;Vinciotti & Hashem (2013), and (implicitly) wherever Gaussian graphical models models were used but the underlying distribution is likely to depart from normality, e.g. be heavy-tailed or asymmetric. In the elliptical setting the usual definition of graphical models mimics the Gaussian case -the model is given by zeros in the inverse covariance, or equivalently, by vanishing partial correlations. Despite this being a reasonable relaxation, the partial correlation graph (PG) cannot be interpreted in terms of conditional independence, since outside of the normal case no elliptical distributions allow for conditional independence (c.f. Proposition 2.6). It is therefore unclear what type of dependence information is embedded by the PG.
For general distributions partial correlations inform only about linear dependence. Missing edges in the PG must then be interpreted with great care and, in some cases, they can fail to capture interesting dependence information. For example, in an aircraft data set from Bowman & Foster (1993), we can model dependence between the speed of an airplane and its wingspan. Although the sample correlation is negligible, more flexible dependence tests reveal that the variables are strongly related; see e.g. Székely & Rizzo (2009). The reason is that for very fast (military) airplanes there is a negative dependence between speed and wingspan, while this dependence is positive for regular aircrafts.
The main theme of this paper is that for (trans)elliptical distributions there is more information in the partial correlation graph beyond presence/absence of linear dependence. We introduce definitions and notation to aid the exposition.
Definition 1.1. A random vector X = (X 1 , . . . , X d ) has an elliptical distribution if there exists µ ∈ R d and a positive semi-definite matrix Σ such that the characteristic function of X is of the form t → φ(t T Σt) exp(iµ T t) for some φ : [0, ∞) → R. We write X ∼ E(µ, Σ) making φ in this notation implicit. Important examples include the multivariate normal, Laplace and multivariate tdistributions. Elliptical graphical models have been extended to transelliptical distributions (also known as elliptical copulas or meta-elliptical distributions, Fang et al. (2002); Liu et al. (2012b)).
Here the additional challenge is that f is unknown. An elegant approach to learning partial correlations relies on directly estimating the correlation matrix of f (X) without actually learning f ; see Liu et al. (2012b); Lindskog et al. (2003), and then proceed as in the elliptical case (Section 3.3).
Throughout we denote K = Σ −1 , the set of vertices by V = {1, . . . , d}, by X (i) the d−1 vector obtained by removing X i from X, and by X (ij) that removing (X i , X j ) from X. Given I, J ⊆ V denote by X I and µ I the subvectors of X and µ with coordinates in I and by Σ IJ the corresponding subblock of Σ with rows in I and columns in J. The partial correlation between (X i , X j ) is and so ρ ij·V \{i,j} = 0 if and only if K ij = 0. Finally, we denote that (X i , X j ) are independent by X i ⊥ ⊥ X j .
A usual interpretation of PGs in elliptical distributions is that -since the conditional expectation E(X i |X (i) ) is linear in X (i) and the conditional correlation is equal to the partial correlation -the condition ρ ij·V \{i,j} = 0 implies that (X i , X j ) are conditionally uncorrelated. That is, zero partial correlation implies zero conditional correlation. As we show in Theorem 3.4 something much stronger is true. It is possible to fully characterize the PG in elliptical distributions: The partial correlation ρ ij·V \{i,j} = 0 if and only if cov(g(X i ), X j |X (ij) ) = 0 for every function g for which the covariance exists.
A similar characterization extends to transelliptical distributions. The usual interpretation of ρ ij·V \{i,j} = 0 is that f i (X i ), f j (X j ) are conditionally uncorrelated given f (ij) (X), which is not very interesting, since f is unknown. We show in Theorem 3.5 that equivalently cov(f i (X i ), g(X j )|X (ij) ) = 0 for any g, provided the covariance exists. In particular, cov(f i (X i ), X j |X (ij) ) = 0, a more explicit dependence information in terms of X.
These findings are practically relevant. Recall that two variables (X i , X j ) with general distribution are independent if and only if for all L 2 (R) functions g, h we have cov(g(X i ), h(X j )) = 0; see e.g. (Feller, 1971, page 136). That is, X i ⊥ ⊥ X j if and only if there is no way to transform (X i , X j ) such that the new variables are correlated. Our characterization of ρ ij·V \{i,j} = 0 has an analogous interpretation, in elliptical families there is no way to transform X i such that the new variable is correlated with X j . Further, interpreting ρ ij·V \{i,j} = 0 can be important in applications. In particular (trans)elliptical models are often used to capture second-order or tail dependencies, even though ρ ij·V \{i,j} = 0 such dependencies can be practically significant.
As an example, let X ∼ E(0, Σ) and consider θ ij = corr(X 2 i , X 2 j ) as a simple measure of marginal tail (or second-order) dependence. Using the representation X = µ + τ −1/2 Y , where Y ∼ N (0, Σ) and τ > 0 is random (Section 2), it is possible to show where λ = 0 if and only if X is Gaussian. This measure is minimized for ρ ij = 0, then corr(X 2 i , X 2 j ) = λ, which can be non-negligible, e.g. λ = 1/(ν − 1) for the tdistribution with ν > 4 degrees of freedom. Figure 1 shows this quantity and, for comparison, also the normalized mutual information (a standard measure of deviation from independence). Both measures converge to zero as ν → ∞ but this convergence is slow. Similarly, one may measure conditional tail dependence viaθ ij·V \{i,j} = where the right-hand side follows from Proposition 2.2 below. When ρ ij·V \{i,j} = 0 the conditional tail dependence is λ. See Section 5 for further discussion and an illustration on stock market data.
Our other main contributions relate to PGs in settings where one wishes to study positive forms of association. Two standard ways to define positive dependence are via the notions of multivariate total positivity of order two (MTP 2 ) and conditionally increasing (CI, Section 4). Although these concepts are different, in the Gaussian case they are equivalent and reduce to constraining partial correlations to be non-negative. normalized MI Figure 1. corr(X 2 i , X 2 j ) for a multivariate t-distribution with ν degrees of freedom in the ρ ij = 0 case It is less clear how to interpret these concepts in general elliptical families. A first contribution is showing several impossibility results: within the elliptical family with at least one partial correlation zero there exist no conditionally increasing distributions (other than the Normal) implying the same result for MTP 2 distributions. That is, if one wants remove edges in the PG with an additional positive dependence structure, one cannot rely on the standard notions of positive dependence.
A natural relaxation is to learn a PG under the constraint that ρ ij·V \{i,j} ≥ 0, as proposed by Agrawal et al. (2019). We refer to this strategy as positive partial correlation graphs (PPG). We contribute to understanding how should one interpret missing edges in the PPG, and to characterizing embedded positivity properties such as the positive correlation of each X i with any increasing function of the vector X. In Section 5 we illustrate how positivity constraints induce a type of regularization that can help improve inference relative to other standard forms of regularization, such as graphical LASSO, specifically attaining a higher log-likelihood with a sparser graph. This is again meant as a testimony that our theoretical results have practical relevance. For further examples in risk modelling see Abdous et al. (2005); Rüschendorf & Witting (2017), and in psychology see Epskamp & Fried (2018); Lauritzen et al. (2019b), for example. This paper also contributes to recent research aimed at understanding multivariate total positivity in a wide variety of contexts; see, for example, Fallat et al. (2017); Lauritzen et al. (2019a,b); Robeva et al. (2018); Slawski & Hein (2015). We provide in Theorem 4.15 a complete characterization of elliptical MTP 2 distributions in terms of their density generator. In Theorem 4.18 this characterisation is used to show a remarkable result: a density generator may induce a d-variate MTP 2 distribution for each d ≥ 2 if and only if the underlying distribution is Gaussian.
The paper is organized as follows. In Section 2 we review basic results on elliptical distributions. In Section 3 we characterize partial correlation graphs for elliptical and transelliptical distributions, giving a refined understanding of their encoded dependence information. In Section 4 we study positive elliptical distributions and their alternative characterizations. In Section 5 we illustrate our main results with examples.

Elliptical distributions
We review some elliptical distribution results; for more information see Fang (2018) or Kelker (1970), for example.
2.1. Stochastic representation. If X ∼ E(µ, Σ) then X admits the representation where Σ 1/2 denotes the square root of Σ, ξ is a random variable with positive values, U ∈ R d is uniformly distributed on the unit (d − 1)-dimensional sphere and ξ ⊥ ⊥ U .
Elementary arguments show that, if Eξ < ∞, then EX = µ and the covariance matrix cov(X) = E(ξ 2 ) d · Σ. Remark 2.1. Throughout this paper we assume that Eξ 2 < ∞, that ξ admits a density function with respect to the Lebesgue measure and that Σ is positive definite. Then X has a density of the form where ϕ is called the density generator and is independent of the dimension d (c.f. Kelker (1970)).
There is a useful equivalent representation in terms of normal variables. Let D 2 ∼ χ 2 d and Y = √ D 2 Σ 1/2 U . From (4) it follows that where τ = D 2 /ξ 2 and Y ∼ N (0, Σ), after marginalizing with respect to D 2 . Note that, in general, we did not assume that τ ⊥ ⊥ Y . The case τ ⊥ ⊥ Y corresponds to the scale mixture of normals sub-family, which includes most popular elliptical distributions. The normal distribution corresponds to τ ≡ 1. If τ ∼ χ 2 ν /ν for ν > 2 then X has a multivariate t-distribution. If ν = 1 we get the multivariate Cauchy, and if τ ∼ Exp(1) the multivariate Laplace distribution.
The elliptical family is closed under taking margins and under conditioning.
Proposition 2.2. Let X = (X I , X J ) ∼ E(µ, Σ) be any split of X into subvectors X I and X J . Then For the proof see, for example, (Fang, 2018, Theorem 2.18). The conditional mean µ I|J has the same form as in the Gaussian case, where Σ above can be replaced by cov(X) (a scalar multiple of Σ). Moreover, the conditional correlations corr(X i , X j | X (ij) ) are the normalized entries of K = Σ −1 (the partial correlations ρ ij·V \{i,j} ), and do not depend on the value of the conditioning variable X (ij) ; see also Lemma 2.4 below.
2.2. Characterization of Gaussianity within the elliptical family. If X is Gaussian then each marginal distribution and each conditional distribution is Gaussian. Moreover, the conditional covariances do not depend on the conditioning variable and independence is equivalent to zero correlations. These properties characterize the Gaussian distribution in the class of elliptical distributions. We recall these basic results.
Lemma 2.3 (Lemma 4 and 8 in Kelker (1970)). Let X ∼ E(µ, Σ). If X I is Gaussian for some I ⊆ V then X is Gaussian. Further, if X I given X J is Gaussian for some I, J ⊆ V then X is Gaussian.
We noted earlier that conditional correlations do not depend on the conditioning variable. For conditional covariances this is only true in the Gaussian case.
Lemma 2.4 (Theorem 7 in Kelker (1970)). Let X = (X I , X J ) ∼ E(µ, Σ). The conditional covariance of X I given X J is independent of X J if and only if X is Gaussian.
The standard definition of graphical models uses density factorizations that link to conditional independence through the Hammersley-Clifford theorem (Lauritzen, 1996). However, it is not possible to define conditional independence in the elliptical family outside of the Gaussian case. The next two characterizations are the most consequential for this article.
Lemma 2.5 (Lemma 5 in Kelker (1970)). Let X ∼ E(µ, Σ). If Σ is a diagonal matrix, then the components of X are independent if and only if X has a normal distribution.
The fact that Gaussianity is needed to conclude independence Lemma 2.5 or conditional independence in Proposition 2.6 is also seen from (2) and (3) respectively. For example, if τ is not constant then corr(X 2 i , X 2 j ) > 0 proving that (X i , X j ) cannot be independent. This gives an alternative way of proving these two results.

Graphs for (trans)elliptical distributions
3.1. Partial correlation graph and dependence. By Proposition 2.6 it is not possible to do structural learning in (non-normal) elliptical graphical models, under the conditional independence definition. It is then natural to look for relaxations that may be useful from the modelling point of view. A common strategy is to model zeros in the inverse covariance matrix, mimicking the Gaussian case; see Vogel & Fried (2011). Equivalently, K ij = 0 if and only if ρ ij·V \{i,j} = 0. In general, partial correlations do not imply conditional independence but only linear independence. The aim of this section is to understand what additional information does the PG carry in elliptical distributions. Proposition 2.2 and standard matrix algebra give hence K ij = 0 if and only if E(X i |X (i) ) does not depend on X j . This immediately gives the following standard result.
Our first main results offer a stronger characterization for elliptical distributions. Lemma 3.3 relates to marginal covariances, and immediately gives Theorem 3.4 on conditional covariances. Proof. Clearly if cov(g(X I ), X J ) = 0 for all g then, taking g to be the identity function on R |I| , it follows that Σ IJ = 0. To prove the reciprocal implication, let τ > 0 and Y ∼ N (0, Σ) be as in (6). Since X I∪J ∼ E(µ I∪J , Σ I∪J,I∪J ), it follows that and X I∪J | τ ∼ N (µ I∪J , 1 τ Σ I∪J,I∪J ). By the law of total covariance cov(g(X I ), where both the expectation and covariance are with respect to the random variable τ . The first term on the right is zero because Σ IJ = 0 and so X I ⊥ ⊥ X J |τ . The second term is zero because E(X J |τ ) = µ J , which does not depend on τ .
Theorem 3.4. Let X ∼ E(µ, K −1 ). Then K ij = 0 if and only if cov(g(X i ), X j |X (ij) ) = 0 for any function g for which this covariance exists.
Proof. The equivalence follows by applying Lemma 3.3 to the conditional distribution of (X i , X j ) given X (ij) . Lemma 3.3 and Theorem 3.4 are if and only if statements, that is, they characterize the presence of zero marginal and partial correlations (respectively). In particular, Theorem 3.4 characterizes the meaning of elliptical PGs: if (X i , X j ) are conditionally uncorrelated then so are X j and any function of X i . For instance, there is no linear association between X j and higher-order moments associated to X 2 i , X 3 i , etc.

Transelliptical distributions.
Recall that X has a transelliptical distribution, denoted X ∼ TE(µ, Σ), if and only if f (X) = (f 1 (X 1 ), . . . , f d (X d )) ∼ E(µ, Σ) for strictly increasing f . If f (X) is Gaussian (nonparanormal sub-family) the PG gives conditional independence on X and so it is highly interpretable (Liu et al. , 2012a). More generally a missing edge in the PG means that cov(f i (X i ), f j (X j )|f (ij) (X)) = 0, but this interpretation is not very interesting given that f is unknown and simply refers to linear independence between the latent (f The focus should be directly on the dependence structure of X. Our second main result shows that a weaker version of Theorem 3.4 holds for transelliptical distributions.
is zero for every function g for which the covariance exists.
, where in the last equation we used the fact that X (ij) is a one-to-one function of Y (ij) and so they both define the same sigma-field. To prove the reverse implication, note that )|Y (ij) ) = 0 where the right-hand side follows from Theorem 3.4.
Theorem 3.5 helps interpret the PG as follows. If K ij = 0 then f i (X i ) is conditionally uncorrelated with any function of X j . Hence learning a single element f i within f (rather than the whole f ) describes (local) aspects of conditional dependence of X i on X (i) (and functions thereof).
Taking g to be the identity function in Theorem 3.5 we get the following result.
The function g in this corollary is precisely the function f i in Theorem 3.5. Corollary 3.6 gives the following interpretation. If K ij = 0 then X i is conditionally uncorrelated with some strictly increasing transformation of X j and also X j is conditionally uncorrelated with some strictly increasing transformation of X i .
. Theorem 3.5 characterizes PGs using covariances between any function of X j and latent f i (X i ). Kendall's tau gives an interesting alternative characterization that can be interpreted without any reference to f . Let X = (X 1 , . . . , X d ) be a continuous random vector and X = (X 1 , . . . , X d ) be an independent copy. Kendall's tau for ( In elliptical distributions the following beautiful result relates Pearson correlations ρ(X i , X j ) = corr(X i , X j ) = Σ ij / Σ ii Σ jj with Kendall's tau.
Lemma 3.7 (Lindskog et al. (2003)). If X ∼ E(µ, Σ) then Since Kendall's tau is invariant under strictly increasing transformations, ) and one can learn the correlation matrix associated to Σ without learning f .
Below is another basic corollary of Lemma 3.7.
Define conditional Kendall's correlation as where (X i , X j ) is an independent copy of (X i , X j ) from the conditional distribution given X (ij) . If X ∼ E(µ, K −1 ) then, by Lemma 3.7 applied to the conditional distribution of (X i , X j ) given X (ij) , we have that This suggests an obvious plug-in estimator for conditional Kendall's correlations.
Proof. The last equivalence follows from invariance of τ under strictly monotone , since Kendall's tau is invariant to monotone transformations, which by Lemma 3.7 applied to the conditional distribution of (Y i , Y j ) given Y (ij) implies that Σ ij·V \{i,j} = 0 (c.f. Proposition 2.2), or equivalently, K ij = 0. To prove the reverse implication, suppose that Σ ij = 0, then, by Lemma 3.7, τ (Y i , Y j | X (ij) ) = 0 and hence τ (g(X i ), h(X j ) | X (ij) ) = 0 for all strictly increasing g, h.

Positive dependence in elliptical distributions
In this section we study PGs in elliptical distributions when one imposes positive dependence. We begin this section recalling two important notions of multivariate positive dependence. We show that none of them is meaningful to learn structure in elliptical PGs. This leads to relaxations given by elliptical distributions whose partial correlations are all nonnegative, which we refer to as positive partial correlation graph (PPG). We then complement the interpretation of PPGs offered by the characterizations in Section 3 by studying positive dependence properties embedded within PPGs. 4.1. Positive dependence. Let X be a d-variate continuous random vector with density function f . Definition 4.1. A random vector X (or its density function f ) is multivariate totally positive of order two (MTP 2 ) if and only if where min(x, y) = (min(x 1 , y 1 ), . . . , min(x d , y d )) is the coordinatewise minimum and max(x, y) = (max(x 1 , y 1 ), . . . , max(x d , y d )) the coordinatewise maximum of x and y.
The proof for the marginal distribution for CI follows from the definition. For the MTP 2 property it relies on smart combinatorial arguments; see Karlin & Rinott (1980). The statement for conditional distributions follows from the definitions.
Another well-known result is that these positivity notions are strictly related, and closed under monotone transforms; see Theorem 3.3 and Proposition 3.5 in Müller & Scarsini (2001) as well as Proposition 3.1 in Fallat et al. (2017).
Theorem 4.4. If a random vector is MTP 2 then it is conditionally increasing.
It is possible to prove that in the Gaussian case both condition (10) and CI simplify to an explicit constraint on the inverse covariance K. A symmetric positive definite matrix K is called an M-matrix if K ij ≤ 0 for all i = j. Denote the set of inverse M-matrices (inverses of M-matrices) by IM. Directly from (1), Σ ∈ IM if and only if all partial correlations ρ ij·V \{i,j} are nonnegative.
Proposition 4.6 (Proposition 3.6 in Müller & Scarsini (2001)). Suppose X is a Gaussian vector with covariance Σ then Positive elliptical distributions. We first show in Theorem 4.8 that the positive dependence notions reviewed in Section 4.2 are not useful in our setting. If K has any zeroes then X cannot be CI (hence neither MTP 2 , from Theorem 4.4). The same impossibility result applies to transelliptical families (outside the nonparanormal subfamily). As a consequence, it is not possible to learn structure (remove edges) of a non-normal elliptical graphical model under these positivity constraints.
Even if one were to forsake structural learning and focus on the fully dense graph with no missing edges, it is not possible to find MTP 2 /CI transelliptical distributions, except in very restrictive cases. Proposition 4.9 shows that there are no MTP 2 /CI t-distributions. We defer a deeper analysis to Section 4.3, where we fully characterize the elliptical MTP 2 class and show that it is highly restrictive, particularly as d grows.
We conclude the current section by defining positive transelliptical distributions to be those for which ρ ij·V \{i,j} ≥ 0 for all i, j ∈ V (equivalently, K being an M-matrix, following upon Agrawal et al. (2019)) and showing basic properties such as closedness under margins, conditionals and increasing transforms. We also give properties important for inference, such as positivity of partial correlations given any conditioning set, graph faithfulness and that Simpson's paradox cannot occur.
Remark 4.7. From (7) it follows that K is an M-matrix if and only if for every i ∈ V the conditional expectation E(X i |X (i) ) is increasing in X (i) . Note that, if X is CI then E(X i |X (i) ) must be increasing in X (i) and so, in particular, ρ ij·V \{i,j} ≥ 0 for all i, j ∈ V . This shows that nonnegativity of all partial correlations is a necessary condition for X to be CI and so also for X to be MTP 2 .
Proof. Let X ∼ E(µ, K −1 ) and suppose K ij = 0. By Proposition 3.2, the conditional covariance cov(X i , X j |X (ij) ) is zero. Since X is CI, by Proposition 4.3, the conditional distribution of (X i , X j ) given X (ij) is also CI. It is well known that CI distributions are also associated; c.f. Colangelo et al. (2005). By Corollary 3 in Newman (1984) applied to this conditional distribution we get that cov(X i , X j |X (ij) ) = 0 implies X i ⊥ ⊥ X j |X (ij) . From Proposition 2.6 we know that the latter is only possible if X is Gaussian. Consider now X ∼ TE(µ, K −1 ). From Proposition 4.5, Y = f (X) is CI and, since Y is elliptical, by our earlier result Y must be Gaussian.
Zeros in the inverse covariance matrix are not the only obstacle for the CI property.
Proposition 4.9. If X has a multivariate t-distribution then X is not CI.
Proof. Since the CI property is closed under taking margins, it is enough to show that no bivariate t-distribution is conditionally increasing. Suppose (X 1 , X 2 ) has bivariate tdistribution with ν degrees of freedom. Without loss of generality assume that the mean is zero and that the scale matrix Σ satisfies Σ 11 = Σ 22 = 1, Σ 12 = ρ. By Remark 4.7, necessarily ρ ≥ 0. Moreover, if ρ = 0 the statement follows from Theorem 4.8 so assume ρ > 0. The conditional distribution of X 1 given X 2 = x 2 is a t-Student distribution with ν * = ν + 1 degrees of freedom, µ * = ρx 2 , and scale parameter (c.f. Section 5 in Roth (2012)). For the increasing function f ( Using the formula (Johnson et al. , 1994, (28.4a)) for the c.d.f. of the t-Student distribution, if µ * ≤ ν (or equiv. x 2 ≤ ν/ρ), we express E(f (X 1 )|X 2 = x 2 ) in terms of the incomplete beta function Using the definition of the incomplete beta function in terms of the beta function we get that for a positive constant C and Since the integral above is strictly increasing in α( is not increasing, it is enough to show that α(x 2 ) is not an increasing function for showing that α is strictly decreasing for all x 2 ≤ −ρ in some neighborhood of −ρ.
Proposition 3.3, Rüschendorf & Witting (2017) states that for an elliptical distribution Σ ∈ IM if and only if X is CI. Unfortunately, this result is not true as illustrated both by Theorem 4.8 and Proposition 4.9.
Our results show that the CI/MTP 2 properties are too restrictive in connection with PGs. As a natural alternative, we study the following relaxation proposed by Agrawal et al. (2019).  Proof. If X ∼ E(µ, Σ) then, by Proposition 2.2, for every I ⊂ V , X I ∼ E(µ I , Σ II ). If Σ ∈ IM then Σ II ∈ IM by (Johnson & Smith, 2011, Corollary 2.3.2). Similarly, Σ ∈ IM then Σ II −Σ IJ Σ −1 JJ Σ JI ∈ IM by (Johnson & Smith, 2011, Corollary 2.3.1) proving that the conditional distribution of X I given X J is a positive elliptical distribution. The same argument after replacing X with f (X) works for transelliptical distributions. The last statement follows directly from the definition of transelliptical distributions.
The next proposition shows that positive elliptical distributions retain some strong properties of MTP 2 Gaussian distributions.
Proposition 4.12. If X has a positive elliptical distribution then for all i ∈ V and C ⊆ V \{i} the conditional mean E(X i |X C ) is an increasing function of X C . Moreover, for any two i, j ∈ V and C ⊆ V \ {i, j} it holds that Proof. These results are well known for Gaussian MTP 2 distributions; c.f. Fallat et al. (2017). It is convenient to translate them to equivalent statements in terms of Σ; c.f. (Drton et al. , 2009, Proposition 3.1.13). The statement about the conditional mean and the first statement about conditional correlations follow from the fact that IM-matrices are closed under taking principal submatrices. In consequence, for all i, j ∈ V and C ⊆ V \ {i, j} it holds that (Σ C∪{i,j},C∪{i,j} ) −1 ij ≤ 0. The last part states that if det Σ C∪{i},C∪{j} = 0 for some C ⊆ V \ {i, j} then det Σ D∪{i},D∪{j} = 0 for every D ⊇ C. This statement is given in (Johnson & Smith, 2011, Theorem 3.3).
These properties are pivotal in the interpretation and application of the classical positive dependence measures. Briefly, the first part says that for positive elliptical distributions conditional correlations are positive, regardless of what subset of variables one conditions upon. The second part says that if a covariance conditional on X C is 0, then it remains 0 when conditioning upon larger sets. In particular, zero marginal correlation implies zero partial correlation, hence Simpson's paradox cannot occur.
The following result offers an extension of Theorem 3.4 to the positive case.
Proposition 4.13. If X has a positive elliptical distribution then corr(X i , g(X)|X C ) ≥ 0 for every i ∈ V , any increasing function g : R d → R, and any conditioning set C ⊂ V .
The second term is zero because E(X i |τ ) does not depend on τ . To argue that the first term is nonnegative we note that conditionally on τ the vector X is MTP 2 and so also associated (c.f. Colangelo et al. (2005)). This implies that cov(h(X), g(X)|τ ) ≥ 0 for any two increasing functions h and g : R d → R, which holds in particular if h(x) = x i . If C = ∅ the same proof holds after conditioning on X C because the MTP 2 property is closed under conditioning.
Many constraint-based structure learning algorithms, like the PC algorithm (Spirtes et al. , 2000), rely on the assumption that the dependence structure in the datagenerating distribution reflects faithfully the graph. We say that the distribution of X is faithful to a graph G if we have that X i ⊥ ⊥ X j |X C if and only if the subset of vertices C separates vertices i and j in G, for any C ⊆ V \ A ∪ B. In words, any independence obtained by conditioning on subsets C is reflected in the graph. Under faithfulness one can consistently learn the underlying graph from data by conditioning on potentially smaller subsets than the full set of vertices, and benefit from simpler computation. One may extend this definition to partial correlation graphs (Spirtes et al. , 2000): the distribution of X is linearly faithful to an undirected graph G if we have that corr(X i , X j |X C ) = 0 if and only if C separates i and j in G. Bühlmann et al. (2010) proposed a related convenient notion: the distribution of X is partially faithful to a graph G if we have that corr(X i , X j |X C ) = 0 for any C ⊂ V \ {i, j} implies that corr(X i , X j |X (ij) ) = 0. It is easy to see that linear faithfulness implies partial faithfulness: If corr(X i , X j |X C ) = 0 then i and j are separated in G by C and so also by V \ {i, j} implying that corr(X i , X j |X (ij) ) = 0. An important property of positive elliptical distributions is given by the following result.
Theorem 4.14. Every positive elliptical distribution is linearly faithful to its partial correlation graph and so also partially faithful.
Proof. The proof of linear faithfulness (which also implies partial faithfulness) follows the same ideas as in Fallat et al. (2017), Theorem 6.1. A direct proof of partial faithfulness follows from Proposition 4.12.
Using partial faithfulness Bühlmann et al. (2010) developed a simplified version of the PC algorithm that is computationally feasible even with thousands of variables and was reported to be competitive to standard penalty-based approaches.
4.3. Characterisation of MTP 2 elliptical distributions. We finish our discussion of positive dependence for elliptical distributions with a complete characterization of MTP 2 distributions. Proposition 1.2 in Abdous et al. (2005) gives a necessary and sufficient condition for bivariate elliptical distributions. With a bit of matrix algebra their proof generalizes. Recall from Remark 2.1 that the density of X up to a normalizing constant is uniquely given by the density generator ϕ.
Theorem 4.15. Suppose X has a d-dimensional elliptical distribution with partial correlations ρ ij·V \{i,j} ≥ 0. Let ρ * = min ρ ij·V \{i,j} . If the logarithm of the density generator φ(t) = log ϕ(t) is two times differentiable then X is MTP 2 if and only if φ (t) ≤ 0; φ (t) = 0 implies φ (t) = 0; and for all t ∈ T = {t : φ (t) < 0}. In particular, inf t∈T Proof. Without loss of generality assume X has mean zero and K = Σ −1 satisfies K 11 = · · · = K dd = 1. In this case K ij = −ρ ij·V \{i,j} for all i = j. If X admits a strictly positive density function f (x) then X is MTP 2 if and only if for every 1 This result, found in Bach (2019) can be proved by elementary means, for example, by applying a second-order mean value theorem (Theorem 9.40 in Rudin (1964)). In our Basic calculus gives ∇φ(x T Kx) = 2 φ (x T Kx)Kx and Denoting t = x T Kx and v = 1 √ t Kx the (i, j)-th entry of the Hessian is nonnegative if and only if We will now check explicit conditions so that (12) is satisfied for all x ∈ R d .
Taking x = 0 we get that φ (0) ≤ 0. Fixing t > 0 note that v satisfies v T K −1 v = 1 but otherwise it is arbitrary. In particular, taking v such that v i = 0, (12) implies that φ (t) ≤ 0 for all t ≥ 0, which gives the first condition. If φ (t) = 0, (12) cannot hold for all x ∈ R d satisfying x T Kx = t unless φ (t) = 0, which gives the second condition in the theorem. Now suppose t is such that φ (t) < 0 then (12) becomes To study the bounds on 2v i v j subject to v T Σv = 1 we define the Lagrangian Multiplying both sides by K AA we get All stationary points must then satisfy v 2 i = v 2 j . The maximal value of 2v i v j subject to v T Σv = 1 is 2α 2 obtained at a point where v i = v j = α. The value of α can be found by noting that where 1 is the vector of ones. Since v T Σv = 1, 2α 2 = 1 − ρ ij·V \{i,j} . In a similar way we show that the minimal value of 2v i v j is −(1 + ρ ij·V \{i,j} ). This gives that (13) is equivalent to This inequality must be satisfied for every i = j. However the functions ρ/(1 + ρ) and ρ/(1−ρ) are increasing for ρ ∈ [0, 1) and so min ij ρ ij·V \{i,j} /(1−ρ ij·V \{i,j} ) = ρ * /(1−ρ * ) and min ij ρ ij·V \{i,j} /(1 + ρ ij·V \{i,j} ) = ρ * /(1 + ρ * ). Thus we arrive at (11).
Now suppose that φ is such that φ (t) ≤ 0; φ (t) = 0 implies that φ (t) = 0; and (11) holds for all t ∈ T. By reversing the argument above we conclude that (12) holds for all t ∈ T. For all the remaining t this inequality also holds because then both sides are equal to zero. However, as we argued before (12) holds for all t if and only if X is MTP 2 . This concludes our proof.
We illustrate Theorem 4.15 with two examples.
Example 4.16. By Proposition 4.9, if X has t-distribution then X is not conditionally increasing and so in particular it is not MTP 2 . Theorem 4.15 provides an easy way to see this. For the d-dimensional t-distribution with ν degrees of freedom φ(t) = − ν+d 2 log(1 + t ν ). Since φ (t) = − 1 2 ν+d ν+t < 0, condition (11) implies that must be satisfied for all t ∈ R. Taking the limits t → ±∞ shows that this is impossible irrespective of ρ * ∈ (−1, 1). Similarly, in the case of a zero-mean multivariate Laplace distribution the density generator is ϕ(t) = ( t 2 ) ν/2 K ν ( √ 2t), where ν = 2−d 2 and K ν (·) is the modified Bessel function of the second kind. Irrespective of d, tφ (t) φ (t) ∈ (−1, − 1 2 ) and so these distributions are never MTP 2 .
The constraints on possible ρ * in Example 4.17 did not take into account one more important aspect of the problem, namely that K is a d × d positive definite matrix. To illustrate this, suppose that all off-diagonal entries of K are equal, that is, ρ ij·V \{i,j} = ρ * = −K ij > 0 for all i = j. Such K is positive definite if and only if ρ * < 1/(d − 1). In Example 4.17 this gives an upper bound on ρ * that interplays with the lower bound ρ * ≥ |1 − 1/α|. These two bounds define a non-empty set if and only if |1 − 1/α| < 1/(d − 1). If d = 2 this holds for any α > 1/2. If d ≥ 3 this holds if and only if It is remarkable that this simple example generalizes and yields the following characterization of elliptical families with a fixed density generator that contain MTP 2 distributions. Recall from Remark 2.1 that a d-variate elliptical distribution with density generator ϕ admits the density function f (x) = c d |Σ| −1/2 ϕ (x − µ) T Σ −1 (x − µ) and the density generator in this representation is independent of d.
Theorem 4.18. Consider the family of all elliptical distributions with density generator ϕ(t) and let φ(t) = log ϕ(t). Then, there exists a scale matrix parameter Σ such that the density (5) is MTP 2 if and only if φ (t) ≤ 0; φ (t) = 0 implies φ (t) = 0; and for all t ∈ T = {t : φ (t) < 0}. The Gaussian distribution is the only elliptical distribution for which this condition holds for every d ∈ N.
For the last statement note that −1/d ≤ β * ≤ β * ≤ 1/(d − 2) for every d ∈ N if and only if β * = β * = 0. But this implies that φ (t) = 0 for all t ≥ 0 and so φ(t) = at + b for some a, b ∈ R. Hence the density generator ϕ is the exponential function giving the Gaussian distribution.
To illustrate this result consider the elliptically symmetric logistic distribution as defined in Fang (2018), Section 3.5. The density generator satisfies Theorem 4.15 gives that a bivariate logistic distribution is MTP 2 if and only if ρ 12 ≥ 1/2. However, if d ≥ 3, Theorem 4.18 implies that there are no MTP 2 distributions of this form.

Examples
We illustrate the application of transelliptical PG and PPG and the interpretation afforded by our characterizations with SP500 stock market data. The R code to reproduce our analyses is provided as supplementary material. We downloaded the daily log-returns of SP500 stocks for the 10-year period ranging from 2010-04-29 to 2020-04-14 (n = 2, 514 observations). For illustration we selected the first d = 100 stocks, hence the graphical model has 4,950 potential edges. We used R package huge (Zhao et al. , 2012) to apply univariate transformations aimed at improving the marginal normal fit (function huge.npn). Despite these transformations, we observed departures from multivariate normality. Let the observed and transformed n × d data matrices be X and Y (respectively), both with zero column means and unit variances. The empirical distribution of the Mahalanobis distances (y i1 , . . . , y id )S −1 (y i1 , . . . , y id ) T , where S is the sample covariance, had significantly thicker tails than the χ 2 d expected under multivariate normal data and S = Σ (Figure 2, top left).
We studied the dependence structure in these data via several models. First we fit a transelliptical model TE(0, Σ) to X, where Σ is estimated by first computing Kendall's τ and then exploiting their connection to Σ in Lemma 3.7. This procedure can be performed with option npn.func = "skeptic" in function huge.npn, see Liu et al. (2012b) for details. Second, we also fit an elliptical model E(0, Σ) to Y . In both models we estimated K = Σ −1 via graphical LASSO (Friedman et al. , 2008) and regularization parameter set via the EBIC (Chen & Chen (2008), function huge.select), in the transelliptical case using the pseudo-likelihood defined byτ ij , see Foygel & Drton (2010). The transelliptical model is in principle more robust, in that it does not require estimating the marginal transformations. However both models provided similar results: the Spearman correlation between the estimatedK ij was 0.911, the selected PGs agreed in 93.0% of the 4,950 edges, and there were no disagreements in the signs ofK ij for any (i, j).
To illustrate the interpretation of the PG implied byK, relative to a Gaussian graphical model, we focus on the elliptical model for Y . Figure 2 (bottom left) shows that the marginal tail dependenceθ ij in (2) is significantly larger than theρ 2 ij expected under normality. The magnitude of these departures is practically significant. For comparison the figure also displaysθ ij estimated from simulated Normal data, with zero mean and sample covariance matching that of Y .
In practical terms, θ ij measures the predictability of a variable's variance (also called volatility) from that of other variables. A natural question is what predictability remains after conditioning upon other variables, i.e. what is the conditional tail dependence θ ij·V \{i,j} in (3). To address this question for each variable pair (i, j) we computed the non-parametric estimateθ ij·V \{i,j} = corr(e 2 i , e 2 j | X (ij) ) where e i = x i −μ i|V \{i,j} , x i is the i th column in X andμ i|V \{i,j} the least-squares prediction given X (ij) (analogously for e j ). These estimates were significantly larger than theρ 2 ij·V \{i,j} expected under normality (Figure 2, bottom right). As a further check, from (3) the elliptical model predicts θ ij|V \{i,j} to be linear in ρ 2 ij|V \{i,j} . Figure  2 (top right) suggests that they are indeed roughly linearly related. Admittedly one never expects a model to describe the data perfectly, but the elliptical model appears reasonable to study volatility in these data.
The estimated partial correlation graph had 1,600 out of the 4,950 edges. Our results from Section 3 help strengthen the interpretation of the missing edges, e.g.K ij = 0 suggests that conditional on X (ij) one cannot predict the variance, asymmetry or kurtosis in x j linearly from x i . Further it also implies zero Kendall's conditional tau between increasing transforms of x i and x j , e.g. if daily returns are not conditionally positively/negative correlated (according to Kendall's tau) then neither are log-returns.
A quite interesting point is that among the 1,600 edges the estimated partial correlations were positive for 1,481 and negative for only 119 edges. That is, the partial correlation graph was very close to being a PPG; see Agrawal et al. (2019) for a discussion why this may be frequently encountered in stock data, and Epskamp & Fried (2018); Lauritzen et al. (2019b) for examples in Psychology. To compare the PPG fit with our earlier graphical LASSO fit we estimated the precision matrix under the constraint that K is an M-matrix, using R package mtp2 available at GitHub (Lauritzen et al. , 2019a, Algorithm 1). The maximized constrained log-likelihood was substantially higher than for the graphical LASSO fit (-266,361.3 versus -268,773.2) and the graph was sparser (1,228 versus 1,600 edges), hence the EBIC (and any other L 0 model selection criteria) strongly favored the PPG model. Note that, from its Lagrangian interpretation, the graphical LASSO constrains the size |K ij |. In contrast the M-matrix constraint allows for arbitrarily large |K ij |, providedK ij ≤ 0. That is, these two constraints induce quite different regularization and the latter appears more appropriate for these SP500 data, illustrating the potential value of positivity constraints in certain applications.
The selected graph being a PPG strengthens its interpretation. By Proposition 4.12, the finding suggests that all possible partial correlations are positive regardless of the conditioning set, and that Simpson's paradox does not occur in these data, i.e. stocks with zero marginal correlation also have zero partial correlation. By our earlier discussion, this implies that if ρ ij = 0 marginally then x i is uncorrelated with higher moments of x j , both marginally and conditionally on X (ij) . Further, the conditional expectation of x i can only increasing as a function of other variables (or increasing transformations thereof), and missing edges indicate the lack of such association.

Discussion
When studying multivariate dependence in applications it is often convenient to strike a balance between models that come equipped with strong theoretical properties (e.g. Gaussian, non-paranormal, MTP 2 and CI classes) at the cost of imposing potentially restrictive conditions, and models that are more flexible but do not provide such strong characterizations and/or lead to complex interpretations. We studied a natural strategy based on the transelliptical family and partial correlation graphs. We showed that the interpretation remains simple yet goes far beyond the regular linear dependence.
This work is also relevant in the context of Gaussian graphical models. Although the partial correlation graph in the Gaussian case translates into conditional independence statements, it is important to understand how robust is this interpretation with respect to the Gaussianity assumption. Our analysis shows that in the elliptical case a lot of this dependence information is retained. We also illustrate how simple tail dependence measures, like the one in (2), characterize the Gaussian distribution within the elliptical family and can help assess whether trans-elliptical class is useful to capture second-order dependence (variance) dependencies in the data.
An important part of this paper is the study of positive dependence. The notion of positivity can be quite useful in regularizing inference relative to unrestricted penalized likelihood, as we illustrated in the SP500 example. However, we also showed that strictly speaking some standard notions of positive dependence are meaningless for structural learning in elliptical partial correlation graphs. One of our main contributions is a remarkable result that characterizes MTP 2 elliptical distributions and shows that MTP 2 becomes very restrictive in high dimensions, in that only the Gaussian satisfies this constraint for each dimension. It is therefore important to study relaxations such as positive elliptical distributions that impose all partial correlations to be nonnegative. We showed that this family retains strong positive dependence properties that are important from the applied point of view.
In conclusion, we hope that our results help motivate the study of other suitable relaxations of Gaussianity and positivity in graphical models, as well as strengthen the use of transelliptical graphical models in practice.