A Global Homogeneity Test for High-Dimensional Linear Regression

This paper is motivated by the comparison of genetic networks based on microarray samples. The aim is to test whether the differences observed between two inferred Gaussian graphical models come from real differences or arise from estimation uncertainties. Adopting a neighborhood approach, we consider a two-sample linear regression model with random design and propose a procedure to test whether these two regressions are the same. Relying on multiple testing and variable selection strategies, we develop a testing procedure that applies to high-dimensional settings where the number of covariates $p$ is larger than the number of observations $n_1$ and $n_2$ of the two samples. Both type I and type II errors are explicitely controlled from a non-asymptotic perspective and the test is proved to be minimax adaptive to the sparsity. The performances of the test are evaluated on simulated data. Moreover, we illustrate how this procedure can be used to compare genetic networks on Hess \emph{et al} breast cancer microarray dataset.


Introduction
The recent flood of high-dimensional data has motivated the development of a vast range of sparse estimators for linear regressions, in particular a large variety of derivatives from the Lasso. Although theoretical guarantees have been provided in terms of prediction, estimation and selection performances (among a lot of others [7,34,47]), the research effort has only recently turned to the construction of high-dimensional confidence intervals or parametric hypothesis testing schemes [8,23,29,31,49]. Yet, quantifying the confidence surrounding coefficient estimates and selected covariates is essential in areas of application where these will nourrish further targeted investigations.
In this paper we consider the two-sample linear regression model with Gaussian random design.

Connection with two-sample Gaussian graphical model testing
This testing framework is mainly motivated by the validation of differences observed between Gaussian graphical models (modelling regulation networks) inferred from transcriptomic data from two samples [12,17,32] when looking for new potential drug or knock-out targets [24]. Following the development of univariate differential analysis techniques, there is now a surging demand for the detection of differential regulations between pairs of conditions (treated vs. placebo, diseased vs. healthy, exposed vs. control, . . . ). Given two gene regulation networks inferred from two transcriptomic data samples, it is however difficult to disentangle differences in the estimated networks that are due to estimation errors from real differences in the true underlying networks.
We suggest to build upon our two-sample high-dimensional linear regression testing scheme to derive a global test for the equality of high-dimensional Gaussian graphical models inferred under pairs of conditions. Formally speaking, the global two-sample GGM testing problem is defined as follows. Consider two Gaussian random vectors Z (1) ∼ N (0, [Ω (1) ] −1 ) and Z (2) ∼ N (0, [Ω (2) ] −1 ). The dependency graphs are characterized by the non-zero entries of the precision matrices Ω (1) and Ω (2) [26]. Given an n 1 -sample of Z (1) and an n 2 -sample of Z (2) , the objective is to test H G 0 : Ω (1) = Ω (2) versus H G 1 : Ω (1) = Ω (2) , where Ω (1) and Ω (2) are assumed to be sparse (most of their entries are zero). This testing problem is global as the objective is to assess a statistically significant difference between the two distributions. If the test is rejected, a more ambitious objective is to infer the entries where the precision matrices differ (ie Ω (1) i,j = Ω (2) i,j ). Adopting a neighborhood selection approach [32] as recalled in Section 6, high-dimensional GGM estimation can be solved by multiple high-dimensional linear regressions. As such, two-sample GGM testing (4) can be solved via multiple two-sample hypothesis testing as (3) in the usual linear regression framework. This extension of two-sample linear regression tests to GGMs is described in Section 6.

Related work
The literature on high-dimensional two-sample tests is very light. In the context of highdimensional two-sample comparison of means, [4,11,30,39] have introduced global tests to compare the means of two high-dimensional Gaussian vectors with unknown variance. Recently, [9,27] developped two-sample tests for covariance matrices of two high-dimensional vectors.
In contrast, the one-sample analog of our problem has recently attracted a lot of attention, offering as many theoretical bases for extension to the two-sample problem. In fact, the high-dimensional linear regression tests for the nullity of coefficients can be interpreted as a limit of the two-sample test in the case where β (2) is known to be zero, and the sample size n 2 is considered infinite so that we perfectly know the distribution of the second sample.
There are basically two different objectives in high-dimensional linear testing: a local and a global approach. In the local approach, one considers the p tests for the nullity of each coefficient H 0,i : β (1) i = 0 (i = 1, . . . , p) with the purpose of controling error measures such as the false discovery rate of the resulting multiple testing procedures. In a way, one aims to assess the individual statistical significance of each of the variables. This can be achieved by providing a confidence region for β (1) [8,23,29,31,49]. Another line of work derives p-values for the nullity of each of the coefficients. Namely, [48] suggests a screen and clean procedure based upon half-sampling. Model selection is first applied upon a random half of the sample in order to test for the significance of each coefficient using the usual combination of ordinary least squares and Student t-tests on a model of reasonnable size on the remaining second half. To reduce the dependency of the results to the splitting, [33] advocate to use half-sampling B times and then aggregate the B p-values obtained for variable j in a way which controls either the family-wise error rate or false discovery rate.
In the global approach, the objective is to test the null hypothesis H 0 : β (1) = 0. Although global approaches are clearly less informative than approaches providing individual significance tests like [8,33,49], they can reach better performances from smaller sample sizes. Such a property is of tremendous importance when dealing with high-dimensional datasets. The idea of [46], based upon the work of [6], is to approximate the alternative H 1 : β (1) = 0 by a collection of tractable alternatives {H S 1 : ∃j ∈ S , β (1) j = 0, S ∈ S} working on subsets S ⊂ {1, . . . , p} of reasonable sizes. The null hypothesis is rejected if the null hypothesis H S 0 is rejected for at least one of the subsets S ∈ S. Admittedly, the resulting procedure is computationally intensive. Nonetheless it is non-asymptotically minimax adaptive to the unknown sparsity of β (1) , that is it achieves the optimal rate of detection without any assumption on the population covariance Σ (1) of the covariates. Another series of work relies on higher-criticism. This last testing framework was originally introduced in orthonormal designs [15], but has been proved to reach optimal detection rates in high-dimensional linear regression as well [3,22]. In the end, highercriticism is highly competitive in terms of computing time and achieves the asymptotic rate of detection with the optimal constants. However, these nice properties require strong assumptions on the design.
While writing this paper, we came across the parallel work of Städler and Mukherjee [40], which adopts a local approach in an elegant adaptation of the screen and clean procedure in its simple-split [48] and multi-split [33] versions to the two-sample framework. Interestingly, their work also led to an extension to GGM testing [41].
In contrast we build our testing strategy upon the global approach developed by [6] and [46]. A more detailed comparison of [40,41] with our contribution is deferred to simulations (Section 5) and discussion (Section 7).

Our contribution
Our suggested approach stems from the fundamental assumption that either the true supports of β (1) and β (2) are sparse or that their difference β (1) − β (2) is sparse, so that the test can be successfully led in a subset S ⊂ {1, . . . , p} of variables with reasonnable size, compared to the sample sizes n 1 and n 2 . Of course, this low dimensional subset S is unknown. The whole objective of the testing strategy is to achieve similar rates of detection (up to a logarithmic constant) as an oracle test which would know in advance the optimal low-dimensional subset S .
Concretely, we proceed in three steps : 1. We define algorithms to select a data-driven collection of subsets S identified as most informative for our testing problem, in an attempt to circumvent the optimal subset S . 2. New parametric statistics related to the likelihood ratio statistic between the conditional distributions Y (1) |X (1) S and Y (2) |X (2) S are defined for S ∈ S. 3. We define two calibration procedures which both guarantee a control on type-I error: • we use a Bonferroni calibration which is both computationally and conceptually simple, allowing us to prove that this procedure is minimax adaptive to the sparsity of β (1) and β (2) from a non-asymptotic point of view; • we define a calibration procedure based upon permutations to reach a fine tuning of multiple testing calibration in practice, for an increase in empirical power.
The resulting testing procedure is completely data-driven and its type I error is explicitely controlled. Furthermore, it is computationally amenable in a large p and small n setting. Interestingly, the procedure does not require any half-sampling steps which are known to decrease the robustness when the sample size is small. The procedure is described in Section 2 while Section 3 is devoted to technical details, among which theoretical controls on Type I error, as well as some useful empirical tools for interpretation. Section 4 provides a non-asymptotic control of the power. Section 5 provides simulated experiments comparing the performances of the suggested procedures with the approach of [40]. In Section 6, we detail the extension of the procedure to handle the comparison of Gaussian graphical models. The method is illustrated on Transcriptomic Breast Cancer Data. Finally, all the proofs are postponed to Section 8.
The R codes of our algorithms are available at [1].

Notation
In the sequel, p norms are denoted | · | p , except for the l 2 norm which is referred as . to alleviate notations. For any positive definite matrix Σ, . Σ denotes the Euclidean norm associated with the scalar product induced by Σ: for every vector x, x 2 Σ = x Σx. Besides, fo every set S, |S| denote its cardinality. For any integer k, I k stands for the identity matrix of size k. For any square matrix A, ϕ max (A) and ϕ min (A) denote respectively the maximum and minimum eigenvalues of A. When the context makes it obvious, we may omit to mention A to alleviate notations and use ϕ max and ϕ min instead. Moreover, Y refers to the size n 1 + n 2 concatenation of Y (1) and Y (2) and X refers to the size (n 1 + n 2 ) × p the concatenation of X (1) and X (2) . To finish with, L refers to a positive numerical constant that may vary from line to line.

Description of the testing strategy
Likelihood ratio statistics used to test hypotheses like H 0 in the classical large n, small p setting are intractable on high-dimensional datasets for the mere reason that the maximum likelihood estimator is not itself defined under high-dimensional design proportions.
Our approach approximates the intractable high-dimensional test by a multiple testing construction, similarly to the strategy developped by [6] in order to derive statistical tests against non-parametric alternatives and adapted to one sample tests for high-dimensional linear regression in [46].
For any subset S of {1, . . . , p} satisfying 2|S| ≤ n 1 ∧ n 2 , denote X (1) S and X (2) S the restrictions of the random vectors X (1) and X (2) to covariates indexed by S. Their covariance structure is noted Σ where the noise variables S . We now state the test hypotheses in reduced dimension: Of course, there is no reason in general for β (1) S and β (2) S to coincide with the restrictions of β (1) and β (2) to S, even less in high-dimension since variables in S can be in all likelihood correlated with covariates in S c . Yet, as exhibited by Lemma 2.1, there is still a strong link between the collection of low dimensional hypotheses H 0,S and the global null hypothesis H 0 . Proof. Under H 0 , the random vectors of size p + 1 (Y (1) , X (1) ) and (Y (2) , X (2) ) follow the same distribution. Hence, for any subset S, Y (1) follows conditionally on X (1) S the same distribution as Y (2) conditionnally on X (2) S . In other words, β S . By contraposition, it suffices to reject at least one of the H 0,S hypotheses to reject the global null hypothesis. This fundamental observation motivates our testing procedure. As summarized in Algorithm 1, the idea is to build a well-calibrated multiple testing procedure that considers the testing problems H 0,S against H 1,S for a collection of subsets S. Obviously, it would be prohibitive in terms of algorithmic complexity to test H 0,S for every S ⊂ {1, . . . , p}, since there would be 2 p such sets. As a result, we restrain ourselves to a relatively small collection of hypotheses {H 0,S , S ∈ S}, where the collection of supports S is potentially data-driven. If the collection S is judiciously selected, then we can manage not to lose too much power compared to the exhaustive search.
We now turn to the description of the three major elements required by our overall strategy (see Algorithm 1): 1. a well-targeted data-driven collection of models S as produced by Algorithm 2; 2. a parametric statistic to test the hypotheses H 0,S for S ∈ S, we resort actually to a combination of three parametric statistics F S,V , F S,1 and F S,2 ; 3. a calibration procedure guaranteeing the control on type I error as in Algorithm 3 or 4.

Algorithm 1 Overall Adaptive Testing Strategy
Require: Data X (1) ,X (2) ,Y (1) ,Y (2) , collection S and desired level α Step 1 -Choose a collection S of low-dimensional models (as e.g. S Lasso in Algorithm 2) procedure ModelChoice(X (1) ,X (2) ,Y (1) ,Y (2) ,S) Define the model collection S ⊂ S end procedure Step 2 -Compute p-values for each test in low dimension procedure Test(X Compute the p-valuesq V,S ,q 1,S ,q 2,S associated to the statistics F S,V , F S,1 , F S,2 end for end procedure Step 3 -Calibrate decision thresholds as in Algorithms 3 (Bonferroni) or 4 (Permutations) procedure Calibration(X (1) ,X (2) ,Y (1) ,Y (2) , S,α) for each subset S in S and each i = V, 1, 2 do Define a threshold α i,S . end for end procedure Step 4 -Final Decision if there is a least one model S in S such that there is at least one p-value for whichq i,S < α i,S then Reject the global null hypothesis H 0 end if

Choices of Test Collections (Step 1)
The first step of our procedure (Algorithm 1) amounts to select a collection S of subsets of {1, . . . , p}, also called models. A good collection S of subsets must satisfy a tradeoff between the inclusion of the maximum number of relevant subsets S and a reasonable computing time for the whole testing procedure, which is linear in the size | S| of the collection. The construction of S proceeds in two steps: (i) One chooses a deterministic collection S of models. (ii) One defines an algorithm (called ModelChoice in Algorithm 1) mapping the raw data (X, Y) to some collection S satisfying S ⊂ S. Even though the introduction of S as an argument of the mapping could appear artificial at this point, this quantity will be used in the calibration step of the procedure. Our methodology can be applied to any fixed or data-driven collection. Still, we focus here on two particular collections. The first one is useful for undertaking the first steps of the mathematical analysis. For practical applications, we advise to use the second collection.
Deterministic Collections S ≤k . By deterministic, we mean the model choice step is trivial in the sense ModelChoice(X, Y, S) = S. Among deterministic collections, the most straightforward collections consist of all size-k subsets of {1, . . . , p}, which we denote S k . This kind of family provides collections which are independent from the data, thereby reducing the risk of overfitting. However, as we allow the model size k or the total number of candidate variables p to grow, these deterministic families can rapidly reach unreasasonble sizes. Admittedly, S 1 always remains feasible, but reducing the search to models of size 1 can be costly in terms of power. As a variation on size k models, we introduce the collection of all models of size smaller than k, denoted S ≤k = k j=1 S j , which will prove useful in theoretical developments.
Lasso-type Collection S Lasso . Among all data-driven collections, we suggest the Lassotype collection S Lasso . Before proceeding to its definition, let us informally discuss the subsets that a "good" collection S should contain. Let supp(β) denote the support of a vector β. Intuitively, under the alternative hypothesis, good candidates for the subsets are either subsets of S * ∨ := supp(β (1) ) ∪ supp(β (2) ) or subsets of S * ∆ := Supp(β (1) − β (2) ).
The first model S * ∨ nicely satisfies β (2) . The second subset has a smaller size than S * ∨ and focuses on covariates corresponding to different parameters in the full regression. However, the divergence between effects might only appear conditionally on other variables with similar effects, this is why the first subset S * ∨ is also of interest. Obviously, both subsets S * ∨ and S * ∆ are unknown. This is why we consider a Lasso methodology that amounts to estimating both S * ∨ and S * ∆ in the collection S Lasso . Details on the construction of S Lasso are postponed to Section 3.1.

Parametric Test Statistic (Step 2)
Given a subset S, we consider the three following statistics to test H 0,S against H 1,S : As explained in Section 3, these three statistics derive from the Kullback-Leibler divergence between the conditional distributions Y (1) |X S . While the first term F S,V evaluates the discrepancies in terms of conditional variances, the last two terms F S,1 and F S,2 address the comparison of β (1) to β (2) .
Because the distributions under the null of the statistics F S,i , for i = V, 1, 2, depend on the size of S, the only way to calibrate the multiple testing step over a collection of models of various sizes is to convert the statistics to a unique common scale. The most natural is to convert observed F S,i 's into p-values. Under H 0,S , the conditional distributions of F S,i for i = V, 1, 2 to X S are parameter-free and explicit (see Proposition 3.1 in the next section). Consequently, one can define the exact p-values associated to F S,i , conditional on X S . However, the computation of the p-values require a function inversion, which can be computationally prohibitive. This is why we introduce explicit upper boundsq i,S (Equations (13,17)) of the exact p-values.

Combining the parametric statistics (Step 3)
The objective of this subsection is to calibrate a multiple testing procedure based on the sequence of p-values {(q V,S ,q 1,S ,q 2,S ), S ∈ S}, so that the type-I error remains smaller than a chosen level α. In particular, when using a data-driven model collection, we must take good care of preventing the risk of overfitting which results from using the same dataset both for model selection and hypothesis testing.
For the sake of simplicity, we assume in the two following paragraphs that ∅ S, which merely means that we do not include in the collection of tests the raw comparison of Var(Y (1) ) to Var(Y (2) ).
Testing Procedure Given a model collection S and a sequence α = (α i,S ) i=V,1,2, S∈ S , we define the test function: In other words, the test function rejects the global null if there exists at least one model S ∈ S such that at least one of the three p-values is below the corresponding threshold α i,S . In Section 3.3, we describe two calibration methods for choosing the thresholds (α i,S ) S∈ S . We first define a natural Bonferroni procedure, whose conceptual simplicity allows us to derive non-asymptotic type II error bounds of the corresponding tests (Section 4). However, this Bonferroni correction reveals too conservative in practice, in part paying the price for resorting to data-driven collections and upper bounds on the true p-values. This is why we introduce as a second option the permutation calibration procedure. This second procedure controls the type I error at the nominal level and therefore outperforms the Bonferroni calibration in practice. Nevertheless, the mathematical analysis of the corresponding test becomes more intricate and we are not able to provide sharp type II error bounds.
Remark: In practice, we advocate the use of the Lasso Collection S Lasso (Algorithm 2) combined with the permutation calibration method (Algorithm 4). Henceforth, the corresponding procedure is denoted T P SLasso .
3. Discussion of the procedure and Type I error In this section, we provide remaining details on the three steps of the testing procedure. First, we describe the collection S Lasso and provide an informal justification of its definition. Second, we explain the ideas underlying the parametric statistics F S,i , i = V, 1, 2 and we define the corresponding p-valuesq i,S . Finally, the Bonferroni and permutation calibration methods are defined, which allows us to control the type I error of the corresponding testing procedures.
For a suitable choice of the tuning parameter λ and under assumptions of the designs, it is proved [7,34] that θ λ estimates well θ * andV λ is a good estimator of supp(θ * ). The Lasso parameter λ tunes the amount of sparsity of θ λ : the larger the parameter λ, the smaller the supportV λ . As the optimal choice of λ is unknown, the collection S Lasso is built using the collection of all estimators (V λ ) λ>0 , also called the Lasso regularization path of θ * . Below we provide an algorithm for computing S Lasso along with some additional justifications. (2) Compute the function f : λ →V λ (defined in (9,10)) using Lars-Lasso Algorithm [16] Compute the decreasing sequences (λ k ) 1≤k≤q of jumps in f

Algorithm 2 Construction of the Lasso-type Collection S Lasso
It is known [16] that the function f : λ →V λ is piecewise constant. Consequently, there exist thresholds λ 1 > λ 2 > . . . such thatV λ changes on λ k 's only. The function f and the collection (λ k ) are computed efficiently using the Lars-Lasso Algorithm [16]. We build two collections of models using (V (1) λ k ) k≥1 and (V (2) λ k ) k≥1 . Following the intuition described above, for a fixed λ k ,V (2) λ k is an estimator of supp(β (1) − β (2) ) whileV (1) λ k ∪V (2) λ k is an estimator of supp(β (1) ) ∪ supp(β (2) ). This is why we define where k max is the smallest integer q such that |V In the end, we consider the following S Lasso data-driven family, Recall that S 1 is the collection of the p models of size 1. Recently, data-driven procedures have been proposed to tune the Lasso and find a parameter λ is such a way that θ λ is a good estimator of θ * (see e.g. [5,42]). We use the whole regularization path instead of the sole estimator θ λ , because our objective is to find subsets S such that the statistics F S,i are powerful. Consider an example where β (2) = 0 and β (1) contains one large coefficient and many small coefficients. If the sample size is large enough, a well-tuned Lasso estimator will select several variables. In contrast, the best subset S (in terms of power of F S,i ) contains only one variable. Using the whole regularization path, we hope to find the best trade-off between sparsity (small size of S) and differences between β S . This last remark is formalized in Section 4.4. Finally, the size of the collection S Lasso is generally linear with (n 1 ∧n 2 )∨p, which makes the computation of (q i,S ) S∈ SLasso,i=V,1,2 reasonnable.

Symmetric conditional likelihood
In this subsection, we explain the intuition behind the choice of the parametric statistics (F S,V , F S,1 , F S,2 ) defined in Equations (5,6). Let us denote by L (1) (resp. L (2) ) the loglikelihood of the first (resp. second) sample normalized by n 1 (resp. n 2 ). Given a subset S ⊂ {1, . . . , p} of size smaller than n 1 ∧ n 2 , ( β (1) S , σ (1) S ) stands for the maximum likelihood estimator of (β (1) , σ (1) ) among vectors β whose supports are included in S. Similarly, we note ( β (2) S , σ (2) S ) for the maximum likelihood corresponding to the second sample. Statistics F S,V , F S,1 and F S,2 appear as the decomposition of a two-sample likelihoodratio, measuring the symmetrical adequacy of sample-specific estimators to the opposite sample. To do so, let us define the likelihood ratio in sample i between an arbitrary pair (β, σ) and the corresponding sample-specific estimator ( β With this definition, D n1 ( β (2) , σ (2) ) measures how far ( β (2) , σ (2) ) is from ( β (1) , σ (1) ) in terms of likelihood within sample 1. The following symmetrized likelihood statistic can be decomposed into the sum of F S,V , F S,1 and F S,2 : Instead of the three statistics (F S,i ) i=V,1,2 , one could use the symmetric likelihood (12) to build a testing procedure. However, we do not manage to obtain an explicit and sharp upper bound of the p-values associated to the statistic (12), which makes the resulting procedure either computationally intensive if one estimated the p-values by a Monte-Carlo approach or less powerful if one uses a non-sharp upper bound of the p-values. In contrast, we explain below how, by considering separately F S,V , F S,1 and F S,2 , one upper bounds sharply the exact p-values.

Definition of the p-values
Denote by g(x) = −2 + x + 1/x the non-negative function defined on R + . Since the restriction of g to [1; +∞) is a bijection, we note g −1 the corresponding reciprocal function. 1. Let Z denote a Fisher random variable with (n 1 − |S|, n 2 − |S|) degrees of freedom. Then, under the null hypothesis, 2. Let Z 1 and Z 2 be two centered and independent Gaussian vectors with covariance X and I n1−|S| . Then, under the null hypothesis, A symmetric result holds for F S,2 .
Although the distributions identified in Proposition 3.1 are not all familiar distributions with ready-to-use quantile tables, they all share the advantage that they do not depend on any unknown quantity, such as design variances Σ (1) and Σ (2) , noise variances σ (1) and σ (2) , or even true signals β (1) and β (2) . For any i = V, 1, 2, we note Q i,|S| (u|X S ) for the conditional probability that F S,i is larger than u.
By Proposition 3.1, the exact p-valueq V,S = Q V,|S| (F S,V |X S ) associated to F S,V is easily computed from the distribution function of a Fisher random variable: where F m,n (u) denotes the probability that a Fisher random variable with (m, n) degrees of freedom is larger than u.
Since the conditional distribution of F S,1 given X S only depends on |S|, n 1 , n 2 , and X S , one could compute an estimation of the p-value Q 1 (u|X S ) associated with an observed value u by Monte-Carlo simulations. However, this approach is computationally prohibitive for large collections of subsets S. This is why we use instead an explicit upper bound of Q 1,|S| (u|X S ) based on Laplace method, as given in the definition below and justified in the proof of Proposition 3.3.
For any u ≤ |a| 1 , define Q 1,|S| (u|X S ) := 1. For any u > |a| 1 , take where λ * is defined as follows. If all the components of a are equal, then λ * := If a is not a constant vector, then we define Q 2,|S| is defined analogously by exchanging the role of X S and X S .
Finally we define the approximate p-valuesq 1,S andq 2,S bỹ Although we use similar notations forq i,S with i = V, 1, 2, this must not mask the essential difference thatq 1,S is the exact p-value of F S,1 whereasq 1,S andq 2,S only are upper-bounds on F S,2 and F S,2 p-values. The consequences of this asymetry in terms of calibration of the test is discussed in the next subsection.

Bonferroni Calibration (B)
Recall that a data-driven model collection S is defined as the result of a fixed algorithm mapping a deterministic collection S and (X, Y) to a subcollection S. The collection of For the collection S ≤k , or any data-driven collection derived from S ≤k , a natural choice is which puts as much weight to the comparison of the conditional variances (F S,V ) and the comparison of the coefficients (F S,1 , F S,2 ). Similarly for the collection S Lasso , a natural choice is (19) with k replaced by D max (which equals (n 1 ∧ n 2 )/2 in practice).

Algorithm 3 Bonferroni Calibration for a collection S ⊂ S ≤Dmax
Given any data-driven collection S, denote by T B S the multiple testing procedure calibrated by Bonferroni thesholds α B (18).
1 (Bonferroni correction on S and not on S). Note that even though we only compute the statistics F S,i for models S ∈ S, the Bonferroni correction (18) must be applied to the initial deterministic collection S including S. Indeed, if we replace the condition (18) by the condition S∈ S 3 i=1 α i,S ≤ α, then the size of the corresponding is not constrained anymore to be smaller than α. This is due to the fact that we use the same data set to select S ⊂ S and to perform the multiple testing procedure. As a simple example, consider any deterministic collection S and the data-driven collection meaning that S only contains the subset S that minimizes the p-values of the parametric tests. Thus, computing T B S for this particular collection S is equivalent to performing a multiple testing procedure on S.
Although procedure T B S is computationally and conceptually simple, the size of the corresponding test can be much lower than α because of three difficulties: 1. Independently from our problem, Bonferroni corrections are known to be too conservative under dependence of the test statistics.
2. As emphasized by Remark 3.1, whereas the Bonferroni correction needs to be based on the whole collection S, only the subsets S ∈ S are considered. Provided we could afford the computational cost of testing all subsets within S, this loss cannot be compensated for if we use the Bonferroni correction. 3. As underlined in the above subsection, for computational reasons we do not consider the exact p-values of F S,1 and F S,2 but only upper boundsq 1,S andq 2,S of them. We therefore overestimate the type I error due to F S,1 and F S,2 .
In fact, the three aforementionned issues are addressed by the permutation approach.

Calibration by permutation (P).
The collection of thresholds α P = {α i,S , S ∈ S} is chosen such that each α i,S remains inversely proportional to p |S| in order to put all subset sizes at equal footage. In other words, we choose a collection of thresholds of the form where C i 's are calibrated by permutation to control the type I error of the global test. Given a permutation π of the set {1, . . . , n 1 +n 2 }, one gets Y π and X π by permuting the components of Y and the rows of X. This allows us to get a new sample (Y π,(1) , Y π,(2) , X π,(1) , X π,(2) ). Using this new sample, we compute a new collection S π , parametric statistics (F π S,i ) i=V,1,2 and p-values (q i,S ) i=V,1,2 . Denote P the uniform distribution over the permutations of size n 1 + n 2 .
We define C V as the α/2-quantiles with respect to P of Similarly, C 1 = C 2 are the α/2-quantiles with respect to P of In practice, the quantiles C i are estimated by sampling a large number B of permutations. The permutation calibration procedure for the Lasso collection S Lasso is summarized in Algorithm 4. Given any data-driven collection S, denote by T P S the multiple testing procedure calibrated by the permutation method (20).
Remark 3.2. Through the three constants C V , C 1 and C 2 (Eq. (21,22)), the permutation approach corrects simultaneously for the losses mentioned earlier due to the Bonferroni correction, in particular the restriction to a data-driven class S and the approximate pvaluesq 1,S andq 2,S . Yet, the level of T P S is not exactly α because we treat separately the the statistics F S,V and (F S,1 , F S,2 ) and apply a Bonferroni correction. It would be possible to calibrate all the statistics simultaneously in order to constrain the size of the corresponding test to be exactly α. However, this last approach would favor the statistic F S,1 too much, because we would put on the same level the exact p-valueq V,S and the upper boundsq 1,S andq 2,S .

Interpretation tools
Empirical p-value When using a calibration by permutations, one can derive an empirical p-value p empirical to assess the global significance of the test. In contrast with model and statistic specific p-valuesq i,S , this p-value provides a nominally accurate estimation of the type-I error rate associated with the global multiple testing procedure, every model in the collection and test statistic being considered. It can be directly compared to the desired level α to decide about the rejection or not of the global null hypothesis. This empirical p-value is obtained as the fraction of the permuted values of the statistic that are less than the observed test statistic. Keeping the notation of Algorithm 4, the empirical p-value for the variance and coefficient parts are given respectively by : The empirical p-value for the global test is then given by the following equation.
Rejected model Moreover, one can keep track of the model responsible for the rejection, unveiling sensible information on which particular coefficients most likely differ between samples. The rejected models for the variance and coefficient parts are given respectively by :

Power and Adaptation to Sparsity
Let us fix some number δ ∈ (0, 1). The objective is to investigate the set of parameters (β (1) , σ (1) , β (2) , σ (2) ) that enforce the power of the test to exceed 1 − δ. We focus here on the Bonferroni calibration (B) procedure because the analysis is easier. Section 5 will illustrate that the permutation calibration (P) outperforms the Bonferroni calibration (B) in practice. In the sequel, A B (resp. A B) means that for some positive constant L(α, δ) that only depends on α and δ, A ≤ L(α, δ)B (resp. A ≥ L(α, δ)B).
We first define the symmetrized Kullback-Leibler divergence as a way to measure the discrepancies between (β (1) , σ (1) ) and (β (2) , σ (2) ). Then, we consider tests with deterministic collections in Sections 4.2-4.3. We prove that the corresponding tests are minimax adaptive to the sparsity of the parameters or to the sparsity of the difference β (1) − β (2) . Sections 4.4-4.5 are devoted to the analysis T B SLasso . Under stronger assumptions on the population covariances than for deterministic collections, we prove that the performances of T B SLasso are nearly optimal.

Symmetrized Kullback-Leibler divergence
Intuitively, the test T B S should reject H 0 with large probability when (β (1) , σ (1) ) is far from (β (2) , σ (2) ) in some sense. A classical way of measuring the divergence between two distributions is the Kullback-Leibler discrepancy. In the sequel, we note K P Y (1) |X ; P Y (2) |X the Kullback discrepancy between the conditional distribution of Y (1) given X (1) = X and conditional distribution of Y (2) given X (2) = X. Then, we denote K 1 the expectation of this Kullback divergence when X ∼ N (0 p , Σ (1) ). Exchanging the roles of Σ (1) and Σ (2) , we also define K 2 : The sum K 1 + K 2 forms a semidistance with respect to (β (1) , σ (1) ) and (β (2) , σ (2) ) as proved by the following decomposition When Σ (1) = Σ (2) , we quantify the discrepancy between these covariance matrices by Observe that the quantity ϕ Σ (1) ,Σ (2) can be considered as a constant if we assume that the smallest and largest eigenvalues of Σ (i) are bounded away from zero and infinity.

Power of T B S ≤k
First, we control the power of T B S for a deterministic collection S = S ≤k (with some k ≤ (n 1 ∧ n 2 )/2) and the Bonferroni calibration weights α i,S as in (19)). For any β ∈ R p , |β| 0 refers to the size of its support and |β| stands for the vector (|β i |), i = 1, . . . , p. We consider the two following assumptions Remark 4.1. Condition A.1 requires that the type I and type II errors under consideration are not exponentially smaller than the sample size. Condition A.2 tells us that the number of non-zero components of β (1) and β (2) has to be smaller than (n 1 ∧ n 2 )/ log(p). This requirement has been shown [44] to be minimal to obtain fast rates of testing of the form (24) in the specific case β (2) = 0, σ (1) = σ (2) and n 2 = ∞.
If we further assume that Remark 4.2. The condition Σ (1) = Σ (2) is not necessary to control the power of T B S ≤k in terms of |β (1) − β (2) | 0 as in (25). However, the expression (25) would become far more involved.

Remark 4.4. Equation
for any γ > 0. Nevertheless, we only managed to prove the minimax lower bound for Σ (1) = Σ (2) = I p , implying that, even though the detection rate (24) is unimprovable uniformly over all (Σ (1) , Σ (2) ), some improvement is perhaps possible for specific covariance matrices. Up to our knowledge, there exist no such results of adaptation to the population covariance of the design even in the one sample problem.
Remark 4.5. Equation (25) together with (27) tells us that T B S ≤k simultaneously achieves (up to a constant) the optimal rates of detection over s-sparse differences β (1) − β (2) Remark 4.6 (Informal justification of the introduction of the collection S Lasso ). If we look at the proof of Theorem 4.1, we observe that the power (24) is achieved by the statistics (F S∨,V , F S∨,1 , F S∨,2 ) where S ∨ is the union of the support of β (1) and β (2) . In contrast, (25) is achieved by the statistics (F S∆,V , F S∆,1 , F S∆,2 ) where S ∆ is the support of β (1) − β (2) . Intuitively, the idea underlying the collection S L in the definition (11) of S Lasso is to estimate S ∨ , while the idea underlying the collection S (2) L is to estimate S ∆ .

Power of T B
S for any deterministic S Theorem 4.3 belows extends Theorem 4.1 from deterministic collections of the form S ≤k to any deterministic collection S, unveiling a bias/variance-like trade-off linked to the cardinality of subsets S of collection S. To do so, we need to consider the Kullback discrepancy between the conditional distribution of Y (1) given X S = X S and the conditional distribution of Y (2) given X For short, we respectively note K 1 (S) and K 2 (S) Intuitively, K 1 (S) + K 2 (S) corresponds to some distance between the regression of Y (1) given X S and of Y (2) given X S ) the restriction of Σ (1) (resp. Σ (2) ) to indices in S, we define Theorem 4.3 (Power of T B S for any deterministic S). For any S ∈ S, we note α S = min i=V,1,2 α i,S . The power of T B S is larger than 1 − δ as long as there exists S ∈ S such that |S| n 1 ∧ n 2 and and Remark 4.7. Let us note ∆(S) the right hand side of (30). According to Theorem 4.3, The term ∆(S) plays the role of a variance term and therefore increases with the cardinality of S. Furthermore, the term K 1 − K 1 (S) + K 2 − K 2 (S) plays the role of a bias. Let us note S * the subcollection of S made of sets S satisfying (29). According to theorem 4.3, T B S is powerful as long as Such a result is comparable to oracle inequalities obtained in estimation since the test T B S is powerful when the Kullback loss K 1 + K 2 is larger than the trade-off (31) between a bias-like term and a variance-like term without requiring the knowledge of this trade-off in advance. We refer to [6] for a thorough comparison between oracle inequalities in model selection and second type error terms of this form.

SLasso
For the sake of simplicity, we restrict in this subsection to the case n 1 = n 2 := n, more general results being postponed to the next subsection. The test T B S ≤n/2 is computationally expensive (non polynomial with respect to p). The collection S Lasso has been introduced to fix this burden. We consider T B SLasso with the prescribed Bonferroni calibration weights α i,S (as in (19) with k replaced by (n 1 ∧ n 2 )/2 . In the statements below, ψ Σ (1) ,Σ (2) ,. . . refer to positive quantities that only depend on the largest and the smallest eigenvalues of Σ (1) and Σ (2) . Consider the additional assumptions . . If Remark 4.8. The rates of detection (32) and the sparsity condition A.3 are analogous to (24) and Condition A.2 in Theorem 4.1 for T B S ≤(n 1 ∧n 2 )/2 . The second result (33) is also similar to (25). As a consequence, T B SLasso is minimax adaptive to the sparsity of (β (1) , β (2) ) and of β (1) − β (2) .  (1) and Σ (2) are unavoidable because the collection S Lasso is based on the Lasso estimator which require design assumptions to work well [10]. Nevertheless, one can improve all these dependencies using restricted eigenvalues instead of largest eigenvalues. This and other extensions are considered in next subsection.

SLasso
Given a matrix X, an integer k, and a number M , one respectively defines the largest and smallest eigenvalues of order k, the compatibility constants κ[M, k, X] and η[M, k, X] (see [43]) by where that measure the closeness to orthogonality of Σ (1) and Σ (2) . Theorem 4.4 is straightforward consequence of the two following results.

Numerical Experiments
This section evaluates the performances of the suggested test statistics along with aforementioned test collections and calibrations on simulated linear regression datasets.

Synthetic Linear Regression Data
In order to calibrate the difficulty of the testing task, we simulate our data according to the rare and weak parametrization adopted in [3]. We choose a large but still reasonable number of variables p = 200, and restrict ourselves to cases where the number of observations n = n 1 = n 2 in each sample remains smaller than p. The sparsity of sample-specific coefficients β (1) and β (2) is parametrized by the number of non zero common coefficients p 1−η and the number of non zero coefficients p 1−η2 which are specific to β (2) . The magnitude µ r of all non zero coefficients is set to a common value of √ 2r log p, where we let the magnitude parameter range from r = 0 to r = 0.5: We consider three sample sizes n = 25, 50, 100, and generate two sub-samples of equal size n 1 = n 2 = n according to the following sample specific linear regression models: Design matrices X (1) and X (2) are generated by multivariate Gaussian distributions, X (j) i ∼ N (0, Σ (j) ) with varying choices of Σ (j) , as detailed below. Noise components ε (1) i and ε (2) i are generated independantly from X (1) and X (2) according to a standard centered Gaussian distribution.
The next two paragraphs detail the different design scenarios under study as well as test statistics, collections and calibrations in competition. Each experiment is repeated 1000 times.
Design Scenarios Under Study.
Sparsity Patterns. We study six different sparsity patterns as summarized in Table 1. The first two are meant to validate type I error control. The last four allow us to compare the performances of the various test statistics, collections and calibrations under different sparsity levels and proportions of shared coefficients. In all cases, the choices of sparsity parameters η and η 2 lead to strong to very strong levels of sparsity. The last column of Table 1 illustrates the signal sparsity patterns of β (1) and β (2) associated with each scenario. In scenarios 1 and 2, sample-specific signals share little, if not none, non zero coefficient. In scenarios 3 and 4, sample-specific coefficients show some overlap. Scenario 4 is the most difficult one since the number of sample-2-specific coefficients is much smaller than the number of common non zero coefficients: the sparsity of the difference between β (1) and β (2) is much smaller than the global sparsity of β (2) . This explains why the illustration in the last column might be misleading: the two patterns are not equal but do actually differ by only one covariate.
Beyond those six varying sparsity patterns, we consider three different correlation structures Σ (1) and Σ (2) for the generation of the design matrix. In all three cases, we assume that Σ (1) = Σ (2) = Σ. On top of the basic orthogonal matrix Σ (1) = Σ (2) = I p , we investigate two randomly generated correlation structures.
Power Decay Correlation Structure. First, we consider a power decay correlation structure such that Σ i,j = ρ |i−j| . Since the sparsity pattern of β (1) and β (2) is linked to the order of the covariates, we randomly permute at each run the columns and rows of Σ in order to make sure that the correlation structure is independent from the sparsity pattern.
Gaussian Graphical Model Structure. Second, we simulate correlation structures with the R package GGMselect. The function simulateGraph generates covariance matrices corresponding to Gaussian graphical model structure made of clusters with some intracluster and extra-cluster connectivity coefficients. See Section 4 of [18] for more details. A new structure is generated at each run.
Both random correlation structures are calibrated such that, on average, each covariate is correlated with 10 other covariates with correlations above 0.2 in absolute value. This corresponds to fixing ρ at a value of 0.75 in the power decay correlation structure and the intra-cluster connectivity coefficient to 5% in the Gaussian graphical model structure.
With the default option of the function simulateGraph the extra-cluster connectivity coefficient is taken five times smaller.
Test statistics, collections and calibrations in competition In the following, we present the results of the proposed test statistics combined with two test collections, namely a deterministic and data-driven model collection, respectively S 1 and S Lasso , as well as with a Bonferroni (B) or Permutation (P) calibration (computed with 100 random permutations). Furthermore, to put those results in perspective, we compare our suggested test statistic to the usual Fisher statistic and we compare our approach with the parallel work of [40].
Fisher statistic. For a given support |S| of reduced dimension the usual likelihood ratio statistic for the equality of β S follows a Fisher distribution with |S| and n 1 + n 2 − 2|S| degrees of freedom: where β S is the maximum likelihood estimator restricted to covariates in support S on the concatenated sample (X, Y). While this statistic Fi S is able to detect differences between β (1) and β (2) , it is not really suited for detecting differences between the standard deviations σ (1) and σ (2) . The Fisher statistic Fi S is adapted to the high-dimensional framework similarly as the suggested statistics (F S,V , F S,1 , F S,2 ), except that exact p-values are available. The corresponding test with a collection S and a Bonferroni (resp. permutation) calibration is denotes T B,Fisher S (T P,Fisher S ). Procedure of Städler and Mukherjee [40]. The DiffRegr procedure of Städler and Mukherjee performs two-sample testing between high-dimensional regression models. The procedure is based on sample-splitting: the data is split into two parts, the first one allowing to reduce dimensionality (screening step) and the second being used to compute p-values based on a restricted log-likelihood-ratio statistic (cleaning step). To increase the stability of the results the splitting step is repeated multiple times and the resulting p-values must be aggregated. We choose to use the p-value calculations based on permutations as it remains computationally reasonable for the regression case, see [40]. The single-splitting and multi-splitting procedures are denoted respectively SS(perm) and MS(perm).

Validation of Type I Error Control
Control Under the Global Null Hypothesis H 00 . Table 2 presents estimated type I error rates, that is the percentage of simulations for which the null hypothesis is rejected, based upon 1000 simulations under the restricted null hypothesis H 00 , where β (1) = β (2) = 0 and under orthogonal correlation structure. The desired level is α = 5%, and the estimated levels are given with a 95% Gaussian confidence interval.
As expected under independence, the combination of the S 1 collection with Bonferroni correction gives accurate alpha-level when applied to the usual Fisher statistic. On the contrary when applied to the suggested statistics, the use of upper bounds on p-values leads to a strong decrease in observed type-I error. This decrease is exacerbated when using the S Lasso collection, since we are accounting for many more models than the number actually tested in order to prevent overfitting. This effect can be seen both on the Fisher statistic and our suggested statistic. Even with the usual Fisher statistic, for which we know the exact p-value, it is unthinkable to use Bonferroni calibration as soon as we adopt data-driven collections instead of deterministic ones.
On the contrary, a calibration by permutations restores a control of type-I error at the desired nominal level, whatever the test statistic or model collection.
As noted by [40], the multi-splitting procedure yields conservative results in terms of type I error control at level 5%.

Model collection
S 1 S Lasso Calibration (B) (P) (B) (P) n= 25 1 ± 0.6 6.9 ± 1.6 0 ± 0 6.9 ± 1.6 n= 50 1.8 ± 0.  Control Under the Global Equality of Non Null Coefficients H 0 . Figures 1 and 3 present level checks under H 0 but with non null β (1) = β (2) = 0, under respectively orthogonal and non-orthogonal correlation structures. Conclusions are perfectly similar to the case H 00 : all methods behave well, except the multi-split DiffRegr procedure and the Bonferroni calibration-based procedures T B S (for any collection S) and T B,Fisher

SLasso
. In particular, the Fisher statistic combined with S 1 and Bonferroni calibration is more conservative than the desired nominal level under correlated designs. . The deterministic collection S 1 is drawn in empty points, while the data-driven collection S Lasso is in plain points. Green circles represent the DiffRegr procedure, respectively plain and empty for single-splitting and multi-splitting.

FS -SLasso -(B)
FS -SLasso -(P) F iS -S1 -(B) F iS -S1 -(P) Power Analysis. We do not investigate the power of the Bonferroni-based procedures T B S and T B,Fisher S as they have been shown to be too conservative in the above Type I error analysis. Figure 4 represents power performances for the test T P S and the usual likelihood ratio test T P,Fisher S combined with either S 1 or S Lasso test collections using a calibration by permutation under an orthogonal covariance matrix Σ, as well as power performance for the DiffRegr procedure. Figure 5 represents equivalent results for power decay and GGM covariance structures when n = 50.
In the absence of common coefficients (scenarios 1 and 2), the test T P S reaches 100% power from very low signal magnitudes and small sample sizes. Compared to the test based on usual likelihood ratio statistics, which does not reach more than 40% power when . The deterministic collection S 1 is drawn in empty points, while the data-driven collection S Lasso is in plain points. Green circles represent the DiffRegr procedure, respectively plain and empty for single-splitting and multi-splitting. n = 25 given the signal magnitudes under consideration, the suggested statistics proves itself extremely efficient. Under these settings as well, any subset of size 1 containing one of the variables activated in only β (2) can suffice to reject the null, which is why collection S 1 performs actually very well when associated with (F S,V , F S,1 , F S,2 ) and not so badly when associated with Fi S . However, in more complex settings 3 and 4, where larger subsets are required to correct for strong and numerous common effects, subset collection S Lasso yields a higher power than the collection S 1 .
For small n, the test T P SLasso outperforms the procedure DiffRegr, whose limitation likely stems from the half sampling step. This limitation of sample splitting approaches has already been noticed by [40]. However, for n = 100 the procedure DiffRegr performs better than our procedure in the highly challenging setting 4. Figure 5 provide similar results under respectively power decay correlated designs and GGM-like correlated designs for a sample size of n = 50, leading to similar conclusions as in the uncorrelated case.

Application to GGM
The following section explicits the extension of the two-sample linear regression testing framework to the two-sample Gaussian graphical model testing framework. We describe the tools and guidelines for a correct interpretation of the results and illustrate the approach on a typical two-sample transcriptomic data-set.

How to Apply this Strategy to GGM Testing
Neighborhood Selection Approach The procedure developed in Section 2 can be adapted to the case of Gaussian graphical models as in [45]. We quickly recall why estimation of the Gaussian graphical model amounts to the estimation of p independent linear regressions when adopting a neighborhood selection approach [32].
Consider two Gaussian random vectors Z (1) ∼ N (0, Ω (1) −1 ) and Z (2) ∼ N (0, Ω (2) −1 ). Their respective conditional independence structures are represented by the graphs G (1) and G (2) , which consist of a common set of nodes Γ = {1, . . . , p} and their respective sets  . The deterministic collection S 1 is drawn in empty points, while the data-driven collection S Lasso is in plain points. Results for the DiffRegr procedure are represented by green circles, respectively plain and empty for single-splitting and multi-splitting approaches.
of edges E (1) and E (2) . When speaking of gene regulation networks, each node represents a gene, and edges between genes are indicative of potential regulations. In contrast with gene co-expression networks, edges in Gaussian graphical models do not reflect correlations but partial correlations between gene expression profiles. Formally, an edge (i, j) belongs to the edge set j ) conditional on all other variables Z (1) \i,j (resp. Z (2) \i,j ). When the precision matrix Ω (k) is nonsingular, the edges are characterized by its non zero entries.
The idea of neighborhood selection is to circumvent the intricate issue of estimating the precision matrix by recovering the sets of edges neighborhood by neighborhood, through the conditional distribution of Z \i . Indeed, this distribution is again a Gaussian distribution, whose mean is a linear combination of Z can be decomposed into the following linear regression: Z where β Given an n 1 -sample of Z (1) and an n 2 -sample of Z (2) , we recall that our objective is to test as formalized in (4) As a result of Equation (41), testing for the equality of the matrix rows Ω is equivalent to testing for β \i . Consequently, we can translate the GGM hypotheses given in Equation (4) into a conjunction of twosample linear regression tests: Concretely, we apply the previous two-sample linear regression model with ,i , and Y (2) = Z (2) ,i for every gene i and combine multiple neighborhood tests using a Bonferroni calibration as presented in Algorithm 5. The equality of σ (k) 's in H 0 models the equality of Ω Interpretation. Because we need Ω (1) ii = Ω (2) ii and Σ (1) \i = Σ (2) \i for every neighborhood in the two-sample GGM null hypothesis H N 0 (42), the assumptions that σ (1) = σ (2) and Σ (1) = Σ (2) in the two-sample linear regression null hypothesis H 0 (3) are crucial for each neighorbood test to be interpreted correctly. As a result, only the global test can be strictly speaking interpreted in a statistically correct sense.
However in practice, when the global null hypothesis is rejected, our construction of neighborhood tests provides helpful clues on the location of disruptive regulations. In particular, for each rejected neighorhood test i, one can keep track of the rejected model

Algorithm 5 Gaussian Graphical Model Testing Strategy
Require: Data Z (1) ,Z (2) , maximum model dimension Dmax and desired level α for each gene i = 1, . . . , p do procedure Neighborhood Test Apply the Adaptive Testing Strategy of Algorithm 1 to X (1) ,X (2) ,Y (1) ,Y (2) end procedure end for Reject the global null hypothesis if at least one Neighborhood Test is rejected at level α/p S i R , retaining sensible information on which particular regulations are most likely altered between samples.

Illustration on Real Transcriptomic Breast Cancer Data
We apply this strategy to the full (training and validation) breast cancer dataset studied by [21] and [35], whose training subset was originally published in [37]. The full dataset consists of microarray gene expression profiles from 133 patients with stage I-III breast cancer undergoing preoperative chemotherapy. A majority of patients (n=99) presented residual disease (RD), while 34 patients demonstrated a pathologic complete response (pCR). The common objective of [21] and [35] was to develop a predictor of complete response to treatment from gene expression profiling. In particular, [21] identified an optimal predictive subset of 30 probes, mapping to 26 distinct genes.
[2] inferred Gaussian graphical models among those 26 genes on each patient class using weighted neighborhood selection. The corresponding graphs of conditional dependencies for medium regularization are presented in Figure 6. Those two graphs happen to differ dramatically from one another. The question we tackle is whether those differences remain when taking into account estimation uncertainties.

Pathologic Complete Response (pCR)
Residual Disease (RD) Figure 6. Graphs of conditional dependencies among the 26 genes selected by [21] on patients with pathologic complete response or residual disease with medium regularization as presented in Figure 3 of [2].
We run for each of the p = 26 genes a neighborhood test T P S Lasso at level 0.05/26. We associate to each neighborhood test the empirical p-value defined in 23 that has to be be compared to α/p.
Most of the graph estimation methods proposed in the literature, such as the procedure of [2] leading to Figure 6, rely on the assumption that observations are i.i.d. Yet the training and validation datasets have been collected and analysed separately by two different clinical centers. We therefore start by checking whether the pooled sample can be considered as homogeneous. Within each group of patients (RD and pCR), we lead a test for the homogeneity of Gaussian graphical models between the training and validation subsets.
Within pCR patients (3), two neighborhood tests corresponding to CA12 and PDGFRA are rejected at level 0.05/26. Within RD patients (4), half of the neighborhoods happen to differ significantly between the training and validation datasets. Genes CA12 and JMJD2B are responsible for the rejection of respectively seven and six neighborhoods.  Because of these surprisingly significant divergences between training and validation subsets, we restrict the subsequent analysis to the training set (n=82 patients, among which 61 RD and 21 pCR patients).
To roughly check that we got rid of the underlying heterogeneity, we create an artificial dataset under H 0 by permutation of the patients, regardless of their class. No neighborhood test is rejected at a level corrected for multiple testing. We also cut the group of patients with residual disease artificially in half. When testing for the difference between the two halves, no significant heterogeneity remains, whatever the neighborhood.
Within the training set, the comparison of Gaussial graphical structures between pCR and RD patients leads to the rejection of all neighborhood tests after Bonferroni correction for multiple testing of the 26 neighborhoods, as summarized in Table 5. RRM2, MAPT and MELK genes appear as responsible for the rejection of respectively nine, nine and four of these neighborhood tests. Quite interestingly, these three genes have all been described in clinical literature as new promising drug targets. [20] exhibited inhibitors of RRM2 expression, which reduced in vitro and in vivo cell proliferation. [38] led functional biology experiments validating the relationship between MAPT expression levels and response to therapy, suggesting to inhibit its expression to increase sensivity to treatment. More recently, [13] developed a therapeutic candidate inhibiting MELK expression that was proved to suppress the growth of tumour-initiating cells in mice with various cancer types, including breast cancer.  For comprehensiveness, we add that similar analysis of the validation set (n=51 patients, among which 38 RD and 13 pCR patients) leads to the identification of only 9 significantly altered neighborhoods between pCR and RD patients 6. This difference in the number of significantly altered neighborhoods can be explained by the reduced size of the sample. Yet, genes responsible for the rejection of the tests differ from those identified on the training set. In particular, five of the significant tests are rejected because of SCUBE2, which has been recently recognised as a novel tumor suppressor gene [28].

Discussion
Design hypotheses. In this work, we have made two main assumptions on the design matrices: (i) The design matrices X (1) and X (2) are random. (ii) Under the null hypothesis (3), we further suppose that the population covariances Σ (1) and Σ (2) are equal.  Although this setting is particularly suited to consider the two-sample GGM testing (Section 6), one may wonder whether one can circumvent these two restrictions. We doubt that this is possible without making the testing problem much more difficult. First, the formulation (3) allows the null hypothesis to be interpreted as a relevant intermediary case between two extreme fixed design settings: design equality (X (1) = X (2) ) and arbitrary different design (X (1) = X (2) ). In the first case, the two-sample problem amounts to a one-sample problem by consideringỸ = Y (1) − Y (2) and it has therefore already been thoroughly studied. The second case is on the contrary extremely difficult as illustrated by the proposition below.
Comparison with related work [40,41] Städler and Mukherjee propose a very general approach to high-dimensional two-sample testing, being applicable to a wide range of models. In particular this approach allows for the direct comparison of two-sample Gaussian graphical models without adopting a neighborhood selection approach. This avoids the burden of multiple neighborhood linear regression and the multiple testing correction which follows.
Because they estimate the supports of sample-specific estimators and joint estimator separately in the screening step, they resort to an elegant estimation of the p-values for the non-nested likelihood ratio test in the cleaning step. Yet, they do not provide any theoretical controls on type I error rate or power for their overall testing strategy.
Finally, as it appears in the numerical experiments, their approach is based on halfsampling and can thus suffer from an acute reduction of power on small samples. On the bright side, the multi-split procedure shows stable results and performs well even in difficult scenarios as soon as n is sufficiently large.
Non asymptotic bounds and constants. In the spirit of [6], our type II error analysis is completely non-asymptotic. However, the numerical constants involved in the bounds are clearly not optimal. Another line of work initiated by [15] considers an asymptotic but high-dimensional framework and aims to provide detection rates with optimal constants. For instance [3,22] have derived such results in the one-sample high-dimensional linear regression testing problem under strong assumptions on the design matrices. In our opinion, both analyses are complementary. While deriving sharp detection rates (under perhaps stronger assumptions on the covariance) is a stimulating open problem, it is beyond the scope of our paper.
Loss functions and interpretation. The Kullback discrepancies considered in the power analysis of the test depend on β (1) and β (2) through the prediction distances β (1) − β (2) Σ i , i = 1, 2 rather than the l 2 distance β (1) − β (2) . On the one hand, such a dependency on the prediction abilities is natural, as our testing procedures relies on the likelihood ratio. On the other hand, it is possible to characterize the power of our testing procedures as in Theorems 4.1 and 4.4 in terms of the distance β (1) − β (2) by inverting Σ (1) and Σ (2) at β (1) − β (2) . However, the inversion would lead to an additional factor of the form Φ −1 |β (1) −β (2) |0,− ( √ Σ (i) ) in the testing rates. In terms of interpretation, even though our procedure adopts a global testing approach through prediction distances, our real dataset example illustrates that identifying which subset in the collection is responsible for rejecting the null hypothesis provides clues into which specific coefficients are most likely to differ between samples.
Gene network inference. Thinking of gene network inference by Gaussian graphical modeling, the high levels of correlations encountered within transcriptomic datasets and the potential number of missing variables result in highly unstable graphical estimations. Our global testing approach provides a way to validate whether sample-specific graphs eventually share comparable predictive abilities or disclose genuine structural changes. Such a statistical validation is obviously crucial before translating any graphical analysis into further biological experiments. Interestingly, the three main genes pointed out by our testing strategy have been validated as promising therapeutic targets by functional biology experiments.
Finally, this test should also facilitate the validation of the fundamental i.i.d. assumption across multiple samples, paving the way to pooled analyses when possible. In that respect, we draw attention to the significant heterogeneity detected between the training and validation subsets of the well-known Hess et al dataset, suggesting that these samples should be used separately as originally intended. Methods which require i.i.d. observations should only be applied with caution to this dataset if considered as a single large and homogeneous sample.

Upper bounds of the quantiles
Proof of Proposition 3.3. For the sake of simplicity, we note N = n 1 − |S|, (Z 1 , . . . , Z |S| ) a standard Gaussian random vector and W N a χ 2 random variable with N degrees of freedom. We apply Laplace method to upper bound P[F S,1 ≥ u]: The sharpest upper-bound is given by the value λ * which minimizes ψ u (λ). We obtain an approximation of λ * by cancelling the second-order approximation of its derivative. Deriving ψ u gives which admits the following second order approximation : Cancelling this quantity amounts to solving a polynomial equation of the second degree. The smallest solution of this equation leads to the desired λ * .
Additional Notations. Given a subset S, Π  Let us consider the regression of Y (1) (resp. Y (2) ) with respect to X S β S + S .
Under the null hypothesis H 0,S , we have β S . For the sake of simplicity, we write β S and σ S for these two quantities. Define the random variable T 1 and T 2 as Given X, T 1 /T 2 follows a Fisher distribution with (n 1 − |S|, n 2 − |S|) degrees of freedom.
Observing that under the null hypothesis allows us to prove the first assertion of Proposition 3.1. Let us turn to the second statistic:

Calibrations
Proof of Proposition 3.4. By definition of the p-values Q i,|S| , we have under H 0 for each S ∈ S and each i ∈ {V, 1, 2} Applying a union bound and integrating with respect to X allows us to control the type I error: where we have upper bounded the sum over the random collection S by the sum over S. p |S| are invariant with respect to the permutation π. Hence, we derive Applying a union bound and integrating with respect to X allows us to conclude.

Proof of Theorem 4.3
The objective is to exhibit a subset for which the power of T B S is larger than 1 − δ. This subset is such that the distance between the two sample-specific distributions is large enough that we can actually reject the null hypothesis with large probability. As exposed in Theorem 4.3, we rely on the semi-distances K 1 (S) + K 2 (S) for S ∈ S: The proof is split into five main lemmas. First, we upper bound Q −1 V,|S| (x|X S ), Q −1 1,|S| (x|X S ), and Q −1 2,|S| (x|X S ) in Lemmas 8.1, 8.2 and 8.3. Then, we control the deviations of F S,V , F S,1 , and F S,2 under H 1,S in Lemmas 8.4 and 8.5. In the sequel, we call S the subcollection of S made of subsets S satisfying |S| ≤ (n 1 ∧ n 2 )/2 and where the numerical constants L • 1 , L • 2 , and L • 3 only depend on L * 2 in (53) and on the constants introduced in Lemmas 8.1-8.5. These conditions allow us to fix the constants in the statement (29) of Theorem 4.3.
There exists a positive universal constant L such that the following holds. Consider some 0 < x < 1 such that 16 log(2/x) ≤ n 1 ∧n 2 . For any subset S of size smaller than (n 1 ∧ n 2 )/2, we have We recall that a = (a 1 , . . . , a |S| ) denotes the positive eigenvalues of ). There exist two positive universal constants L 1 and L 2 such that the following holds. If |a| 1 < u ≤ (n 1 − |S|)|a| ∞ and if |S| ≤ L 1 n 1 , For any 0 < x < 1, satisfying we have the following upper bound Lemma 8.3 (Upper-bound of |a| ∞ ). There exist two positive universal constants L 1 and L 2 such that the following holds. Consider δ a positive number sastifying L 1 log(4/δ) < n 1 ∧ n 2 . With probability larger than 1 − δ/2, we have Lemma 8.4 (Deviations of F S,V ). There exist three positive universal constants L 1 , L 2 and L 3 such that the following holds. Assume that L 1 log(1/δ) ≤ n 1 ∧ n 2 . With probability larger than 1 − δ, we have Lemma 8.5 (Deviations of F S,1 ). There exist two positive universal constants L 1 and L 2 such that the following holds. Assume that With probability larger than 1 − δ/2, we have where ϕ S is defined in (28).
Since we assume that 4L * 2 log(6/δ) ≤ n 1 ∧ n 2 in (46), the last condition is fulfilled if We now proceed to the proof of the five previous lemmas.
Proof of Lemma 8.1. Let u ∈ (0, 1) andF −1 D,N (u) be the 1 − u quantile of a Fisher random variable with D and N degrees of freedom. According to [6], we havē Let us assume that 8/N log(1/u) ≤ 1. By convexity of the exponential function it holds thatF Recall T 1 and T 2 defined in (44). Under hypothesis H 0 , Consider some x > 0 such that [8/(n 1 − |S|) ∨ 8/(n 2 − |S|)] log(2/x) ≤ 1. Then, with probability larger than 1 − x/2 we have, since |S| ≤ (n 1 ∧ n 2 )/2. Similarly, with probability at least 1 − x/2, we have n1(n2−|S|) − 1, we apply one the two following identities: Combining the different bounds, we conclude that with probability larger than 1 − x, Proof of Lemma 8.2. As in the proof of Proposition 3.3, we note N = n 1 − |S|. Recall that Q 1,|S| (u|X S ) is defined as exp ψ u (λ ) (see Definition 3.2). We start by upperbounding ψ u (λ ), which proves the first upper-bound of the logarithm of the tail probability log Q 1,|S| (u|X S ). We then exhibit a value u x such that ψ ux (λ ) ≤ log x.
Upper-bound of the tail probability. Since Equation (43) is increasing with respect to λ and with respect to N , λ * decreases with N . Consequently, .
By convexity, 1 − √ 1 − x ≥ x/2 for any 0 ≤ x ≤ 1. Applying this inequality, we upper bound √ ∆ and derive that for any 0 < x < 1/2 and that log(1 + x) ≥ x − x 2 for any x > 0, we derive Upper-bound of the quantile. Let us turn to the upper bound of Q −1 1,|S| (x|X S ). Consider u x the solution larger than |a| 1 of the equation and observe that Choosing L 1 and L 2 large enough in the condition |S| ≤ L 1 n 1 and in condition (48) leads us to u x ≤ N |a| ∞ . We now prove that ψ ux∨2|a|1 (λ * ) ≤ log x. If u x ≥ 2|a| 1 , then u 3 x ≤ 8(u x − |a| 1 ) 3 and it follows from (55) that if we take L 2 large enough in Condition (48). If u x ≤ 2|a| 1 , then |a| 2 1 /(|a| ∞ |a| 1 + a 2 ) ≥ 8 log(1/x) and if we take L 1 and L 2 large enough in the two aforementionned condition. since |S| ≤ 2 −6 n 1 . Thus, we conclude that S follow standard Gaussian distributions.
In order to conclude, we control the largest and the smallest eigenvalues of Standard Wishart matrices applying Lemma 8.12.
Proof of Lemma 8.4. By symmetry, we can assume that σ S ≥ 1. Recall the definition of T 1 and T 2 in the proof of Proposition 3.1 (56) We need to control the deviations of T 2 /T 1 . Using bound (54), we get with probability larger than 1 − δ. Since |S| ≤ (n 1 ∧ n 2 )/2, we derive that for L 1 large enough in the statement of the lemma. In conclusion, we have with probability larger than 1 − δ, as long as Combining (56), (57), and (58), we derive with probability larger than 1 − δ.
Proof of Lemma 8.5. We want to lower bound the random variable F S,1 = where R is defined by Let us first work conditionally to X (1) S and X (2) S . Upon defining the Gaussian vector W by W ∼ N 0, (σ S ) + W 2 /n 2 . We have the following lower bound: The random variable X S ), W, is proportional to a χ 2 distributed random variable with one degree of freedom and its variance is smaller than (σ S ) 2 . Applying Lemma 8.11, we derive that with probability larger than 1 − x/6, Using the upper bound |S| ≤ (n 1 ∧ n 2 )/2 and Lemma 8.12, we control the last term with probability larger than 1 − 2 exp[−(n 1 ∧ n 2 )L ]. If we take the constant L 1 large enough in condition (51), then we get with probability larger than 1 − δ/3.
If S ∨ = ∅, then we can consider any subset of size 1 to prove the first result. If S ∆ = ∅, then β (1) = β (2) and the second result does not tell us anything.
This proof is divided into two main steps. First, we prove that with large probability the collection S Lasso contains some set S λ close to the union S ∨ of the supports of β (1) and β (2) . Then, we show that the statistics (F S λ ,V , F S λ ,1 , F S λ ,2 ) allow to reject H 0 with large probability.
Recall that the collection S Lasso is based on the Lasso regularization path of the following heteroscedastic Gaussian linear model, which we denote for short Y = Wθ * + . Given a tuning parameter λ, θ λ refers to the Lasso estimator of θ: In order to analyze the Lasso solution θ λ , we need to control how W acts on sparse vectors.
Lemma 8.6 (Control of the design W). If we take the constants L * , L * 1 , and L * 2 in Proposition 4.5 small enough then the following holds. The event has probability larger than 1 − δ/4. Furthermore, on the event A, for any k ≤ k * .
Then, we closely follow the arguments of Theorem 4.3 to state that T B SLasso rejects H 0 with large probability as long as K 1 ( S λ ) + K 2 ( S λ ) is large enough. Lemma 8.9. If on the event A ∩ B, we have then, min i∈{V,1,2} Q i,| S λ | (F S λ ,i |X S λ ) < α i, S λ with probability larger than 1 − δ/2.
Gathering the last inequality with (67) yields Proof of Lemma 8.9. For any non empty set S of size smaller or equal to k * , define δ S = δ 2 |S| p k * −1 . If we take L * and L * 1 in (35-36) small enough, then 1+log[1/(α S δ S )]/(n 1 ∧ n 2 ) is smaller than some constant L small enough so that we can apply Theorem 4.3. Arguing as in the proof of this Theorem, we derive that P min i∈{V,1,2} Q i,S (F S,i |X S ) < α S ≥ 1 − δ S if K 1 (S) + K 2 (S) ≥ Lϕ S 1 n 1 + 1 n 2 |S| log(p) + log 1 αδ + log(p) .
Applying a union bound over all sets S of size smaller or equal to k * allows us to prove P min i∈{V,1,2} Q i, S λ (F S λ ,i |X S λ ) < α S λ ≥ 1 − δ .
and we suppose that event A ∩ B (defined in the last proof) holds. Recall that P[A ∩ B] ≥ 1 − δ/4 − 1/p. We consider the set S (2) λ defined as the support of θ (2) λ . Note that S (2) λ ∈ S (2) L ⊂ S Lasso . Lemma 8.10. If we take constantsL * and L * 2 in Proposition 4.6 small enough, then the following holds. There exists an C of probability larger than 1 − 1/p such that, under A ∩ B ∩ C, we have It follows that on A ∩ B ∩ C: Arguing as in the proof of Lemma 8.7 and taking L * 2 small enough, we derive that under A ∩ B, | θ This allows us to upper bound θ Pythagorean inequality then gives where we use the two previous upper bounds in the last line. Consequently, we obtain Applying Lemma 8.9 to S (2) λ , using (71) and taking L * 3 large enough then allows us to conclude.

Technical lemmas
In this section, some useful deviation inequalities for χ 2 random variables [25] and for Wishart matrices [14] are reminded.
Lemma 8.11. For any integer d > 0 and any positive number x,