Generating knockoffs via conditional independence

Let $X$ be a $p$-variate random vector and $\widetilde{X}$ a knockoff copy of $X$ (in the sense of \cite{CFJL18}). A new approach for constructing $\widetilde{X}$ (henceforth, NA) has been introduced in \cite{JSPI}. NA has essentially three advantages: (i) To build $\widetilde{X}$ is straightforward; (ii) The joint distribution of $(X,\widetilde{X})$ can be written in closed form; (iii) $\widetilde{X}$ is often optimal under various criteria. However, for NA to apply, $X_1,\ldots, X_p$ should be conditionally independent given some random element $Z$. Our first result is that any probability measure $\mu$ on $\mathbb{R}^p$ can be approximated by a probability measure $\mu_0$ of the form $$\mu_0\bigl(A_1\times\ldots\times A_p\bigr)=E\Bigl\{\prod_{i=1}^p P(X_i\in A_i\mid Z)\Bigr\}.$$ The approximation is in total variation distance when $\mu$ is absolutely continuous, and an explicit formula for $\mu_0$ is provided. If $X\sim\mu_0$, then $X_1,\ldots,X_p$ are conditionally independent. Hence, with a negligible error, one can assume $X\sim\mu_0$ and build $\widetilde{X}$ through NA. Our second result is a characterization of the knockoffs $\widetilde{X}$ obtained via NA. It is shown that $\widetilde{X}$ is of this type if and only if the pair $(X,\widetilde{X})$ can be extended to an infinite sequence so as to satisfy certain invariance conditions. The basic tool for proving this fact is de Finetti's theorem for partially exchangeable sequences. In addition to the quoted results, an explicit formula for the conditional distribution of $\widetilde{X}$ given $X$ is obtained in a few cases. In one of such cases, it is assumed $X_i\in\{0,1\}$ for all $i$.


Introduction
One of the main problems, both in statistics and machine learning, is to identify the explanatory variables which are to be discarded, for they don't have a meaningful effect on the response variable.To formalize, let X 1 , . . ., X p , Y be real random variables, where Y is regarded as the response variable and X 1 , . . ., X p as the explanatory variables.A Markov blanket is a minimal subset S ⊂ {1, . . ., p} such that Under mild conditions, a Markov blanket S exists, is unique, and {1, . . ., p} \ S can be written as {1, . . ., p} \ S = i : Y ⊥ ⊥ X i | (X 1 , . . ., X i−1 , X i+1 , . . ., X p ) ; see e.g.[9, p. 558] and [11, p. 8].The problem mentioned above is to identify S.
To any selection procedure concerned with this problem, we can associate the false discovery rate E | Ŝ\S| | Ŝ|∨1 , where Ŝ denotes the estimate of S provided by the procedure.As in the Neyman-Pearson theory, those selection procedures which take the false discovery rate under control worth special attention.
Roughly speaking, Barber and Candes' idea is to create an auxiliary vector X = ( X 1 , . . ., X p ), called a knockoff copy of X, which is able to capture the connections among X 1 , . . ., X p .Once X is given, each X i is selected/discarded based on the comparison between it and X i .Intuitively, X i plays the role of a control for X i , and X i is selected if it appears to be considerably more associated with Y than its knockoff copy X i .This procedure is a recent breakthrough as regards variable selection.
In addition to take the false discovery rate under control, it has other merits.In particular, it works whatever the conditional distribution of Y given X.More precisely, for the knockoff procedure to apply, one must assign L(X) but is not forced to specify L(Y | X). (Here and in the sequel, for any random elements U and V , we denote by L(U ) and L(U | V ) the probability distribution of U and the conditional distribution of U given V , respectively).
Let us make precise the conditions required to X.For each i ∈ {1, . . ., p} and each point x ∈ R 2p , define f i (x) ∈ R 2p by swapping x i with x p+i and leaving all other coordinates of x fixed.Then, f i : R 2p → R 2p is a permutation.For instance, for p = 2, one obtains f 1 (x) = (x 3 , x 2 , x 1 , x 4 ) and f 2 (x) = (x 1 , x 4 , x 3 , x 2 ).In this notation, X is a knockoff copy of X, or merely a knockoff, if (i) f i (X, X) ∼ (X, X) for each i ∈ {1, . . ., p} and (ii) X ⊥ ⊥ Y | X.
For the knockoff procedure to apply, one must select L(X) and construct X.However, obtaining X is not easy.Condition (ii) does not create any problems, for it is automatically true whenever X is built based only on X, neglecting any information about Y .On the contrary, condition (i) is quite difficult to be realized.Current tractable methods to achieve (i) require conditions on L(X).To our knowledge, such methods are available only when X is Gaussian [9], or the set of observed nodes in a hidden Markov model [19], or conditionally independent given some random element [8] and [14].The third condition (conditional independence) is discussed in Section 1.1 and includes the other two as special cases.There are also some universal algorithms, such as the Sequential Conditional Independent Pairs [9] and the Metropolized Knockoff Sampler [5], which are virtually able to cover any choice of L(X).However, these algorithms do not provide a closed formula for X.More importantly, they are computationally intractable as soon as L(X) is complex; see [5] and [14].As a matter of fact, they work effectively only for some choices of L(X) (such us graphical models) but not for all.A last remark is that, even if one succeeds to build X, the joint distribution of the pair (X, X) could be unknown.This is a further shortcoming.In fact, after observing X = x, it would be natural to sample a value x for X from the conditional distribution L( In a nutshell, the above remarks may be summarized as follows.If X is not conditionally independent (in the sense of Section 1.1), then: • How to build a reasonable knockoff X is unknown.
• The existing numerical algorithms are computationally heavy and may fail to work.• Even if one succeeds to build X, the joint distribution of the pair (X, X) is unknown.
1.1.A new approach to knockoffs construction.As noted above, while powerful and effective, the knockoff procedure suffers from some shortcomings due to the difficulty of building a reasonable knockoff X.Such shortcomings are partially overcome by a new method for constructing X, based on conditional independence, introduced in [8].Similar ideas were also previously developed in [14].Another related reference is [4].In this section, we recall the main features of this method.
Suppose that X 1 , . . ., X p are conditionally independent given some random element Z. Denote by Θ the set where Z takes values and by γ the probability distribution of Z.Moreover, let B be the Borel σ-field on R and Note that γ is a probability measure on Θ and each P i (• | θ) is a probability measure on R. Since X 1 , . . ., X p are conditionally independent given Z, Hence, one can define a probability measure λ on R 2p as where A i ∈ B and B i ∈ B for all i.In [8, Th. 12], it is shown that any p-variate random vector X such that L(X, X) = λ is a knockoff copy of X.
Thus, arguing as above, not only one builds X in a straightforward way but also obtains the joint distribution of (X, X), namely The price to be paid is to assign L(X) so as to satisfy (1).(Recall that the choice of L(X) is a statistician's task).But this price is not expensive for two reasons.The first one is quite practical.The probability measures satisfying (1) are flexible enough to cover most real situations.Modeling X 1 , . . ., X p as conditionally independent (given some Z) is actually reasonable in a number of practical problems.The second reason is theoretical and is based on the results of this paper.Indeed, even if (1) fails, L(X) can be approximated arbitrarily well by probability measures satisfying (1); see Theorems 3 and 4 below.
The previous approach has two further advantages.First, X is often optimal under some criterions, such as mean absolute correlation and reconstructability.This is discussed in Example 1.However, we note by now that cov Second, even if it is not Bayesian from the conceptual point of view, the previous approach largely exploits Bayesian tools.Hence, to construct X and evaluate L(X, X), all the Bayesian machinery can be recovered.
To illustrate, suppose that P i (• | θ) admits a density f i (• | θ) with respect to some dominating measure λ i .For instance, λ i could be Lebesgue measure or counting measure.Then, the joint densities of X and (X, X) are, respectively, where x and x denote points of R p .In turn, assuming h(x) > 0 for the sake of simplicity, the conditional density of X given X = x can be written as .
Therefore, we have an explicit formula for L( X | X = x).
In the rest of this paper, to make the exposition easier, a knockoff X obtained as above (i.e., a knockoff X satisfying equation ( 2)) is said to be a conditional independence knockoff (CIK).To highlight the connection between X and X, we also say that X is the CIK of X.
1.2.Content of this paper.This paper is basically a follow up of [8].It consists of two results, two examples, and a numerical experiment.The results are of the theoretical type.They aim to characterize the CIKs, to show that they can be applied to virtually any real situation, and to highlight some of their optimality properties.The examples provide an explicit formula for L( X | X = x) in two (meaningful) cases: mixtures of 2-valued (or 3-valued) distributions and mixtures of centered normal distributions.In particular, the first example deals with the case X i ∈ {0, 1} for all i.Such a case is important in applications, mainly in a genetic framework.Nevertheless, apart from our example, we are not aware of any theoretical investigation of this case.Finally, in the numerical experiment, the CIKs are tested against simulated and real data.
In the sequel, for any d ≥ 1, a probability measure on R d is called absolutely continuous if it admits a density with respect to Lebesgue measure on R d .Moreover, P is the class of all probability measures on R p and P 0 ⊂ P is the subclass consisting of those µ 0 ∈ P of the form for some choice of Θ, γ and P i (• | θ) such that P i (• | θ) is absolutely continuous for all i and θ.
We next briefly describe our two results.Moreover, by means of an example, we point out some optimality properties of the CIKs.
Our first result (henceforth, R1) is that, for all µ ∈ P and ϵ > 0, there is µ 0 ∈ P 0 such that In addition, an explicit formula for µ 0 is provided.Here, d BL and d T V are the bounded Lipschitz metric and the total variation metric, respectively.Their definitions are recalled in Section 2.
The motivation for R1 is that, to build a CIK, one needs L(X) ∈ P 0 .This is not guaranteed, however, since the choice of L(X) is not subjected to any constraint.Hence, it is natural to investigate whether L(X) can be at least approximated by elements of P 0 .Because of R1, this is actually true.Roughly speaking, R1 aims to support P 0 by showing that its elements are (approximatively) able to model any real situation.
In addition to the previous motivation, R1 has also some practical utility.Suppose L(X) = µ for some µ ∈ P. To fix ideas, suppose µ is absoutely continuous.If µ is arbitrary, how to build a reasonable knockoff X is unknown.However, given ϵ > 0, there is µ 0 ∈ P 0 such that d T V (µ, µ 0 ) < ϵ.Such a µ 0 can be built explicitly (recall that R1 provides an explicit formula for µ 0 ).Denote by T a p-variate random vector such that L(T ) = µ 0 .Since µ 0 ∈ P 0 , the CIK T of T can be obtained straightforwardly.Then, for any knockoff copy X of X.Hence, by the robustness properties of the knockoff procedure [3], T should be a reasonable approximation of X.
Our second result (henceforth, R2) is a characterization of the CIKs.Let K denote the class of the CIKs, that is K = X : L(X, X) admits representation (2) for some Θ, γ and Moreover, for any knockoff X, say that (X, X) is infinitely extendable if there is an • V satisfies the same invariance condition as (X, X) (this condition is formalized in Section 2.2).Then, R2 states that X ∈ K ⇔ (X, X) is infinitely extendable.
Hence, if (X, X) is required to be infinitely extendable, then X must be conditionally independent (given some Z) and X must be the CIK of X.The proof of R2 is based on de Finetti's theorem for partially exchangeable sequences.
Based on R2, a question is whether infinite extendability of (X, X) is a reasonable condition.To answer, two facts are to be stressed.Firstly, by Finetti's theorem, infinite extendability of (X, X) essentially amounts to conditional independence of X and X.Secondly, for the knockoff procedure to have a low type II error rate, it is desirable that X and X are "as independent as possible"; see e.g.[9, p. 563] and [20].Now, to have X and X as independent as possible, a reasonable strategy is to take X and X conditionally independent, or equivalently to require (X, X) to be infinitely extendable.
Example 1. (Optimality of the CIKs).Suppose E(X 2 i ) < ∞ and var(X i ) > 0 for all i.Obviously, X should be selected so as to make the power of the knockoff procedure as high as possible.To this end, two criterions are to minimize the mean absolute correlation and to minimize the reconstructability index where The first criterion (mean absolute correlation) is quite popular in the machine learning comunity.At least in some cases, however, it is overcome by the second (reconstructability index); see [5] and [20].Note also that ). Suppose now that X 1 , . . ., X p are conditionally independent, given some random element Z, and X is the CIK of X. Suppose also that E( Therefore, X is optimal under the first criterion.Moreover, ) and the reconstructability index attains its minimum value Therefore, X is optimal under the second criterion as well.

Theoretical results
We first recall some (well known) definitions.A function where ∥•∥ is the Euclidean norm.In this case, we also say that f is b-Lipschitz or that b is a Lipschitz constant for f .
We remind that P denotes the class of all probability measures on R p .The bounded Lipschitz metric d BL and the total variation metric d T V are two distances on P. If µ, ν ∈ P, they are defined as where sup g is over the 1-Lipschitz functions g : R p → [−1, 1] and sup A is over the Borel subsets A ⊂ R p .Among other things, d BL has the property that where µ n , µ ∈ P. We also note that d BL and d T V are connected through the inequality We next turn to our main results.
2.1.P 0 is dense in P. Let P 0 be the class of those probability measures µ 0 ∈ P which can be written as for some choice of Θ, γ and P i (• | θ).To avoid trivialities, P i (• | θ) is assumed to be absolutely continuous for all i = 1, . . ., p and θ ∈ Θ.The latter assumption is motivated by the next example.

Example 2. (Why
where θ = (θ 1 , . . ., θ p ) denotes a point of R p .Then, Hence, without some further constraint (such as P i (• | θ) absolutely continuous), one would obtain P 0 = P with Θ, γ and P i (• | θ) as in (3).However, this is not practically useful.In fact, under (3), the CIK X of X is the trivial knockoff X = X, which is unsuitable to perform the knockoff procedure.
If L(X) ∈ P 0 , it is straightforward to obtain the CIK X of X and to write L(X, X) in closed form.But clearly it may be that L(X) / ∈ P 0 .In this case, it is quite natural to investigate whether L(X) can be approximated by elements of P 0 .This is actually possible and the approximation is very strong if L(X) is absolutely continuous.
Theorem 3.For all µ ∈ P and ϵ > 0, there is µ 0 ∈ P 0 such that d BL (µ, µ 0 ) < ϵ.In particular, one such µ 0 is where c = ϵ 2 /2p and N p (x, cI) denotes the Gaussian law on R p with mean x and covariance matrix c I, i.e.
Theorem 4. Suppose µ ∈ P is absolutely continuous.Then, for each ϵ > 0, there is µ 0 ∈ P 0 such that d T V (µ, µ 0 ) < ϵ.Moreover, if µ has a Lipschitz density, one such µ 0 can be defined by (4) with where b is a Lipschitz constant for the density of µ, m is the Lebesgue measure on R p and B ⊂ R p is any Borel set satisfying µ(B c ) < ϵ/2 and 0 < m(B) < ∞.
Theorems 3 and 4 are proved in the Supplementary Material.
It is worth noting that, in addition to (4), there are other laws µ 0 ∈ P 0 satisfying the inequalities d BL (µ, µ 0 ) < ϵ or d T V (µ, µ 0 ) < ϵ.Moreover, in the second part of Theorem 4, the Lipschitz condition on the density of µ can be weakened at the price of making µ 0 slightly more involved.
The motivation of Theorems 3-4 has been mentioned in Section 1.2.In short, if L(X) / ∈ P 0 , the CIK X of X cannot be built.However, Theorems 3-4 imply that L(X) can be approximated by elements of P 0 .Hence, with a negligible error, it can be assumed X ∼ µ 0 and the CIK X of X can be easily obtained.This is our main motivation.However, Theorems 3-4 have a practical implication as well.Suppose L(X) = µ for some µ ∈ P. To fix ideas, suppose µ is absolutely continuous with a Lipschitz density.Fix ϵ > 0, define µ 0 ∈ P 0 as in Theorem 4, and call T a p-variate vector such that L(T ) = µ 0 .Since µ 0 ∈ P 0 , the CIK T of T can be easily built.Moreover, given any knockoff copy X of X, since X ∼ X ∼ µ and T ∼ T ∼ µ 0 , Theorem 4 yields Therefore, by the robustness properties of the knockoff procedure [3], T is expected to be a reasonable approximation of X. (Obviously, the latter claim should be supported by a numerical comparison of the power and the false discovery rate corresponding to X and T .Such a comparison is not trivial, however, since X is unknown for arbitrary µ).
Finally, two remarks are in order.The first is summarized by the following lemma.
Proof.For any Borel sets A, B ⊂ R p , one obtains □ Lemma 5 makes clear the structure of L(T, T ) and may be useful for sampling from such distribution.
The second remark is that, if L(X, X) is absolutely continuous and has a Lipschitz density, the pair (T, T ) can be taken such that In the notation µ * = L(X, X) and µ * 0 = L(T, T ), it suffices to let where c is a suitable constant.Thus, L(X, X) can be approximated in total variation by L(T, T ) for any knockoff X which makes L(X, X) absolutely continuous with a Lipschitz density.While this fact is theoretically meaningful and supports the CIKs further, the above formula for µ * 0 has little practical use, since µ * is generally unknown (it is even unknown how to obtain X).

2.2.
A characterization of the CIKs.Recall that K = X : L(X, X) admits representation (2) for some Θ, γ and is the class of the CIKs of X.Such a K does not include all possible knockoffs.Here is a trivial example.Example 6. (Not every knockoff is a CIK).Suppose that X 1 , . . ., X p are i.i.d. with P (X 1 = 0) = P (X 1 = 1) = 1/2.In this case, it would be natural to take X as an independent copy of X.But suppose we let Then, for all a, b ∈ {0, 1} p , P (X = a, X = b) = P (X = a) if b i = 1 − a i for each i = 1, . . ., p while P (X = a, X = b) = 0 otherwise.Based on this fact, it is straightforward to verify that X is a knockoff copy of X.However, since Based on Example 6, a question is how to identify the members of K among all possible knockoffs X.To answer this question, we recall that (X, X) is said to be infinitely extendable if there exists an (infinite X) and V satisfies the same invariance condition as (X, X).
Formally, the latter request should be meant as follows.Given three integers i, j, k with 1 ≤ i ≤ p and j, k ≥ 0, define a new sequence V * = (V * 1 , V * 2 , . ..) by swapping V kp+i with V jp+i and leaving all other elements of V fixed, that is, Then, V is required to satisfy Condition ( 5) is nothing but a form of partial exchangeability; see [1] and [10].In fact, the main tool for proving the next result is de Finetti's theorem for partially exchangeable sequences.
Theorem 7. Let X be a knockoff copy of X.Then, X ∈ K if and only if (X, X) is infinitely extendable.
The essence of Theorem 7 is that, if (X, X) is required to be infinitely extendable, then X must be conditionally independent (given some Z) and X must be the CIK of X.One reason for requiring infinite extendability has been given in Section 1.2.Essentially, infinite extendability of (X, X) amounts to conditional independence between X and X, which in turn implies optimality of X under various criterions for increasing the power of the knockoff procedure; see Example 1.

2-valued and 3-valued covariates
In applications, an important special case is X i ∈ {0, 1}.In a genetic framework, for instance, X i = 0 or X i = 1 according to whether the i-th gene is absent or present.Another meaningful case is X i ∈ {0, 1, 2}, where X i = 2 can be given various interpretations.For instance, X i = 2 could mean that the absence/presence of the i-th gene cannot be established.Despite their practical significance, to our knowledge, these cases have not received much attention, from the theoretical point of view, to date.In this section, we try to fill this gap.We aim to build a CIK X when X is a vector of 2-valued or 3-valued random variables.
There are obviously various cases.For instance, some covariates are 2-valued, other 3-valued, and the remaining ones have a continuous distribution function.Here, we only focus on two extreme situations: either all covariates are 2-valued or all are 3-valued.
To this end, define Then, Finally, .
We now have an explicit formula for L( X | X = x).In a sense, this is the best we can do.In fact, after observing X = x, a value x for X can be drawn directly from L( X | X = x).
Mixtures of centered normal distributions allow to model various real situations while preserving some properties of the Gaussian laws.For this reason, they are quite popular in applications; see e.g.[15] and references therein.Among other things, they are involved in Bayesian inference for logistic models [16] and they arise as the limit laws in the CLT for exchangeable random variables [7,Sect. 3].
A further motivation for this type of data is that E(X i | θ) = 0. Hence, the CIKs are optimal and in particular cov(X i , X i ) = 0 for all i; see Example 1.
To build a CIK, a "prior" γ on Θ = (0, ∞) p is to be selected.Quite surprisingly, to our knowledge, the choice of γ seems to be almost neglected in the Bayesian literature (apart from the special case θ 1 = . . .= θ p ); see e.g.[13].We next propose two choices of γ.As in Section 3, we let θ = λ c where λ > 0 is a scalar and c = (c 1 , . . ., c p ) a vector such that c i > 0 for all i.
4.1.First choice of γ.We first assume that λ is random but c is not.Equivalently, we suppose that the ratios θ i /θ j = c i /c j are non-random and known.While simple, this assumption makes sense in various applications, for instance in a financial framework.
The random variable λ is given an inverse Gamma distribution with parameters a > 0 and b > 0, that is, λ has density In this case, the density of (X, X) is where x and x are points of R p and .
Similarly, the density of X is .
It is worth noting that f and h are densities of Student's-t distributions.We recall that the m-variate Student's-t distribution with k degrees of freedom is the absolutely continuous distribution on R m with density where Σ is a symmetric positive definite m × m matrix.Hence, one obtains Finally, the conditional density of X given X = x can be written as Once again, g(• | x) is the density of a Student's-t distribution (with parameters depending on x).To see this, it suffices to let m = p, k = 2a + p, and Thus, we have an explicit formula for g(• | x) and this is quite useful in applications.
A numerical example is in Section 5.

4.2.
Second choice of γ.Suppose now that c is random and independent of λ.
Let c be given an absolutely continuous distribution with density q.Then, f , h and g turn into As an example, c 1 , . . ., c p could be taken i.i.d.according to a uniform distribution on some bounded interval B ⊂ (0, ∞), i.e., In general, the above integrals cannot be explicitly evaluated.Hence, sampling from g(• | x) is not easy, but it is still possible by computational methods.For instance, we could proceed as follows.Since g(• | x) is proportional to f (x, •), we focus on f (x, •).Then, to sample from f (x, •), we adopt a data augmentation strategy where λ and c are treated as auxiliary variables.The idea is to consider the density function and perform a Gibbs sampling on the variables ( x, λ, c).We conclude this section by listing the full conditional distributions required to run the algorithm.
• The full conditional distribution of (λ | x, c) is proportional to Hence, since ψ is the inverse gamma density with parameters a and b, the full conditional of λ is still an inverse gamma with parameters Obviously, λ could be also given a different distribution.In this case, the corresponding full conditional is probably more involved, but one may use a metropolis within Gibbs step.
• Let c −i = (c 1 , . . ., c i−1 , c i+1 , . . ., c p ).The full conditional distribution of ( Sampling from the above is not straightforward and may require a metropolis within Gibbs step.

A numerical experiment
In this section, the CIKs are tested numerically against both simulated and real data.To this end, X is assumed to be as in Section 4.1.Hence, L(X) is a mixture of centered normal distributions and θ = λ c, where the scalar λ has an inverse gamma distribution with parameters a and b while c = (c 1 , . . ., c p ) is a vector of strictly positive constants.
To learn something about the impact of the parameters, the experiment has been repeated for various choices of a, b and c.The obtained results are quite stable with respect to a and b but exhibit a notable variability with respect to c.In the sequel, a and b have been selected so as to control the mean and the variance of λ (which hold b/(a − 1) and b 2 (a−1) 2 (a−2) , respectively, for a > 2).In case of real data (Section 5.2) a and b have been also tuned based on the observed value of X.The choice of c is certainly more delicate.As in Section 4.2, one option could be modeling c as a random vector (rather than a fixed vector).For instance, c 1 , . . ., c p could be i.i.d, according to a uniform distribution on some interval, and independent of λ.However, in this section, c is taken to be non-random.This choice has essentially three motivations.First, it may be convenient in real problems, in order to account for the different roles of the various covariates.Second, it is practically simpler since computational methods are not required.Third, if c is non-random, a direct comparison with Section 6.3 of [18] is easier.
One more remark is in order.To compare different knockoff procedures, three popular criterions are the power, the false discovery rate, and the observed correlations between the X i and their knockoffs X i .However, as regards the CIKs of Section 4.1, the third criterion is superfluos, since cov(X i , X i ) = 0 for all i.Indeed, under the third criterion (as well as under the reconstructability criterion), the CIKs of Section 4.1 are superior to any other knockoff procedure; see Example 1.
5.1.Simulated data.According to the usual format (see e.g.[9] and [18]) the simulation experiment has been performed as follows.
• A subset I ⊂ 1, . . ., p such that |I| = 60 has been randomly selected and the coefficients β 1 , . . ., β p have been defined as Here, n is a positive integer and u > 0 a parameter called signal amplitude.
have been generated from a p-variate Student's-t distribution with 2a degrees of freedom and matrix Σ = b a −1 diag(c 1 , . . ., c p ).Given X (j) , the corresponding response variable Y (j) has been defined as where e 1 , . . ., e n are i.i.d.standard normal errors.• For each j = 1, . . ., n, we sampled m CIKs, say X (1,j) , . . ., X (m,j) , from the conditional distribution of X (j) given X (j) = x (j) where x (j) is the observed value of X (j) .Precisely, for each k = 1, . . ., m, the value of X (k,j) was sampled from the p-variate Student's-t distribution with 2a + p degrees of freedom and matrix • For each k = 1, . . ., m, the knockoff selection procedure has been applied to the data Y (j) , X (j) , X (k,j) : j = 1, . . ., n so as to calculate the power and the false discovery rate, say pow(k) and f dr(k).To do this, we exploited the R-cran package knockoff: https://cran.r-project.org/web/packages/knockoff/index.html.This package is based on the comparison between the lasso coefficient estimates of each covariate and its knockoff.To run the simulation experiment, we took m = n = p = 1000 and a theoretical value of the false discovery rate equal to 0.1.As already noted, the experiment has been repeated for various choices of the parameters a, b, c, u.Overall, the results have been quite stable with respect to all parameters but c.The specific results reported here correspond to a = 6, b = 10, c i = i and u = 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 1, 1.5, 2, 2.5, 3.
The observed results, in terms of pow and f dr, are summarized in Figure 1.The performance of the CIKs appears to be excellent, even if it slightly gets worse for small values of the amplitude u.It is worth noting that, as regards the power, the behavior of the CIKs is even optimal.This was quite expected, however, because of the optimality of the CIKs discussed in Example 1.
5.2.Real data.We next turn to real data.In this case, the CIKs can be compared with some other knockoff procedures, namely: The Benjamin and Hochberg method [6], denoted by BHq; The fixed X knockoff [2], denoted by Fixed-X; The model-X Gaussian knockoff [9], denoted by Model-X; The second-order knockoff [9,18], denoted by Second-order.The comparison is based on the power, the false discovery rate, and the number of false and true discoveries.The results reported here correspond to a = 4, b = 3 and c i = i.
We focus on the human immunodeficiency virus type 1 (HIV-1) dataset [17], which has been used in several papers on the knockoff procedure; see e.g.[2,18].The dimension of our dataset is n = 846 and p = 341, where n denotes the number of observations.The knockoff filter is applied to detect the mutations associated with drug resistance.In fact, the HIV-1 dataset provides drug resistance measurements.Furthermore, it includes genotype information from samples of HIV-1, with separate data sets for resistance to protease inhibitors, nucleoside reverses transcriptase inhibitors, and non-nucleoside RT inhibitors.We deal with resistance to protease inhibitors, and we analyze separately the following drugs: amprenavir (APV), atazanavir (ATV), indinavir (IDV), lopinavir (LPV), nelfinavir (NFV), ritonavir (RTV) and saquinavir (SQV).
Figure 2 summarizes the performances of the five methods across different drugs in terms of power and false discovery rate.It turns out that, in most cases, the CIKs are performing well.Compared to the other procedures, the CIKs are performing better in terms of power for APV, IDV and LPV whilst are performing worse for SQV.In terms of false discovery rate, the CIKs perform better than others for RTV whilst are performing worse for LPV, NFV and SQV. Figure 3 shows the performances of the five methods for each drug related to their discoveries.We note that the number of true discoveries with the CIKs is higher compared to BHq and Fixed-X for all the drugs and similarly to Second-order and Model-X.We also highlight the performance of the CIKs in RTV with respect to the other methods.
To sum up, though the CIKs are not the best, they guarantee a good balance between power and false discovery rate and its performance is analogous to that of the other methods.For instance, as regards APV, ATV, IDV, LPV, NFV, the CIKs have a similar number of true discoveries with respect to Second-order and X-Model but also a fewer number of false discoveries.q q q q q qq q q q q q q 0.0 q q q q q q q q q q q q q False Discovery Rate q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q APV ATV IDV LPV NFV RTV SQV 0.2 0.4 0.6 0.8 drug Power q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q BHq Fixed−X Model−X 2−order CIK APV ATV IDV LPV NFV RTV SQV Therefore,

□
Proof of Theorem 4. Let m denote the Lebesgue measure on R p .Suppose µ is absolutely continuous and denote by f a density of µ (with respect to m).Given ϵ > 0, there is a function f 0 on R p such that: • f 0 is a probability density (with respect to m); where the last inequality is because µ(B c ) < ϵ/2.This concludes the proof.□

•
The final outputs are the arithmetic means of the powers and the false discovery rates, i.e., pow = (1/m) m k=1 pow(k) and f dr = (1/m)

Figure 2 .Figure 3 .
Figure 2. Results from the real example: False Discovery Rate (left) and Power (right) performances across methods and drugs