Estimation of a non-parametric variable importance measure of a continuous exposure

,


Introduction
Consider the following statistical problem: One observes the data structure O = (W, X, Y ) on an experimental unit of interest, where W ∈ W stands for a vector of baseline covariates, and X ∈ R and Y ∈ R respectively quantify an exposure and a response; the exposure features a reference level x 0 with positive mass (there is a positive probability that X = x 0 ) and a continuum of other levels (a first source of difficulty); one wishes to investigate the relationship between X on Y , accounting for W (a second source of difficulty) and making few assumptions on the true data-generating distribution (a third source of difficulty).Taking W into account is desirable when one knows (or cannot rule out the possibility) that it contains confounding factors, i.e., common factors upon which the exposure X and the response Y may simultaneously depend.
We illustrate our presentation with an example where the experimental unit is a set of cancer cells, the relevant baseline covariate W is a measure of DNA methylation, the exposure X and response Y are, respectively, the DNA copy number and expression level of a given gene.Here, the reference level is x 0 = 2, that is the expected copy number in a normal cell.The fact that there is no clear biological indication that X and Y can be interpreted as an exposure and a response, respectively, is not problematic.Associations between DNA copy numbers and expression levels in genes have already been considered in the literature (see e.g., [11,26,1,17,10]).In contrast to these earlier contributions, we do exploit the fact that X features both a reference level and a continuum of other levels, instead of discretizing it or considering it as a purely continuous exposure.
We focus on the case that there is very little prior knowledge on the true data-generating distribution P 0 of O, although we know/assume that (i) O takes its values in the bounded set O (we will denote O = max{|W |, |X|, |Y |} ), (ii) P 0 (X = x 0 ) > 0, and finally (iii) P 0 (X = x 0 |W ) > 0 P 0 -almost surely.Accordingly, we see P 0 as a specific element of the non-parametric set M of all possible data-generating distributions of O satisfying the latter constraints.We define the parameter of interest as Ψ(P 0 ), for the non-parametric variable importance measure Ψ : M → R characterized by Ψ(P ) = arg min β∈R E P (E P (Y |X, W ) − E P (Y |X = x 0 , W ) − β(X − x 0 )) 2  (1) for all P ∈ M. The methodology presented in this article straightforwardly extends to situations where one would prefer to replace the expression β(X − x 0 ) in (1) by βf (X) for any f such that f (x 0 ) = 0 and E P {f (X) 2 } > 0 for all P ∈ M. We emphasize that we do not assume a semi-parametric model (which would write here as Y = β(X − x 0 ) + η(W ) + U with unspecified η and U such that E P (U |X, W ) = 0), in contrast to [15,14,28,21,20].This fact bears important implications.The parameter of interest, Ψ(P 0 ), is universally defined (therefore justifying the expression "non-parametric variable importance measure of a continuous exposure" in the title), no matter what properties the unknown true datagenerating distribution P 0 enjoys, or does not enjoy.Parameter Ψ quantifies the influence of X and Y on a linear scale, using the reference level x 0 as a pivot (note that this expression conveys the notion that the role of X and Y are not symmetric).As its name suggests, Ψ belongs to the family of variable importance measures (a family that includes the excess risk), which was introduced in [21].However, its case is not covered by the latter article because X is continuous (we will see how Ψ naturally relates to an excess risk when X takes only two distinct values).Our purpose here is to fully develop the semi-parametric estimation methodology called targeted minimum loss estimation (TMLE) methodology [23,22].We cover the whole spectrum of its theoretical study, practical implementation, simulation study, and application to the aforementioned genomic example.
In Section 2, we study the fundamental properties of parameter Ψ.In Section 3 we provide an overview of the TMLE methodology tailored for the purpose of estimating Ψ(P 0 ).In Section 4, we state and comment on important theoretical properties enjoyed by the TMLE (convergence of the iterative updating procedure at the core of its definition; its consistency and asymptotic normality).The specifics of the TMLE procedure are presented in Section 5.The properties considered in Section 4 are illustrated by a simulation study inspired by the problem of assessing the importance of DNA copy number variations on expression level in genes, accounting for their methylation (the real data application we are ultimately interested in), as described in Section 6.All proofs are postponed to the appendix.

The non-parametric variable importance parameter
It is of paramount importance to study the parameter of interest in order to better estimate it.Parameter Ψ actually enjoys the following properties [see Chapter 25 in 25, for definitions].
The proof of Proposition 1 is relegated to Section A.2.
Let us emphasize again that we do not assume a semi-parametric model Y = βX + η(W ) + U (with unspecified η and U such that E P (U |X, W ) = 0).Setting R(P, β)(X, W ) = θ(P )(X, W )−θ(P )(0, W )−βX for all (P, β) ∈ M×R, the latter semi-parametric model holds for P ∈ M if there exists a unique β(P ) ∈ R such that R(P, β(P )) = 0. Note that β is always solution to the equation βE P {X 2 } = E P {X (θ(P )(X, W ) − θ(P )(0, W ) − R(P, β)(X, W ))}. In particular, if the semi-parametric model holds for a certain P ∈ M, then β(P ) = Ψ(P ) by (2).On the contrary, if the semi-parametric model does not hold for P , then it is not clear what β(P ) could even mean whereas Ψ(P ) is still a well-defined parameter worth estimating.We discuss in Section 4.2 what happens if one estimates β(P ) when assuming wrongly that the semi-parametric holds (the discussion allows to identify the awkward non-parametric extension of parameter β(P ) that one therefore estimates).
Equality (2) also teaches us that for the functional F : M → R characterized by (all P ∈ M).In that view, the second term in the right-hand side of (3) is a correction term added to F(P ) in order to take W into account for the purpose of quantifying the influence of X on Y on a linear scale.Whereas the roles of X and Y are symmetric in the numerator of F(P ), they are obviously not in that of the correction term.Less importantly, (2) also makes clear that there is a connexion between Ψ and an excess risk.Indeed, consider P ∈ M such that P (X ∈ {0, x 1 }) = 1 for x 1 = 0. Then Ψ(P ) satisfies for h(P )(W ) = P (X = x 1 |W ), i.e., Ψ(P ) appears as a weighted excess risk (the classical excess risk would be here Since Ψ is pathwise differentiable, the theory of semi-parametric estimation applies, providing a notion of asymptotically efficient estimation.Remarkably, the asymptotic variance of a regular estimator of Ψ(P 0 ) is lower-bounded by the variance Var P 0 D ⋆ (P 0 )(O) under P 0 of the efficient influence curve at P 0 (a consequence of the convolution theorem).The TMLE procedure takes advantage of the properties of Ψ described in Proposition 1 in order to build a consistent and possibly asymptotically efficient substitution estimator of Ψ(P 0 ).In view of (3), this is a challenging statistical problem because, whereas estimating F(P 0 ) is straightforward (the ratio of the empirical means of XY and X 2 is an efficient estimator of F(P 0 )), estimating the correction term in (3) is more delicate, notably because this necessarily involves estimating the infinite-dimensional features θ(P 0 )(0, •) and µ(P 0 ).
3 Overview of the TMLE procedure tailored to the estimation of the non-parametric variable importance measure We assume now that we observe n independent copies O (1) = (W (1) , X (1) , Y (1) ), . . ., O (n) = (W (n) , X (n) , Y (n) ) of the observed data structure O ∼ P 0 ∈ M. The empirical measure is denoted by P n .The TMLE procedure iteratively updates an initial substitution estimator ψ 0 n = Ψ(P 0 n ) of Ψ(P 0 ) (based on an initial estimator P 0 n of the data-generating distribution P 0 ), building a sequence {ψ k n = Ψ(P k n )} k≥0 (with P k n the kth update of P 0 n ) which converges to the targeted minimum loss estimator (TMLE) ψ * n as k increases.This iterative scheme is visually illustrated in Figure 1, and we invite the reader to consult its caption now.
We determine what initializing the TMLE procedure boils down to in Section 3.1.A general one-step targeted updating procedure is described in Section 3.2.How to conduct specifically these initialization and update (as well as two alternative tailored two-step updating procedures) is addressed in Section 5.

Initial estimator
In this subsection, we describe what it takes to construct an initial substitution estimator of Ψ(P 0 ).Of course, how one derives the substitution estimator Ψ(P ) from the description of (certain features of) P is relevant even if P is not literally an initial estimator of P 0 .
By (2), building an initial substitution estimator Ψ(P 0 n ) of Ψ(P 0 ) requires the estimation of θ(P 0 ), of σ 2 (P 0 ), and of the marginal distribution of (W, X) under P 0 .Given P 0 n , initial estimator of P 0 with known θ(P 0 n ), σ 2 (P 0 n ) > 0 and marginal distribution of (W, X) under P 0 n , Ψ(P 0 n ) can indeed be obtained (or, more precisely, evaluated accurately) by the law of large numbers, as discussed below.We emphasize that such an initial estimator may very well be biased.In other words, one would need strong assumptions on the true datagenerating distribution P 0 (which we are not willing to make; typically, assuming that P 0 belongs to a given regular parametric model) and adapting the construction of P 0 n based on those assumptions (typically, relying on maximum likelihood estimation) in order to obtain the consistency of Ψ(P 0 n ).For B a large integer (say B = 10 5 ), evaluating accurately (rather than computing exactly) the initial substitution estimator Ψ(P 0 n ) of Ψ(P 0 ) boils down to simulating B independent copies ( W (b) , X(b) ) of (W, X) under P 0 n , then using the approximation Knowing the marginal distribution of (W, X) under P 0 n amounts to knowing (i) the marginal distribution of W under P 0 n , (ii) the conditional distribution of Z ≡ 1{X = 0} given W under P 0 n , and (iii) the conditional distribution of X given (W, X = 0) under P 0 n .Firstly, we advocate for estimating initially the marginal distribution of W under P 0 by its empirical version, or put in terms of likelihood, to build P 0 n in such a way that Secondly, the conditional distribution of Z given W under P 0 n is the Bernoulli law with parameter 1 − g(P 0 n )(0|W ), so it is necessary that g(P 0 n ) be known too (and such that, P 0 n -almost surely, g(P 0 n )(0|W ) ∈ (0, 1)).Thirdly, the conditional distribution of X given (W, X = 0) under P 0 n can be any (finite variance) distribution, whose conditional mean can be deduced from µ(P 0 n ): and whose conditional second order moment E P 0 n (X 2 |X = 0, W ) satisfies In particular, it is also necessary that µ(P 0 n ) be known too.In summary, the only features of P 0 n we really care for in order to evaluate accurately (rather than compute exactly) ψ 0 n = Ψ(P 0 n ) are θ(P 0 n ), µ(P 0 n ), g(P 0 n ), σ 2 (P 0 n ), and the marginal distribution of W under P 0 n , which respectively estimate θ(P 0 ), µ(P 0 ), g(P 0 ), σ 2 (P 0 ), and the marginal distribution of W under P 0 .We could for instance rely on a working model where the conditional distribution of X given (W, X = 0) is chosen as the Gaussian distribution with conditional mean as in (6) and any conditional second order moment (which is nothing but a measurable function of W ) such that (7) holds.Let us emphasize that we do use here expressions from the semantical field of choice, and not from that of assumption; a working model is just a tool we use in the construction of the initial estimator, and we do not necessarily assume that it is well-specified.Although such a Gaussian working model would be a perfectly correct choice, we advocate for using another one for computational convenience, as presented in Section 5.1.

A general one-step updating procedure of the initial estimator
The next step consists in iteratively updating ψ 0 n = Ψ(P 0 n ).Assuming that one has already built (k − 1) updates P 1 n , . . ., P k−1 n of P 0 n , resulting in (k − 1) updated substitution estimators ), it is formally sufficient to describe how the kth update P k n is derived from its predecessor P k−1 n in order to fully determine the iterative procedure.Note that the value of ) are derived as ψ 0 n = Ψ(P 0 n ), by following (5) in Section 3.1 with P 1 n , . . ., P k−1 n substituted for P 0 n .We present here a general one-step updating procedure (two alternative tailored two-step updating procedures are also presented in Section 5.2).We invite again the reader to refer to Figure 1 for its visual illustration.
Set ρ ∈ (0, 1) a constant close to 1 and consider the path x Figure 1: Illustration of the TMLE procedure (with its general one-step updating procedure).We purposedly represent the initial estimator P 0 n closer to P 0 than its kth and (k+1)th updates P k n and P k+1 n , heuristically because P 0 n is as close to P 0 as one can possibly get (given P n and the specifics of the super-learning procedure) when targeting P 0 itself.However, this obviously does not necessarily imply that Ψ(P 0 n ) performs well when targeting Ψ(P 0 ) (instead of P 0 ), which is why we also purposedly represent Ψ(P k+1 n ) closer to Ψ(P 0 ) than Ψ(P 0 n ).Indeed, P k+1 n is obtained by fluctuating its predecessor P k n "in the direction of Ψ", i.e., taking into account the fact that we are ultimately interested in estimating Ψ(P 0 ).More specifically, the fluctuation {P k n (ε) : |ε| < η k n } of P k n is a one-dimensional parametric model (hence its curvy shape in the large model M) such that (i) P k n (0) = P k n , and (b) its score at ε = 0 equals the efficient influence curve D ⋆ (P k n ) at P k n (hence the dotted arrow).An optimal stretch ε k n is determined (e.g. by maximizing the likelihood on the fluctuation), yielding the update where D ⋆ (P k−1 n ) is the current estimator of the efficient influence curve at P 0 obtained as the efficient influence curve at P k−1 n .The path is a one-dimensional parametric model that fluctuates , the score of the path at ε = 0 equals D ⋆ (P k−1 n )).Here, we choose minus the log-likelihood function as loss function (i.e., we choose L : M × O → R characterized by L(P )(O) = − log P (O)).Consequently, the optimal update of P k−1 n is indexed by the maximum likelihood estimator (MLE) The MLE ε k−1 n is uniquely defined (and possibly equal to ±ρ D ⋆ (P k−1 n ) −1 ∞ , hence the introduction of the constant ρ in the definition of the path) provided for instance that (this statement is to be understood conditionally on P n , i.e. it is a statement about the sample).Under mild assumptions on P 0 , ), thus concluding the description of the iterative updating step of the TMLE procedure.Finally, the TMLE ψ * n is defined as ψ * n = lim k→∞ ψ k n , assuming that the limit exists, or more generally as ψ kn n for a conveniently chosen sequence {k n } n≥0 (see Sections 4.1 and 4.2 regarding this issue).This is a very general way of dealing with the updating step of the TMLE methodology.The key is that it is possible to determine how the fundamental features of P k n (ε) (i.e., the components of P k n (ε) involved in the definition of D ⋆ (P k n (ε)) and in the definition of Ψ) behave (exactly) as functions of ε relative to their counterparts at ε = 0 (i.e., with respect to (wrt) P k n ), as shown in the next Lemma (its proof is relegated to Section A.2).
Lemma 1. Set s ∈ L 2 0 (P ) with s ∞ < ∞ and consider the path The path has score function s.For all |ε| < s −1 ∞ and all measurable function f of W , Regarding the computation of Ψ(P k n ), it is also required to know how to sample independent copies of (W, X) under P k n (ε), see Section 3.1.Finally, we emphasize that by (14), the marginal distribution of W under P k n typically deviates from its counterpart under P 0 n (i.e., from its empirical counterpart).

TMLE and one-step estimation methodologies.
By being based on an iterative scheme, the TMLE methodology naturally evokes the onestep estimation methodology introduced by Le Cam [8] (see [25,Sections 5.7 and 25.8] for a recent account).The latter estimation methodology draws its inspiration from the method of Newton-Raphson in numerical analysis, and basically consists in updating an initial estimator by relying on a linear approximation to the original estimating equation.
Yet, some differences between the TMLE and one-step estimation methodologies are particularly striking.Most importantly, because the TMLE methodology only involves substitution estimators, how one updates (in the parameter space R) the initial estimator ) is the consequence of how one updates (in model M) the initial estimator P 0 n of P 0 into P 1 n .In contrast, the one-step estimator is naturally presented as an update (in the parameter space R) of the initial estimator, for the sake of solving a linear approximation (in Ψ(P )) to the estimating equation P n D ⋆ (P ) = 0.The TMLE methodology does not involve such a linear approximation; it nevertheless guarantees by construction P n D ⋆ (P k n ) ≈ 0 for large k (see Section 4.1 on that issue).Furthermore, on a more technical note, the asymptotic study of the TMLE ψ * n does not require that the initial estimator ψ 0 n = Ψ(P 0 n ) be √ n-consistent (i.e., that √ n(ψ 0 n − Ψ(P 0 )) be uniformly tight), whereas that of the one-step estimator typically does.
However, there certainly exist interesting relationships between the TMLE and one-step estimation methodologies too.Such relationships are not obvious, and we will investigate them in future work.

Convergence and asymptotics
In this section, we state and comment on important theoretical properties enjoyed by the TMLE.In Section 4.1, we study the convergence of the iterative updating procedure which is at the core of the TMLE procedure.In Section 4.2, we derive the consistency and asymptotic normality of the TMLE.By building on the statement of consistency, we also argue why it is more interesting to estimate our non-parametric variable importance measure Ψ(P 0 ) than its semi-parametric counterpart.

On the convergence of the updating procedure
Studying the convergence of the updating procedure has several aspects to it.We focus on the general one-step procedure of Section 3.2.All proofs are relegated to Section A.4.
On one hand, the following result (very similar to Result 1 in [23]) trivially holds: Lemma 2. Assume (i) that all the paths we consider are included in M ′ ⊂ M such that sup P ∈M ′ D ⋆ (P ) ∞ = M < ∞, and (ii) that their fluctuation parameters ε are restricted to Condition (i) is weak, and we refer to Lemma 4 for a set of conditions which guarantee that it holds.Lemma 2 is of primary importance.It teaches us that if the TMLE procedure "converges" (in the sense that lim k→∞ ε k n = 0) then its "limit" is a solution of the efficient influence curve equation (in the sense that for any arbitrary small deviation from 0, it is possible to guarantee P n D ⋆ (P k n ) ≈ 0 by choosing k large enough).This is the key to the proofs of consistency and asymptotic linearity, see Section 4.2.Actually, the condition lim k→∞ ε k n = 0 can be replaced by a more explicit condition on the class of the considered data-generating distributions, as shown in the next lemma.Lemma 3.Under the assumptions of Lemma 2, let us suppose additionally that the sample satisfies (iii) inf k≥0 P n D ⋆ (P k n ) 2 > 0, and (iv) that the log-likelihood of the data is uniformly bounded on M ′ : sup P ∈M ′ n i=1 log P (O (i) ) < ∞.Then it holds that lim k→∞ ε k n = 0 and On the other hand, it is possible to obtain another result pertaining to the "convergence" of the updating procedure directly put in terms of the convergence of the sequences {P k n } k≥0 and {ψ k n } k≥0 , provided that {ε k n } k≥0 goes to 0 quickly enough.Specifically, It is necessary to bound g(P k n ) and σ 2 (P k n ) away from 0 because conditions (i) and (ii) of Lemma 2 only imply that g . Now, it makes perfect sense from a computational point of view to resort to lower-thresholding in order to ensure that g(P k n )(0|W ) and σ 2 (P k n ) cannot be smaller than a fixed constant.Assuming that the series k≥0 |ε k n | converges ensures that {P k n } k≥0 converges in total variation rather than weakly only.Interestingly, we do draw advantage from this stronger type of convergence in order to derive the second part of the lemma.In conclusion, note that Newton-Raphson-type algorithms converge at a k −2 -rate, which suggests that the condition k≥0 |ε k n | < ∞ is not too demanding.

Consistency and asymptotic normality
Let us now investigate the statistical properties of the TMLE ψ * n .We actually consider a slightly modified version of the TMLE in order to circumvent the issue of the convergence of the sequence {ψ k n } k≥0 as k goes to infinity.The modified version is perfectly fine from a practical point of view.All proofs are relegated to Section A.5.
Merit of the non-parametric variable importance measure over its semi-parametric counterpart.
Let us repeat that we do not assume a semi-parametric model Y = βX + η(W ) + U (with unspecified η and U such that E P (U |X, W ) = 0).However, if P ∈ M is such that θ(P )(X, W ) = β(P )X + θ(P )(0, W ) (i.e., if the semi-parametric model holds under P ) then Ψ(P ) = β(P ).Let us denote by M SP ⊂ M the set of all such data-generating distributions.
It is known (see for instance [28]) that β : M SP → R is a pathwise differentiable parameter (wrt the corresponding maximal tangent space), and that its efficient influence curve at P ∈ M SP is given by is the conditional variance of Y given (X, W ) under P .Note that the second factor in the right-hand side expression reduces to (X − µ(P )(W )) whenever v 2 (P )(X, W ) only depends on W .
For the purpose of emphasizing the merit of the non-parametric variable importance measure over its semi-parametric counterpart, say that one estimates β(P 0 ) assuming (temporarily) that P 0 ∈ M SP (hence Ψ(P 0 ) = β(P 0 )).Say that one builds P * n,SP ∈ M SP such that (i) v 2 (P * n,SP )(X, W ) does not depend on (X, W ), and (ii) , and finally that one solves in the limit the efficient influence curve equation: (this is typically derived from (ii) above; see the proof of Proposition 2 for a typical derivation).Then (by double-robustness of D ⋆ SP ), the estimator β(P * n,SP ) of β(P 0 ) is consistent (i.e., β 1 = β(P 0 )) if either θ 1 = θ(P 0 ) (that is obvious) or µ 1 = µ(P 0 ).For example, let us suppose that µ 1 = µ(P 0 ).In particular, one can deduce from equalities 2 } and ( 15) that (provided that X does not coincide with µ(P 0 )(W ) under P 0 ).Equivalently, β 1 = b(P 0 ) for the functional b : Note that one can interpret parameter b as a non-parametric extension of the semiparametric parameter β (non-parametric, because its definition does not involve a semiparametric model anymore).Now, we want to emphasize that b arguably defines a sensible target if θ 1 (0, •) = θ(P )(0, •) (in addition to µ 1 = µ(P 0 )), but not otherwise!This illustrates the danger of relying on a semi-parametric model when it is not absolutely certain that it holds, thus underlying the merit of targeting the non-parametric variable importance measure rather than its semi-parametric counterpart.
Corollary 1 covers a simple case in the sense that, by being o P (1/ √ n), the second righthand side term in (16) does not significantly contribute to the linear asymptotic expansion i.e., the influence curve actually is D ⋆ (σ 2 (P 0 ), θ 0 , µ 0 , g 0 , Ψ(P 0 )).Depending on how θ(P 0 n ), µ(P 0 n ) and g(P 0 n ) are obtained (again, we recommend relying on super-learning), the contribution to the linear asymptotic expansion may be significant (but determining this contribution would be a very difficult task to address on a case by case basis when relying on super-learning).
5 Specifics of the TMLE procedure tailored to the estimation of the non-parametric variable importance measure In this section, we present practical details on how we conduct the initialization and updating steps of the TMLE procedure as described in Section 3. We introduce in Section 5.1 a working model for the conditional distribution of X given (W, X = 0) which proves very efficient in computational terms.In Section 5.2, we introduce two alternative two-step updating procedures which can be substituted to the general one-step updating procedure presented in Section 3.2.Finally, we describe carefully what are all the features of interest of P 0 that must be considered for the purpose of targeting the parameter of ultimate interest, Ψ(P 0 ), via the construction of the TMLE.
5.1 Working model for the conditional distribution of X given (W, X = 0) The working model for the conditional distribution of X given (W, X = 0) under P 0 n that we build relies on two ideas: -we link the conditional second order moment E P 0 n (X 2 |X = 0, W ) to the conditional mean E P 0 n (X|X = 0, W ) (both under P 0 n ) through the equality where ), and λ ∈ [0, 1] is a fine-tune parameter; -under P 0 n and conditionally on (W, X = 0), X takes its values in the set Since the conditional distribution of X given (W, X = 0) under P 0 n is subject to two constraints, X cannot take fewer than three different values in general.Elegantly, it is possible (under a natural assumption on P 0 n ) to fine-tune λ and to select three values in {X (i) : i ≤ n} \ {0} in such a way that X only takes the latter values: Lemma 5. Assume that P 0 n guarantees that σ 2 (P 0 n ) > 0, P 0 n (X = 0) > 0, g(P 0 n )(0|W ) ∈ (0, 1) P 0 n -almost surely, and X ∈ [m n + c, M n − c] for some c > 0 when X = 0.It is possible to construct P 00 n ∈ M in such a way that (i) W has the same marginal distribution under P 00 and (ii) for all W ∈ W, there exist three different values x (1) , x (2) , x (3) ∈ {X (i) : i ≤ n} \ {0} and three non-negative weights p 1 , p 2 , p 3 summing up to 1 such that, conditionally on (W, X = 0) under P 00 n , X = x (k) with conditional probability p k .
Hence, we directly construct a P 0 n of the same form as P 00 n .Note that, by (8), because the conditional distribution of X given (W, X = 0) under P 0 n has its support included in {X (i) : i ≤ n} \ {0}, then so do the conditional distributions of X given (W, X = 0) under P k n (all k ≥ 1) obtained by following the general one-step updating procedure of Section 3.2.Similarly, because we initially estimate the marginal distribution of W under P 0 by its empirical counterpart, then the marginal distributions of W under P 0 n and P k n (all k ≥ 1) have their supports included in {W i : i ≤ n}.
We discuss in Section 5.4 why it is computationally more interesting to consider such a working model (instead of a Gaussian working model for instance).We emphasize that assuming X ∈ [m n + c, M n − c] when X = 0 (for a possibly tiny c > 0) is hardly a constraint, and that the latter must be accounted for while estimating µ(P 0 ), g(P 0 ), and σ 2 (P 0 ).The proof of the lemma is relegated to Section A.2.

Two tailored alternative two-step updating procedures
We present in Section 3.2 a general one-step updating procedure.Alternatively, it is also possible to decompose each update into a first update of the conditional distribution of Y given (W, X), followed by a second update of the marginal distribution of (W, X). )(X, W ) and conditional variance 1, where the so-called clever covariate H(P ) is characterized for any P ∈ M by This definition guarantees that the path fluctuates the first intermediate update bends Note that the loss L a,b (P ) depends on the conditional distribution of Y given (W, X) under P only through its conditional mean θ(P ).This straightforwardly implies that in order to describe a fluctuation , it is only necessary to detail the form of the marginal distribution of (W, X) under ) and ε.Specifically, we first fluctuate only the conditional distribution of Y given (W, X), by making P k−1 n,1 (ε) be such that (i) (W, X) has the same distribution under P k−1 n,1 (ε) as under , and (ii) Now, introduce the L a,b -minimum loss estimator which finally yields the first intermediate update ).The following lemma (whose proof is relegated to Section A. Moreover, it holds that The second inequality is the counterpart of the fact that, when using the Gaussian fluctuation, the score of the path at ε = 0 equals Second update: fluctuating the marginal distribution of (W, X).
Next, we preserve the conditional distribution of Y given (W, X) and only fluctuate the marginal distribution of (W, X), by introducing the path , and (ii) the marginal distribution of (W, X) under P k−1 n,2 (ε) is characterized by This second path fluctuates P k−1 n,2 (i.e., P k−1 n,2 (0) = P k−1 n,2 ) in the direction of D ⋆ 1 (P k−1 n,2 ) (i.e., the score of the path at ε = 0 equals D ⋆ 1 (P k−1 n,2 )).Consider again minus the log-likelihood as loss function, and introduce the MLE the second update bends ), concluding the description of how we can alternatively build P k n based on P k−1 n .Note that, by (18), because the conditional distribution of X given (W, X = 0) under P 0 n has its support included in {X (i) : i ≤ n} \ {0} (a consequence of our choice of working model, see Section 5.1), then so do the conditional distributions of X given (W, X = 0) under P k n (all k ≥ 1) obtained by following either one of the tailored two-step updating procedure.Furthermore, it still holds that the marginal distributions of W under P 0 n and P k n (all k ≥ 1) have their supports included in {W i : i ≤ n} (because we initially estimate the marginal distribution of W under P 0 by its empirical counterpart).

Super-learning of the features of interest
It still remains to specify how we wish to carry out the initial estimation and updating of the features of interest θ(P 0 ), µ(P 0 ), g(P 0 ), and σ 2 (P 0 ).As for σ 2 (P 0 ) = E P 0 {X 2 }, we simply estimate it by its empirical counterpart i.e., construct P 0 n in such a way that σ 2 (P 0 n ) = n −1 n i=1 (X (i) ) 2 .The three other features θ(P 0 ), µ(P 0 ) and g(P 0 ) are estimated by super-learning, and P 0 n is constructed in such a way that θ(P 0 n ), µ(P 0 n ) and g(P 0 n ) equal their corresponding estimators.Super-learning is a cross-validation based aggregation method that builds a predictor as a convex combination of base predictors [24,22] (we briefly describe in Section 6.5 the specifics of the super-learning procedure that we implement for our application to simulated and real data).The weights of the convex combination are chosen so as to minimize the prediction error, which is expressed in terms of the non-negative least squares (NNLS) loss function [7] and estimated by V -fold cross-validation.Heuristically the obtained predictor is by construction at least as good as the best of the base predictors (this statement has a rigorous form implying oracle inequalities, see [24,22]).
Lemma 1 teaches us what additional features of P k−1 n must be known in order to derive the kth update P k n from its predecessor P k−1 n , starting from k = 1.Specifically, if we rely on the general one-step updating procedure of Section 3.2 then we need to know: ), and the marginal distribution of W under P k−1 n (see the right-hand side denominators in ( 11), ( 12), ( 14)); ) (see the right-hand side numerator in ( 12)); It is noteworthy that if either one of the two-step updating procedures of Section 5.2 is used then the first two conditional expectations do not need to be known, because updating θ(P k−1 n ) relies on the clever covariate H(P k−1 n ), which is entirely characterized by the current estimators µ(P k−1 n ), g(P k−1 n ), and σ 2 (P k−1 n ) of the features µ(P 0 ), g(P 0 ), and σ 2 (P 0 ), respectively.In the sequel of this sub-section, we focus on the general one-step updating procedure of Section 3.2.How to proceed when relying on either of the two-step updating procedures of Section 5.2 can be easily deduced from that case.
Once θ(P 0 n ), µ(P 0 n ), g(P 0 n ), and σ 2 (P 0 n ) are determined (see the first paragraph of this sub-section) hence D ⋆ (P 0 n ) is known, we therefore also estimate by super-learning the conditional expectations , we simply estimate it by its empirical counterpart.Then we constrain P 0 n in such a way that the conditional expectations , and expectation E P 0 n {X 2 D ⋆ (P 0 n )(O)} equal their corresponding estimators.This completes the construction of P 0 n , and suffices for characterizing the features θ(P 1 n ), µ(P 1 n ), g(P 1 n ) and σ 2 (P 1 n ) of the first update P 1 n .Now, if one wished to follow exactly the conceptual road consisting in relying on Lemma 1 in order to derive the second update P 2 n from its predecessor P 1 n , one would have to describe how each conditional (and unconditional) expectation of the above list behaves, as a function of ε, on the path This would in turn enlarge the above list of the features of interest of P 0 that one would have to consider in the initial construction of P 0 n .Note that the length of the list would increase quadratically in the number of updates.Instead, once D ⋆ (P k−1 n ) is known, we estimate by super-learning the conditional expectations we simply estimate it by its empirical counterpart.Then we proceed as if the conditional expectations were equal to their corresponding estimators.By doing so, the length of the list of the features of interest of P 0 is fixed no matter how many steps of the updating procedure are carried out.Arguably, following this alternative road has little if no effect relative to following exactly the conceptual road consisting in relying on Lemma 1, because only second (or more) order expressions in ε are involved.

Merit of the working model for the conditional distribution of X given (W, X = 0)
Let us explain here why (a) initially estimating the marginal distribution of W under P 0 by its empirical counterpart and (b) relying on the working model for the conditional distribution of X given (W, X = 0) that we described in Section 5.1 is computationally very interesting.The key is that, under P 0 n and its successive updates P k n (all k ≥ 1), the distributions of (W, X) have their supports included in {(W (i) , X (j) ) : i ≤ j ≤ n} (we say they are "parsimonious").
Indeed, Lemma 1 and a simple induction yield that, for each k ≥ 1, a single call to θ(P k n ), µ(P k n ) or g(P k n ) involves a number of (nested) calls to the "past" features of interest Furthermore, the evaluation of Ψ(P k n ) (following (5) with P k n substituted to P 0 n ) requires in turn B calls (assuming for simplicity that the functions are not vectorized) to θ(P k n ) (in order to evaluate the numerator of the right-hand side term of ( 5)), µ(P k n ) and g(P k n ) (in order to simulate {( W (b) , X(b) ) : b ≤ B}).Overall, at least O(Bk) calls to the set of all features of interest are performed at the kth updating step of the TMLE procedure.In practice (even if functions are vectorized) this leads to a large memory footprint and prohibitive running time of the algorithm, as each of these calls consists in the prediction of the corresponding feature, as described in Section 5. 3.
By taking advantage of the "parsimony" of the distributions of (W, X) under the successive P k n (k ≥ 0), we manage to alleviate dramatically the time and memory requirements of our implementation.Indeed, the "parsimony" implies that, at the kth step of the TMLE procedure (k ≥ 0), it is only required to compute and store O(n 2 ) quantities (including, but not limited to, θ(P k n )(X (i) , W (j) ), µ(P k n )(W (i) ) and g(P k n )(W (j) ) for all 1 ≤ i, j ≤ n)see Section 5.3).In particular, the evaluation of Ψ(P k n ) now requires retrieving O(B) values from a handful of vectors instead of performing O(Bk) memory and time-consuming (nested) function calls.

Application
We first present the genomic problem that motivated this study, in Section 6.1, and earlier contributions on the same topic, in Section 6.2.Two real datasets are described in Section 6.3.They play a central role in this article.We both (a) draw inspiration from one of them and (b) use it in order to set up our simulation study, as presented in Section 6.4.We also apply the TMLE methodology directly to the other.The specifics of the TMLE procedures that we undertake both on simulated and real data are given in Section 6.5, and their results are summarized in Section 6.6, for the simulation study, and in Section 6.7, for the real data application.

Association between DNA copy number and gene expression in cancers
The activity of a gene in a cell is directly related to its expression level, that is, the number of messenger RNA (mRNA) fragments corresponding to this gene.Cancer cells are characterized by changes in their gene expression patterns.Such alterations have been shown to be caused directly or indirectly by genetic events, such as changes in the number of DNA copies, and epigenetic events, such as DNA methylation.Some changes in DNA copy number have been reported to be positively associated with gene expression levels [11].Conversely, DNA methylation is a chemical transformation of cytosines (one of the four types of DNA nucleotides) which is thought to lead to gene expression silencing [5].Therefore, DNA methylation levels are generally negatively associated with gene expression levels.
We propose to apply the methodology developed in the previous sections to the search for genes for which there exists an association between DNA copy number variation and gene expression level, accounting for DNA methylation.

Related works
In the context of cancer studies, various methods have been proposed in order to find associations between DNA copy number and gene expression at the level of genes.Because we cannot cite all of them, we try here to cite one relevant publication for each broad type of method.Most of them can be classified into two groups, depending on whether DNA copy number is viewed as a continuous or a discrete variable.When DNA copy number is viewed as a continuous variable, associations between X and Y are generally quantified using a correlation coefficient [11].When it is viewed as a discrete variable, associations are typically quantified using a test of differential expression between DNA copy number states [26].A common limitation to this two types of methods is that they are generally good at identifying genes that were already known, but less so at finding novel candidates.This is not surprising: for correlation-based methods, high correlation between X and Y requires both X and Y to vary substantially, in which case it is likely that these (marginal) variations have already been reported.For methods based on differential expression between copy number states, the latter often correspond to biological or clinical groups which are already known and for which differential expression analyses have already been carried out.
In the present paper, we acknowledge the fact that while DNA copy number is observed as a quantitative variable, the copy neutral state (two copies of DNA) generally has positive mass, in the sense that for a given gene, a positive proportion of samples have two copies of DNA.
Another major difference between our method and the ones cited above is that we explicitly incorporate DNA methylation into the analysis.Several papers where DNA copy number, gene expression and DNA methylation are combined have been published recently, but they typically analyze one dimension of (W, X, Y ) at a time, and then use an ad hoc rule to merge or intersect the results [1,17].The CNAmet method [10] relies on two scores: a score of differential expression between copy number levels on the one hand, and between DNA methylation levels on the other hand.Then both scores are summed.In the method proposed here, the three dimensions are studied jointly.

Datasets
We exploit glioblastoma multiforme (GBM, the most common type of primary adult brain cancers) and ovarian cancers (OvCa, a cancerous growth arising from the ovary) data from The Cancer Genome Atlas (TCGA) project [2], a collaborative initiative to better understand several types of cancers using existing large-scale whole-genome technologies.TCGA has recently completed a comprehensive genomic characterization of these types of tumor, including DNA copy number (X) , gene expression (Y ), and DNA methylation (W ) microarray experiments [18,19].
Probe-level normalized GBM and OvCa data can be downloaded from the TCGA repository at http://tcga-data.nci.nih.gov/tcga/.In order to study associations between X, Y and W at the level of genes, these probe-level measurements first need to be aggregated into gene-level summaries.We choose to define X, Y and W as follows for a given gene: -DNA methylation W is the proportion of "methylated" signal at a CpG locus in the gene's promoter region; -DNA copy number X is a locally smoothed total copy number relative to a set of reference samples; -expression Y is the "unified" gene expression level across three microarray platforms, as defined by [27].
After this pre-processing step, each gene is represented by a 3 × n matrix, where 3 is the number of data types and n is the number of samples.Figure 2(a) represents DNA methylation, DNA copy number, and gene expression data for one particular gene, EGFR, which is known to be altered in GBM.The association between copy number and expression is non-linear, and high methylation levels are associated with low expression levels.

Simulation scheme
Because association patterns between copy number, expression and methylation are generally non-linear, setting up a realistic simulation model is a difficult task.We design here a simulation strategy based on perturbations of real observed data structures, which mimics situations such as the one observed in the Figure 2(a) for the EGFR gene in GBM.This strategy implements the following constraints: DNA copy number 0.87 0.02 0.04 0.06 0.08 DNA copy number 0.80 0.00 0.02 0.04 0.06 0.08 0.10 Figure 2: Illustrating DNA methylation, DNA copy number, and gene expression data.In both graphics, we represent kernel density estimates (diagonal panels), pairwise plots (lower panels), and report the pairwise Pearson correlation coefficients (upper panels).(a).Real dataset corresponding to the EGFR gene in 187 GBM tumor samples.For 130 among the 187 samples, only DNA copy number and gene expression data were available (circles in lower middle plot).(b).Simulated dataset consisting of n = 200 independent copies of the synthetic observed data structure described in Section 6.6.Note that the constant O X 2 is added to each value of X so that graphics corresponding to real and simulated data can be more easily compared.
-there are generally up to three copy number classes: normal regions, and regions of copy number gains and losses; -in normal regions, expression is negatively correlated with methylation; -in regions of copy number alteration, copy number and expression are positively correlated.
Our simulation scheme relies on three real observed data structures 3 ) corresponding to three samples from different copy number classes: loss (class 1), normal (class 2), and gain (class 3).We simulate a synthetic observed data structure O = (W, X, Y ) ∼ P s as follows.Given a vector p = (p 1 , p 2 , p 3 ) of proportions such that p 1 + p 2 + p 3 = 1, we first draw a class assignment U from the multinomial distribution with parameter (1, p) (in other words, U = u with probability p u ).Conditionally on U , a measure W of DNA methylation is drawn randomly as a perturbation of the DNA methylation in the corresponding real observed data structure O U : given a vector ω = (ω 1 , ω 2 , ω 3 ) of positive numbers, where Z is a standard normal random variable independent of U .Finally, a couple (X, Y ) of DNA copy number and DNA expression is drawn conditionally on (U, W ) as a perturbation of the couple (O X U , O Y U ) in the corresponding real observed data structure O U (with an additional centering applied to X so that the pivot value be equal to 0): Given σ 2 > 0, two variancecovariance 2 × 2-matrices Σ 1 and Σ 3 and a non-increasing mapping λ 0 : , where Z ′ is a standard normal random variable independent of (U, W ); In particular, the reference/pivot value x 0 = 0. Note that λ 0 is chosen non-increasing in order to account for the negative association between DNA expression and methylation.Furthermore, the synthetic observed data structure O drawn from P s is not bounded.We easily derive closed-form expressions for the features of interest θ(P s ), µ(P s ), g(P s ), and σ 2 (P s ), which we report in the Appendix (see Lemma 7).Relying on Lemma 7 makes it possible to evaluate the value of Ψ(P s ), by following the procedure described in Section 3.1 (see details in Section 6.6).
Finally we provide in Figure 2(b), for the sake of illustration, a visual summary of a simulation run with n = 200 independent copies of the synthetic observed data structure O drawn from P s and based on real observed data structure from two GBM samples for the EGFR gene which are described in Table 1.The parameters for this simulation were chosen as follows: p = (0, 1/2, 1/2), ω = (0, 3, 3), λ 0 : w → −w, σ 2 = 1, Σ 3 = 9.96 1 1 0.43 .

Library of algorithms for super-learning
We explain in Section 5.3 that we rely on super-learning [24,22] in order to estimate some relevant infinite-dimensional features of P 0 , including (but not limited to) θ(P 0 ), µ(P 0 ) and g(P 0 ).This algorithmic challenge is easily overcome, thanks to the remarkable R-package SuperLearner [12] and the possibility to rely on the library of R-packages [13] built by the statistical community.As for the base predictors, they involve (by alphabetical order): -Generalized additive models: we use the gam R-package [4], with its default values.
-Piecewise linear splines: we use polymars R-function from the polspline R-package [6], with its default values.
-Random forests: we use the randomForest R-package [9], with its default values.
-Support vector machines: we use the svm R-function from the e1071 R-package [3], with its default values.
Note that none of the statistical models associated to the above estimation procedures contains P s (see Lemma 7).

Simulation study
We conduct twice a simulation study where B ′ = 10 3 datasets of n = 200 independent observed data structures are (independently) generated under P s (i.e., under the simulation scheme described in Section 6.4).In each simulation study and for every simulated dataset, we perform the TMLE methodology for the purpose of estimating the target parameter Ψ(P s ).
From one simulation study to the other, we only change the set up of the super-learning procedure, by modifying the library of algorithms involved in the super-learning of the features of interest: -the first time, we proceed exactly as described in Section 6.5 (we say that the full-SL is undertaken); -the second time, we decide to include only algorithms based on generalized linear models (we say that the light-SL is undertaken).
We do not use any index to refer to the super-learning set up (full-SL or light-SL) for the sake of alleviating notations.
In each simulation study (i.e., for each set up of the super-learning procedure full-SL and light-SL) and for each b ≤ B ′ , we record the values ψ k n,b = Ψ(P k n,b ) of the initial substitution estimator (k = 0) and subsequent updated substitution estimators (k = 1, 2, 3) targeting Ψ(P s ), as derived on the bth simulated dataset (whose empirical measure is denoted by P n,b ).The targeted update steps rely on the Gaussian fluctuations presented in Section 5.2 (the results are very similar when one applies either the general one-step updating procedure of Section 3.2 or the second tailored alternative two-step updating procedure of Section 5.2).We do not record the next updates because the ad hoc stopping criterion that we devise systematically indicates that this is not necessary (heuristically, the criterion elaborates on the gains in likelihood and the variations in the resulting estimates).
The value of Ψ(P s ) is evaluated by simulations, following (5) in Section 3.1 with P s substituted for P 0 n (we rely on B = 10 5 simulated observed data structures, whose empirical measure is denoted by P B ; the features θ(P s ) and σ 2 (P s ) are explicitly known, see Lemma 7).In order to get a sense of how accurate our evaluation of Ψ(P s ) is, we also use the same large simulated dataset to evaluate Var P s D ⋆ (P s )(O) (as the empirical variance Var P B D ⋆ (P s )(O); again, D ⋆ (P s ) is known explicitly by Lemma 7).Denoting by ψ B (P s ) and v B (P s ) the latter evaluations, we interpret the intervals [ψ B (P s ) ± ξ 1−α/2 v B (P s )/n] and [ψ B (P s ) ± ξ 1−α/2 v B (P s )/B] as (1 − α)-accuracy intervals for the evaluation of Ψ(P s ) based on n = 200 and B = 10 5 independent observed data structures.The gray intervals in Figure 3 represent these accuracy intervals for α = 5%, n = 200 (light gray) and B = 10 5 (dark gray).Note that (by the convolution theorem) the length of [ψ B (P s ) ± ξ 0.975 v B (P s )/n] is the optimal length of a 95%-confidence interval based on an efficient (regular) estimator of Ψ(P s ) relying on n observations (assuming that the asymptotic regime is reached).The numerical values are reported in Table 2.
The results of this joint simulation study are summarized by Figure 3 (which shows kernel density estimates of the empirical distributions of {ψ k n,b : b ≤ B ′ } for 0 ≤ k ≤ 3) and Table 3.They illustrate some of the fundamental characteristics of the TMLE estimator and related confidence intervals: convergence of the iterative updating procedure, robustness, asymptotic normality, and coverage.
Convergence of the iterative updating procedure, and robustness.A substantial bias in the initial estimation is revealed by the location of the mode of {ψ 0 n,b : b ≤ B ′ } in Figure 3, both for the full-SL and light-SL procedures.We see that the full-SL initial estimator is less biased than its light-SL counterpart.As one can judge visually or by the first rows of Tables 3 Table 1: Real methylation, copy number and expression data used as a baseline for simulating the dataset according to the simulation scheme presented in Section 6.6.A visual of the simulated dataset is provided in Figure 2 Coverage.The theoretical convergence in distribution of the TMLE estimator to a Gaussian limit (e.g.under the conditions of Corollary 1) promotes the use of intervals Interestingly, the theoretical result of Corollary 1 do not guarantee that it is safe to estimate the limit variance by (s k n,b ) 2 (additional assumptions on the construction and convergence of θ(P kn n ), µ(P kn n ) and g(P kn n ) would be required to get such a result).We nonetheless check whether the latter intervals provide the wished coverage or not.For this purpose, we compute and report in the sixth and seventh rows of Tables 3(a) and 3(b) the empirical coverages √ n] = ∅} (the latter incorporates the remaining uncertainty of the true value of Ψ(P s )).We conclude that the provided coverage is good for the light-SL procedure (with excellent optimistic coverage), but disappointing for the full-SL procedure (even for the optimistic coverage).The results may have been better if one had relied on the bootstrap in order to estimate the asymptotic variance of the TMLE.We will investigate this issue in future work.

Real data application
For the real data application, we focus on all 130 genes g ∈ G of chromosome 18 in the OvCa dataset.This choice is notably motivated by the associated sample size, approximately equal to 500 (thus much larger than the sample size associated to the GBM dataset).We estimate the non-parametric variable importance measure of X on Y accounting for W for each gene separately (i.e., Ψ(P g 0 ) where P g 0 ∈ M is the true distribution of O = (W, X, Y ) for gene g), following exactly one of the statistical methodologies developed in the simulation study.Specifically, the targeted update step relies on the Gaussian fluctuations presented in Section 5.2, and the super-learning involves the library of algorithms that we report in Section 6.5.In particular, we estimate for each gene g the asymptotic variance of the TMLE ψ g, * n of Ψ(P g 0 ) with the empirical variance (s g, * n ) 2 of the efficient influence curve at P g, * n .In a future work solely devoted to this real data application, we will use the bootstrap in order to derive a more robust estimator of the asymptotic variance (again, Corollary 1 requires some conditions on P g 0 and P g, * n in order to guarantee that (s g, * n ) 2 is a consistent estimator).We will also "extend" W , by adding to the DNA methylation of the gene of interest the DNA methylations, DNA copy numbers and gene expressions of its neighboring genes.We only briefly summarize the results of the real data application.For this purpose, we report in Figure 4 the values of the test statistics √ n(ψ g,3 n − ψ g ref )/s g,3 n (g ∈ G) derived from the TMLE after three updates, using two different reference values ψ g ref ∈ {0, F(P g n )}.
Here, F(P g n ) = n i=1 X (i) Y (i) / n i=1 (X (i) ) 2 is the least square (substitution, asymptotically efficient) estimator of parameter F(P g 0 ), see (4), a parameter which overlooks the role potentially played by W while quantifying the influence of X on Y .We are aware that F(P g n ) is not independent of ψ g,3 n and s g,3 n , and will make sure in a future work solely devoted to this real data application that our estimator of F(P g 0 ) is derived from an independent dataset (or we will undertake a cross-validated procedure).The reference value ψ g ref = 0 is a natural null value to rely on from a testing perspective.Using ψ g ref = F(P g n ) as another null value is relevant because that allows us to identify those genes for which the (possibly intricate) role played by W in quantifying the influence of X on Y is especially important and results in a stark deviation of Ψ(P g 0 ) from F(P g 0 ).Looking at the left graphic in Figure 4 teaches us that a majority of the Ψ(P g 0 ) (g ∈ G) are likely positive.Eight genes stand up (by having a test statistic √ nψ g,3 n /s g,3 n > 45): two genes at 18p11.32 (USP14 and THOC1), a cluster of five genes at 18q11.2 (SNRPD1, RBBP8, RIOK3, NPC1, SS18), and gene MBP at 18q23.This suggests that the region 18q11.2(especially 19-24 Mb) is of particular relevance in this set of ovarian cancers.Seven out of the latter eight genes (specifically: all of them but gene NPC1) also stand up in the right graphic of Figure 4: six out of the latter seven genes standing up in both graphics (specifically: all of them but gene MBP) exhibit a significantly small test statistic (by having √ n(ψ g,3 n − F(P g n ))/s g,3 n < −6), as does the additional gene SERPINB2, while gene MBP exhibits a significantly large test statistic (by having √ n(ψ g,3 n − F(P g n ))/s g,3 n > 6), as do eight additional genes (MBD1, TXNL1, LMAN1, WDR7, NARS, ZNF236, ATP9B, TXNL4A).All genes standing up in the right graphic of Figure 4 are located at 18q2 (41-76 Mb).equality iff p = q, we obtain that θ(P 0 ) minimizes P → P 0 L a,b (P ) and also that another minimizer must satisfy θ(P )(X, W ) = θ(P 0 )(X, W ) P 0 -almost surely.The second equality is easily obtained by differentiating.

A.3 Proof of Lemma 5
Proof of Lemma 5. Assume for the time being that, for all W ∈ W, there exists λ n such that (17) holds with λ n substituted for λ.Then, for all W ∈ W, the point with coordinates (E P 0 n (X|X = 0, W ), ϕ n,λn (E P 0 n (X|X = 0, W ))) lies in the convex envelope of the set {(X (i) , X (i)2 ) : i ≤ n} \ {(0, 0)}.Equivalently, there exist for all W ∈ W three non-negative weights p 1 , p 2 , p 3 summing up to 1 and three different values x (1) , x (2) , x (3) ∈ {X (i) : i ≤ n} \ {0} such that the right-hand side expressions being, respectively, the mean and second order moment of the distribution 3 k=1 p k Dirac(x (k) ).Thus, there exists P 00 n ∈ M such that (i) and (ii) hold.
Proof of Lemma 3, first part.Let us first show, by contradiction, that lim k→∞ P n D ⋆ (P k n ) = 0 under the stated assumptions.Suppose that P n D ⋆ (P k n ) does not converge to 0 as k → ∞: there exist η > 0 and an increasing function ϕ : N → N such that, for all k ≥ 0, We show that necessarily lim k→∞ ε ϕ(k) = 0, hence by Lemma 2 that lim k→∞ P n D ⋆ (P ϕ(k) n ) = 0, contradicting (22).
First update: fluctuating the conditional distribution of Y given (W, X).We actually propose two different fluctuations for that purpose: a Gaussian fluctuation on one hand and a logistic fluctuation on the other hand, depending on what one knows or wants to impose.Gaussian fluctuation.In this case too, minus the log-likelihood function is used as a loss function.Specifically, we first fluctuate only the conditional distribution of Y given (W, X), by introducing the path {P k−1 n,1 (ε) : ε ∈ R} such that (i) (W, X) has the same distribution under P k−1 n,1 (ε) as under P k−1 n , and (ii) under P k−1 n,1 (ε) and given (W, X), Y is distributed from the Gaussian law with conditional mean θ(P k−1 n )(X, W )+ εH(P k−1 n ). Logistic fluctuation.There is yet another interesting option in the case that Y ∈ [a, b] is bounded (or in the case that one wishes to impose Y ∈ [a, b], typically then with a = min i≤n Y (i) and b = max i≤n Y (i) ), which allows to incorporate this known fact (or wish) into the procedure.Let us assume that θ(P 0 ) takes its values in ]a, b[ and also that θ(P k−1 n ) is constrained in such a way that θ(P k−1 n )(X, W ) ∈]a, b[.Introduce for clarity the function on the real line characterized by F a,b (t) = (t − a)/(b − a).Here, we choose the loss function characterized by −L a,b (P )

Lemma 6 .
2) justifies our interest in the loss function L a,b and fluctuation {P k−1 n,1 (ε) : ε ∈ R}: Assume that the conditions stated above are met.Then L a,b is a valid loss function for the purpose of estimating θ(P 0 ) in the sense that θ(P 0 ) = arg min P ∈M P 0 L a,b (P ).
for the update of µ(P k−1 n ) (see the right-hand side numerator in (11)); -E P k−1 n (1{X = 0}D ⋆ (P k−1 n )(O)|W ) for the update of g(P k−1 n

3 sample
(a) and 3(b), this initial bias is diminished (if not perfectly corrected) at the first updating step of the TMLE procedure, illustrating the robustness of the targeted estimator.The empirical distributions of {ψ k n,b : b ≤ B ′ } for k = 1, 2,

Figure 3 :
Figure 3: Empirical distribution of {ψ k n,b : b ≤ B ′ } based on n = 200 independent observed data structures for k = 0 (initial estimator) and k iterations of the updating procedure (k = 1, 2, 3), as obtained from B ′ = 10 3 independent replications of the simulation study (using a Gaussian kernel density estimator).(a).The super-learning procedure involves all algorithms described in Section 6.5.(b).The super-learning procedure only involves algorithms based on generalized linear models.In both graphics, gray rectangles represent 95%-accuracy intervals [ψ B (P s ) ± ξ 0.975 v B (P s )/n] and [ψ B (P s ) ± ξ 0.975 v B (P s )/B] for the true parameter Ψ(P s ) based on 200 observed data structures (light gray) and B = 10 5 observed data structures (dark gray).The length of [ψ B (P s )±ξ 0.975 v B (P s )/n] is the optimal length of a 95%-confidence interval based on an efficient (regular) estimator of Ψ(P s ) relying on n observations (assuming that the asymptotic regime is reached).

Figure 4 :
Figure 4: Real data application to the 130 genes of chromosome 18 in the OvCa dataset (ovarian cancers).We represent the tests statistics √ n(ψ g,3 n − ψ g ref )/s g,3 n for ψ g ref = 0 (left graphic) and ψ g ref = F(P g n ) (right graphic) along the position of gene g on the genome.We report the names of the genes such that √ n|ψ g,3 n |/s g,3 n > 45 (left graphic) and √ n|ψ g,3 n − F(P g n )|/s g,3 n > 6 (right graphic), the cut-offs being arbitrarily chosen.