Reweighting samples under covariate shift using a Wasserstein distance criterion

Considering two random variables with different laws to which we only have access through finite size iid samples, we address how to reweight the first sample so that its empirical distribution converges towards the true law of the second sample as the size of both samples goes to infinity. We study an optimal reweighting that minimizes the Wasserstein distance between the empirical measures of the two samples, and leads to an expression of the weights in terms of Nearest Neighbors. The consistency and some asymptotic convergence rates in terms of expected Wasserstein distance are derived, and do not need the assumption of absolute continuity of one random variable with respect to the other. These results have some application in Uncertainty Quantification for decoupled estimation and in the bound of the generalization error for the Nearest Neighbor Regression under covariate shift.


Regression under covariate shift
This article is dedicated to the study of a method aimed at approximating the law of a random variable where X ∈ R d , Θ ∈ Θ are independent random variables, with respective laws denoted by μ X and μ Θ , and f : R d × Θ → R e is a measurable function. The Reweighting samples under covariate shift 3279 space Θ is only assumed to be measurable. The specificity of the problem at stake is that we assume to be provided with: where Θ j has law μ Θ and is independent from X j , but the law μ X of X j may differ from μ X ; • an evaluation sample (X i ) i∈J1,nK with i.i.d observations distributed according to μ X .
This situation is known as covariate shift in the statistical learning literature [8,33]. This problem is motivated by the study of decomposition-based uncertainty quantification (UQ) methods in complex industrial systems, as is detailed in Subsection 5.2 below. In this context, the overall objective is to approximate a quantity of interest of the form for some function φ : R e → R. Following previous works in this direction [2,3,4], our estimator of QI assumes the form where the vector of weights w m = (w 1 , . . . , w m ) is chosen so that the weighted empirical measure μ wm X m := 1 m m j=1 w j δ X j of the training sample X m = (X 1 , . . . , X m ) be close, in a sense which will be made precise below, to the empirical measure δ Xi of the evaluation sample X n = (X 1 , . . . , X n ). Such a reweighting procedure is a standard approach to the problem of density ratio estimation in the statistical learning literature [35], the purpose of which is to estimate the density dμ X /dμ X from the samples X n and X m , without estimating separately the measures μ X and μ X . While the theoretical analysis of such methods almost always requires this density to exist, in the UQ context which motivates the present study it is desirable not to assume that any of the measures μ X and μ X be absolutely continuous with respect to the other, see in particular Remark 5.3. The first step of our work is thus the computation of optimal weights w m for the problem min wm W q μ Xn , μ wm X m , of QI. Then the vector of weights w m which are optimal for (4)-(5) turns out not to depend on the value of q, and the associated estimator QI m,n defined by (3) coincides with the 1-NN estimator QI (1) m,n . For this reason, we shall denote by w (1) m the vector of optimal weights for (4)- (5), and more generally by w (k) m the vector of weights induced by the k-NN estimator of ψ.
The main results of this paper describe the asymptotic behavior, as the respective sizes m and n of the training and evaluation samples grow to infinity, of both the Wasserstein distance W q ( μ Xn , μ w (k) m X m ) and the estimator QI (k) m,n of QI. While taking k = 1 is optimal for the convergence of μ w (k) m X m to μ Xn , one may expect from the theory of NN regression that the estimator QI (k) m,n display better convergence properties if k is chosen to grow to infinity with m. Therefore we shall study both regimes k = 1 and k = k m → +∞.

Outline of the article
The derivation of the Wasserstein optimal weights w m is detailed in Section 2, where we also highlight connections between our results and various topics in numerical probability and statistical learning. The asymptotic behavior of W q ( μ Xn , μ w (k) m X m ) and QI (k) m,n are respectively studied in Sections 3 and 4. Applications to decomposition-based UQ and the generalization error for NN regression under covariate shift, as well as numerical illustrations, are presented in Section 5.

Notation
We denote by N the set of the natural integers including zero and by N * = N\{0} the set of positive integers. Given two integers n 1 ≤ n 2 , the set of the integers between n 1 and n 2 is written Jn 1 , n 2 K = {n 1 , . . . , n 2 }. For x ∈ R, x (resp. x ) is the unique integer verifying x ≤ x < x + 1 (resp. x − 1 < x ≤ x). For (x, y) ∈ R 2 , we use the join and meet notation x ∧ y = min(x, y) and x ∨ y = max(x, y). Last, we denote by (x) + := 0 ∨ x and (x) − := 0 ∨ (−x) the nonnegative and nonpositive parts of x ∈ R.
We fix a norm | · | on R d , which need not be the Euclidean norm. The supremum norm of φ : R d → R is denoted by φ ∞ = sup x∈R d |φ(x)|. The distance between a point x ∈ R d and a subset A ⊂ R d is denoted by dist(x, A). Last, for all x ∈ R d and r ≥ 0, we denote B(x, r) := {x ∈ R d : |x − x | ≤ r}, and recall that the support of a probability measure ν ∈ P(R d ) is defined by

Optimal weights for Wasserstein distances
We begin by recalling the definition of the Wasserstein distance. Definition 2.1 (Wasserstein distance). Let P(R d ) be the set of probability measures on R d and, for any q ∈ [1, +∞), let The Wasserstein distance of order q between μ and ν ∈ P q (R d ) is defined as We refer to [38,Section 6] for a general introduction to Wasserstein distances. This definition allows for an explicit resolution of the minimization problem (4)- (5), which relies on the notion of Nearest Neighbor (NN). For x ∈ R d and k ∈ J1, mK, we denote by NN (k) X m (x) the k-th Nearest Neighbor (k-NN) of x among the sample X m , that is to say the k-th closest point to x among X 1 , . . . , X m for the norm |·|. If there are several such points, we define NN (k) X m (x) to be the point X j with lowest index j. We omit the superscript notation (k) when referring to the 1-NN, i.e.
In the next statement, for any i ∈ J1, nK and l ∈ J1, mK, we denote by j The vector w (k) m satisfies (5) and verifies, for all q ∈ [1, +∞), For k = 1, the equality is reached and the vector is optimal for (4) in the sense that for any w m = (w 1 , . . . , w m ) which also satisfies (5), we have In other words, for a given j ∈ J1, mK, w (k) j is proportional to the number of points X i of which X j is one of the first k NN. We refer to [27] for a numerical illustration of the use of the vector of weights w (1) m in the context of classification under covariate shift.
Proof. For a general vector of weights w m = (w 1 , . . . , w m ) which satisfies (5), the Wasserstein distance W q q ( μ Xn , μ wm X m ) is the solution of the following optimal transport problem where γ i,j is the coefficient of the discrete transport plan between δ Xi and δ X j .
For the k-NN vector of weights w (k) m defined by (7), the transport plan satisfies the two marginal conditions. Reordering the terms in the associated cost gives the upper bound of Equation (8).
We now prove the equality (9) and optimality (10) of w (1) m at the same time. On the one hand, it is clear that for any vector of weights w m = (w 1 , . . . , w m ) and any transport plan (γ i,j ) (i,j)∈J1,nK×J1,mK between μ Xn and μ wn X m , we have therefore taking the infimum over all transport plans yields On the other hand, taking w m = w (1) m in the left-hand side and combining this inequality with (8) for k = 1, we obtain both the equality (9) and optimality (10). Remark 2.3. In order to alleviate notation, we now write μ (k)

Comments on Proposition 2.2
In this subsection, we discuss the relation between the result of Proposition 2.2 and other fields in numerical probability and statistical learning, as well a generalization of this result to a more general framework.

Link with optimal quantization
It is clear from Proposition 2.2 that μ (1) X m is the pushforward of μ Xn by NN X m , and that this transport map yields an optimal coupling between μ Xn and μ (1) X m in Definition 2.1, for any q ≥ 1. The idea to associate each is the basis of the theory of optimal quantization [23,24,30]. In this context, the sample X m plays the role of the quantization grid, and NN X m is known to be the optimal quantization function. The right-hand side of (9) then corresponds to the L q mean quantization error induced by the grid X m for the measure μ Xn .

Link with geometric inference
When q = 2, the right-hand side of (8) rewrites where d μ,α (·) is the distance function to μ with parameter α introduced by Chazal, Cohen-Steiner and Mérigot in [11,Definition 3.2] in order to perform geometric and topological inference for set estimation, see also [12,9] for robust inference. In particular, [11,Proposition 3.3] shows that for any x ∈ R d , . . , w m ) satisfies (5) and w j ≤ m/k for all j.
This result may be directly compared with the estimates (8) and (9), at least in the case where μ X = δ x . In this case, if k = 1, then the supplementary constraint w j ≤ m/k is necessarily implied by (5) and therefore, combining (12) with (9), we recover the optimality result (10). For arbitrary values of k, the combination of (12) with (8) shows that the vector w (k) m need not be optimal for (10), but yields a solution which is lower than any solution with the supplementary constraint that w j ≤ m/k.

A more general problem
Proposition 2.2 may appear as a specific instance, restricted to empirical measures, of the following problem: given two probability measures μ and ν on R d , and assuming that μ ∈ P q (R d ), compute the infimum of W q (μ, ρdν) over all probability densities ρ with respect to ν. Similar questions were recently addressed in [10]. First, it is clear that if μ is absolutely continuous with respect to ν, then taking ρ = dμ/dν shows that this minimum is 0. Next, following the proof of Proposition 2.2, it is easily seen that for any ρ, To show that the right-hand side actually matches with the infimum of the lefthand side when ρ varies, we keep following the proof of Proposition 2.2. Since supp(ν) is closed, for any x ∈ R d the set Ψ(x) := {x ∈ supp(ν) : |x − x | = dist(x, supp(ν))} is nonempty and closed. Besides, the multifunction Ψ is weakly measurable 1 , therefore by the Kuratowski-Ryll-Nardzewski theorem it admits a measurable selection which we denote by nn ν . We denote by μ * the associated pushforward of μ by nn ν . Then it is clear that supp(μ * ) ⊂ supp(ν) on the one hand, and that on the other hand. We finally deduce from the approximation result stated in Lemma 2.4 below that which thereby generalizes the results of Proposition 2.2. Notice that there may not exist a minimizer ρ for this problem as the measure μ * need not be absolutely continuous with respect to ν.
Lemma 2.4 (W q approximation by absolutely continuous measures). Let μ * and ν be two probability measures on R q such that supp(μ * ) ⊂ supp(ν). For any > 0, there exists a probability density ρ with respect to ν such that, for all q ≥ 1, W q (μ * , ρdν) < .
On the one hand, the random variable Y has density with respect to ν, and on the other hand we have |X − Y | < , almost surely, which ensures that W q (μ * , ρdν) < for any q ≥ 1.

Analysis of the Wasserstein distance
In this section, we study the asymptotic behavior of E[W q q ( μ Xn , μ (k) X m )] when n, m → +∞. To this aim, we first notice that by Proposition 2.2, we have for k = 1, and for k ≥ 2. Observe that the right-hand sides of both (13) and (14) no longer depend on n.

Consistency
Our first main result is a consistency result. Before stating it in Theorem 3.3, we formulate two crucial assumptions.

Rates of convergence
The next step of our study consists in complementing Theorem 3.3 with a rate of convergence. We first discuss the case k = 1. Following (13), we start by writing and observe that for any there is an open set U of R d containing x and such that μ X (· ∩ U ) has a density p X with respect to the Lebesgue measure which is continuous at x, then an elementary computation shows that, for all r ≥ 0, where v d denotes the volume of the unit sphere of R d for the norm | · |. If p X (x) > 0 then this indicates that the correct order of convergence in Theo- is not absolutely continuous with respect to the Lebesgue measure, it is easy to construct elementary examples yielding different rates of convergence; see also [5,Chapter 2] for the singular case. We leave these peculiarities apart and work under the following strengthening of the support condition of Assumption 3.1.

Assumption 3.6 (Strong support condition).
There exists an open set U ⊂ R d which contains supp(μ X ) and such that: Obviously, Assumption 3.6 implies Assumption 3.1 because then supp(μ X ) ⊂ U ⊂ supp(μ X ). Part (iii) of Assumption 3.6 was introduced in [20] in the context of Nearest Neighbor classification, and called Strong minimal mass assumption there. Similar assumptions are commonly used in set estimation, geometric inference and quantization, such as standardness [14] or Ahlfors regularity [24].
Under Assumption 3.6, for all x ∈ supp(μ X ), a positive random variable Z such that P( where Γ denotes Euler's Gamma function. Therefore, as soon as the sequence when m goes to infinity. This statement appears for example in the literature of stochastic optimal quantization [23, Theorem 9.1]. Here, we provide an explicit moment condition ensuring uniform integrability. Assumption 3.7 (Moments). In addition to Assumption 3.6, the condition Assumptions 3.6 and 3.7 are discussed in more detail below. We now state our second main result.
We now discuss the estimation of μ Xn by the weighted empirical measure μ (k) X m for an arbitrary k ∈ J1, mK. By (10), we first observe that we always have so that the estimation of μ Xn is deteriorated by increasing the number of neighbors. Still, in the asymptotic regime of Theorem 3.3, a bound of the same order of magnitude as Theorem 3.8 may be obtained.  . When X has a density p X with respect to the Lebesgue measure, an interesting fact is that the minimum of the quantity E 1/p X (X) q/d over the probability measure p X is not reached when p X = p X . Instead, according to [41], the minimum is attained when

Corollary 3.9 (Convergence rates for k-NN). Under the assumptions of Theorem 3.8, for any nondecreasing sequence of positive integers
Remark 3.11 (NN distance without covariate shift). In the case where μ X = μ X , the quantity is called Nearest Neighbor distance. It naturally arises in the theoretical study of NN regression and classification [5,Chapter 2]. Previous works on the topic focus mainly on the convergence when q = 2 and assume that X has a bounded support [5,17,26,32]. Some works [13,25] consider some random variables X with unbounded support in the context of k-NN regression, but make the assumption of a bounded regression function ψ.
In this perspective, a direct corollary from Theorem 3.8 is the following statement: let X have a density p X for which the strong minimal mass assumption 3.6 (iii) holds with U = R d and This extends the results of the literature by ensuring the asymptotic equivalence for random variables with unbounded support.
Let us conclude this subsection with some comments on Assumptions 3.6 and 3.7. When X has a compact support, Assumptions 3.6 and 3.7 are verified as soon as μ X has a continuous density p X which is bounded from below and above on an open set U containing the support of μ X . Indeed, in that case there exist > 0 and an open subset U of U such that U contains supp(μ X ) and U contains the -neighborhood of U . Then, Assumption 3.6 (iii) is verified with r κ < and κ = inf x∈U p X (x)/ sup x∈U p X (x). Assumptions 3.6 and 3.7 also hold in some nontrivial noncompact cases. An example of a sufficient condition for Assumption 3.6, which does not depend on μ X , is given in the next statement and is proved in Subsection 3.3.

Lemma 3.12 (Radial density -Sufficient condition for Assumption 3.6). Let
· be a norm on R d , induced by an inner product and not necessarily identical to |·|. If μ X has a density p X with respect to the Lebesgue measure on R d , which We also refer to [20, Section 2.4] for a discussion of this assumption. Assumption 3.7 gives a relationship between μ X and p X to ensure the convergence. In essence, it asserts that the tail of μ X must be quite lightweight compared to the tail of p X . For instance, if X and X are centered Gaussian vectors with respective covariance σ 2 I d and σ 2 I d , then by Lemma 3.12, Assumption 3.6 is satisfied with U = R d , and it is easy to check that for q ∈ [1, +∞), Assumption 3.7 holds if and only if σ 2 > σ 2 q/d.

Proofs
In this subsection, we present the proofs of Theorems 3.3 and 3.8, Corollary 3.9 and Lemma 3.12.
Proof of Theorem 3.3. We begin our proof with the constant case k m = 1 for all m and then extend it to the general case. We recall that by (13), By Assumption 3.1, X ∈ supp(μ X ) almost surely, so that we deduce from Let m 0 be the integer given by Assumption 3.2, we have The random variable |X| q is integrable by assumption and for m ≥ q m 0 , the inequality holds. Then by the dominated convergence theorem, For the general case k m /m → 0, we adapt directly the proof of [5, Theorem 2.4] to the context μ X = μ X . Let us fix l ∈ J1, m/2K and partition the set We denote by NN and consequently Finally, we deduce from (14) that, as soon as k m ≤ m/2, which goes to 0 as a consequence of the first part of the proof when m/2k m goes to infinity.
Proof of Theorem 3.8. By (13), we have by independence of the X j . The proof consists in computing the pointwise limit of P(m p/d |x − Y | p > t) m for (x, t) ∈ supp(μ X ) × R + and then establishing the convergence of the integral via the dominated convergence theorem. Pointwise convergence. We have By Assumption 3.6, we have with v d the volume of the unit sphere. Thus and we conclude that Dominated convergence. Let r κ > 0 be given by Assumption 3.6. We split the integral in the right-hand side of (17) and study each term separately Using the elementary inequality (1 − a/n) n ≤ exp(−a) for a ≤ n, we can write This bound does not depend on m and the integral is finite by Assumption 3.7. We therefore deduce from the dominated convergence theorem that Convergence of II. Let m ≥ 2(q + 1)m 0 . Using the change of variable r q = t/m q/d , we have As P(|x − X | > r κ ) < 1 for all x in U , by Assumption 3.6, V m (x, r) is pointwise convergent to 0 on the support of μ X . We check that V m (x, r) is bounded from above by an integrable function which does not depend on m. Let us denote m = m − (q + 1)m 0 ≥ m/2 and rewrite where we have used Assumption 3.6 and the elementary above inequality at the third line. We deduce that To complete the proof, we verify that V (x, r) is integrable on U × [r κ , +∞). We first fix x ∈ R d and estimate the integral of V (x, r) in r. Using the fact that if |x − X | > r then |X | > r − |x|, we first write On the interval [0, +∞), we first rewrite and recall from Assumption 3.2 that C 2 := E[min j∈J1,m0K |X j |] < +∞. As a consequence, we deduce from Markov's inequality that the right-hand side in the previous equality is bounded from above by If |x| ≤ 1 then this expression is bounded from above. If |x| > 1, then we have |x| 0 (r + |x|) q−1 dr ≤ 2 q−1 |x| q on the one hand, and which is bounded from above, on the other hand. Overall, we conclude that there exists a constant C 3 such that As a consequence, the combination of (18) and (19) yields which by Assumption 3.7 allows to apply the dominated convergence theorem to show that II goes to 0, and thereby completes the proof.
Proof of Corollary 3.9. We start from the second line of Equation (16) and estimate its right-hand side . Let > 0. By Theorem 3.8, there exists u ≥ 0 such that, for all u ≥ u , We can remark that for m ∈ N * and l ∈ J1, k m K, Thus, if we take m such that for all m ≥ m , m 2km ≥ u , we have because k m is nondecreasing. This concludes the proof.
Proof of Lemma 3.12. Obviously, it suffices to check that p X satisfies (iii) in Assumption 3.6. Let us denote by ·, · and B(x, r) respectively the inner product and the ball of center x and radius r associated to · . We set x 0 = 0 without loss of generality. As h is positive and nonincreasing, we may fix r 0 > 0 and define If x ≤ r 0 /2, then for all y ∈ B(0, r 0 /2), the monotonicity of h ensures that p X (x + y) ≥ κp X (x). By the equivalence of the norms, there exist C ≥ c > 0 such that for any x ∈ R d and any r ≥ 0, B(x, cr) ⊂ B(x, r) ⊂ B(x, Cr). Thus If x > r 0 /2, let us introduce the half-cone and notice that for all r ≤ r 0 /2 and x ∈ C x ∩ B(x, r), Thus, for all x ∈ C x ∩ B(x, r), p X (x ) ≥ p X (x). For a given r, the sets C x ∩ B(x, r) have the same volume for all x, which we denote by αv d r d for some α ∈ (0, 1/C d ). Finally, we have If we take κ = (c/C) d min(αC d , κ) and r κ = r 0 /2c, we obtain the point (iii) of Assumption 3.6.

m,n to QI
This section is dedicated to the study of the convergence of QI (k) m,n to QI. As a preliminary step, we complement the results from Section 3 by deriving rates of convergence for the Wasserstein distance between μ (k) X m and μ X in Subsection 4.1. We then distinguish between the noiseless case in which Y = f (X), addressed in Subsection 4.2, and the noisy case Y = f (X, Θ), addressed in Subsection 4.3.

Convergence of μ (k)
X m to μ X Let us fix q ∈ [1, +∞) and use Jensen's inequality to write, for k = k m ∈ J1, mK, Under the assumptions of Corollary 3.9, the second term has order of magnitude at most (k m /m) q/d . The study of the first term, namely the rate of convergence of the expected W q distance (taken to the power q) between the empirical measure of iid realizations and their common distribution, has been the subject of several works. Under the condition that there exists s > 2q such that E[|X| s ] < +∞, we have from [19, Theorem 1] These estimates may be improved if more assumptions are made on μ X . For example, if this measure possesses a lower and upper bounded density on some bounded subset of R d , then the rate is known to be n −q/d even if q > d/2 [21]. This rate may even be improved if μ X concentrates on a low-dimensional submanifold of R d [39,16], which is particularly relevant in the UQ context which motivates this study, see Remark 5.3. In order to make the use of our results as flexible as possible, from now on we shall denote by (τ q,d (n)) n≥1 a sequence such that E W q q (μ X , μ Xn ) = O (τ q,d (n)) , and thus As is sketched in the discussion above, the precise order of τ q,d (n) depends on properties of the measure μ X .
In the sequel, where we study the convergence of QI (k) m,n to QI, the W 1 distance plays a specific role, due to the Kantorovitch duality formula [38, Remark 6.5] where |ϕ| Lip denotes the Lipschitz constant of ϕ. We shall need the following estimate.
Proof. For any vector (x 1 , . . . , x n ) ∈ (R d ) n , let us define thanks to (22). Then for any i ∈ J1, nK and x i ∈ R d , using the identity above and the fact that | sup f − sup g| ≤ sup |f − g|, we get As a consequence, letting X n = (X 1 , . . . , X n ) and X n = (X 1 , . . . , X n ) be two independent samples from μ X , we deduce that the random variables V + and V − defined by As a consequence, for any q ≥ 2 we have by Jensen's inequality We therefore deduce from the higher-order Efron-Stein inequality [7, Theorem 2] that there exists a universal constant C q such that . We conclude the proof by writing, using Jensen's inequality again, which yields the claimed estimate. (21)

m,n in the noiseless case
We assume that Y = f (X) and study the rate of convergence of QI We therefore obtain the following result.
There is no need for k m to go to infinity and thus k m = 1 is optimal. These computations can be adapted to cases other than φ • f Lipschitz continuous. For instance, if A ⊂ R e , φ(y) = 1 {y∈A} and f is globally Lipschitz continuous, it is possible to use the margin assumption of [37] to deduce theoretical rates of convergence in the estimation of QI = P(Y ∈ A).

m,n in the noisy case
We now study the convergence of QI (km) m,n to QI when Y = f (X, Θ). A first striking result is then that even under the assumptions of Theorem 3.3, the estimator QI (1) m,n need not be consistent. Indeed, consider the case where X is actually deterministic and always equal to some x 0 ∈ R d . Then we have are equal to some j (1) and the estimator rewrites (1) , Θ j (1) )).
While Assumption 3.1 ensures that X j (1) converges to x 0 when m → +∞, in general the corresponding sequence of Θ j (1) does not converge.
As is evidenced on this example, the presence of an atom in the law of X makes the estimator QI (1) m,n depend on a single realization of Θ and therefore prevents this estimator from displaying an averaging behavior with respect to the law of Θ. In Proposition 4.4, we clarify this point by exhibiting a necessary and sufficient condition for the estimator QI (1) m,n to be consistent, while in Proposition 4.5, we show that replacing QI (1) m,n with QI (km) m,n with k m → +∞ allows to recover such an averaging behavior and makes the estimator consistent, even when μ X has atoms. In the latter case, we also provide rates of convergence in Proposition 4.6.
We recall that ψ(x) = E[φ(f (x, Θ))] is defined in Equation (6). In the next statement, we denote by A X the set of atoms of μ X , that is to say the set of x ∈ R d such that P(X = x) > 0, and introduce the notation ϑ(x) := Var(φ(f (x, Θ))).
introduce the notation A + X := {x ∈ A X : ϑ(x) > 0}, and denote In Step 1 below, we prove that demonstrating at the same time the direct implication of the convergence when A + X = ∅. In Step 2, we show that if A + X = ∅ then E[|e 2 |] does not converge to 0, which implies that in this case, QI In both steps, we shall use the following preliminary remark: given a measurable subset A of R d , taking the conditional expectation with respect to (X n , X m ) it is easy to see that for i ∈ J1, nK, Therefore, Step 1. Thanks to the boundedness of φ, and thus of ϑ, it is immediate that uniformly in m. Therefore, to show that E[|e 1 | 2 ] converges to 0, it suffices to prove that In this purpose, let us first write and recall that, by Assumption 3.1 and Lemma 2.2 in [5, Chapter 2], NN X m (X 1 ) converges to X 1 and NN X m (X 2 ) converges to X 2 , almost surely. As a consequence, if X 1 ∈ A X \ A + X then ϑ(X 1 ) = 0 and by the continuity of ϑ and the boundedness of φ, the dominated convergence theorem shows that On the other hand, if X 1 ∈ A X , then almost surely X 1 = X 2 , and therefore 1 {NN X m (X1)=NN X m (X2)} converges to 0 almost surely. Using the boundedness of φ and the dominated convergence theorem again, we deduce that which shows that E[|e 1 | 2 ], and thus E[|e 1 |], converge to 0.
Step 2. Let us now assume that A + X is nonempty and show that e 2 does not converge to 0 in L 1 . We shall actually prove that e 2 does not converge to 0 in L 2 : since e 2 is bounded then this prevents the convergence from occuring in L 1 . From the preliminary remark, we write and we prove that Obviously, By Assumption 3.1 and Lemma 2.2 in [5, Chapter 2] again, NN X m (x) converges to x almost surely, therefore using the continuity and boundedness assumptions on ϑ, the dominated convergence theorem shows that which completes the proof.
We now study the estimator QI (km) m,n and show that it is unconditionnally consistent as soon as k m → +∞. We provide L 2 convergence rates in Proposition 4.6. Proof. We decompose the error as with QI (km) As ψ is globally Lipschitz continuous and does not depend on Θ, we have by Jensen's inequality, with L the Lipschitz constant of ψ. The second term is bounded from above by E[W 2 2 ( μ Xn , μ (km) , which goes to 0 by Theorem 3.3. For the first term, the same arguments as in the proof of Lemma 4.1 show that Since W 1 (μ X , μ Xn ) converges to 0 in probability [31] and the assumption that E[|X| 2 ] < +∞ ensures that this sequence is uniformly integrable, we deduce that its expectation converges to 0 [6, Section 5]. Thus, the second part of the right-hand side of (24) converges to 0 in L 2 . Let us consider the first part in the right-hand side of (24). We write the quadratic error Using the fact that E[w ψ(X j ) by definition and the independence of the Θ j , the cross terms vanish. The remaining quadratic term is We remark that and that for some fixed i 1 , i 2 and l 1 , there exists exactly one l 2 ∈ J1, mK such that j i2 ) 1≤l≤m is a permutation of J1, mK. Therefore, there exists at most one l 2 ∈ J1, k m K verifying this property and, consequently, We can then bound the second term by which converges to 0 when k m goes to infinity.
In order to complement Proposition 4.5 with a rate of convergence, we restart from the decomposition (24). Under the additional assumptions of Corollary 3.9 with q = 2, the same arguments as in the proof of Proposition 4.3 yield from the proof of Proposition 4.5. As a consequence, Optimizing in k m , we get the following statement.
The loss of convergence order with respect to Proposition 4.3 is similar to the NNR context, in which it deteriorates from the rate 1/d in the noiseless case to the rate of 1/(d + 2) in the noisy case [5, Section 14.6 and Section 15.3].

Applications and numerical illustration
We present a reformulation of our results in a standard framework for k-NN regression in Subsection 5.1, and then provide a detailed account of the original motivation of this work by decomposition-based UQ in Subsection 5.2. Last, numerical illustrations of our main results in a simple setting are reported in Subsection 5.3; we refer to [36,Chapter 11] for an application in an industrial context.

Generalization error of k-NN regression under covariate shift
In this subsection, we address the k-NN regression problem under covariate shift from the following more standard point of view: the quantity of interest is directly the regression function and the k-NN estimator of r(x) is defined from the training set by where j (l) (x) denotes the (smallest) index j such that X j = NN (l) X m (x). We are no longer interested in some quantity E[φ(Y )] but rather in the (L 2 ) generalization error under covariate shift For the sake of simplicity we assume that Y , r(X), etc. take scalar values.
We retrieve essentially the same orders of convergence as in the case without covariate shift. The quantity E[1/p X (X) 2/d ] 1/2 seems to be the relevant bound of the loss due to the use of μ X instead of μ X and we expect that the greater this quantity is, the slower the convergence will be.
Proof. The proof is an adaptation of [5,Theorem 14.5], using elements of the proofs of Theorem 3.8 and Corollary 3.9. We can decompose the L 2 error By Jensen's inequality, the first term can be bounded by where L is the Lipschitz constant of f , and then following the proof of Corollary 3.9, we get lim sup The optimal rate is k m ∼ m 2/(2+d) , leading to , which completes the proof.

Application to decomposition-based UQ
In the UQ context, the relation (1) represents a computer simulation [18,15]: the random variable X is the input of the simulation, the random variable Θ describes the set of its parameters, the function f is the numerical model and the random variable Y is the output of the simulation. The function φ involved in the definition (2) of the quantity of interest QI is the observable.
The fact that we assume that both X and Θ may be random, but with distinct sources of uncertainty (which is modeled by their statistical independence), comes from the study of uncertainty propagation in complex networks of numerical models [1,3,28,34,29]. In this context, several computer codes, representing various disciplines, are connected with each other by the fact that the outputs of certain codes are taken as inputs of other codes. Then f represents one discipline, with 'internal' uncertain parameters Θ whose law is known by the agent in charge of the simulation, and 'external' uncertain parameters X which are the output of possibly several upstream numerical simulations. Independently from our complex system context, assuming that the internal parameter Θ may be random is a standard practice to take aleatoric or epistemic uncertainty into account [15,22].
If X is deterministic then the computation of a global quantity of interest can be treated by the so-called Collaborative Optimization methods [8,40] in Multidisciplinary Analysis and Optimization. However, if X is random, a direct Monte Carlo evaluation of QI is often impossible to implement in practice.
Indeed, if the number of interacting disciplines is large and each code evaluation is costly, then one cannot wait for a sample X 1 , . . . , X n to be generated by the upstream simulations before starting running one's own simulation. Therefore, decomposition-based UQ methods have been introduced in the literature in order to allow disciplines to run their numerical simulations independently. Basically, these methods work in two phases. In an offline phase, each discipline generates it own synthetic sample X 1 , . . . , X m according to some user-chosen probability measure μ X on R d (or possibly other designs of experiment). The numerical model f is then evaluated on the sample (X 1 , Θ 1 ), . . . , (X m , Θ m ) to obtain a corresponding set of realizations Y 1 , . . . , Y m . Once actual realizations X 1 , . . . , X n become available in a subsequent online phase, they have to be used in combination with the synthetic sample to construct an estimator of QI, but evaluations of the numerical model f are no longer allowed.
We refer to [2,3,4] for examples and background on these methods. The k-NN reweighting scheme introduced in the present article is yet another possible approach to this problem. More general nonparametric regression methods, such as Nadaraya-Watson estimators, may also be considered. A more systematic study of such approaches, based on linear reweighting, as well as their generalization to the estimation of quantities of interest defined on a graph of numerical models, may be found in [36,Chapter 11] and will be the object of a future publication.

Remark 5.2 (Stochastic simulators).
Our framework is also suited to the situation where Θ does not represent well-identified parameters, but must rather be interpreted as the inherent randomness of the numerical model. In the UQ literature, such models are called stochastic simulator (see for instance [42] and the references therein) and their emulation is closely related with the regression problem addressed in this article, interpreting Θ as a noise term. Remark 5.3 (A simple example with low-dimensionally supported data). In the multidisciplinary context introduced above, consider the simple setting in which a random variable X 1 is taken as an input by two distinct disciplines, represented by two numerical models f 1 and f 2 . Assume in addition that the output Y 1 = f 1 (X 1 ) of the former is taken as an input by the latter, so that Y 2 = f 2 (X 1 , Y 1 ). In the decomposition-based approach, the second discipline has to design a synthetic sample (X 1 j , Y 1 j ) j∈J1,mK without the actual knowledge of f 1 . Therefore it is unlikely that this sample be absolutely continuous with respect to the true law of (X 1 , Y 1 ), the support of which lies in the manifold {(x, f 1 (x))}.

Numerical illustrations
This subsection investigates numerically the influence of the choice of the synthetic distribution μ X on the quality of the respective approximations of μ X by μ (1) X m , and of QI by QI  We investigate how the relationship between μ X and μ X impacts the convergence of μ with μ = 0.5, σ = 0.3 and various s corr in (−1, 1). Intuitively, the closer s corr is to 1, the closer μ X is to μ X , as illustrated in Figure 1. As a first 'purely visual' indication of the quality of the approximation of μ X by μ (1) X m , we plot on Figure 2 the trace of a kernel smoothing of μ (1) X m on the segment {(u, u), u ∈ [0, 1]}. We can see that the greater s corr is, the better the reconstruction looks like. From a more quantitative point of view, this observation is confirmed in Figure 3, where we plot the evolution of E[W 2 2 ( μ Xn , μ X m )] as a function of m. We can see that although this quantity converges at the theoretical rate m −1 , the multiplicative constant decreases with s corr . is computed by Monte Carlo estimation. As highlighted in Figure 4, the closeness of μ X to μ X is an important factor for the efficiency of the estimator.   A. Guyader. Last, we thank two anonymous referees for their careful reading of the article, and their numerous suggestions which allowed to greatly improve the presentation of this work.