Concentration Inequalities for Two-Sample Rank Processes with Application to Bipartite Ranking

The ROC curve is the gold standard for measuring the performance of a test/scoring statistic regarding its capacity to discriminate between two statistical populations in a wide variety of applications, ranging from anomaly detection in signal processing to information retrieval, through medical diagnosis. Most practical performance measures used in scoring/ranking applications such as the AUC, the local AUC, the p-norm push, the DCG and others, can be viewed as summaries of the ROC curve. In this paper, the fact that most of these empirical criteria can be expressed as two-sample linear rank statistics is highlighted and concentration inequalities for collections of such random variables, referred to as two-sample rank processes here, are proved, when indexed by VC classes of scoring functions. Based on these nonasymptotic bounds, the generalization capacity of empirical maximizers of a wide class of ranking performance criteria is next investigated from a theoretical perspective. It is also supported by empirical evidence through convincing numerical experiments.


Introduction
In the context of ranking, a variety of performance measures can be considered. In the simplest framework of bipartite ranking, where two independent i.i.d. samples X 1 , . . . , X n and Y 1 , . . . , Y m defined on the same probability space (Ω, F, P), valued in the same space Z, say R d with d ≥ 1 for instance, and drawn from probability distributions G and H respectively (referred to as the 'positive distribution' and the 'negative distribution' respectively), the goal pursued is to learn a preorder on Z defined through a scoring function s : Z → R (which transports the natural order on the real line onto the feature space Z) such that, for any random observation Z ∈ Z sampled from a distribution that is equal either to the 'positive distribution' or to the 'negative one', the larger the score s(z), the likelier it is drawn from the 'positive distribution' G. Though easy to formulate, this simple framework encompasses many practical problems from the design of search engines in Information Retrieval (in this case, for a specific request, G is the distribution of the relevant digitized documents, while H is that of the irrelevant ones) to the elaboration of decision support tools in personalized medicine for instance. In spite of its simplicity there is not one and only one natural scalar criterion for evaluating the performance of a scoring rule s(z), but many possible options. The Receiving Operator Characteric curve (the ROC curve in abbreviated form), i.e. the PP-plot of the false positive rate vs the true positive rate: t ∈ R → (P{s(Y) > t}, P{s(X) > t}) , denoting by X and Y two generic r.v. with distributions G and H respectively, provides an exhaustive description of the performance of any scoring rule candidate s. However, its functional nature renders direct optimization strategies rather complex, see e.g. [10]. Empirical risk minimization methods (ERM) are thus generally based on summaries of the ROC curve, which take the form of empirical risk functionals where the averages involved are no longer taken over i.i.d. sequences. The most popular choice is undoubtedly the AUC criterion (AUC standing for Area Under the ROC Curve), see [1] or [4] for instance, but when focus is on top-ranked instances, various choices can be considered, e.g. the Discounted Cumulative Gain or DCG (see [11]), the p-norm push (see [30]), the local AUC (refer to [7]) or other variants such as those recently introduced in [26]. The present paper starts from the simple observation that most of these summary criteria have a common feature: they belong to the class of two-sample linear rank statistics. Such statistics have been extensively studied in the mathematical statistics literature because of their optimality properties in hypothesis testing, see [19]. They are widely used in order to test whether two samples are drawn from the same distribution against the alternative stipulating that the distribution of one of the samples is stochastically larger than the other. For instance, the empirical counterpart of the AUC of a scoring function s(z) corresponds to the popular Mann-Withney-Wilcoxon statistic based on the two (univariate) samples s(X 1 ), . . . , s(X n ) and s(Y 1 ), . . . , s(Y m ). Other rank statistics can be considered, corresponding to other ways of measuring how the distribution of the 'positive score' s(X) is (possibly) stochastically larger than that of the 'negative score' s(Y). Now, in the statistical learning view, with the importance of excess risk bounds, the Empirical Risk Minimization paradigm must be revisited and new problems, mainly related to the uniform control of the fluctuations of collections of two-sample linear rank statistics, termed rank processes throughout the article, and to the measure of the complexity of nonparametric classes of scoring functions, come up. The arguments required to deal with risk functionals based on two-sample linear rank statistics have been sketched in [7] in a very special case. In the present paper, we relate two-sample linear rank statistics to performance measures relevant for the ranking problem by showing that the target of ranking algorithms corresponds to optimal ordering rules in this sense and show that the generic structure of two-sample linear rank statistics as an orthogonal decomposition after projection onto the space of sums of i.i.d. random variables is the key to all statistical results related to maximizers of such criteria: consistency, rates of convergence or model selection. Notice incidentally that the empirical AUC is also a U -statistic and a decomposition method akin to that considered in this paper (though much less general) has been used in order to handle this specific dependence structure in [4]. In this article, concentration properties of twosample rank processes (i.e. collections of two-sample linear rank statistics) are investigated using the linearization technique aforementioned. While interesting in themselves, the concentration inequalities established for this class of stochastic processes, when indexed by Vapnik-Chervonenkis classes (abbreviated with VC-classes) of scoring functions, are next applied to study the generalization capacity of empirical maximizers of a large collection of performance criteria based on two-sample linear rank statistics. Notice finally that a preliminary version of this work is briefly outlined in the conference paper [8]. This article presents a much deeper analysis of bipartite ranking via maximization of two-sample linear rank statistics. In particular, it offers a complete and detailed study of the concentration properties of two-sample rank processes (in a slightly different framework, stipulating that two independent i.i.d. samples, positive and negative, are observed, rather than classification data), provides model selection results and, from a practical perspective, tackles the issue of smoothing the risk functionals under study here with statistical learning guarantees. The paper is organized as follows. In Section 2, the main notations are set out, the bipartite ranking problem is formulated as a statistical learning task in a rigorous probabilistic framework and the concept of two-sample linear rank statistic is briefly recalled. It is also explained that, unsurprisingly, natural performance criteria in bipartite ranking are of the form of two-sample (linear) rank statistics. Concentration results for rank processes, are established in Section 3. By means of the latter, performance of bipartite ranking rules obtained by maximizing two-sample linear rank statistics are investigated in Section 4. Finally, Section 5 displays illustrative experimental results, supporting the theoretical analysis carried out in the present article. Proofs, technical details and additional numerical results are deferred to the Appendix section.

Motivation and Preliminaries
We start with recalling key notions pertaining to ROC analysis and bipartite ranking, which essentially motivates the theoretical analysis carried out in the subsequent section. We next recall at length the definition of two-sample linear rank statistics, which have been intensively used to design statistical (homogeneity) testing procedures in the univariate setup, and finally highlight that many scalar summaries of empirical ROC curves, commonly used as ranking performance criteria, are precisely of this form. Here and throughout, the indicator function of any event E is denoted by I{E}, the Dirac mass at any point x by δ x , the generalized inverse of any cumulative distribution function W (t) on R ∪ {+∞} by W −1 (u) = inf{t ∈] − ∞, +∞] : W (t) ≥ u}, u ∈ [0, 1]. We also denote the floor and ceiling functions by u ∈ R → u and by u ∈ R → u respectively.

Bipartite Ranking and ROC Analysis
As recalled in the Introduction section, the goal of bipartite ranking is to learn, based on independent 'positive' and 'negative' random samples {X 1 , . . . , X n } and {Y 1 , . . . , Y m }, how to score any new observations Z 1 , . . . , Z k , being each either 'positive' or else 'negative', that is to say drawn either from G or else from H, without prior knowledge, so that positive instances are mostly at the top of the resulting list with large probability. A natural way of defining a total preorder 1 on Z is to map it with the natural order on R ∪ {+∞} by means of a scoring rule, i.e. a measurable mapping s : Z →] − ∞, ∞]. By S is denoted the set of all scoring rules. It is by means of ROC analysis that the capacity of a scoring rule candidate s(z) to discriminate between the positive and negative statistical populations is generally evaluated.
ROC curves. The ROC curve is a gold standard to describe the dissimilarity between two univariate probability distributions G and H. This criterion of functional nature, ROC H,G , can be defined as the parametrized curve in [0, 1] 2 : where possible jumps are connected by line segments, so as to ensure that the resulting curve is continuous. With this convention, one may then see the ROC curve related to the pair of d.f. (H, G) as the graph of a càd-làg (i.e. rightcontinuous and left-limited) non decreasing mapping valued in [0, 1], defined by: where (X, Y ) denotes a pair of independent r.v.'s with respective marginal distributions H and G.
1 A preorder on a set Z is a reflexive and transitive binary relation on Z. It is said to be total, when either z z or else z z holds true, for all (z, z ) ∈ Z 2 . 2 Given two distribution functions H(dt) and G(dt) on R ∪ {+∞}, it is said that G(dt) is stochastically larger than H(dt) iff for any t ∈ R, we have G(t) ≤ H(t). We then write: H ≤sto G. Classically, a necessary and sufficient condition for G to be stochastically larger than H is the existence of a coupling (X, Y) of (G, H), i.e. a pair of random variables defined on the same probability space with first and second marginals equal to H and G respectively, such that X ≤ Y with probability one.  Bipartite Ranking as ROC curve optimization. Going back to the multivariate setup, where H and G are probability distributions on Z, say Z = R d with arbitrary dimension d ≥ 1, the goal pursued in bipartite ranking can be phrased as that of building a scoring rule s(z) such that the (univariate) distribution G s of s(X) is 'as stochastically larger as possible' than the the distribution H s of s(Y). Hence, the capacity of a candidate s(z) to discriminate between the positive and negative statistical populations can be evaluated by plotting the ROC curve α ∈ (0, 1) → ROC(s, α) = ROC Hs,Gs (α): the closer to the left upper corner of the unit square the curve ROC(s, .), the better the scoring rule s. Therefore, the ROC curve conveys a partial preorder on the set of all scoring functions: for all pairs of scoring functions s 1 and s 2 , one says that s 2 is more accurate than s 1 when ROC(s 1 , α) ≤ ROC(s 2 , α) for all α ∈ [0, 1]. It follows from a standard Neyman-Pearson argument that the most accurate scoring rules are increasing transforms of the likelihood ratio Ψ(z) = dG/dH(z). Precisely, it is shown in [9] (see Proposition 2 therein) that the optimal scoring rules are the elements of the set: We denote by ROC * (.) = ROC(Ψ, .) and recall incidentally that this optimal curve is non-decreasing and concave and thus always above the main diagonal of the unit square. Now, the bipartite ranking task can be reformulated in a more quantitative manner: the objective pursued is to build a scoring function s(z), based on the training examples {X 1 , . . . , X n } and {Y 1 , . . . , Y m }, with a ROC curve as close as possible to ROC * . A typical way of measuring the deviation between the two curves is to consider the distance in sup norm: Attention should be paid that this quantity is a distance between ROC curves (or between the related equivalence classes of scoring functions, the ROC curve of any scoring function being invariant by strictly increasing transform) not between the scoring functions themselves. Since the curve ROC * is unknown in practice, the major difficulty lies in the fact that no straightforward statistical counterpart of the (functional) loss (2.3) is available. In [9] (see also [10]), it has been however shown that bipartite ranking can be viewed as a superposition of cost-sensitive classification problems and somehow 'discretized' in an adaptive manner, so as to apply empirical risk minimization with statistical guarantees in the d ∞ -sense, at the price of an additional bias term inherent to the approximation step. Alternatively, the performance of a candidate scoring rule s can be measured by means of the L 1 -norm in the ROC space. Observing that, in this case, the loss can be decomposed as follows: (2.4) minimizing the L 1 -distance to the optimal ROC curve boils down to maximizing the area under the curve ROC(s, .), that is to say where X and Y are random variables defined on the same probability space, independent, with respective distributions G and H, denoting by G s and H s the distributions of s(X) and s(Y) respectively. The scalar performance criterion AUC(s) defines a total preorder on S and its maximal value is denoted by AUC * = AUC(s * ), with s * ∈ S * . Bipartite ranking through maximization of empirical versions of the AUC criterion has been studied in several articles, including [1] or [4]. Extension to multipartite ranking (i.e. when the number of samples/distributions under study is larger than 3) is considered in [6], see also [5]. In contrast to [9] or [10], where the task of learning scoring rules with statistical guarantees in sup norm in the ROC space is considered, the present article focuses on optimization of summary scalar empirical criteria generalizing the AUC that takes the form of two-sample linear rank statistics, as could be naturally expected when addressing ranking problems.

Two-Sample Linear Rank Statistics
If the curve ROC H,G is the appropriate tool to examine to which extent a univariate distribution G is stochastically larger than another one H, practical decisions are generally made on the basis of the observations of two univariate independent random i.
I{Y j ≤ t} and N = n + m. Breakpoints of the piecewise linear curve (2.6) necessarily belong to the set of gridpoints {(j/m, i/n) : j ∈ {1, . . . , m − 1} and i ∈ {1, . . . , n − 1}}. Denote by X (i) the order statistics related to the sample {X 1 , . . . , X n }, i.e. Rank(X (n) ) > · · · > Rank(X (1) ), and by Y (j) those related to the sample {Y 1 , . . . , Y m }. Consider the càd-làg step function: where, for all j ∈ {1, . . . , m}, we set: The ROC curve (2.6) is the continuous broken line that connects the jump points of the step curve (2.8) and can thus be expressed as a function of the 'positive ranks' i.e. the Rank(X i )'s only. As a consequence, any summary of the empirical ROC curve, is a two-sample rank statistic, that is a measurable function of the 'positive ranks'. In particular, the empirical AUC, i.e. the AUC of the empirical ROC curve (2.6), also termed the rate of concording pairs or the Mann-Whitney statistic, can be easily shown to coincide, up to an affine transform, with the sum of 'positive ranks', the well-known rank-sum Wilcoxon statistic [37]: Rank(X i ) .
However, two-sample rank statistics (i.e. functions of the Rank(X i )'s) form a very rich collection of statistics and this is by no means the sole possible choice to summarize the empirical ROC curve.
Definition. 1. (Two-sample linear rank statistics) Let φ : [0, 1] → R be a nondecreasing function. The two-sample linear rank statistics with 'scoregenerating function' φ(u) based on the random samples {X 1 , . . . , X n } and {Y 1 , . . . , Y m } is given by: The statistics (2.10) defined above are all distribution-free when H = G and are, for this reason, particularly useful to detect differences between the distributions H and G and widely used to perform homogeneity tests in the univariate setup. Tabulating their distribution under the null assumption, they can be used to design unbiased tests at certain levels α in (0, 1). The choice of the score-generating function φ can be guided by the type of difference between the two distributions (e.g. in scale, in location) one possibly expects, and may then lead to locally most powerful testing procedures, capable of detecting 'small' deviations from the homogeneous situation. More generally, depending on the statistical test to perform, one may use particular function φ, Figure  2 shows classic score-generating functions broadly used for two-sample statistical tests (refer to [17]). One may refer to Chapter 9 in [31] or to Chapter 13 in [35] for an account of the (asymptotic) theory of rank statistics. In the present paper, two-sample linear rank statistics are used for a very different purpose, as empirical performance measures in bipartite ranking based on two independent multivariate samples {X 1 , . . . , X n } and {Y 1 , . . . , Y m }. The analysis of the bipartite ranking problem carried out in Section 4, based on the concentration inequalities established in Section 3, shows the relevance of evaluating the ranking performance of a scoring rule candidate s(z) by computing a two-sample linear rank statistic based on the univariate samples obtained after scoring {s(X 1 ), . . . , s(X n )} and {s(Y 1 ), . . . , s(Y m )} and establishes statistical guarantees for the generalization capacity of scoring rules built by optimizing such an empirical criterion.  Van der Waerden test φ vdw (u) = Φ −1 (u) in green, Φ being the normal quantile function.

Bipartite Ranking as Maximization of Two-Sample Rank Statistics
As foreshadowed above, empirical performance measures in bipartite ranking should be unsurprisingly based on ranks. We propose here to evaluate empirically the ranking performance of any scoring function candidate s(z) in S by means of statistics of the type: 11) where N = n + m, φ : [0, 1] → R is some Borelian nondecreasing function. This quantity is a two-sample linear rank statistic (see Definition 1) related to the score-generating function φ(u) and the samples {s(X 1 ), . . . , s(X n )} and {s(Y 1 ), . . . , s(Y m )}. This statistic is invariant by increasing transform of the scoring function s, just like the (empirical) ROC curve and, as recalled in the previous section, it is a natural and common choice to quantify differences in distribution between the univariate samples {s(X 1 ), . . . , s(X n )} and {s(Y 1 ), . . . , s(Y m )}, to evaluate to which extent the distribution of the first sample is stochastically larger than that of the second sample in particular. It consequently appears as legitimate to learn a scoring function s by maximizing the criterion (2.11). Whereas rigorous arguments are developed in Section 4, we highlight here that, for specific choices of the score-generating function φ, many relevant criteria considered in the ranking literature can be accurately approximated by statistics of this form: • φ(u) = u -this choice leads to the celebrated Wilcoxon-Mann-Whitney statistic which is related to the empirical version of the AUC.
• φ(u) = uI{u ≥ u 0 }, for some u 0 ∈ (0, 1) -such a score-generating function corresponds to the local AUC criterion, introduced recently in [7]. Such a criterion is of interest when one wants to focus on the highest ranks.
• φ(u) = u q -this is another choice which puts emphasis on high ranks but in a smoother way than the previous one. This is related to the q-norm push approach taken in [30]. However, we point out that the criterion studied in the latter work relies on a different definition of the rank of an observation. Namely, the rank of positive instances among negative instances (and not in the pooled sample) is used. This choice permits to use independence which makes the technical part much simpler, at the price of increasing the variance of the criterion.
• φ(u) = φ N (u) = c ((N + 1)u) I{u ≥ k/(N + 1)} -this corresponds to the DCG criterion in the bipartite setup (see [11]), one of the 'gold standard quality measure' in information retrieval, when grades are binary. The c(i)'s denote the discount factors, c(i) measuring the importance of rank i. The integer k denotes the number of top-ranked instances to take into account. Notice that, with our indexation, top positions correspond to the largest ranks and the sequence {c(i)} i≤N should be chosen increasing.
Depending on the choice of the score-generating function φ, some specific patterns of the preorder induced by a scoring function s(z) can be either enhanced by the criterion (2.11) or else completely disappear: for instance, the value of (2.11) is essentially determined by the possible presence of positive instances among top-ranked observations, when considering a score generating function φ that rapidly vanishes near 0 and takes much higher values near 1.
Investigating the performance of maximizers of the criterion (2.11) from a nonasymptotic perspective is however far from straightforward, due to the complexity of the latter (i.e. a sum of strongly dependent random variables). It requires in particular to prove concentration inequalities for collections of twosample linear rank statistics, indexed by classes of scoring functions of controlled complexity (i.e. of VC-type), referred to as two-sample rank processes throughout the article. It is the purpose of the next section to establish such results.

Concentration Inequalities for Two-Sample Rank Processes
This section is devoted to prove concentration bounds for collections of twosample linear rank statistics (2.11), indexed by classes S 0 ⊂ S of scoring functions. In order to study the fluctuations of (2.11) as the full sample size  (3.1) Since n/N → p as N tends to infinity, the quantity above is a natural estimator of the c.d.f. F s . Equipped with these notations, we can write: Hence, the statistic (3.2) can be naturally seen as an empirical version of the quantity defined below, around which it fluctuates.
Definition. 2. For a given score-generating function φ, the functional is referred to as the "W φ -ranking performance measure".
Indeed, replacing F s,N (s(X i )) in (3.2) by F s (s(X i )) and taking next the expectation permits to recover (3.3). Observe in addition that, for φ(u) = u, the quantity (3.3) is equal to AUC(s) (2.5) a soon as the distribution F s is continuous. The next lemma reveals that the criterion (3.3) can be viewed as a scalar summary of the ROC curve.
Lemma. 3. Let φ be a score-generating function. We have, for all s in S, proof. Using the decomposition F s = pG s + (1 − p)H s , we are led to the following expression: Then, using a change of variable, we get: As revealed by Eq. (3.4), a score-generating function φ that takes much higher values near 1 than near 0 defines a criterion (3.3) that mainly summarizes the behavior of the ROC curve near the origin, i.e. the preorder on the set of instances with highest scores.
Below, we investigate the concentration properties of the process: As a first go, we prove, by means of linearization techniques, that two-sample linear rank statistics can be uniformly approximated by much simpler quantities, involving i.i.d. averages and two-sample U -statistics. This will be key to establish probability bounds for the maximal deviation: under adequate complexity assumptions for the class S 0 of scoring functions considered and to study next the generalization ability of maximizers of the empirical criterion (3.2) in terms of W φ -ranking performance. Throughout the article, all the suprema considered, such as (3.6), are assumed to be measurable and we refer to Chapter 2.3 in [36] for more details on the formulation in terms of outer measure/expectation that guarantees measurability.
Uniform approximation of two-sample linear rank statistics. Whereas statistical guarantees for Empirical Risk Minimization in the context of classification or regression can be directly obtained by means of classic concentration results for empirical processes (i.e. averages of i.i.d. random variables), the study of the fluctuations of the process (3.5) is far from straightforward, insofar as the terms averaged in (3.2) are not independent. For averages of non-i.i.d. random variables, the underlying statistical structure can be revealed by orthogonal projections onto the space of sums of i.i.d. random variables in many situations. This projection argument was the key for the study of empirical AUC maximization or that of within cluster point scatter, which involved U -processes, see [4] and [3]. In the case of U -statistics, this orthogonal decomposition is known as the Hoeffding decomposition and the remainder may be expressed as a degenerate U -statistic, see [20]. For rank statistics, a similar though more complex decomposition can be considered. We refer to [18] for a systematic use of the projection method for investigating the asymptotic properties of general statistics. From the perspective of ERM in statistical learning theory, through the projection method, well-known concentration results for standard empirical processes and U -processes may carry over to more complex collections of random variables such as two-sample linear rank processes, as revealed by the approximation result stated below. It holds true under the following technical assumptions.
For all s ∈ S 0 , the random variables s(X) and s(Y) are continuous, with density functions that are twice differentiable and have Sobolev W 2,∞ -norms 3 bounded by M < +∞.
For the definition of VC classes of functions, one may refer to e.g. [36], see section 2.6.2 therein, and also recalled in Appendix section A.3. By means of the proposition below, the study of the fluctuations of the two-sample linear rank process (3.5) boils down to that of basic empirical processes.
Proposition. 4. Suppose that Assumptions 1-3 are fulfilled. The two-sample linear rank process (3.5) can be linearized/decomposed as follows. For all s ∈ S 0 , For any δ ∈ (0, 1), there exist constants c 1 , c 3 > 0, c 2 ≥ 1, c 4 > 6, c 5 > 3, depending on φ and V, such that The proof of this linearization result is detailed in the Appendix section B.1 (refer to it for a description of the constants involved in the bound stated above).
Its main argument consists in decomposing (3.2) by means of a Taylor expansion at order two of the score generating function φ(u) and applying next the Hájek orthogonal projection technique (recalled at length in the Introduction Lemma A.1 for completeness) to the component corresponding to the first order term. The quantity R n,m (s) is then formed by bringing together the remainder of the Hájek projection and the component corresponding to the second order term of the Taylor expansion, while the probabilistic control of its order of magnitude is established by means of concentration results for (degenerate) one/two-sample U -processes (see the Appendix section A.4 for more details). It follows from decomposition (3.7) combined with triangular inequality that: Hence, nonasymptotic bounds for the maximal deviation of the process (3.5) can be deduced from concentration inequalities for standard empirical processes, as shall be seen below. Before this, a few comments are in order.

Remark 1. (On the complexity assumption)
We point out that alternative complexity measures could be naturally considered, such as those based on Rademacher averages, see e.g. [22]. However, as different types of stochastic process (i.e. empirical process, degenerate one-sample U -process and degenerate two-sample U -process) are involved in the present nonasymptotic study, different types of Rademacher complexities (see e.g. [4]) should be introduced to control their fluctuations as well. For the sake of simplicity, the concept of VC-type class of functions is used here.

Remark 2. (Smooth score-generating functions)
The subsequent analysis is restricted to the case of smooth score-generating functions for simplification purposes. We nevertheless point out that, although one may always build smooth approximants of irregular score generating functions, the theoretical results established below can be directly extended to non-smooth situations, at the price of a significantly greater technical complexity.
The theorem below provides a concentration bound for the two-sample rank process (3.5). The proof is based on the uniform approximation result precedingly established, refer to the Appendix section B.3 for technical details.
Theorem. 5. Suppose that the assumptions of Proposition 4 are fulfilled. Then, there exist constants C 1 , C 2 ≥ 24, depending on φ, V, such that for all C 4 ≥ C 1 depending on φ, and t: as soon as The concentration inequalities stated above are extensively used in the next section to study the ranking bipartite learning problem, when formulated as W φ -ranking performance maximization.

Performance of Maximizers of Two-Sample Rank Statistics in Bipartite Ranking
This section provides a theoretical analysis of bipartite ranking methods, based on maximization of the empirical ranking performance measure (2.11 Optimal elements. The next result states that optimal scoring functions do maximize the W φ -ranking performance and form a collection that coincides with the set S * φ of maximizers of (3.3), provided that the score-generating function φ is strictly increasing on (0, 1).
Proposition. 6. Let φ be a score-generating function. The assertions below hold true.
Remark 3. (On plug-in ranking rules) Theoretically, a possible approach to bipartite ranking is the plug-in method ( [12]), which consists of using an estimateΨ of the likelihood function as a scoring function. As shown by the subsequent bound, when φ is differentiable with a bounded derivative, whenΨ is close to Ψ in the L 1 -sense, it leads to a nearly optimal ordering in terms of W-ranking criterion: However, the bound above may be loose and the plug-in approach faces computational difficulties when dealing with high-dimensional data, see [16], which provide the motivation for exploring algorithms based on W φ -ranking performance maximization.

Remark 4. (Alternative probabilistic framework)
We point out that the present analysis can be extended to the alternative setup, where, rather than assuming that two samples of sizes n and m, 'positive' and 'negative', are available for the learning tasks considered in this paper, the i.i.d. observations Z are supposed to come with a random label Y either equal to +1 or else to −1, indicating whether Z is distributed according to G or H. If p denotes the probability that the label Y is equal to 1, the number n of positive observations among a training sample of size N is then random, distributed as a binomial of size N with parameter p.
Consider any maximizer of the empirical W φ -ranking performance measure over a class S 0 ⊂ S of scoring rules: Since we obviously have: the control of deficit of W -ranking performance of empirical maximizers of (3.2) can be deduced from the concentration properties of the process (3.5).

Generalization Error Bounds and Model Selection
The corollary below describes the generalization capacity of scoring rules based on empirical maximization of W φ -ranking performance criteria. It straightforwardly results from Theorem 5 combined with the bound (4.2).
The result above establishes that maximizers of the empirical criterion (2.11) achieve a classic learning rate bound of order O P (1/ √ N ) when based on a training data set of size N , just like in standard classification, see e.g. [12]. Refer to the Appendix section B.4 for the proof of an additional result, that provides a bound in expectation for the deficit of W φ -ranking performance measure, similar to that established in the subsequent analysis, devoted to the model selection issue.
Model selection by complexity penalization. We have investigated the issue of approximately recovering the best scoring rule in a given class S 0 in the sense of the W φ -ranking performance measure (3.3), which is satisfactory only when the minimum achieved over S 0 is close to W * φ of course. We now address the problem of model selection, that is the problem of selecting a good scoring function from one of a collection of VC classes S k , k ≥ 1. A model selection method is a data-based procedure that aims at achieving a trade-off regarding two contradictory objectives, i.e. at finding a class S k rich enough to include a reasonable approximant of an element of S * , while being not too complex so that the performance of the empirical minimizer over itŝ k = arg max s∈S k W φ n,m (s) can be statistically guaranteed. We suppose that all class candidates S k , k ≥ 1, fulfill the assumptions of Proposition 4 and denote by V k the VC dimension of the class S k . Various model selection techniques, based on (re-)sampling or data-splitting procedures, could be naturally considered for this purpose. Here, in order to avoid overfitting, we focus on a complexity regularization approach, of which study can be directly derived from the rate bound analysis previously carried out, that consists in substracting to the empirical ranking performance measure the penalty term (increasing with V k ) given by: for pN ≥ B 2 V k where the constants B 1 and B 2 are those involved in Proposition 21 and C = 6( φ 2 ∞ + 9 φ 2 ∞ + 9||φ || 2 ∞ ). The scoring function selected maximizes the penalized empirical ranking performance measure, it isŝk(z) where:k The result below shows that the scoring ruleŝk nearly achieves the expected deficit of W φ -ranking performance that would have been attained with the help of an oracle, revealing the model minimizing Proposition. 8. Suppose that the assumptions of Proposition 4 are fulfilled for any class S k with k ≥ 1 and that sup k≥1 V k < +∞. Then, we have: as soon as pN ≥ B 2 sup k≥1 V k , where the constant B 2 > 0 is the same as that involved in Proposition 21 and C = 6( φ 2 Refer to the Appendix section B.5 for the technical proof.

Kernel Regularization for Ranking Performance Maximization
where κ(t) = t −∞ K(u)du and h > 0 is the bandwidth that determines the degree of smoothing, see e.g. [27]. The uniform integrated error sup s∈S0 | F s,h (t)− F s (t)|dt is shown to be of order O(h 2 ) under the assumptions recalled below, see [21].
Assumption 5. The kernel function K is of the form K 1 • K 2 , where K 1 is a function of bounded variation and K 2 is a polynomial.
Notice that Assumption 4 is fulfilled as soon as Assumption 1 is satisfied with R ≥ M . The statistical counterpart of (4.7) is then: A smooth version of the theoretical criterion (3.3) is given by: for all s ∈ S and an empirical version of the latter is W φ n,m,h (s)/n, where: For any maximizers of (4.10) over the class S 0 of scoring function candidates, we almost-surely have: This decomposition is similar to that obtained in (4.2) for maximizers of the criterion (2.11), apart from the additional bias term. Since the latter can be shown to be of order O(h 2 ) under appropriate regularity conditions and the first term on the right hand side of the equation above can be controlled like in Theorem 5, one may bound the deficit of W φ -ranking performance measure of s as follows.
Proposition. 9. Suppose that the assumptions of Proposition 4 are fulfilled, as well as Assumptions 4 and 5. Lets be any maximizer of the smoothed criterion (4.10) over the class S 0 . Then, for any δ ∈ (0, 1), there exist constants C 1 , C 3 > 0, C 2 ≥ 24 depending on φ, K, R, V, C 4 ≥ C 1 , and C 5 > 0 is a constant depending on φ, K and R, such that we have with probability at least 1−δ: as soon as N ≥ 1/(p min(p, 1 − p) 2 C 3 C 2 4 ) log(C 2 /δ) and δ ≤ C 2 e −C 2 1 C3 , with The proof is detailed in the Appendix section B.6.

Numerical Experiments
It is the purpose of this section to illustrate empirically various points highlighted by the theoretical analysis previously carried out: in particular, the capacity of ranking rules obtained by maximization of empirical W φ -performance measures to generalize well and the impact of the choice of the score generating function φ on ranking performance from the perspective of ROC analysis. Some practical issues, concerning the maximization of smoothed versions of the empirical W φ -performance criterion, are also discussed through numerical experiments. Additional experimental results can be found in the Appendix section C. All experiments displayed in this article can be reproduced using the code available at https://github.com/MyrtoLimnios/grad_2sample.

A Gradient-Based Algorithmic Approach
We start by describing the gradient ascent method (GA) used in the experiments in order to maximize the smoothed criterion (4.10) obtained by kernel smoothing over the class of scoring functions S 0 considered, as proposed in section 4.2, see Algorithm 1. Precisely, suppose that S 0 is a parametric class, indexed by a parameter space Θ ⊂ R d with d ≥ 1 say: S 0 = {s θ : X → R, θ ∈ Θ}. Assume also that, for all z ∈ Z, the mapping θ ∈ Θ → s θ (z) is of class C 1 (i.e. continuously differentiable) with gradient ∂ θ s θ (z) and that the scoregenerating function φ fulfills Assumption 2. The gradient of the smoothed ranking performance measure of s θ w.r.t. to the parameter θ, is given by: for all θ ∈ Θ, h > 0, where the gradient of F s θ ,N,h (s θ (z)) w.r.t. to θ is: for any z ∈ Z, using the fact that κ = K.
In practice, the iterations are continued until the order of magnitude of the variations ||θ (t+1) − θ (t) || becomes negligible. Then, the approximate maximum s θn,m (z) output by Algorithm 1 is next used to rank test data. Averages over several Monte-Carlo replications are computed in order to produce the results displayed in Subsection 5.3.

Synthetic Data Generation
We now describe the data generating models used in the simulation experiments, as well as the parametric class of scoring functions, which the learning algorithm previously described is applied to.
Probabilistic models. Two classic two-sample statistical models are used here, namely the location and the scale models, where both samples are drawn from multivariate Gaussian distributions. We denote by S + d (R) the set of positive definite matrices of dimension d × d, by I d the identity matrix.
Location model. Inspired by the optimality properties of linear rank statistics regarding shift detection in the univariate setup (cf Subsection 2.2), the model considered stipulates that X ∼ N d (µ X , Σ) and Y ∼ N d (µ Y , Σ) where Σ ∈ S + d (R) and the mean/location parameters µ X and µ Y differ. The Algorithm 1 is implemented here with Z = R d = Θ and S 0 = {s θ (·) = ·, θ , θ ∈ Θ} as class of scoring functions, where ·, · denotes the Euclidean scalar product on the feature space R d , and consequently exhibits no bias caused by the model. Indeed, Figure 3: Curves of the three score-generating functions under study: by computing the loglikelihood ratio, one may easily check that the function θ * , · , where θ * = Σ −1 (µ X − µ Y ), is an optimal scoring function for the related bipartite ranking problem. Denoting by . of the centered standard univariate Gaussian distribution, one may immediately check that the optimal ROC curve is given by: Three levels of difficulty are tested through the implementations Loc1, Loc2 and Loc3. The nearly diagonal covariance matrix of the three models has its eigenvalues in [0.5, 1.5] and µ X = (1 + ε)µ Y with ε = 0.10 (resp. ε = 0.20 and ε = 0.30) for Loc1 (resp. Loc2 and Loc3). The empirical ROC curves over the test pooled samples and additional curves are depicted in Fig. 10, 4, 11 for resp. Loc1, 2 and 3. The averaged ROC curves and the best one are gathered for the three models in Fig. 5. In Fig. 6, the evolution of the averaged empirical value of the W φ -criteria on the train set during the algorithm is computed. Fig.  14 shows the results for Loc2 and 3 for three different parameters of the RTB model with u 0 ∈ {0.70, 0.90, 0.95}.
Scale model. Consider now the situation where X ∼ N d (µ, Σ X ) and Y ∼ N d (µ, Σ Y ), the distributions having the same location vector µ ∈ R d but different scale parameters Σ X and Σ Y in S + d (R). The Algorithm 1 is implemented with Z = R d , Θ = S + d (R) and S 0 = {s θ (z) = z, θ −1 z , for all z ∈ Z, θ ∈ Θ}, with the notations previously introduced. By computing the likelihood ratio, one immediately checks that s θ * (·), with θ * = Σ −1 X − Σ −1 Y , is an optimal scoring function for the related scale model. For models Scale1, Scale2 and Scale3, observations are centered, Σ Y = I d and Σ X = I d + (ε/d)H, where ε is taken equal to 0.70, 0.80 and 0.90 respectively and H a d × d symmetric matrix with real entries such that all the eigenvalues of Σ X ∈ S + d (R) are close to 1.
Similar to the location models, the empirical ROC curves over the test pooled samples and additional curves are depicted in Fig. 7, 12, 13 for resp. Scale1, 2 and 3. The averaged ROC curves and the best one are gathered for the three models in Fig. 8. In Fig. 9, the evolution of the averaged empirical value of the W φ -criteria on the train set during the Algorithm is computed. Fig. 15 shows the results for Scale2 for three different parameters of the RTB model with u 0 ∈ {0.60, 0.70, 0.80}. Evaluation of the criteria. In order to evaluate the performance of the scoring function produced by an early-stopped version of Algorithm 1 depending on the score-generating function chosen, it is used to score the test sample and the corresponding ROC curves and its average are compared to those of the optimal scoring function s θ * (z). Also we consider the best/worst curves in the sense of resp. the minimization/maximization of the generalization error of the set of ROC curves obtained computed over the test pooled sample. Particular attention is paid to the behavior of these curves near the origin, which reflects the ranking performance for the instances with highest score values.

Results and Discussion
We now analyze the experimental results, by commenting on the test ROC curves obtained after learning the scoring functions, using the early-stopped version of the Algorithm 1 described above, that maximize the chosen (smoothed variant of the) W φ -performance measure: MWW, Pol and RTB. We compare them with ROC * . All the experiments were run using Python.
For both the location and scale models, we ran the algorithm for three increasing levels of difficulty defined by the decreasing value of the parameter ε. Figures 5 (location) and 8 (scale) show that the three methods (MWW, Pol, RTB) learn an empirical parameter θ n,m such that the corresponding ROC curve gets close to ROC * (red curves) and the more ε increases and the more the scoring rule learned generalizes well. Fig. 6 (location) and 9 (scale) reveal the monotonicity of the evolution of the empirical criteria, as the number of iterative steps of Algorithm 1 increases. Unsurprisingly, all the results show an increasing ability to learn a scoring function that maximizes the three W φ -performance measures, as ε increases (i.e. when the distribution G and H are significantly more different from each other).
Analyzing the average of the empirical ROC curves obtained, MWW performs better for the location model as its corresponding curve converges faster to ROC * for all ε. This phenomenon was expected due to the well-known high power of the related Mann-Whitney-Wilcoxon test statistic in this modeling. The aggregated ROC curve for the Pol method also performs well, while RTB's presents a low performance compared to MWW, see Fig. 5. Indeed, considering only the best ranked observations at each iteration in the learning procedure, does not always achieve a good scoring parameter and is enhanced by the earlystopped rule. It results in a higher variance and a larger spectrum of the empirical curves both at the same time, see the light blue curves in Fig. 4.3. and 11.3. (Loc2 and Loc3). The slow convergence for the RTB method is illustrated with Loc1, where almost both samples are blended/coincide, for which only the ROC curves above the diagonal were kept. For the scale model, the aggregated ROC curves are comparable for the three methods with a slightly higher performance obtained by RTB and we note the faster convergence of the algorithm for this model, see Fig. 9.
Looking at the best ROC curves (dark blue lines), defined as those obtained by the scoring function minimizing the generalization error for each criterion, RTB yields to a scoring function that generalizes best for most of the models. In particular, when focussing on the 'best' instances in the learning procedure, the obtained empirical scoring functions have higher performance at the beginning of the ROC curves, see the zoomed plots. Also, choosing the optimal proportion 1 − u 0 of observations to consider for the score-generating function results in different performance measures. Figure 14 gathers the resulting plots for models Loc2 and 3 with u 0 in {0.7, 0.9, 0.95} while Fig. 15 depicts the scale model 2 with u 0 in {0.6, 0.7, 0.8} and a higher number of loops T = 70. Considering the best ROC curves for all models shows that when u 0 tends to one, the beginning of the curve is accurately learned. Incidentally, note that the proportion of observations considered has to be large enough, so that the optimization algorithm performs well.

Conclusion
This article argues that two-sample linear rank statistics provide a very flexible and natural class of empirical performance measures for bipartite ranking. We have showed that it encompasses in particular well-known criteria used in medical diagnosis and information retrieval and proved that, in expectation, these criteria are maximized by optimal scoring functions and put the emphasis on specific parts of their ROC curves, depending on the score generating function involved in the criterion considered. We have established concentration results for collections of such statistics, referred to as two-sample rank processes here, under general assumptions and have deduced from them statistical learning guarantees for the maximizers of such ranking criteria in the form of a generalization bound of order O P (1/ √ N ), where N means the size of the pooled training sample. Algorithmic issues concerning practical maximization have also been investigated and we have displayed numerical results supporting the theoretical analysis carried out.

A Definitions and Preliminary Results
For the sake of clarity, crucial concepts and results extensively used in the technical analysis subsequently carried out are first recalled.

A.1 Hájek Projection Method
The Hájek projection method introduced in the seminal contribution [18] aims at decomposing (linearizing) any (possibly complex) square integrable statistic based on independent observations, so as to express it as an average of independent r.v.'s plus an uncorrelated term. The proof of Proposition 4 crucially relies on this technique. For completeness, it is described in the following lemma, one may refer to Chapter 11 in [35] for further details.

A.2 U -statistics and U -processes
As mentioned in Section 3, (degenerate) one/two-sample U -statistics are involved in the definition of the residual term introduced in Proposition 4. We recall the definition of such statistics generalizing basic i.i.d. sample averages, as well as some of their properties. See e.g. [23] for an account of the theory of U -statistics.
Definition. 11. (One-sample U -Statistic of degree two) Let n ≥ 2. Consider a i.i.d. sequence X 1 , . . . , X n drawn from a probability distribution µ on a measurable space X and k : X 2 → R a square integrable function w.r.t. µ ⊗ µ. The one-sample U -statistic of degree 2 and kernel function k based on the X i 's is defined as: As can be shown by a basic Lehmann-Scheffé argument, the statistic U n (h) is the unbiased estimator of the parameter θ(k) = k(x 1 , x 2 )µ(dx 1 )µ(dx 2 ) with minimum variance. Its Hájek projection can be expressed as follows: the projection of U n (k) − θ(k) onto the space of all random variables U -statistic (A.1) is said to be degenerate when the k 1,l (X 1 )'s are equal to zero with probability one, it is then of order O P (1/n). Hence, once recentered, the U -statistic (A.1) can be written as the i.i.d. average U n (h) plus a degenerate Ustatistic. This decomposition is known as the (second) Hoeffding representation of U -statistics and provides the key argument to establish limit results for such functionals, see e.g. [31].
The notion of U -statistic can be generalized in several ways, by considering kernels with a number of arguments (i.e. degree) higher than 2 or by extending it to the multi-sample framework.
Definition. 12. (Two-sample U -Statistic of degree (1, 1)) Let n, m in N * . Consider two independent i.i.d. sequences X 1 , . . . , X n and Y 1 , . . . , Y m respectively drawn from probability distributions µ and ν on the measurable spaces X and Y. Let : X × Y → R be a square integrable function w.r.t. µ ⊗ ν. The two-sample U -statistic of degree (1, 1), with kernel function (x, y) and based on for all (x, y) ∈ X × Y. The U -statistic U n,m ( ) is said to be degenerate when the random variables 1,1 (X 1 ) and 1,2 (Y 1 ) are equal to zero with probability one. Similar to (A.1), the recentered version of the two-sample U -statistic of degree (1, 1) (A.2) can be written as a sum of two i.i.d. averages U n,m ( ) plus a degenerate U -statistic of order O P (1/n) + O P (1/m). Again, the Hoeffding decomposition is the key to directly extend limit results known for i.i.d. aver- ages (e.g. SLLN, CLT, LIL) to statistics of the type (A.2). In the subsequent technical analysis, nonasymptotic uniform results are required for U -processes, namely collections of U -statistics indexed by classes of kernels. By means of the Hoeffding decomposition, concentration bounds for U -processes can be obtained by combining classic concentration bounds for empirical processes and concentration bounds for degenerate U -processes, such as those recalled in A.4.

A.3 V C-type Classes of Functions -Permanence Properties
The concentration inequalities for U -processes recalled in Appendix A.4 and involved in the proof of the main results stated in this article apply to collections of kernels that are of VC-type, a classic concept used to quantify the complexity of classes of functions. It is recalled below, see e.g. [36] for generalizations and further details.
Definition. 13. A class F of real-valued functions defined on a measurable space Z is a bounded V C-type class with parameter (A, V) ∈ (0, +∞) 2 and constant envelope L F > 0 if for all ε ∈ (0, 1): where the supremum is taken over all probability measures Q on Z and the smallest number of L 2 (Q)-balls of radius less than ε required to cover class F (i.e. covering number) is meant by N (F, L 2 (Q), ε).
Recall that a bounded VC class of functions with VC dimension V < +∞ is of VC-type and fulfills the condition above with V = 2(V − 1) and A = (cV (16e) V ) 1/(2(V −1)) , where c is a universal constant, see e.g. Theorem 2.6.7 in [36]. The lemma stated below permits to control the complexity of the classes of kernels/functions involved in the Hoeffding decompositions of a two-sample U -process of degree (1, 1) or of a one-sample U -process of degree 2, cf subsection A.2.
Lemma. 14. Let X and Y be two independent random variables, valued in X and Y respectively, with probability distributions µ and ν. Consider L a VC-type bounded class of kernels : X × Y → R with parameters (A, V) and constant envelope L L > 0. Then, the sets of functions : ∈ L} are also VC-type bounded classes.
proof. Consider first the uniformly bounded class L 1 composed of functions x ∈ X → E[ (x, Y )] with ∈ L. Let ε > 0 and P be any probability measure on X . Define the probability measure P ν (dx, dy) = P (dx)ν(dy) on X × Y and consider a ε-covering of the class L with centers 1 , . . . , K w.r.t. the metric L 2 (P ν ), K ≥ 1. For all ∈ L, there exists k ≤ K such that: By virtue of Jensen's inequality, we have Hence, one gets a ε-covering of the class L 1 with balls of centers {E[ k (·, Y )] : k = 1, . . . , K} in L 2 (P ). This proves that As a similar reasoning can be applied to the two other classes of functions, one then gets the desired result.

A.4 Concentration Inequalities for Degenerate U -processes.
In [24] (see Theorem 2 therein), a concentration bound for one-sample degenerate U -processes of arbitrary degree indexed by L 2 -dense classes of nonsymmetric kernels is established. The lemma below is a formulation of the latter in the specific case of degenerate U -processes of degree 2 indexed by VC-type bounded classes of non-symmetric kernels.
Lemma. 15. Let n ≥ 2 and X 1 , . . . , X n be i.i.d. random variables drawn from a probability distribution µ on a measurable space X . Let K be a class of measurable kernels k : X 2 → R such that sup x,x ∈X 2 |k(x, x )| ≤ D < +∞ and that defines a degenerate one-sample Uprocess of degree 2, based on the X i 's: {U n (k) k ∈ K}. Suppose in addition that the class K is of VC-type with parameters (A, V). Then, there exist constants C 1 > 0, C 2 ≥ 1 and C 3 ≥ 0 depending on (A, V) such that: as soon as The next lemma provides a similar nonasymptotic result for degenerate twosample U -processes of degree (1, 1).
Its proof is given in B.7 and is inspired from that of Lemma 2.14.9 in [36] and of Lemma 3.2 in [34] for empirical processes, and from Lemma 2.4 in [28] which gives a version in expectation applicable to degenerate two-sample U -processes of arbitrary degree indexed by L p -dense classes of kernels.

B Technical Proofs
The proofs of the results stated in the paper are detailed below.

B.1 Proof of Proposition 4
Let θ 0 ∈ (0, 1). Since φ(u) ∈ C 2 ([0, 1], R) by virtue of Assumption 2, a Taylor expansion of order two yields: for all θ ∈ (0, 1) with probability one. Let i ≤ n, for t = s(X i ), (B.2) writes: Hence, by summing over i ∈ {1, . . . , n}, one gets that the approximation of W n,m (s) stated below holds true almost-surely: W n,m (s) = n W φ (s) + B n,m (s) + T n,m (s) , Linearization of B n,m (·). First, observe that Notice that the first two terms are U -processes indexed by S 0 , cf Section A.2, while the last term is an empirical process. Indeed, one may write is a (nondegenerate) 1-sample U -process of degree 2 based on the random sample {X 1 , . . . , X n } with nonsymmetric kernel k s (x, x ) = I{s(x ) ≤ s(x)}φ • F s (s(x)) on X × X , is a (nondegenerate) two-sample U -process of degree (1, 1) based on the samples is an empirical process based on the X i 's. In order to write B n,m as an empirical process plus a (negligible) remainder term, the Hoeffding decomposition is applied to the U -processes above, cf Appendix A.2: Consequently, the Hájek projection of the process B n,m (s) is given by (B.15) The following result provides an approximation of (B.15) and is proved in Appendix B.2.2.
Lemma. 17. Under Assumptions 1-3, the Hájek projection of the stochastic process B n,m (·), denoted by B n,m (·) and indexed by S 0 , onto the subspace generated by the random variables X 1 , . . . , X n and Y 1 , . . . , Y m can be approximated as follows: for all s ∈ S 0 , Let δ > 0, there exist constants A 1 > 0, A 2 ≥ 2 depending on φ and V such that for all A 4 ≥ A 1 and The last step relies on all previous decompositions, so as to approximate B n,m (·) by the sum of two empirical processes V X n (·) and V Y n (·), with a uniform control of the error. All residual terms, R n,m (s) (Lemma 17) plus the remainders of the U -processes, are the components of the process R B n,m (s) , see the following Lemma 18. Let δ > 0. There exist D 1 > 0 universal constant, and constants D 3 , D 4 > 0, D 2 ≥ 1, d 1 , d 2 > 3 depending on φ and V, such that with probability at least 1 − δ: Refer to Appendix B.2.3 for the detailed proof.
A uniform bound for T n,m (·). By virtue of (B.5), we have: A classic concentration bound for empirical processes based on the VC inequality (see e.g. Theorems 3.2 and 3.4 in [2]) shows that, for any δ ∈ (0, 1), we have with probability at least 1 − δ: where c > 0 is a universal constant. In a similar fashion, we have, with probability larger than 1 − δ, Combining the bounds above with the union bound, (B.21) and (B.20) we obtain that, for any δ ∈ (0, 1), we have with probability larger than 1 − δ: where B 1 (resp. B 2 ) is a constant that only depends on φ and V (resp. on φ). To end the proof, it suffices to observe that the remainder process is the sum of  (1 − p), p). As B 2 > 1 , d ≥ 3, and for small δ, we obtain the upperbound B 1 + ( φ ∞ κ p D + B 2 ) log(2d/δ).

B.2 Intermediary Results
The intermediary results involved in Section B.1 are now established.

B.2.1 Permanence Properties
The lemmas below claim that the collections of kernels/functions involved in the decomposition obtained in Appendix B.1 are of VC-type and uniformly bounded. proof. Recall that:

B.2.2 Proof of Lemma 17
For s ∈ S 0 , by adding the diagonal term, the empirical process can be written s,1,2 (X i ) .
, at the price of changing A 2 .

B.2.3 Proof of Lemma 18
The remainder of the decomposition (18) is obtained by combining Eq. (B.8), (B.15) and yields, for all s ∈ S 0 R B n,m (s) ≤ | R n,m (s)| + p 2 N |R n (k s )| + p(1 − p)N |R n,m ( s )| . Suppose Assumptions 2-3 are fulfilled. The first process can be uniformly bounded on S 0 as proved in Lemma 17. For the two others, we apply the results of Lemmas 15 and 16 as follows. The process R n (k s ) (resp. R n,m ( s )) is the residual term obtained by decomposing the U -process U n (k s ) (Eq. (B.11), resp. (B.12))), for all s ∈ S 0 . By Lemma 20, its class of degenerate ker- is uniformly bounded and VC-type of parameters depending only on φ and on the VC dimension V. Notice that the three classes of functions have variances and envelopes which can be similarly bounded by σ 2 = [0,1] φ 2 ≤ ||φ || 2 ∞ , up to a multiplicative constant for both residuals. Let δ > 0, there exist constants as soon as N ≥ (4pA 3 A 2 4 ) −1 log(A 2 /δ). Also by Lemma 15 29) when N ≥ (pB 3 ) −1 log(B 2 /δ). And, by Lemma 16, there exist constants C 1 > 0, C 2 > 1 depending on V, φ and a universal constant C 3 > 0 such that The union bound concludes by considering constants such that with probability at least 1 − δ
as soon as Proposition 4 provides the existence of constants C > 6, D > 0 and c 3 > 0, c 5 > 3 depending on φ and V, such that , as soon as N ≥ (c 3 /p) log(c 5 /δ). The remainder process is negligible with respect to the empirical processes and we gather the four bounds to get where one can choose as soon as (B.35) is satisfied, 3) =: C 1 and C 3 can be chosen as C 3 = log(1 + C 4 /(4C 2 ))/(C 4 max(Σ 2 , σ 2 )) as 1/p, 1/(1 − p) ≥ p and at the price of changing C 2 .

B.4 A Generalization Bound in Expectation
For the sake of completeness, we state and prove a version in expectation of the generalization result formulated in Corollary 7.
Proposition. 21. Under the assumptions of Proposition 4, the expected risk bound is derived as follows: proof. Following the decomposition (3.9), we bound in expectation each process recalling that they are indexed by uniformly bounded VC-type classes, refer to Proof B.3 for the details on theoretical guarantees concerning the permanence properties. For the empirical processes W φ , V X n and V Y m , we use Theorem 2.1 in [13], whereas for the remainder process, we require the following result that is proved subsequently.
Lemma. 22. Under the assumptions of Proposition 4, the remainder process can be uniformly bounded in expectation as follows: for pN ≥ D 2 V with constants D 1 > 0 depending on φ, V and D 2 > 0 on φ.
(B.40) as well as The remainder process being of higher order, we conclude proof. For all s ∈ S 0 |R n,m (s)| ≤ | R n,m (s)| + N |R n (k s )| + N |R n,m ( s )| + | T n,m (s)| (B.43) The process appearing first in the remainder induced by the Hájek projection method (Lemma 17), is composed of sums of empirical processes, hence applying Theorem 2.1 in [13] to each process of (B.24) yields with constants D 1 > 0 depending on φ and d > 0 on φ, V.The stochastic processes R n (k s ) and R n,m ( s ) being both degenerate U -processes, respectively one-sample of degree 2 and two-sample of degree (1, 1), we apply results in [29] (see Theorem 6 therein) and [28] (see Lemma 2.4 therein) so as to get Minimizing the bound above w.r.t. u > 0, we obtain the point B 1 + B 2 log(2) and the upperbound then writes B 1 + B 2 (1 + log (2)), where B 1 (resp. B 2 ) is a constant that only depends on φ and V (resp. on φ). Combining all bounds together permits to conclude: for N ≥ V log(d), we have (2)) where D > 0 constant depending on φ, V.

B.5 Proof of Proposition 8
We first prove the following lemma.
Lemma. 23. Let S 0 ⊂ S and suppose that Assumptions 1-3 are fulfilled. For all t > 0, we have: . Considering that sup s∈S0 W φ (s) − W φ n,m (s)/n is a function of the N independent random variables X 1 , . . . , X n , Y 1 , . . . , Y m , observe that changing the value of any of the X i 's while keeping all the others fixed changes the value of the supremum by at most taking into account the jumps of each of the three terms involved in (B.50), see Eq. (B.7) and (B.20). In a similar way, changing the value of any of the Y j 's changes the value of the supremum by at most 2||φ || ∞ n N + 1 + 2||φ || ∞ 1 + 2n N 2 . When taking the squares, both can be upperbounded by 12( φ 2 ∞ + 9 φ 2 ∞ + 9||φ || 2 ∞ ). The desired bound stated then straightforwardly results from the application of the bounded difference inequality, see [25].
Let ε > 0, using Proposition 21 and Lemma 23, we have, for any k ≥ 1, as soon as pN ≥ B 2 V k and where C = 6( φ 2 ∞ + 9 φ 2 ∞ + 9||φ || 2 ∞ ). For each k ≥ 1, denote the penalized empirical ranking performance measure by For any ε > 0, we have, as soon as pN ≥ B 2 sup k≥1 V k , For all k ≥ 1, W * k = sup s∈S k W φ (s) = W φ (s * k ) and consider the decomposition n,m (ŝk)/n + W φ,k n,m (ŝk)/n − W φ (ŝk) . The expectation of the second term of the right hand side of the equation above can be bounded by means of the tail bound (B.53) for any k ≥ 1, as soon as pN ≥ B 2 sup k≥1 V k . Concerning the expectation of the first term, observe that for any k ≥ 1, as soon as pN ≥ B 2 sup k≥1 V k . Summing the bound obtained and that in (B.54) gives the desired result.

B.6 Proof of Proposition 9
The proof consists in combining the two results stated below with the decomposition (4.11) of the W φ -ranking performance deficit of the maximizer. The first result is the analogue of Theorem 5 for the smoothed criterion.
Theorem. 24. Suppose that the assumptions of Proposition 4 are fulfilled.
Then, for any δ ∈ (0, 1), there exist constants C 1 , C 3 > 0, C 2 ≥ 24, depending on φ, K, R, V such that with probability larger than 1 − δ: as soon as N ≥ 1/(p min(p, 1 − p) 2 C 3 C 2 4 ) log(C 2 /δ) and δ ≤ C 2 e −C 2 1 C3 , with The proof being quite similar to that of Theorem 5, it is omitted. Assumption 5 ensuring that the class {K((· − t)/h); , t ∈ R q , h > 0} (q = 1 here) is bounded VC-type (see e.g. Lemma 22(ii) in [29] and [14]), classic permanence properties can be used to check that all the classes of functions over which uniform bounds are taken are of finite VC dimension. The second result provides a uniform bound for the additional bias error made when approximating W φ (s) by W φ,h (s) for s ∈ S 0 .
Lemma. 25. Suppose that Assumptions 4 is satisfied. Then, for all h > 0, we have: sup where C 5 > 0 is a constant depending on φ, K and R only.
Details are left to the reader, the proof is straightforward under Assumption 4, using the regularity of the score generating function and the uniform integrated error bound obtained in [21].

B.7 Proof of Lemma 16
We shall prove an exponential bound of Hoeffding's type for the uniformly bounded two-sample degenerate U -process for all in L. We start by proving the following lemmas, involved in the argument.
Lemma. 26. Let P and Q be probability distributions on measurable spaces X and Y respectively. Consider the degenerate two-sample U -statistic of degree (1, 1) (B.57) with a bounded kernel : X × Y → R based on the independent i.i.d. random samples X 1 , . . . , X n and Y 1 , . . . , Y m , drawn from P and Q respectively. Let two sequences of i.i.d. Rademacher variables ε 1 , . . . , ε n and η 1 , . . . , η m , independent of the X i 's and Y j 's, such that the randomized process (B.58) is defined. Then, for any increasing and convex function Φ : R → R, we have: assuming that the suprema are measurable and that the expectations exist.
proof. We prove the first inequality, the proof of the second one being similar. Using the independence of the two samples, Fubini's theorem and the degeneracy property, one gets that by applying twice the reverse inequality in Lemma 3.5.2 of [33].
Next, we prove an exponential bound of Hoeffding's type for degenerate twosample U -statistics with bounded kernels.
Lemma. 27. Let P and Q be probability distributions on measurable spaces X and Y respectively. Consider the degenerate two-sample U -statistic of degree (1, 1) (B.57) with a bounded kernel : X × Y → R based on the independent i.i.d. random samples X 1 , . . . , X n and Y 1 , . . . , Y m , drawn from P and Q respectively. For all t > 0, we then have: using the fact that (e u + e −u )/2 ≤ e u 2 /2 for all u ∈ R. Integrating the bound over the X i 's and Y j 's and plugging it next into (B.62) yields the desired bound when choosing λ = nmt/(16c 2 ).
Finally, we prove the tail probability version of Lemma 26 stated below.
Lemma. 28. Let P and Q be probability distributions on measurable spaces X and Y respectively. Consider the degenerate two-sample U -statistic of degree (1, 1) (B.57) with a bounded kernel : X × Y → R based on the independent i.i.d. random samples X 1 , . . . , X n and Y 1 , . . . , Y m , drawn from P and Q respectively. Let two sequences of i.i.d. Rademacher variables ε 1 , . . . , ε n and η 1 , . . . , η m , independent of the Xis and Y js, such that the randomized process (B.58) is defined. Then we have for all t > 0, assuming that the suprema are measurable and that the expectations exist.
proof. This lemma, bounding the tail probability of sup ∈L |U n,m ( )| to that of sup ∈L |T n,m ( )|, generalizes Lemma 2.7 in [15] and Lemma 3.1 in [32] to degenerate two-sample U -processes. It is proved by applying twice a version of the latter result for independent but non necessarily identically distributed random variables. Indeed, we have: ∀t > 0, for all kernels 1 and 2 in L. For all q ∈ N * , consider a number k q ≤ (A/ε q ) V of L 2 -balls with radius ε q ≤ L ≤ 1 and centers q,k , 1 ≤ k ≤ k q , w.r.t. the (random) probability measure (1/nm) i≤n j≤m δ (Xi,Yj ) covering the class L. Assume that the sequence ε q is decreasing as q increases, so that k q is increasing. Let ∈ L, q ≥ 1 and˜ q be the center of a ball s.t. d nm ( ,˜ q ) ≤ ε q . Fixing q 0 ≤ q in N * , the following decomposition holds U n,m ( ) = (U n,m ( )−U n,m (˜ q ))+U n,m (˜ q0 )+ q ω=q0+1 U n,m (˜ ω ) − U n,m (˜ ω−1 ) .