Two-level monotonic multistage recommender systems

A recommender system learns to predict the user-specific preference or intention over many items simultaneously for all users, making personalized recommendations based on a relatively small number of observations. One central issue is how to leverage three-way interactions, referred to as user-item-stage dependencies on a monotonic chain of events, to enhance the prediction accuracy. A monotonic chain of events occurs, for instance, in an article sharing dataset, where a ``follow'' action implies a ``like'' action, which in turn implies a ``view'' action. In this article, we develop a multistage recommender system utilizing a two-level monotonic property characterizing a monotonic chain of events for personalized prediction. Particularly, we derive a large-margin classifier based on a nonnegative additive latent factor model in the presence of a high percentage of missing observations, particularly between stages, reducing the number of model parameters for personalized prediction while guaranteeing prediction consistency. On this ground, we derive a regularized cost function to learn user-specific behaviors at different stages, linking decision functions to numerical and categorical covariates to model user-item-stage interactions. Computationally, we derive an algorithm based on blockwise coordinate descent. Theoretically, we show that the two-level monotonic property enhances the accuracy of learning as compared to a standard method treating each stage individually and an ordinal method utilizing only one-level monotonicity. Finally, the proposed method compares favorably with existing methods in simulations and an article sharing dataset.


Introduction
A multistage recommender system on a monotonic chain of events predicts a user's preference of a large collection of items based on only a few user-item feedback at multiple stages, where a user's positive feedback of subsequent stages can only be observed given his/her positive feedback at present stages. As a result, a user may exhibit an increasing level of intention and preference given the feedback, ranging from the most implicit to the most explicit. It has been widely used for personalized prediction in e-commerce and social networks as well as individual drug responses over multiple phases in personalized medicine (Suphavilai et al., 2018). In such a situation, the objective of a multistage recommender system is to predict a user's subsequent behavior from his/her previous feedback with a high percentage of missing observations. For instance, as displayed in Figure 1, the Deskdrop article only if this user has viewed it, and the user may follow an article only if this user has liked it. Three stagewise predictions based on different pairs of a given stage and a subsequent stage are considered: given view to predict like, given view to predict follow, and given like to predict follow.
article sharing dataset 1 consists of a sequence of actions ranging from view, like, to follow, where only less than 0.2% of values are observations with a total of 73, 000 users and more than 3,000 articles. A user likes an article only if this user has viewed it, and the user may follow an article only when he/she has liked it. Thus, three pairwise predictions are made based on three pairs of present and subsequent stages are performed, namely, the prediction of if the user will like an article given that he has viewed, that if he/she will follow given that he has viewed, and that if he/she will follow given that he has liked.
One key characteristic of multistage prediction on a monotonic chain of events is a twolevel monotonic property, as precisely defined in (3). On the one hand, monotonicity occurs over subsequent stages given a present stage. For instance, based on the fact that a user has viewed the article, this user won't follow it if he/she doesn't like it. On the other hand, monotonicity exhibits over present stages given a subsequent stage in a reversed order. That said, if the user has decided to follow only he/she viewed it, then the user will follow an article when he/she has liked.

Literature review and our contributions
Despite recent progress in recommender systems, multistage personalized prediction on a monotonic chain of events remains largely unexplored. A standard method is to predict a user's behaviors at a subsequent stage given a present stage individually, ignoring the monotonic property, as described in (11). In this sense, essentially all existing methods of recommender systems for a single-stage are applicable, including latent factor models (Koren, 2008;Dai et al., 2019b), logistic matrix factorization (Johnson, 2014), tensor factorization (Bi et al., 2018), and deep neural networks (He et al., 2017). The worth of note is that ordinal regression/classification may be adopted by treating subsequent stages as an ordered class given a present stage, as in (12). For example, in Wan and McAuley (2018); Tran et al. (2012), latent factor models are developed based on an ordered logit model. Yet, a two-level monotonic property introduced by a monotonic chain of events is not met, which considers only forward monotonicity of subsequent stages given the present one but not the backward monotonicity over present stages given the subsequent one. As to be seen, the two-level monotonicity yields not only a high accuracy of prediction but also prediction consistency (3) among different stage pairs. In practice, prediction consistency is indeed desirable for decision-making.
This article develops a multistage recommender system that models user-item-stage interactions and integrates the two-level monotonic property into personalized prediction on a monotonic chain of events. Three key contributions in this article can be summarized: • The proposed method can produce a recommendation for any subsequent stage given observations at any present stage, which is highly demanded in real applications; see Figure 1. Yet, most conventional recommender systems focus on a fixed present stage.
• A novel multistage loss function is proposed to treat multistage prediction for any pair of present and subsequent stages. This loss function admits user-item interactions observed at different stages and evaluates the prediction accuracy on all subsequent stages.
• The two-level monotonic property is fully accounted for by our nonnegative additive latent factor model based on the Bayes rule in Lemma 1. As a result, it substantially reduces model parameters and most importantly ensures prediction consistency across different stages. By comparison, none of aforementioned methods can guarantee prediction consistency, c.f., Tables 2 and 3.
• An algorithm is developed to implement the proposed method based on blockwise coordinate descent. Moreover, a learning theory is established to quantify the generalization error to demonstrate the benefits of modeling user-item-stage interactions based on the two-level monotonic property.

Multistage classification on a monotonic chain
Consider a recommender system in which triplets (∆ ij , X ij , Y ij ), with Y ij = (Y 1 ij , · · · , Y T ij ) are observed for user i on item j at stage t; 1 ≤ i ≤ n, 1 ≤ j ≤ m, 1 ≤ t ≤ T . Here n and m are the number of users and items, respectively. ∆ ij = 1/0 indicates if (X ij , Y ij ) is observed or missing and Ω t = {(i, j) : ∆ ij = 1, y t ij = 1} is an index set of observed positive feedback at stage t, where Y t ij = ±1 with 1/-1 indicating positive/negative feedback at stage t, and denotes transpose. Moreover, x ij = (u i , v j , s i , o j ) consists of both numerical and categorical features, where u i ∈ [0, 1] p 1 and v j ∈ [0, 1] p 2 are user-specific and item-specific numerical predictors, respectively, and normalization is performed for each feature to scale to [0, 1] otherwise, and moreover s i = (s i1 , · · · , s id 1 ) with s il ∈ {1, · · · , n l } and o j = (o j1 , · · · , o jd 2 ) with o jl ∈ {1, · · · , m l } are d 1 -dimensional and d 2 -dimensional user-specific and item-specific categorical predictor vectors. For instance, in the Deskdrop dataset 2 , v j is a numerical embedding for the content, s i consists of "person Id", "user agent", "user region", "user country", o j consists of "content Id", "author Id", "language", and Y 1 ij and Y 2 ij indicate if article j is liked and followed by user i, respectively. One important characteristic of this kind of data is that subsequent events will not occur if one present event does not occur. For instance, if a user does not view an item, then subsequent events of like and follow will not occur.
On this ground, we define a monotonic behavior chain as follows: which says that event Y t−1 ij = −1 implies that any subsequent event Y t ij = −1 must occur, which encodes a certain causal relation as defined by the local Markov property (Edwards, 2012) in a directed acyclic graphical model.

Two-level monotonicity
Multistage classification learns to predict the outcome of Y t ij at a subsequent stage t given observations at a present stage Y t ij and x ij ; 0 ≤ t < t ≤ T . With the convention, we set Y 0 ij = 1. To predict, we introduce a decision function φ t t ij = φ t t ij (x ij , y t ij ) for classification, which depends on observations x ij and y t ij . For user i on item j across T stages, decision functions can be expressed as φ ij = (φ t t ij ) 0≤t <t≤T . For all users over items across T stages, decision functions are φ = (φ ij ) 1≤i≤n;1≤j≤m .
To evaluate the overall performance of φ, we define the multistage misclassification error where e ij (φ ij ) is the pairwise misclassification error for user i on item j across T stages, and P ij (·) = P ij (· | X ij = x ij ) and E ij (·) = E(· | X ij = x ij ) denote the conditional probability and expectation given X ij = x ij . In (2), w t t ≥ 0 is a pre-specified weight for predicting the outcome at subsequent stage t based on present stage t ; 0 ≤ t < t ≤ T , reflecting the relative importance with respect to prediction at different stages in the overall evaluation. In particular, (2) reduces to the next-stage prediction and last-stage prediction when w t t = I(t − t = 1) and w t t = I(t = T ), respectively. Moreover, if missing occurs completely at random, or missing pattern ∆ ij is independent of Y ij , then (2) is proportional to the standard misclassification error. Lemma 1 gives the multistage Bayes decision functionφ ij = argmin φ ij e ij (φ ij ) for user i on item j in (2) subject to a constraint that predicted Y t ij by the Bayes rule satisfies the monotonic behavior chain property (1).
Lemma 1 (Multistage Bayes-rule) The optimal multistage pairwise decision function φ ij minimizing (2) can be written as: The multistage Bayes rule for predicting Y t ij by Moreover, there existsh r ij (x ij ) ≥ 0 such thatf t t ij (x ij ) can be written in an additive form: The two-level monotonic property (3) says that sign(f t t ij (x ij )) is decreasing in t for any fixed t (forward monotonicity) whereas is increasing in t for any fixed t (backward monotonicity). Note that (4) guarantees (3).
In view of Lemma 1, we introduce our multistage pairwise decision functions to mimic the additive multistage Bayes rule:

Two-level monotonic multistage classification
Based on the representation of decision function in (5), we rewrite e ij (φ ij ) in (2) as follows: Note that the indicator function I(·) in (6) is difficult to treat in optimization. Therefore, we replace it by a surrogate loss V (u) for large-margin classification, in which V is a function of the corresponding functional margin Y t ij f t t ij (x ij ). This includes, but is not limited to, the hinge loss V (u) = (1 − u) + (Cortes and Vapnik, 1995), the import vector machine V (z) = log(exp(−z)/(1 − exp(−z))) (Zhu and Hastie, 2002), and the ψ-loss V (u) = min(1, (1 − u) + ) (Shen et al., 2003). On this ground, we propose a multistage large-margin loss function: where Z ij = (∆ ij , Y ij ) and f ij = (f t t ij ) 0≤t <t≤T . Lemma 2 says that a minimizer of the cost with respect to f satisfies the Bayes rule in Lemma 1.
Lemma 2 (Multistage Fisher consistency) The minimizer of l(f ) is Fisher-consistent in that it satisfies the Bayes rule in Lemma 1 if the surrogate loss V (·) is Fisher consistency in binary classification.
Next we parametrize our decision functions based on an additive latent factor model with to incorporate the collaborative information across users, items and stages. In particular, (4), and set q 0 = 1 to avoid the overparametrization. Moreover, a(u i , s i ) and b(v j , o j ) are proposed to link our decision function linearly to numerical predictors u i and v j , as well as additive latent factors structured by categorical predictors s i and o j . Then, the proposed prediction function can be written as subject to A ≥ 0, B ≥ 0, q r ≥ 0; r = 1, · · · , T , a lh ≥ 0; l = 1, · · · , d 1 ; h = 1, · · · , n l ; b lh ≥ 0; l = 1, · · · , d 2 ; h = 1, · · · , m l , where • is the Hadamard product, A ∈ R K×p 1 and B ∈ R K×p 2 are two matrices, which transform the user-specific and item-specific numerical features to K-dimensional latent vectors, and a ls il and b lo jl are two K-dimensional latent factor for s il and o jl . Representation (8) is highly interpretable, it leverages all feature-interaction between users and items based on an additive model. For example, in Deskdrop dataset, the interaction effect of "userId"-"articleId", "userId"-"authorId", "userAgent"-"articleId", "userAgent"-"authorId", are all captured in (8). Furthermore, nonnegative constraints for A ≥ 0, B ≥ 0, a lh ≥ 0 and b lh ≥ 0 are enforced to ensure the two-level monotonic property (3). Given , q = (q 1 , · · · , q T ) , a l = (a l1 , · · · , a ln l ) , b l = (b l1 , · · · , b lm l ) , λ 1 , λ 2 , λ 3 ≥ 0 are tuning parameters controlling the trade-off between learning and regularization, and F is a space of candidate decision functions in (8), which can be written as where a(·, ·) and b(·, ·) are defined in (8).

Connection with existing frameworks
This section compares the proposed method with a standard method treating each stage individually and an ordinal method treating the stage as an ordinal class.
Standard. A standard method treats (t , t)-pairwise classification separately and then combine the prediction results for 0 ≤ t < t ≤ T . In fact, it estimates f t t = (f t t ij ) 1≤i≤n;1≤j≤m by solving: where F t t is a parameter space of candidate pairwise decision functions. A standard method includes latent factor models (Koren, 2008), gradient boosting (Cheng et al., 2014), and deep neural networks (He et al., 2017), and J λ (f t t ) regularizes f t t with a nonnegative tuning parameter λ ≥ 0.
Ordinal. An ordinal method treats each stage t separately, which estimates f t = (f t t ) t <t≤T by solving: where F t is a parameter space of candidate stagewise decision functions and J λ (f t ) regularizes f t . For example, Crammer and Singer (2002) formulates a parallel decision function where the forward monotonicity is ensured by positive constraints β t t ≥ 0, t = t + 1, · · · , T . Note that when V (y t ij f t t ij ) is replaced as a negative log-likelihood function, then (12) is a formulation for ordinal regression McCullagh and Nelder (2019); Bhaskar (2016).
Ordinal classification incorporates the forward monotonicity but ignores backward monotonicity. Whereas such monotonicity helps to reduce the size of the parameter space, by comparison, the proposed method can further reduce parameters utilizing the backward monotonicity. In contrast, a standard method does not leverage any level of monotonicity. Most critically, both standard and ordinal methods fail to yield prediction consistency.

Large-scale computation
This section develops a computational scheme to solve nonconvex minimization (9). For illustration, consider the hinge loss V (u) = (1 − u) + . The scheme minimizes a nonconvex cost function (9) by solving a sequence of relaxed convex subproblems via a block successive minimization (BSM). The scheme uses blockwise descent to alternate the following convex subproblems. For each subproblem, we decompose (9) into an equivalent form of many small optimizations for parallelization and for alleviating the memory requirement. Let q t t = 1 K − t r=t +1 q r , we solve (9) as follows. User-effect block A. This convex optimization solves for A: where ⊗ is the Kronecker product and vec(A) is the column vectorization of matrix A.
User-effect block a. This convex optimization solves for (a 1 , · · · , a d 1 ): Note that (14) can be separately solved for each a lh in a parallel fashion, Item-effect block B. This convex optimization solves for B: Stage-effect block q. This convex optimization solves for q r , for r = 1, · · · , T , given the present values for all other variables: As a technical note, (13)-(18) are standard SVMs with fixed intercept and positive constrains for model parameters, which can be solved via one efficient implementation in our python package varsvm 3 based on the coordinate decent algorithm (Wright, 2015).
The aforementioned scheme is summarized in Algorithm 1. Algorithm 1 (Parallelized version: hinge loss) Step 1 (Initialization). Initialize the values of (A, B, a, b, q) and specify the tolerance error.
Step 2 (Update A and a). Update A by solving (13) given present values of the other variables. For l = 1, · · · , d 1 , update a lh ; h = 1, · · · , n l in a parallel fashion by solving (15) given present values of the other variables.
Step 3 (Update B and b). Update B by solving (16) given present values of the other variables. For l = 1, · · · , d 2 , update b lh ; h = 1, · · · , m l in a parallel fashion by solving (17) given present values of the other variables.
Step 4 (Update q). For r = 1, · · · , T , update q r by solving (18) given present values of the other variables.
Step 5. Iterate Steps 2-4 until the decrement of the cost function in (9) is less than the tolerance error.
Algorithm 1 returns an estimate ( A, B, a, b, q) at termination, which in turn yields an estimated decision functions Finally, the prediction at the t-th stage given the t -th stage is sign φ t t ij (y t ij , x ij ) for new user-item covariates x ij . To implement Algorithm 1, we develop a Python library to solve an SVM-type of optimization, including weighted SVMs, drifted SVMs, and non-negative SVMs, based on coordinate descent of the dual problem (Wright, 2015). This implementation has been available in varsvm library in Python on the GitHub repository varsvm 4 , and the source code is released with the MIT license using GitHub and the release is available via PyPI. The development undergoes an integration of Ubuntu Linux and Mac OS X, in Python 3.
The successive update in Algorithm 1 suffices to ensure its convergence because the cost function in (9) is strictly blockwise convex, that is, (13), (15), (16), (17), and (18) are strictly convex in terms of its parameters within each block, yielding a unique minimizer at each step. This aspect differs from the maximum block improvement Chen et al. (2012) that guarantees the convergence of blockwise coordinate descent for a general objective function. One benefit of successive updating is that it significantly reduces computational complexity. Note that the solution of Algorithm 1 can be a global minimizer when certain additional assumptions are made Haeffele and Vidal (2015).

Theory
This section investigates the generalization aspect of the proposed multistage recommender f in terms of the accuracy of classification, as measured by the classification regret, defined as e(f ) − e(f ), where e(·) is the generalization error defined in (2).
Assume that (∆ ij , Y ij ) given X ij ; 1 ≤ i ≤ n; 1 ≤ j ≤ m, are conditionally independent, although (∆ ij , X ij , Y ij )'s may not be independent as multiple items can be purchased for a same user or multiple users may purchase a same item. Letf ∈ F be a Bayes decision function, with (Ā,B,ā,b,q), and the number of latent factors beK. Let the truncated V -loss be V B (u) = min(V (u), B), where B ≥ V Y t ijf t t ij (x ij ) for any i = 1, · · · , n; j = 4. https://github.com/statmlben/variant-svm 1, · · · , m; 0 ≤ t < t ≤ T , and the corresponding loss defined by V B (u) be The following technical assumptions are assumed: Assumption A (Conversion). There exist constants 0 ≤ α ≤ ∞ and c 1 > 0 such that for all 0 < ≤ B and f ∈ F, Assumption B (Variance property). There exist constants 0 ≤ µ ≤ 2 and c 2 > 0 such that for any ≥ 0, n, m ≥ 1, and f ∈ F, Assumptions A and B are two local smooth conditions regarding a connection between two different metrics and the relation between the mean and variance of the regret function in a neighborhood off . Similar assumptions have been used in binary classification can be founded in Zhang and Liu (2014), for example, α = µ = 1 when V (·) is the ψ-loss Shen et al. (2003) or the hinge loss under low-noise assumption.
Let K andK be the numbers of latent factor off andf , respectively.
Theorem 4 Letf be a global minimizer of (9) with K ≥K. If Assumptions A and B hold, then there exist constants c 3 > 0 and c 4 > 0 such that wherew = 0≤t <t≤T w 2 t t , (Λ + T )K is the total number of model parameters, and Λ = p 1 + p 2 + d 1 l=1 n l + d 2 l=1 m l .
As indicated in (19), the convergence rate is improved by the two-level monotonic chain property in (3) and the additive property in (4). As illustrated in Table 1, based on the factorization model in (8) while ignoring the two-level monotonicity (3), a standard method and an ordinal method involve ΛKT (T + 1)/2 and ΛKT parameters, which are more than (Λ + T )K and significantly impedes the performance when T becomes large. Furthermore, due to the monotonic property in (1), the effective sample size for Y t ij = 1 decreases exponentially as t increases. Then the standard and ordinal methods suffer from over-fitting at late stages with a small sample size. In contrast, the proposed method leverages the two-level monotonicity to reduce the dimension of the underlying problem.

Method
Forward monotonicity Backward monotonicity Consistency #Parameters Proposed (Λ + T )K Standard ΛKT (T + 1)/2 Ordinal ΛKT Table 1: Monotonicity and model parameters of the proposed, standard, and ordinal frameworks, denoting (9), (11) and (12). Here forward and backward monotonicity is defined in (3), consistency denotes if the methods can provide a consistent prediction result, and #Parameters denotes the number of model parameters based on (4).

Numerical examples
This section examines the proposed method in (9) with the hinge loss V (u) = (1 − u) + and compares it with standard methods, including latent factor models (SVD ++ ; (Koren, 2008)), gradient boosting (GradBoost) (Friedman, 2001), support vector machine (SVM) (Cortes and Vapnik, 1995), deep neural network (DeepNN) (Schmidhuber, 2015), ordinal method (12), namely ordinal support vector machine (OSVM), in a simulation and an article sharing benchmark. For a standard method, SVD ++ , GradBoost, SVM, and DeepNN treat each (t , t) pair separately for all possible pairs, where we use one-hot encoder (Weinberger et al., 2009) to convert a categorical covariate to 0-1 numerical predictors for training. Moreover, for SVD ++ , the number of latent factors is set as 20, and latent factors are estimated using the first columns of s i and o j with the rating (Y t ij +1)/2 ∈ {0, 1} while the label is predicted as 1 if the predicted rating exceeds 0.5 and -1 otherwise; for GradBoost, the minimum number of samples to split an internal node is set to be 2 and the minimum number of samples required to be at a leaf node is 1, and the number of boosting stages is set to be 10. For DeepNN, a ReLU network is used with the number of node in each layer being fixed at 32. For the ordinal method, we set the OSVM classifier in (12) where the forward monotonicity is ensured by positivity constraints β t t ≥ 0, t = t + 1, · · · , T .
For implementation, we use our Python library varsvm for the proposed method and OSVM, the Python library sklearn 5 for GradBoost, SVM, and DeepNN, and the Python library Surprise 6 for SVD ++ . For all methods, the prediction of Y t ij is automatically set as −1 when Y t ij = −1.
As suggested by Table 2, the proposed multistage recommender performs the best for one-step, two-step, and three-step forecasting, followed by DeepNN, SVM, and GradBoost. In terms of the overall performance, this same phenomenon is also observed and the amount of improvements of the proposed method are 43%, 22%, 20%, and 13% over Gradient Boosting, OSVM, SVM, and DeepNN. Concerning inconsistency prediction, SVM and OSVM perform the worst, followed by DeepNN and Gradient Boosting. Note that there is no single instance in which prediction by the proposed method is inconsistent, which is ensured by the two-level monotonicity property of the classier.

Benchmark: Deskdrop article sharing
A Deskdrop article sharing dataset contains a sample of 12 months logs and about 73,000 logged user interactions on more than 3,000 public articles. In particular, article-specific features are "article Id", "author Id", "plain text", as well as "language", and user-specific features include: "user Id", "user agent", and "user region", and users' actions are logged, including view, like, and follow. The goal is to predict user-item interaction over three stage pairs.
For the proposed method, v j is the numerical embedding based on Doc2Vec (Le and Mikolov, 2014) of the "plain text" of item j, o j consists of "article Id", "author Id", and "language" of item j, and s i is composed of "user Id", "user agent", and "user region" of user   (3),denoted by %Inconsist. Here proposed, SVD ++ , GradBoost, SVM, DeepNN and OSVM denote the proposed method in (9) with the hinge loss, the latent factor model (Koren, 2008), gradient boosting (Friedman, 2001), support vector machine (Cortes and Vapnik, 1995), deep learner (Schmidhuber, 2015), and conditional ordinal support vector machine in (12). The best performer in each case is bold-faced.
i. For other competing methods, we use all features based on a one-hot encoder to convert categorical covariates to zero-one dummy predictors. Moreover, we predict (t = 0, t = 1) based on the proposed method with T = 1, and predict (t = 0, t = 2) and (t = 1, t = 2) based on the proposed method with T = 2. For evaluation, we set all stage weights w t t = 1 and adjust the class weights at each stage inversely proportional to the number of observations in each class since feedback at each stage is highly imbalanced for this data. Hence, all methods are fitted based on balanced class weights or over-or under-sampling, except SVD ++ which based on a regression approach. Yet, for consistency, we focus on balanced zero-one loss for comparison since all methods are fitted and tuned based on a balanced-weighted classification loss.
As suggested by Table 3, the proposed multistage recommender outperforms SVD ++ , GradBoost, SVM, DeepNN, and OSVM, in terms of the overall performance with the amount of improvement ranging from 75.3% to 8.20%, in terms of the class-balanced zeroone loss. Similarly, for one-step and two-step forward prediction, it is either the best or nearly the best. Concerning inconsistency prediction, all other methods have inconsistent cases ranging from 13.7% to 0.98%. Interestingly, the conditional ordinal method (OSVM) performs similarly to the conditional individual method (SVM), indicating that it does not fully account for the monotonicity property.  Table 3: Class-balanced zero-one losses (CB-01) of six competitors and their estimated standard deviations in parentheses on the Deskdrop benchmark, in addition to %Inconsist denoting a proportion of inconsistent instances violating the monotonicity (3). Here proposed, SVD ++ , GradBoost, SVM, DeepNN and OSVM denote the proposed method in (9) with the hinge loss, the latent factor model (Koren, 2008), gradient boosting (Friedman, 2001), support vector machine (Cortes and Vapnik, 1995), deep learner (Schmidhuber, 2015), and ordinal support vector machine (12). The best performance in each case is bold-faced.

Technical Proofs
Proof of Lemma 1. It suffices to show thatφ t t ij is a global minimizer of l ij (f ij ). By (1), which yields thatφ t t ij (x ij , y t ij ) is a global minimizer of (20), sinceφ t t ij (x ij , 1) = sign P ij (Y t ij = 1|Y t ij = 1, ∆ ij = 1) − 1/2 minimizes E ij I(Y t ij φ t t ij (x ij , 1) ≤ 0)|Y t ij = 1, ∆ ij = 1 and φ t t ij (x ij , −1) = −1 minimizes E ij I(−φ t t ij (x ij , −1) ≤ 0)|∆ ij = 1, Y t ij = −1 . Next, we verify the two-level monotonic property (3) off t t ij . Note that which implies that P ij (Y t ij = 1|∆ ij = 1) is nondecreasing over stage t. Moreover, , provided that P ij (Y t ij = 1|∆ ij = 1) = 0, which is non-decreasing when t increases while t is fixed because P ij (Y t ij = 1|∆ ij = 1) is non-decreasing in t, and is non-increasing when t increases while t is fixed because is non-increasing in t . This implies the two-level monotonic property.
Next, we construct an alternative form of the Bayes decision functionf t t ij (x ij ), which is additive in stage t:f > 0, and c ij (x ij ) > 0 and α ij (x ij ) > 0 are two arbitrary positive functions. Clearly,f t t ij automatically satisfies the two-level monotonic property. Now, we need to prove thatf t t ij (x ij ) has the same sign off t t ij (x ij ), that is, sign ¯f t t ij (x ij ) = sign c ij (x ij ) log α ij (x ij ) 2P ij (Y t ij = 1|Y t ij = 1, ∆ ij = 1) = 1, when P ij (Y t ij = 1|Y t ij = 1, ∆ ij = 1) > 1/2 and equals to −1 otherwise. The desired result then follows.
Proof of Lemma 2. The result follows from the fact thatf t t ij is a global minimizer of E ij L(f ij , Z ij ) , when V (·) is Fisher consistency for binary classification. Proof of Lemma 3. Note that (13), (15), (16), (17), and (18) are convex minimization problems. Then convergence of Algorithm 1 follows from (Tseng, 2001).
Proof of Theorem 4. Our treatment of bounding P e(φ) − e(φ) ≥ 2 nm relies on a chain argument of empirical process over a suitable partition of F induced by e(φ) − e(φ), as in (Dai et al., 2019a;Wong et al., 1995). For u ≥ 1 and v ≥ 0, define A uv = {f ∈ F : where and λJ(f ) = λ 1 β 2 2 + K k=1 λ k a k 2 2 + b k 2 2 . To apply Talagrand's inequality (Giné et al., 2006) to I uv , let where a 2 = 32c 2 B µ−min(1,µ) + 9B 2−min(1,µ) and the second inequality follows from the fact that where Z ij is an independent copy of Z ij , is the Rademacher complexity of A uv , and τ ij are independent random variables drawn from the Rademacher distribution. The third inequality follows from the Talagrand's inequality and the last inequality follows from the fact that, A combination of (22) and (23) yield The desired result follows immediately.