Multi-Label Residual Weighted Learning for Individualized Combination Treatment Rule

Individualized treatment rules (ITRs) have been widely applied in many fields such as precision medicine and personalized marketing. Beyond the extensive studies on ITR for binary or multiple treatments, there is considerable interest in applying combination treatments. This paper introduces a novel ITR estimation method for combination treatments incorporating interaction effects among treatments. Specifically, we propose the generalized $\psi$-loss as a non-convex surrogate in the residual weighted learning framework, offering desirable statistical and computational properties. Statistically, the minimizer of the proposed surrogate loss is Fisher-consistent with the optimal decision rules, incorporating interaction effects at any intensity level - a significant improvement over existing methods. Computationally, the proposed method applies the difference-of-convex algorithm for efficient computation. Through simulation studies and real-world data applications, we demonstrate the superior performance of the proposed method in recommending combination treatments.


Introduction
The individualized treatment rule (ITR) in precision medicine has been widely applicable in recommending tailored treatment for each individual.Unlike the traditional one-size-fits-all strategy, ITR aims to account for subject heterogeneity to achieve personalization.While existing ITR methods mainly focus on choosing one of two or more treatments, combination treatments have emerged as a promising strategy to achieve better outcomes including enhanced efficacy and resistance prevention [26,24,13,22].Furthermore, combination treatments have also been applied in personalized marketing, where a mix of promotional strategies are tailored to diverse customer groups.In summary, developing ITR Xu. Q et al. methods for combination treatments is of great interest not only in precision medicine and personalized marketing, but also potentially in many other fields.
The existing literature on ITR estimation can be broadly summarized into two categories.The first of these categories is the indirect approach, including the well-known Q-learning [29,5], D-learning [28,27] and A-learning [33,20,32].These methods propose parametric or non-parametric models for conditional average treatment effects to recommend preferable treatment.Another mainstream of ITR is the direct approach, including outcome-weighted learning [44,46,41,17,38], and residual-weighted learning [45].These methods directly maximize the value function with respect to the decision rules.In practice, direct methods have demonstrated superior empirical performance, as they circumvent model misspecification issues common to indirect approaches.In addition, their decision rules are flexible, accommodating either parametric models (e.g., linear decision rule [44,41]) or nonparametric models (e.g., kernel method [44,41], neural network [16], and boosting [36]).
In regard to the combination treatment problem, we can apply the aforementioned multicategory ITRs to recommend the combination treatment problems, where each combination is treated as an independent treatment.Therefore, the correlations among combination treatments are ignored by the multicategory ITRs.Consequently, as the number of single treatments increases, the number of combination treatments tends to increase exponentially, which leads the model complexity of either direct or indirect approaches to explode.Given the limited or moderate sample sizes in biomedical applications, the estimation efficiency of multicategory ITRs is severely compromised.To address this issue, recent work [37] proposed an indirect approach which estimates the conditional average treatment effects (CATE) with the double encoder model.This method has been shown to achieve both empirically and theoretically efficient estimation for combination treatments.Among the direct approaches, [16,39] utilized the Hamming loss in place of the 0-1 loss, treating the ITR estimation as a weighted multi-label classification problem.Under this formulation, we only need to estimate the decision rules for each single treatment, which greatly reduces the model complexity and addresses the inefficiency issue in multicategory ITRs.However, a parsimonious model may be incapable of incorporating interaction effects among combination treatments.Specifically, [16] could be undermined when the interaction effects are non-negligible.Therefore, it is essential to develop direct methods that offer flexible modeling and are capable of incorporating interaction effects among combination treatments.
In this paper, we introduce a novel Multi-Label Residual Weighted Learning (MLRWL) framework for estimating the optimal ITR for combination treatments.Specifically, we propose using the generalized ψ-loss as a non-convex surrogate for the 0-1 loss in the multi-label classification problem, with the optimal ITR derived as the minimizer of the weighted generalized ψ-loss.The proposed method has two main advantages over the Hamming hinge loss considered in [16].First, the generalized ψ-loss guarantees the Fisher consistency, regardless of whether interaction effects are present.In particular, the minimizer of the weighted generalized ψ-loss exhibits sign consistency with the optimal ITR, a property that holds at any intensity level of interaction effects.This property is especially valuable in real applications, where the intensity of interaction effects could be unknown.Second, the generalized ψ-loss can accommodate negative or shifted outcomes to stabilize the empirical performance.In theory, we demonstrate that the Fisher consistency and the consistency of the proposed estimator are preserved given negative and shifted outcome weights.In contrast, a convex surrogate loss, such as the Hamming hinge loss, can only accommodate positive weights to preserve its convexity, potentially limiting its applicability.
Computationally, the non-convex generalized ψ-loss can be formulated as the difference between two convex functions.Therefore, the minimization of the weighted generalized ψ-loss can be solved efficiently by the difference of the convex (DC) algorithm [34] iteratively.Notably, the subproblem within each iteration is a quadratic programming problem for both linear and nonlinear decision rules, which can be solved by the quadratic programming solver.We show that estimators obtained through the DC algorithm are stationary points.Our numerical studies indicate that the proposed method achieves superior performance compared with existing ITR approaches for combination treatment problems.
The rest of the article is organized as follows.In Section 2, we introduce the background of the ITR problem and existing works.In Section 3, we propose Multi-Label Residual Weighted Learning with the generalized ψ-loss.Algorithms and implementation details for linear and nonlinear decision rules are also illustrated.In Section 4, the theoretical properties of the generalized ψ-loss are provided.In Sections 5 and 6, we present numerical studies to evaluate the empirical performance of the proposed method in simulation settings and a real application to a type-2 diabetes study.

Background
In this paper, we focus on estimating an individualized treatment rule (ITR) for combination treatments using clinical experiment data.The variables of interest are (X, A, Y ), where X ∈ X ⊂ R p represents pre-treatment covariates, A = (A (1) , A (2) , ..., A (K) ) ∈ A = {−1, 1} K denotes the combination treatments consisting of up to K single treatments, and Y ∈ R is the observed outcome.We assume that a larger Y indicates a more desirable outcome, and use Y (A) to represent the potential outcome [30] under the treatment assignment A.
Since only one of the potential outcomes can be observed for each subject, it is infeasible to recommend the subject-wise optimal treatment.Instead, our goal is to learn an ITR d(•) : X → A through maximizing the average outcome across the population.Here, the expected potential outcome under ITR d(•) is also termed as the value function [29] with respect to d(•): To estimate the ITR d(•) from clinical experiments, we rely on the following standard causal assumptions [8]: where I(•) is the indicator function and P(A|X) is the propen sity score [9].Moreover, maximizing the value function ( 2) is equivalent to minimizing the following risk: which is equivalent to a weighted classification problem, with A being the response comprised of 2 K distinct classes and weights given by Y /P(A|X).However, directly minimizing the risk ( 3) is an NP-hard problem due to the nonsmoothness of the indicator function I(•).More intricate than binary or multicategory treatment problems, minimizing (3) for combination treatments encounters the curse of the dimensionality issue.As the number of single treatments K grows, the number of possible combination treatments grows exponentially, which requires a rather complex model d(•) as the decision rule.Consequently, the estimation efficiency is undermined in combination treatment problems, especially those with a large K.
Since the combination treatment A can be considered as a K-dimensional binary response, it is natural to approach the problem (3) from the multi-label classification perspective [16,35].Each treatment A (k) can be treated as a binary response, indicating whether the kth treatment was assigned (A (k) = 1) or not (A (k) = −1).Therefore, we can decompose the ITR d(•) into K decision rules: d (1) (•), ..., d (K) (•), where d (k) (•) decides whether the kth treatment should be assigned or not.In contrast to the multicategory classification requiring 2 K decision rules, K decision rules are sufficient under the multi-label classification framework.
There are two mainstream strategies to tackle multi-label classification in the literature: the first strategy is the so-called binary relevance [21,4], which treats each label A (k) as an independent binary label and builds independent binary classifiers for each label.In combination treatment problems, the combination of multiple treatments could potentially induce additional interaction effects.These effects can be either synergistic or antagonistic effects in nature.As a result, the outcome Y is largely contingent on the holistic treatment assignment A, rather than solely on the individual assignment A (k) .Therefore, it is risky to adopt the binary relevance strategy in combination treatment problems which may ignore the considerable interaction effects.
Second, it is prevalent to propose an appropriate loss function, particularly convex surrogate losses, to replace the 0-1 loss.The convexity property promotes an efficient computation algorithm that guarantees global optimality.However, the improved computational efficiency may come at the cost of compromised statistical properties.For instance, [16] combines the Hamming and hinge losses as a convex surrogate loss to tackle the combination treatment problem.However, the minimizer of their surrogate loss does not guarantee a Fisher-consistent estimation of the optimal ITR d * (•), especially when significant interaction effects exist among treatments.In summary, it is critical to identify a suitable surrogate loss that guarantees sound statistical properties such as Fisher consistency while achieving efficient computation.

Methodology
In Section 3.1, we introduce the proposed residual weighted learning framework from the perspective of multi-label classification.Section 3.2 introduces the algorithm and implementation details for linear and nonlinear decision rules, respectively.

Multi-Label Residual Weighted Learning
In this section, we introduce a novel non-convex surrogate loss, the generalized ψ-loss, that targets the weighted multi-label classification to estimate the optimal ITR for combination treatments.
In order to minimize the ψ−risk (4), ψ(Z (1) , ..., Z (K) ) is expected to be minimal for large weights Y P(A|X) , and the decision functions f (k) (•)'s are expected to align with the associated treatment assignment A. Therefore, the overall treatment effects of the combination treatments, including treatment effects of single treatments and the induced interaction effects, affect the decision rules simultaneously.This property indeed guarantees the Fisher consistency of the proposed method, irrespective of the intensity of interaction effects.In contrast, each single treatment contributes to the Hamming hinge loss [16], leading the decision rules to rely more on the treatment effects of single treatments, so the Fisher consistency of their method is only achieved with minor interaction effects.

Outcome Shift
A significant drawback of the outcome-weighted learning framework [44,16] is that the value of Y must be positive to preserve the convexity of the surrogate loss function.Empirically, it is possible to shift the outcome so that the assumption is satisfied; however, the shift of the outcome may impact the empirical performance of the algorithm.We refer readers to [45] for a detailed discussion about this potential issue.
The proposed generalized ψ-loss is also sensitive to the shift of the outcome Y , and the impact of the outcome shift for the combination treatment problems is rather significant.Specifically, the minimizer of the ψ-risk under outcome weights and the minimizer of the ψ-risk under shifted outcome weights Y − g(X)'s are not necessarily equivalent, because: R ψ (f ) ̸ = R ψ,g (f ) + constant where R ψ,g (f ) is defined as follows: ψ(Z (1) , ..., Z (K) ) , (5) where g(X) is a measurable function.Therefore, minimizing the ψ-risk given the shifted outcome weights Y − g(X) might result in decision rules other than the optimal decision rules derived from (4).Therefore, selecting an appropriate shift is crucial for maintaining statistical consistency and optimality of the decision rules.
In this work, we employ the treatment-free effects g(X) A] as a functional shift, and the inverse probability weighted residual Y −g(X) P(A|X) is regarded as the weight in the multi-label classification.There are two main reasons for choosing g(X) as the treatment-free effects: First, it does not change the relative orders of the conditional average treatment effects over all possible treatments: which is a sufficient condition to guarantee the Fisher consistency property, which will be elaborated in Section 4. Second, it leads to a straightforward interpretation: for treatments associated with above-average treatment effects, the decision rules are expected to match these treatments; for the treatment associated with below-average treatment effects, the decision rules are instead encouraged to deviate from these treatments.
Empirically, given the i.i.d samples (x i , a i , y i ) n i=1 , we can estimate the ITR by minimizing min where g(•) is the treatment-free effects, P(•|•) is the propensity score in the clinical trial experiment.In observational study, both treatment-free effects and true propensity scores are unknown to us, so working models of g(x) and P(a|x) are needed.In Appendix A.8, we discuss the estimation of working models which can be plugged into (6) to estimate decision rules.The penalty function P λ (f ) determines the function space of f and controls its complexity.More detailed discuss of model specification is introduced in Section 3.2.Once we obtain the estimated f (1) , ..., f (K) , the estimated ITR d is given by:

Algorithm and Implementation
In this section, we introduce the algorithm and implementation details of the proposed method, under both the linear and nonlinear decision rules, respectively.Since the generalized ψ-loss is a non-convex surrogate loss, commonly adopted convex algorithms are not applicable.Nevertheless, the generalized ψloss enables a decomposition which can be represented as the difference of two convex functions, in which the difference of the convex algorithm [25,34] is applicable for efficient computation.Suppose the decision functions f (1) , ..., f (K) are parameterized by β k 's respectively.Then the empirical loss ( 6) can be reformulated as where β = (β 1 , ..., β K ) is the collection of parameters of the decision functions, and the weight yi− m(xi) P(ai|xi) is denoted as w i for ease of notation.The penalty function P(β k ) is a convex penalty function associated with the type of the decision rule, and λ serves as the tuning parameter for the penalty.
Since the loss function L(β) can be decomposed into two parts, L cvx (β) and L cave (β), we can employ the difference of the convex algorithm [25,34] to obtain the estimation of β.Specifically, at the t th iteration, the subproblem is minimizing a linear minorization [19] of the loss function L(β): ) is the sub-gradient of L cave (β) at the iterated β (t−1) , and < •, • > denotes the inner product.The algorithm for minimizing L(β) is summarized as follows: The above algorithm guarantees that the convergent point obtained from the iterations is the stationary point of L(β) if the initial value β satisfies certain conditions, the detailed result and conditions can be found in the Appendix A. 3.
In order to minimize (7), which includes complex max operator and indicator functions, we introduce the slack variables η i 's to convert the T 1 (β; x i , a i ) and T 0 (β; x i , a i ) into linear constraints, then the convex optimization problem within each iteration of Algorithm 1 is equivalent to the following problem:

Algorithm 1 Difference of convex algorithm for minimizing L(β)
Initialize β (0) , set maximum iteration T for t = 1, 2, ..., T do Compute the subgradients ∇ β Lcave(β (t−1) ) Update β (t) by solving the convex optimization problem: where ϵ is a pre-specified threshold; Otherwise, output β (T )   min where γ is a constant depending on λ, and ∇ β k L cave (β) is the subgradient of the concave part L cave with respect to β k : In the following, we provide the implementation details for linear decision rules in Section 3.2.1, and then generalize it to the nonlinear decision rules in Section 3.2.2.

Linear Decision Rule for Optimal ITR
Consider the linear decision rules f (k) (x) as follows: where β 1k ∈ R p and β 0k ∈ R. Then the associated ITR d(•) assigns a subject with x to the kth treatment if β 0k + x T β 1k > 0 and does not assign the kth treatment otherwise.For the linear decision functions, we define the penalty function as the Euclidean norm Then, the convex programming problem (8) can be rewritten as (10) which is a quadratic programming with decision function parameters β 1k 's, β 0k 's and slack variables η i 's, where η i 's are associated with individual-wise linear constraints.In the scenarios with large sample size n and relatively small dimension of covariates p, it is computationally efficient to solve the primal form (10) directly.Otherwise, it is preferable to solve the dual form by introducing the Lagrange multipliers θ ik 's.Specifically, the dual form of (10) can be formulated as follows: which can be solved by quadratic programming solvers such as Gurobi [6].Given the estimated θ ik 's, we can derive the estimated β 1k 's and β 0k 's, where the detailed derivations are provided in the Appendix A.1.

Nonlinear Decision Rule for Optimal ITR
In the following, we consider the nonlinear decision functions as follows: where K(•, •) is a valid kernel function associated with a reproducing kernel Hilbert space H K .Therefore, f (k) (x) can represent nonlinear functions embedded by H K with a shift β 0k .The norm in H K , denoted as ∥•∥ K , is induced by the inner product: 8), the convex programming in the t-th iteration is formulated as follows: Even though the f (k) (x) belongs to an infinite-dimensional space, it is computationally efficient to consider the dual form of the problem, which is also called the kernel trick [7].After the Lagrange multipliers θ ik 's are introduced, the dual form is formulated as follows: which can also be solved by quadratic programming solvers.

Theoretical Properties
In this section, we develop the theoretical properties of the estimated ITR under the weighted multi-label classification framework.Particularly, we establish the Fisher consistency under the proposed generalized ψ-risk, to guarantee that the optimizer of the ψ-risk is theoretically optimal.Furthermore, we also establish the excess risk bound and the consistency of the estimator within the reproducing kernel Hilbert space.First, we establish the Fisher consistency under the outcome weighted learning framework.Specifically, the following result holds: , where d * (•) is the optimal ITR given in (3).Lemma 4.1 provides the validity of using the generalized ψ-loss as the surrogate loss in the outcome weighted learning framework to estimate the optimal ITR.More importantly, there is no requirement for the intensity of the interaction effects among combination treatments as in [16], which is an advantage in estimating the ITR for combination treatments.As we emphasized in Section 3.1, ψ-loss can incorporate interaction effects of any intensity, and guarantees the above property.
In the following, we show that the Fisher consistency holds under the residual weighted learning framework: Theorem 4.2 guarantees that the generalized ψ-loss is a valid surrogate loss in the sense of Fisher consistency.As we mentioned in Section 3.1.1,an arbitrary choice of g(X) may violate the Fisher consistency.Our choice of treatment-free effects retains the relative order of the CATE among all treatments so that Fisher consistency is also guaranteed.The detailed proofs of Lemma 1 and Theorem 1 are provided in Appendix A.4 and A.5.
Next, we establish the relationship between the excess risk under the proposed generalized ψ-loss and the 0-1 loss.Theorem 4.3.For f = (f (1) , ..., f (K) ) and any measurable f (k) : X → R, and any probability distribution for (X, A, Y ), we have where the R ψ,g denotes the risk R ψ,g with g(•) as treatment-free effects.
Theorem 4.3 shows that the excess risk of any measurable decision functions f under the 0-1 loss is no larger than the excess risk under the ψ-risk.This suggests that if we estimate the ITR by minimizing R ψ,g (f ), the risk of the minimizer f is close to the Bayes risk.
However, the above theoretical analyses are all based on the population-level probability of (X, A, Y ).In practice, we are concerned more about the estimator obtained from the empirical distribution.In the following, we establish the consistency of the proposed estimator fn learned from the empirical distribution with sample size n.Theorem 4.4.Suppose the penalty coefficient λ in the primal form (7) satisfies λ → 0 and nλ → ∞.The weights |Y −g(X)| P(A|X) 's are assumed to be upper bounded by some positive constant M almost surely.Then for any distribution P for (X, A, Y ), we have where fn is the minimizer of the empirical loss (6) with sample size n, and H K + {1} denotes the shifted reproducing kernel Hilbert space we considered in Section 3.2.2.
Theorem 4.4 claims that the risk of the proposed estimator obtained by minimizing the empirical risk (6) can converge in probability to the minimal of the population risk as sample size increases.In other words, the proposed estimator is consistent corresponding to the optimal decision rules for the combination treatments.In addition, Theorem 4.4 also holds for linear decision rules with a pre-specified linear kernel K(•, •).The technical details of the proof are provided in Appendix A.7. Furthermore, we provide the extension of Theorem 4.4 to observational study in Appendix A.9.

Numerical Studies
In this section, we assess the performance of the proposed method through simulation studies which mimic real-world scenarios.In these simulations, we consider the treatment effects with varying complexities, while interaction effects of different intensities are included in all settings.
The simulation settings are conducted under different sample sizes (n = 400, 800, 2000).The pre-treatment covariates X ∈ R 10 are sampled uniformly from the interval (−1, 1).In all designed simulation studies, combination treatments A are uniformly randomly assigned.We consider three data generation processes.In the first two settings, two treatments are considered (K = 2), resulting in four possible combination treatments with different treatment effects.In the third setting, three treatments are considered (K = 3), corresponding to a total of eight possible combinations, with more complex treatment effects involving non-linear combinations of the covariates.The detailed treatment effects specifications are described as follows: Simulation setting 1: Simulation setting 2: After generating the treatment effects, we design the outcome of interest Y as follows: where g(X) is the treatment-free effects, and ϵ is the random noise.It is noteworthy that interaction effects among combination treatments are designed in all of the above simulation settings.Specifically, in simulation setting 1, we split a two-dimensional plane into four quadrants.Except for the quadrant I(X 1 + X 2 < 0, −X 1 + X 2 > 0), the combination of two treatments induces either positive or negative interaction effects in the other three quadrants as shown in Figure 2. In simulation settings 2 and 3, the interaction effects are polynomials and nonlinear functions of the pre-treatment covariates X, respectively.Additionally, the true decision rules are linear in simulation setting 1, and nonlinear in simulation settings 2 and 3.   [16]), the L 1 penalized least square (L 1 -PLS, [29]), the outcome weighted learning with multinomial deviance (OWL-MD, [10]), and the multicategory outcome weighted learning with linear and kernel functions (MOWL-Linear and MOWL-Kernel, [41]).[16]), the L 1 penalized least square (L 1 -PLS, [29]), the outcome weighted learning with multinomial deviance (OWL-MD, [10]), and the multicategory outcome weighted learning with linear and kernel functions (MOWL-Linear and MOWL-Kernel, [41]).

Table 1 Simulation studies: mean and standard error of the value function under the proposed method with linear and nonlinear decision rules, and five competing methods: the outcome weighted learning with deep learning (OWL-DL,
We compare our method with several existing methods which estimate the optimal ITR for combination treatments or multicategory treatments: the outcome weighted learning with deep learning (OWL-DL, [16]), the L 1 penalized least square (L 1 -PLS, [29]), the outcome weighted learning with multinomial deviance (OWL-MD, [10]), and the multicategory outcome weighted learning with linear and kernel functions (MOWL-Linear and MOWL-Kernel, [41]).For the last four competing methods, we first convert the combination treatments into categorical treatments and apply those methods to estimate the ITR.
All of the above simulation experiments are repeated 100 times, and the empirical performance is evaluated by the prediction accuracy, which is defined as , where A opt i is the optimal treatment assignment for the ith subject derived from the data generation process.We also compute the empirical value function [29], measured by an estimator to assess the performance even when the optimal treatment assignments are unknown.
The results of the simulation studies are presented in Tables 1 and 2, which demonstrates the effectiveness of our proposed method in estimating the optimal ITR for combination treatments.Our approach consistently outperformed competing methods in terms of optimal treatment assignment accuracy and value function across various settings, particularly considering the interaction effects of combination treatments.In simulation setting 1, the proposed method with linear decision rules improves the optimal treatment assignment accuracy by 13.2% to 34.0% compared with competing methods given 2000 samples.Even though the true decision rules are linear, the proposed method with nonlinear decision rules still achieves comparable accuracy and value function.In simulation settings 2 and 3, given that the true decision rules are nonlinear, the proposed method with nonlinear decision rules achieves the best performance.In particular, it improves the optimal treatment assignment accuracy by 27.1% to 39.4% in setting 2, and 4.1% to 53.0% in setting 3, respectively.

Real Data Application
In this section, we apply our method to recommend the optimal combination treatments for type-2 diabetes patients.The dataset is from Electronic Health Record (EHR) data accessible from the Clinical Practice Research Datalink 1 .In this study, type-2 diabetes patients were recruited from 2015 to 2018.Each subject was followed for 6 months, and the effectiveness of their assigned treatments was measured.There are four candidates for single treatments: dipeptidyl peptidase-4 (DPP4), sulfonylurea (SU), metformin (Met), and thiazolidinedione (TZD), which induces 16 combination treatments in total.In the past decade, researchers have investigated the combination treatments for type-2 diabetes patients [31,1,23], and interaction effects among these treatments are evaluated.For example, [31] suggest that SU combined with DDP4 induces a higher risk of hypoglycemia compared with using the SU treatment alone.Therefore, it is essential to consider interaction effects in assessing the optimal ITR for type-2 diabetes patients.
In this dataset, 21 pre-treatment covariates were collected, including subjects' demographic information (e.g., age, BMI, gender, weight, height), diabetesrelated health index (e.g., high-density lipoprotein, low-density lipoprotein, hematocrit), and medical history (e.g., congestive heart failure, stroke, hypertension).We use all these covariates except for the lower extremity arteries (LEA) to control for potential confounding, as the LEA value is the same for all subjects.The primary index to measure the effectiveness of treatment is the A1C, which measures average blood glucose levels [12].The normal A1C level is below 5.7%, and type-2 diabetes patients are generally above 6.5% [42].The A1C levels are expected to decrease after the treatments are applied.Therefore, we use the negative change of A1C as our outcome, where a larger value indicates a better treatment effect.
In the implementation, we split the dataset into training (800), validation (200), and testing (139) sets.Since some combination treatments were assigned to fewer than 10 subjects, we perform stratified sampling to ensure that the training set includes all possible combination treatments.To validate the results, we repeat the sampling procedure and run the experiment independently 100 times, and report the averaged value function on test sets.Similar to the simulation studies, we compare the proposed method with the five competing methods which are used as competing methods in the simulation studies.3 Real data application: mean and standard error of the value function using the proposed method with linear and nonlinear decision rules, and five competing methods: the outcome weighted learning with deep learning (OWL-DL, 16), the L 1 penalized least square (L 1 -PLS, 29), the outcome weighted learning with multinomial deviance (OWL-MD, 10), the multicategory outcome weighted learning with linear and kernel functions (MOWL-Linear and MOWL-Kernel, 41).
Table 3 provides the means and standard errors of the value function.Our data analysis indicates that the proposed method under linear and nonlinear decision rules outperforms the competing methods with higher value functions and smaller standard deviations.Specifically, compared with the methods for the multicategory treatment ITR (all competing methods except for OWL-DL), the proposed method improves the value function by 48.6%, 12.6%, 16.9%, and 3.8%, while reducing the standard errors by 81.2%, 51.8%, 57.3%, and 72.0%, respectively.The improvement is partially due to the proposed MLRWL framework which requires estimating fewer decision rules than those multicategory ITR estimation methods.Therefore, the reduced standard error is also observed for OWL-DL [16].Compared with OWL-DL [16], our proposed method improves the value function by 4.4% with a 29.7% decreased standard errors.This suggests that incorporating interaction effects in estimating the optimal ITR for combination treatments is essential and useful.

Conclusion and Discussion
In this paper, we investigate the efficient estimation of individualized treatment rule for combination treatments.Our main contributions are as follows: First, we formulate the value maximization problem as a multi-label classification problem, which greatly reduces the modeling complexity for decision rules.Second, we proposed a non-convex multi-label surrogate loss which can incorporate any interaction effects among combination treatments.The proposed method has sound theoretical properties including Fisher consistency and universal consistency.Third, we solve the non-convex minimization efficiently with the difference-of-convex algorithm, and achieve great numerical performance in simulation studies and a real data example.
In the combination treatment problems, the positivity assumption is a contingent assumption, especially in observational study scenarios.We can explore further potential directions as follows.First, we could utilize parametric assumptions on the interaction effects among combination treatments.Suppose the high-order interaction effects do not exist, then it is possible to identify the treatment effects of combination treatment by single treatments and lowerorder combination treatments.We refer readers to [40] for a more comprehensive investigation in this direction.The second plausible solution is to identify the value functions with incremental propensity scores [14,43], which shift the propensity values as a treatment assignment probability instead of assigning a deterministic treatment.This stochastic approach inherently avoids the positivity assumption; however, existing methods only apply to binary treatment problems.Therefore, it is worth further investigation on applying incremental propensity scores to multiple or combination treatment problems.Another potential solution is based on the pessimistic principal [11] which optimizes lower confidence bounds, instead of maximizing the point estimation of policy values.This approach can also relax the positivity assumption, but has not been studied in the combination treatment literature.

Appendix A
In this appendix, we provide the detailed derivation of the optimization problem for linear and nonlinear decision rules, and technique proof details of the theoretical properties of the Multi-Label Residual Weighted Learning (MLRWL).In addition, the extension of our method to observational study is also discussed.

A.1. Derivation of the optimization problem of linear decision rules
Within each iteration, the subproblem can be formulated as the following quadratic programming: where γ is associated with the penalty coefficient λ.By introducing the Lagrange multipliers θ ik 's and µ i 's, we have the following Lagrange function: where θ ik ≥ 0 ∀i = 1, ..., n, k = 1, ...K and µ i ≥ 0, ∀i = 1, ..., n.After taking derivatives of L(β, θ, µ) with respect to β 0k 's, β 1k 's, and η i 's and letting them equal to zero, we have Then the primal problem 16 can be transformed to the dual problem: θ ik 's can be solved via the standard quadratic programming algorithm, and β 1k can be obtained from (17).By the Karush-Kuhn-Tucker conditions [3], we have i β 1k for points satisfying θ ik > 0 and η i = 0.For numerical stability, we take the mean value of such β 0k 's as the estimation [7].

A.2. Derivation of the optimization problem of nonlinear decision rules
Similar to the linear case, we can still decompose the loss function into convex and concave parts, but replace the linear decision rule with a nonlinear decision rule, represented as is the pre-specified kernel function.Within the t th iteration, we solve the following quadratic programming: where K = (K ij ) n×n and K ij = K(x i , x j ), and K i is the i th row of K. Following the similar procedure as in (17,18,19), we can obtain the following subproblem in the t th iteration: ≤ γ|wi|, γ∇ β 0k Lcave( Therefore, we can also apply the standard quadratic programming algorithm to solve (22) and obtain the solution of θ ik 's.

A.3. Algorithm Convergence
In this section, we show that the convergent points of the Algorithm 1 is stationary points.
The level set condition for β (0) is a standard assumption in the convergence analysis of non-convex programming [15].Note that Proposition 1 does not exclude the possibility of local optima and saddle points, so the global optimum is not guaranteed.In practice, we can try multiple random initializations and select the ones that achieve the best performance on our validation sets.
Proof : First of all, since (t) .
After rearranging this inequality, we have By the definition of (sub)gradient ∇ β L cave (β (t−1) ), we have Based on the above two inequalities, we can derive which indicates that the sequence {L(β (t) )} is monotonically decreasing.

A.5. Proof of Theorem 4.2
Following the steps in A.4, if we adopt the residual Y − g(X) as the weight, then we have similar conclusions as in Section S. 4: the desired results are concluded.

A.6. Proof of Theorem 4.3
In this proof, we will follow two steps to prove the results.In the first step, we first introduce intermediate risks R g (f ) and R * g and build connection between R g (f )−R * g and R ψ,g (f )−R * ψ,g .In the second step, we establish the equivalence between R g (f ) − R * g and R(f ) − R * and conclude the results.Now, we introduce an intermediate risk given g(X) and 0-1 loss: From the proof of Lemma 4.1 and Theorem 4.2, we have In addition, |f Therefore, for any other a ∈ A, there exists k 0 such that a (k0) f Similarly, we can find Therefore, it is sufficient to prove that R ψ,g (f ) ≥ R g (sign(f )) to establish the first excess risk bound.Note that for any f , there only exists one combination treatment a * such that a (k) * f (k) * (x) > 0. And for any other a ̸ = a * , there exists k 0 such that a (k0) f (k0) (x) < 0. Therefore, ψ(a, f (x)) = 1 for any a ̸ = a * , followed by Since E[Y − g(X)|X = x, A = a * ] > 0, and ψ(a, f (x)) ≥ I(a ̸ = f (x)) for any measurable f , we conclude that R ψ,g (f ) ≥ R g (sign(f )).
Given that )] for any measurable f , and , which concludes the results.

A.7. Proof of Theorem 4.4
First, let L(h, b) = Y −g(X) P(A|X) ψ(Z (1) , ..., Z (K) ), where For the minimizer of the empirical loss (6), we denote the corresponding estimator as h n and b n , respectively.By the definition of h n , b n , for any h (k) ∈ H K and b (k) ∈ R, we have where P n denotes the empirical measure of the observed datasets (x i , a i , y i ) n i=1 .Then, lim sup n P n (L(h n , b n )) ≤ P(L(h, b)) = R ψ (h + b) almost surely.Furthermore, it implies that lim sup Therefore, it is suffice to show that P n (L(h n , b n )) − P(L(h n , b n )) → 0 in probability to conclude the results.
In the following, we establish the bound for ∥h n ∥ K and b n to control the complexity of the space for any h and b, we take h = 0 and b = 0, to obtain that ).
To obtain the bound for b n , we note that there exists some x i such that |h n (x i ) + b n | < 1, then we have Consequently, as nλ → ∞, we have

A.8. Estimation of Working Models of Treatment-free Effects and Propensity Score
In observational studies, the treatment assignment is usually unknown to practioners.Therefore, it is essential to estimate the propensity score before estimating the ITR via (6).In this work, we utilize the multinomial logistic regression to estimate the propensity score.Specifically, we first encode the combination treatment with categorical codings Ãi : {1, ..., 2 K }, and then maximize the likelihood: and the estimated propensity score is P( so we assume as linear model to fit the treatment-free effects and obtain an estimation by minimizing the following loss: For clinical trials with uniform random assignment, the above loss reduces to

A.9. Consistency of fn in Observational Study
In this section, we show the consistency of the proposed method in observational study, where the propensity score model is also estimated from finite sample data.The following Theorem states the necessary assumptions and the consistency of the proposed estimator.
Theorem A.1.Suppose the penalty coefficient λ in the primal form (7) where fn is the minimizer of the empirical loss (6) with plug-in estimator of propensity score P(A|X).H K +{1} denotes the shifted reproducing kernel Hilbert space we considered in Section 3.2.2.
Proof.We first introduce some notations for the ease of derivation.First, we denote the propensity score model as P(A|X; τ ) where P(•|•) specifies the function form, and τ is the associated parameter.The estimated propensity score is denoted as P(A|X; τ n ), where τ n is the finite sample estimator of τ .In addition, we define L(h, b, τ ) = Y −g(X) P(A|X;τ ) ψ(Z (1) , ..., Z (K) ), and P(A|X;τn) ψ(Z (1) , ..., Z (K) ).Therefore, we have Our expected result can be expressed as and it is followed by inf For the (≤) part, we can decompose the difference as follows: where the term (III) is negative by the definition of h n and b n , and the term (V ) is easily goes to zero in probability based on weak law of large number.Therefore, we only need to consider the asymptotic properties of the terms (I), (II), (IV ).
For the term (I), it is easy to see P Y − g(X) due to the boundedness of Y −g(X) P(A|X;τ ) and 0 ≤ ψ(h n , b n ) ≤ 1.For the term (II), we will use empirical process theory to prove this convergence.Before that, we establish the bound for h n , b n and τ n to control the complexity.By the same means, we have and we can take h = 0 and b = 0, so we have For term (IV), we first consider the upper bound of the difference Since τ n → τ uniformly, for any ϵ > 0, there exists N ϵ such that if n > N ϵ , P(Ai|Xi;τ ) P(Ai|Xi;τn) − 1 < ϵ.Therefor, for n > N ϵ , P n (L(h, b, τ n ) − L(h, b, τ )) < M ϵ, which shows that (IV) converges to zero as n goes to infinity.The desired results are concluded.

Fig 1 .
Fig 1. Illustration and comparison of the generalized hinge loss and the generalized ψ-loss in the 2-label classification scenario.

Fig 2 .
Fig 2. Interaction effects induced by the combination of two single treatments in different quadrants.

Table 2
Simulation studies: mean and standard error of the accuracy under the proposed method with linear and nonlinear decision rules, and five competing methods: the outcome weighted learning with deep learning (OWL-DL, satisfies λ → 0 and nλ → ∞.The weights |Y − g(X)|/P(A|X)'s are upper bounded by some positive constant M almost surely.Suppose the working model of propensity score P(A|X; τ n ) is a uniform consistent estimator of the true propensity score model, say, ∥τ n − τ ∥ → 0 in probability and it is bounded below by some constant ξ > 0 for any X ∈ X and A ∈ X .Then for any distribution P for (X, A, Y ), we have

-Kernel 4.730(0.086) 4.734(0.084) 4.736(0.075)
and 5 present the evaluation and comparison of our methods with competing methods, which demonstrate our method can still outperform competing methods in the observational study settings.

Table 4
Simulation studies: mean and standard error of the value function under the proposed method with linear and nonlinear decision rules, and five competing methods: the outcome weighted learning with deep learning (OWL-DL, Xu. Q et al.

Table 5
Simulation studies: mean and standard error of the accuracy under the proposed method with linear and nonlinear decision rules, and five competing methods: the outcome weighted learning with deep learning (OWL-DL,