Additive regression for predictors of various natures and possibly incomplete Hilbertian responses

Abstract: In this paper we consider a fully nonparametric additive regression model for responses and predictors of various natures. This includes the case of Hilbertian and incomplete (like censored or missing) responses, and continuous, nominal discrete and ordinal discrete predictors. We propose a backfitting technique that estimates this additive model, and establish the existence of the estimator and the convergence of the associated backfitting algorithm under minimal conditions. We also develop a general asymptotic theory for the estimator such as the rates of convergence and asymptotic distribution. We verify the practical performance of the proposed estimator in a simulation study. We also apply the method to various real data sets, including those for a density-valued response regressed on a mixture of continuous and nominal discrete predictors, for a compositional response regressed on a mixture of continuous and ordinal discrete predictors, and for a censored scalar response regressed on a mixture of continuous and nominal discrete predictors.


Introduction
Data objects that are not Euclidean are now abundant in real world problems so that their analysis becomes one of the important tasks in modern statistics. As part of such task, this paper provides a general structured nonparametric regression technique for Hilbert-space-valued (Hilbertian) responses coupled with various types of predictors. We consider Hilbert space since it is an important class of data spaces equipped with vector operations and an inner product structure that are vital to most regression tools. It covers a very wide scope of data types such as Euclidean, functional, density-valued and compositional data, etc. Functional data means that each data object itself, corresponding to each subject, is a function. This type of data arises from every corner of our lives [35]. Density-valued data is a kind of functional data, but each data object is a nonnegative function that integrates to one on its domain. Examples include population age distributions in cities [14], the distributions of voxel-tovoxel correlation in fMRI signals [33,34] and the distributions of metabolite level in the groups of new born babies [37]. Compositional data has Euclidean vectors as data objects whose entries are positive and sum to one. Examples are the proportions of votes earned by candidates in an election, the proportions of races/religions in cities/countries and the proportions of chemical materials constituting bodies/air/sea-water/soil [32,9]. For formal definitions of such data objects, see Section 2.1.
There have been a few attempts of dealing with Hilbertian responses. A broad review on regression for functional responses may be found in [42]. Other works include a parametric technique for compositional responses [38], the one for density-valued responses [37], and Nadaraya-Watson smoothing [10] and knearest neighbor estimation [21] for Hilbertian responses. The latter two works are about nonparametric regression but based on full-dimensional (i.e., unstructured) modeling, so that their approaches suffer from the curse of dimensionality when the number of predictors increases. Recently, nonparametric additive regression has been developed for Hilbertian responses [15]. Additive modeling is known to be an efficient way of avoiding the dimensionality problem. All these works on nonparametric regression do not cover discrete predictors, however. There has been no nonparametric method dealing with density-valued or compositional responses, in particular, together with discrete predictors.
In real world regression problems, discrete predictors are abundant. In many cases it is how to model the effects of discrete predictors, rather than continuous predictors, that determines overall prediction performance. In nonparametric regression, it is the usual practice to assume that the effects of discrete predictors are linear, thus incorporate them into a partially linear [e.g., 36,45] or a varying coefficient model [e.g., 13,19,20]. The usual approach certainly lacks flexibility since it cannot accommodate nonlinear effects, which may result in poor practical performance as illustrated in Section 5. A systematic nonparametric approach to identifying possibly nonlinear effects of discrete predictors on Hilbertian responses, does not exist yet, to the knowledge of the authors. The problem remains unexplored despite of its importance in real world problems.
The aim of the current paper is to develop a nonparametric additive regression approach for general Hilbertian responses that also enhances flexibility in modeling the effects of various types of predictors. Our approach sets both the effects of continuous and those of discrete predictors 'fully nonparametric'. The predictors in our model can be general objects including nominal and ordinal discrete variables as well as continuous ones. Hence, our coverage includes the case where only continuous predictors are considered [15]. In addition, we allow for incompletely observed responses. Here, by 'incomplete', we mean 'censored' or 'missing'. In particular, our setup covers the case of missing data with general Hilbertian responses. It means that we can deal with missing functional responses, missing density responses, missing compositional responses and so on, and as such it generalizes many current advances in the analysis of missing data.
To introduce our model, let H denote a separable Hilbert space and Y be a possibly incompletely observed H-valued random element. Let W = (X, U, V) be mixed predictors taking values in an appropriate space W, where X = (X 1 , . . . , X dx ) is a vector of real-valued continuous predictors, U = (U 1 , . . . , U du ) is a vector of nominal discrete predictors, and V = (V 1 , . . . , V dv ) is a vector of discrete predictors for which each V j takes values in a metric space with finite cardinality. Here, some d x , d u or d v are allowed to be zero as long as Let be a H-valued error such that E( |X, U, V) = 0, where 0 is a zero vector in H, and the conditional expectation is defined in terms of Bochner integral [6]. We consider the following additive model.  [25]. In comparison with other structured nonparametric methods such as marginal integration [22] and ordinary backfitting [31], the SBF technique was proved to have practical advantages [29] as well as theoretical superiority in various structured nonparametric models [e.g., 26,23,46,18,12]. All the aforementioned works are for completely observed real-valued responses regressed on continuous predictors only.
In our theoretical developments, we first prove the existence of the SBF estimator and the convergence of the associated SBF algorithm, in a very general setup where predictors take values in arbitrary σ-finite measure spaces. The general treatment allows not only the three types of predictors in the model (1.1) but also other predictors such as functional and manifold-valued predictors. The SBF algorithm, which evaluates the SBF estimator, requires the estimation of the marginal densities and the marginal regression maps of the predictors. In the existing SBF literature, these are confined to kernel-based estimators. In our theory for the existence of the estimator and the convergence of the algorithm, they are not restricted to this type. Instead, we formulate high-level conditions that are minimally required for the estimators of the marginal densities and marginal regression maps. The conditions allow parametric or nonparametric es-timators which may not be kernel-based but could be spline-or wavelet-based. Hence, our theory opens the possibility of developing non-kernel-based SBF techniques. Also, the minimal conditions do not ask for the data being identically distributed or independent. Thus, the theory even allows dependent data such as sequential/spatial data. We believe that this new framework makes an important contribution to various future studies, including those on non-kernelbased SBF methods for various structured nonparametric regression models.
Second, we derive the rates of convergence and the asymptotic distribution of the estimator for the model (1.1). The theoretical developments here are based on kernel weighting schemes for the estimation of the marginal densities and the marginal regression maps. They are much more complex than in [15]. The latter work considers only continuous predictors and completely observed responses, and thus all stochastic terms from the asymptotic expansion of their estimator are based on kernels of the same type and bandwidths of the same size. This enables one to apply a unified and relatively simple approach to deriving the theoretical results. In our case, however, there are many more stochastic terms of different natures that arise from different kernel weighting schemes for different types of predictors with smoothing parameters of different rates, and they involve errors due to incompleteness in the response variables. In addition, the component maps for the discrete predictors are not differentiable, so that the techniques dealing with such terms are quite different from those for continuous predictors. Thus, various and sophisticated technical tools are required in our new setting, see the technical details contained in the Appendix A. 7.
In Section 2 we describe our methodology in the above general framework. In Section 3 we obtain minimal conditions on the estimators of the marginal densities and regression maps under which we prove the existence of the SBF estimator and the convergence of the SBF algorithm in the general framework. In Section 4 we then specialize these results to the model (1.1), and investigate further the asymptotic properties of the corresponding SBF estimator. We treat in Section A.1 the case where there is no continuous predictor. In Section 5 we demonstrate the numerical superiority of our method and its usefulness in real problems. Auxiliary theoretical results and all proofs are in the Appendix.

Some examples of Hilbert spaces and vector operations
We give three examples of separable Hilbert spaces. These spaces and Euclidean spaces are the spaces we consider for the response variable Y in our numerical study.
(i) L 2 space. Let S be a Borel subset of R k . Consider the space of square integrable functions defined on S. For this space the zero vector 0 is the identically zero function, and for a scalar c ∈ R and for two functions f = f (·) and g = g(·), the vector addition f ⊕ g and scalar multiplication c f are defined by f ⊕ g = f (·) + g(·) and c f = c · f (·). The inner product and norm are The space of density functions. Consider the space of probability density functions f supported on a Borel subset S of R k with finite Lebesgue measure such that S (log f (s)) 2 ds < ∞. For this space, the zero vector 0 is the constant density f 0 (·) ≡ (Leb k (S)) −1 , where Leb k denotes the k-dimensional Lebesgue measure. For a scalar c ∈ R and for two densities f = f (·) and g = g(·), the vector addition f ⊕ g and scalar multiplication c f are defined by The inner product and norm are The space with the inner product forms an infinite-dimensional separable Hilbert space, as proved by [39].
(iii) The space of compositional vectors. Consider the space For this space, the zero vector 0 is the compositional vector (1/k, . . . , 1/k) of equalized components. For a scalar c ∈ R and two compositional vectors a, b ∈ S k , the vector addition a ⊕ b and scalar multiplication c a are defined by The inner product and norm are It is known that S k with the inner product forms a (k − 1)-dimensional Hilbert space.

General setting
Here, we consider abstract predictors taking values in general σ-finite measure spaces. We take this route, rather than starting with the specific types of predictors in the model (1.1), for simpler but better exposition of the main idea and also for demonstrating the broad scope of application in terms of the types of predictors we may cover. In technical terms, the general treatment is not direct from the existing literature, however, but actually requires thorough investigation of the way how the SBF idea works.
ν j are the product σ-field and product measure, respectively. We let Z = (Z 1 , . . . , Z d ) be a Z-valued predictor and be a H-valued error satisfying E( |Z) = 0 and E( 2 ) < ∞, where · denotes a norm of H. We consider the following general additive model: We assume that P Z −1 is absolutely continuous with respect to ν. We write dP are the respective product measure spaces resulting from omitting the (j, k)th and the jth measure spaces in (Z, A , ν), and z −jk and z −j are the respective vectors resulting from omitting (z j , z k ) and z j in z = (z 1 , . . . , z d ).
To take into account various situations where Y is not completely observed, we consider a 'synthetic' or 'surrogate' response that replaces Y. Such a synthetic response can be obtained by some mean-preserving transformation of 'observed' Y, which may involve the observed predictor Z. For instance, suppose that Y is subject to missingness. Let R = 0 if Y is missing, and R = 1 otherwise. Then, under the Hilbertian MAR (missing at random) condition, R ⊥ Y|Z, it holds that where π(Z) = P (R = 1|Z), denotes a scalar multiplication on H, and Y * = Y if R = 1 and Y * = 0 otherwise. In this case, we may take ψ(Z, Y * ) := (1/π(Z)) Y * as a surrogate of Y. Another example arises in randomly rightcensored regression where Y is a real-valued survival time subject to censoring, see Example 2 in Section 4.
In case Y is not completely observed, we assume that there exists a completely observable variable Y * ∈ H and a H-valued transformation ψ satisfying The equation (2.3) may be used to estimate the additive model based on the observed values of ψ(Z, Y * ). However, the transformation ψ usually contains unknown parameters or functions. In the missingness example discussed above, the conditional probability π is unknown. In the censoring example as well, the corresponding ψ involves the distribution function of the censoring variable that is unknown, see Example 2. In such cases, we need to estimate ψ. We study the effect of the error in the estimation of ψ on the estimation of the additive model (1.1), see Section 4. Below, we describe our method and theory in terms of ψ(Z, Y * ) and its estimator. In the case of completely observed Y, one may simply set ψ(Z, Y * ) =ψ(Z, Y * ) = Y.

General Bochner SBF estimation
For the estimation of the model (2.1) we first define some relevant spaces of H-valued measurable maps. For any measure space (S, Σ, λ), we define We note that the measure spaces on which the component maps m j and the sum map d j=1 m j in (2.1) are defined, correspond to (S, Σ, λ) = (Z j , A j , P Z −1 j ) and (S, Σ, λ) = (Z, A , P Z −1 ), respectively.
For the identifiability of the component maps m j in (2.1), we put the constraints E(m j (Z j )) = 0 for all 1 ≤ j ≤ d, which entails m 0 = E(Y). We note that these constraints can be written in Bochner integrals. The notion of Bochner integral generalizes that of the conventional Lebesgue integral to maps taking values in Hilbert or more generally in Banach spaces. By Propositions 2.1 and 2.2 in [15], the expected values E(m j (Z j )) and also the conditional expected values E(m k (Z k )|Z j = z j ) for k = j may be written in Bochner integrals as Here and below, we often write h c for the scalar multiplication c h of h ∈ H and c ∈ R. The representations at (2.4) are valid if (2.2) and the following assumptions on p j and p jk hold.

Condition (P). For all
From the first part at (2.4), the constraints on m j are equivalent to We assume E( ψ(Z, Y * ) 2 ) < ∞ and define Here, the marginal regression map μ j is not equal to the component map m j . The model (2.1) with the constraints (2.5) and the representations of the conditional expectations at (2.4) entail that, under the assumptions (2.2) and (P), Our method of estimating the component maps m j , which we detail below, is based on the above system of Bochner integral equations. In fact, we estimate the unknown quantities μ j (z j ), m 0 , p j (z j ) and p jk (z j , z k ) in the system of equations (2.7) to obtain our estimators of the component maps. For this, we consider general estimators of μ j , m 0 , p j and p jk that have no specific forms. We letp be any nonnegative estimator of p satisfying Zp (z)dν(z) = 1. (2.8) An example ofp specialized for the predictors in (1.1) is given in Section 4.2.
Define a probability measureP Z −1 on A byP Z −1 (A) = Ap (z)dν(z). We also letμ be any estimator of μ, as defined at (2.6), satisfyinĝ μ is a temporary estimator that induces our regression estimator, and it can be a full-dimensional estimator that does not take into account the additive structure of the model (2.1), such as the one considered in Section 4.2.
For the estimation of μ j , we note that (2.11) Motivated by (2.11), we estimate whenever the integral exists, and we setμ j (z j ) = 0 otherwise. We note that the integral on the right hand side of (2.12) exists for almost everywhere z j in the measure ν j under the first condition at (2.10). Furthermore, . The square integrability of μ j is required in our theoretical developments, such as at (3.4) in Section 3, for example. For the unknown Hilbertian constant m 0 = E(Y) = E(μ(Z)) = Z μ(z) p(z)dν(z), we choosem 0 = Zμ (z) p(z)dν(z). Then, by (2.12) it holds that (2.13) Now, we define our estimator of μ = m 0 ⊕ d j=1 m j bŷ We note that (2.15) is an estimating equation obtained by substitutingμ j ,m 0 , p j andp jk for μ j , m 0 , p j and p jk in (2.7). In Section 3, we show that the Bochner integrals in (2.15) are well-defined and the system of equations has a unique solutionμ + under weak conditions. We also demonstrate that each componentm j ofμ + is uniquely determined as an estimator of m j under some condition and the constraints Zjm We note that (2.16) is an empirical version of (2.5). We call (2.15) the general Bochner smooth backfitting (gB-SBF) system of equations. We also callμ + and (m 1 , . . . ,m d ) the gB-SBF estimators of μ and (m 1 , . . . , m d ), respectively.

General Bochner SBF algorithm
The gB-SBF estimators have no closed form. Hence, to evaluate the estimators, we need an iteration scheme. For an initial estimator in the iteration scheme, we take (m j ) ≡ (0, . . . , 0), which obviously satisfies the constraints. Another option is to takê m [0] j =μ j −m 0 , which also satisfies the constraints because of (2.13). Put j . For subsequent updates we apply the gB-SBF system of equations sequentially from j = 1 to j = d. A step-by-step procedure is described below, which we call the gB-SBF algorithm. For r ≥ 0, 1 ≤ j ≤ d and a given set ofm Ending: Stop the iteration if Z μ The Bochner integrals at (2.17) are well-defined under the condition (S2) to be given in Section 3.1. Indeed, we may prove that (m

gB-SBF for Euclidean and functional responses
The Euclidean and L 2 spaces are two common types of Hilbert spaces. The cases with compositional and density responses introduced in Section 2.1 may be treated within these spaces by executing some transformations, see the  . . .
(2.18) Now, in case H is L 2 (S), the space of square integrable real-valued functions defined on S ⊂ R D for some D ≥ 1, we may writem j (z j ) =m j (z j , ·), wherê m j : Z j × S → R. Likewise,μ j (z j ) =μ j (z j , ·) withμ j : Z j × S → R. Writê m 0 =m 0 (·) : S → R. Then, we may write the gB-SBF equations at (2.15) aŝ (2.19) The gB-SBF systems of equations at (2.18) and (2.19) can be implemented by the gB-SBF algorithm described in Section 2.4 with the specializations of the operations ⊕ and .

Theory for general predictors
In this section, we prove the existence of the gB-SBF estimators and the convergence of the gB-SBF algorithm in various modes. These are basic properties we need to establish before we study other statistical properties of the gB-SBF method. For other backfitting-based methods [e.g., 3,31,30,43], fairly strong conditions are imposed to guarantee the corresponding existence and convergence results. Even the initial SBF work for scalar responses [25] assumes somewhat strong conditions to get such properties. We demonstrate the basic properties under much weaker conditions, by deep investigation of the SBF technique. Except for Theorem 3 and Corollary 1 in Section 3.3, the conditions are imposed on datasets rather than on the data generating model, and thus the corresponding theoretical results are non-asymptotic and valid for datasets satisfying the conditions. The data-specific results are more useful to practitioners since it is direct and feasible to check the data-specific conditions with a dataset at hand.

Minimal conditions
The system of Bochner integral equations at (2.15) involves the estimatorsμ j , m 0 ,p j andp jk , all of which are determined by the estimators of the joint density p and regression map μ. We state a set of weak conditions for general estimatorŝ p andμ, under which we prove the existence of the gB-SBF estimators and the convergence of the gB-SBF algorithm. The formulation of the conditions in this general framework, which boils down to those as given below, is non-trivial since it requires careful investigation of all the steps of the way how the SBF technique works theoretically.
Note that the condition (S2) is an empirical version of the condition (P). In Section 4.2, we specialize the conditions for the specific estimators we consider in the case of mixed predictors.

Existence of gB-SBF estimators
In this subsection, we prove the existence and the uniqueness of the gB-SBF estimators. Let S H (p) be the 'sum-space' such that which is the space in which we seek the solutionμ + =m 0 ⊕ d j=1m j of (2.15). To give a key idea for the proof of the existence of the solution, consider the The true regression map μ = m 0 ⊕ d j=1 m j is the minimizer of (3.2). By Theorem 5.3.19 in [4], μ satisfies DF (μ)(g) = 0 for all g ∈ L 2 ((Z, A , P Z −1 ), H), provided that the Gâteaux derivative DF (μ) : L 2 ((Z, A , P Z −1 ), H) → R of F at μ exists. In this case, one may verify that DF (μ)(·) ≡ 0 induces (2.7), which is a population version of (2.15).
Based on the above observation, we formulate the existence of the gB-SBF estimators satisfying (2.15) as the existence of a minimizer of the objective functionalF : We note thatF is an empirical version of F with the last integral at (3.2), which is irrelevant in the minimization, being omitted. The functionalF is well-defined by (S1), whereμ j is defined at (2.12). Formally, we prove the following theorem.

Theorem 1. Assume the conditions (S1) and (S2). Then, there exists a solution
, H) satisfying the system of equations at (2.15), and the solution is unique up to measure zero with respect tô P Z −1 . Furthermore, each componentm j is uniquely determined up to measure zero with respect to ν j under the constraints (2.16), provided thatp > 0 on Z.

Convergence of gB-SBF algorithm
Here, we present the convergence ofμ whereĉ * > 0 is a constant that does not depend on r but only onp,μ and the initial estimator (m The above theorem establishes thatμ [r] + converges toμ + at a geometric speed. The next theorem is an asymptotic version of Theorem 2, which deals with the convergence of the individual componentsm [r] j to their respective targetsm j . For this, we introduce the following high-level conditions.

Condition (A).
The condition (2.8) and the first one at (2.10) hold with probability tending to one. Also, there exists a constant C > 0 such that and the one-and two-dimensional density estimators satisfy We note that the last condition at (3.5) for initial component estimatorsm j with a geometric rate. From the theorem we may deduce an almost everywhere convergence ofm with probability tending to one. This entails that, with probability tending to one, e. with respect to ν j , which gives the following corollary.

Theory for mixed predictors
This section deals with the specification of the general results in Section 3 to the model ( The latter general setting allows V j to be an ordinal discrete predictor or a continuous predictor on a fixed design. We let where Leb is the Lebesgue measure on R, C u,j is the counting measure on U j and C v,j is the counting measure on V j . We continue to use p to denote the joint density of W with respect to C v,j , and use μ to denote the regression map E(ψ(W, Y * )|W = ·) in the same spirit as in (2.6), where ψ(W, Y * ) is a synthetic response that satisfies (2.3) for Z = W and E( ψ(W, Y * ) 2 ) < ∞. In Section 4.2, we exemplify ψ(W, Y * ).
The marginalization of p along the coordinates that are of interest in W defines the densities of and (U j , V k ). We denote them, respectively, by p x,j , p u,j , p v,j , p xx,jk , p uu,jk , p vv,jk , p xu,jk , p xv,jk and p uv,jk . We denote the marginal regression maps, which correspond to μ j in (2.6), by μ . Below we discuss the estimation of these marginal densities and regression maps.

Estimation of marginal densities and regression maps
To estimate the joint density p and regression map μ, we use kernel-based estimators. For this, we introduce a kernel weighting scheme for each of the three types of predictors. First, for smoothing across is a baseline kernel function. Throughout this paper, we assume that K vanishes on R \ [−1, 1] and satisfies dt > 0, and we set K h (x, x ) = 0 otherwise. This kernel has been used in the SBF literature [e.g., 25]. We note that it has the normalization property Now, for smoothing across U j we take a discrete kernel L λj : where λ j ∈ [0, 1] is a smoothing parameter and c j is the cardinality of U j . This kernel was introduced by [1]. We note that L λj has the normalization property Next, for smoothing across V j with a metric δ j we define a new metric-based discrete kernel W sj : where 0 ≤ s j < 1 is a smoothing parameter that is sufficiently small so that Basically, this kernel gives more weights when v j gets closer to v j in the metric δ j . It also has the normalization property vj ∈Vj for an appropriate estimatorψ of ψ. We show that thesep andμ satisfy the non-asymptotic condition (S1). Because of the normalization properties (4.2), (4.3) and (4.4),p clearly satisfies (2.8). Also, the full-dimensional estimatorμ satisfies (2.10). To see this, for a vector x ∈ [0, 1] dx let x −j denote the (d x − 1)vector resulting from omitting the jth entry of x and likewise define u −j and v −j for u ∈ du j=1 U j and v ∈ dv j=1 V j , respectively. The first condition at (2.10) is satisfied since For the second condition, we note that and the same bound applies to the other integrals involved in the second condition. Moreover, we may get estimators of the marginal densities and regression maps by integratingp andμ over appropriate domains as in (2.9) and (2.12). In particular, from the normalization properties (4.2), (4.3) and (4.4) we get and similarlyp u,j (u j ) andp v,j (v j ) simply by substituting the discrete kernel weights L λj (u j , U ij ) and W sj (v j , V ij ), respectively, for K hj (x j , X ij ). We also obtain the two dimensional density estimator when d x ≥ 2 by integratingp over (x l : l = j, k) ∈ [0, 1] dx−2 , and likewisep uu,jk , p vv,jk ,p xu,jk ,p xv,jk andp uv,jk . Similarly, using the normalization properties (4.2), (4.3) and (4.4) again, we get the following estimators of the marginal regression maps.

Existence and algorithm convergence
Here, we apply the general results in Theorems 1-3 and Corollary 1 in Section 3 to the model (1.1). In the gB-SBF system of equations (2.15), we consider the Likewise, we enumerate the collection ofp x,j ,p u,j andp v,j intop 1 , . . . ,p dx+du+dv and that of the two-dimensional density estimatorsp xx,jk , . . . ,p uv,jk as well in an obvious manner. We let (m be the solution of the resulting gB-SBF system of equations withm x,j ,m v,j be the rth updates in the resulting gB-SBF algorithm corresponding tom x,j ,m u,j andm v,j , respectively. For more concrete description of the resulting gB-SBF system of equations and algorithm, we refer to the Appendix A.2. We also writê as in Section 2. As we verified in the previous subsection, the full-dimensional kernel estimatorsp andμ satisfy the condition (S1). Below, we give a set of sufficient conditions on the smoothing parameters, the baseline kernel K and a dataset under which the non-asymptotic condition (S2), tailored for the case of mixed predictors, are valid. The sufficient conditions are actually minimal in the sense that they are required even for one-dimensional regression smoothing across [0, 1], U j and V j to be well-posed.

Corollary 2.
Assume the condition (S * ). Then, the corresponding versions of Theorems 1 and 2 hold form x,j ,m u,j ,m v,j ,μ + andμ [r] + . Now, we present the corresponding versions of Theorem 3 and Corollary 1. Recall that we do not impose the assumption of i.i.d. data for Theorem 3 and Corollary 1, but state the results with the higher-level condition (A). Here, we focus on the case where we have to present a set of sufficient conditions that imply the condition (A). Finding sufficient conditions for non-i.i.d. data is more challenging, particulary in Hilbert spaces, but may be solved using the techniques in [6], for example. The sufficient conditions under the i.i.d. assumption are given below.

Condition (B).
The joint density p is bounded away from zero and infinity on W. For all j, k, u k and v k , p xu,jk (·, u k ) and p xv, We note that the conditions (B2)-(B4) are standard in the kernel smoothing theory. The condition on h j in (B4) allows the optimal bandwidth rate h j n −1/5 if α in (B1) is larger than 5/2. When Y is completely observed, we takeψ(W, Y * ) = ψ(W, Y * ) = Y, in which case (B1) reduces to a standard condition in the kernel smoothing theory, and (B5) is automatically satisfied. Below after the statement of a corollary, we give two other examples where the conditions (B1) and (B5) are satisfied.

Corollary 3. Assume the condition (B) and the version of the last condition at (3.5) corresponding to the mixed predictor case. Then, the corresponding versions of Theorem 3 and Corollary 1 hold form
h and π is an estimator of π. For the estimation of π, suppose that the predictors in V are real-valued and one applies logistic linear regression. Let β j denote the regression coefficients in the logistic linear regression andβ j be their estimators. Then, under either of the assumptions: (i) Y is a bounded random element and (1), which gives (B5). We note that both assumptions onβ j in (i) and (ii) are standard results in logistic linear regression.
These conditions on C are commonly adopted in the literature on censored regression. Let T = Y ∧ C denote the observed time and Δ = I(Y ≤ C) denote the censoring indicator. In this case, we observe the random [16]. In this case, ψ does not depend on W and ψ clearly satisfies (B1). Also, (B5) holds for To see the latter, we note that The standard theory in survival analysis [41] gives sup t≤τ |Ĝ(t) − G(t)| = o p (1), from which (B5) follows.

Rates of convergence
In this subsection, we demonstrate that the gB-SBF estimator does not have the dimensionality problem by showing that it achieves the optimal univariate error rate.
where is the error term at (1.1). Here and in Section 4.4, we assume that To obtain the rates of convergence, we make use of the following assumptions.

Condition (C).
The condition on p in (B2) holds. In addition, for all j, k, u k and v k , (C4) The condition (B3) holds. In addition, When Y is completely observed so that ψ(W, Y * ) = Y, the condition (C1) reduces to the one with + being replaced by . The latter is a standard condition in the kernel smoothing theory. Even when Y is incompletely observed as in Examples 1 and Theorem 4. Assume the condition (C). Then, the followings hold for all j.

Remark 1.
We give some remarks on the magnitude of λ * , s * and a n . One can show that the Nadaraya-Watson-type full-dimensional estimator based on the observations of U i , V i and completely observed Y i , defined bỹ achieves the optimal error rate when λ * = O(n −1/2 ) and s * = O(n −1/2 ). Hence, it makes sense to assume that λ * = O(n −C λ ) for some C λ ≥ 2/5 and s * = O(n −Cs ) for some C s ≥ 2/5. When Y is completely observed, the term a n does not appear in the rates in Theroem 4. In the missing data setting (Example 1), we get a n = n −c for some c > 2/5 when E( Y α ) < ∞ for some α > 10 and In the censored data setting (Example 2), we have a n = n −1/2 √ log n when G is continuous, since the Kaplan-Meier estimatorĜ satisfies sup t≤τ |Ĝ(t) − G(t)| = O p (n −1/2 √ log n), as proved by [24]. Therefore, a n = o(n −2/5 ) under these mild conditions. Theorem 4 together with Remark 1 demonstrates that the gB-SBF estimator may achieve the optimal univariate rates of convergence even though there are multiple predictors. This is an important property in nonparametric inference, which is not shared by estimators based on full-dimensional approaches. The result is particularly notable since it shows the dimension-free convergence rates in the general data setting.

Asymptotic distribution
In this subsection, we present the asymptotic joint distribution of the gB-SBF estimator. For this, we make further assumptions. Let {e l : 1 ≤ l ≤ L} denote an orthonormal basis of H. Our theory covers both L < ∞ and L = ∞. (D3) For all j, n 1/5 h j → α j , n 2/5 λ j → β j and n 2/5 s

Condition (D).
The conditions on λ j and s j in (D3) are satisfied with the sizes of the smoothing parameters discussed in Remark 1. The condition onψ is also valid under the mild conditions given there. The remaining ones in (D) are weak regularity conditions. To state the theorem we need to introduce more terminologies.
Let D 2 f : [0, 1] → L R, L(R, H) denote the second Fréchet derivative of f . The second derivative D 2 f (x) at x ∈ [0, 1], now as a map from R to L(R, H) or as a map from R 2 to H, is defined by where α j are the constants in (D3) and Δ x,j together with Δ u,j and Δ v,j are defined in the Appendix A.7.1. They constitute the asymptotic bias of the joint distribution of the estimated component maps, as is demonstrated in Theorem 5 below. In fact, re-enumerating as (Δ 1 , . . . , Δ dx+du+dv ), the (d x +d u +d v )-tuple is nothing else than the solution of a system of equations Note that the similarity between the above system of equations and the one at Let G(0, C j,xj ) denote a H-valued Gaussian random element with mean 0 and covariance operator C j,xj . It is a random element such that G(0, C j,xj ), h is normally distributed with mean 0 and variance C j,xj (h), h for all h ∈ H. When H = R, it reduces to a normal random variable.
We letL x,u,v denote the joint distribution of (n 2/5 (m x,j (x j ) m x,j (x j )) : Letm ora x,j be the oracle estimator of m x,j obtained by using the knowledge of all other component maps. Then, the asymptotic distribution form ora x,j is given by . This means thatm x,j andm ora x,j have the same asymptotic covariance operator, but differ in their asymptotic biases, so that the gB-SBF estimator achieves a 'semioracle property'. The difference of the asymptotic biases is (α 2 j δ x,j (x j )) Δ x,j (x j ) =: β j (x j ) and it holds that E(β j (X j )) = 1 0 β j (x j ) p x,j (x j )dx j = 0 by (A.33) in the Appendix.

Numerical properties
In this section we report the results of simulation studies and real data applications. Details on the practical implementation of the gB-SBF algorithm including smoothing parameter selection can be found in the Appendix A.3.

Simulation study
In the first simulation study, we compared our gB-SBF method with the SBF method based on partially linear additive models 45]. Since the latter model can only deal with completely observed scalar responses, we considered the case ψ(W, Y * ) = Y , where Y is a scalar response. Partially linear (additive) models are widely used when responses are real-valued and both continuous-type and discrete-type predictors are present. Hence, the comparison we made here is a meaningful check of how our new class of methods works in comparison with the existing class of methods. In the simulation, we estimated the following additive model: where X 1 and X 2 are independent uniform [0, 1] random variables, U 1 and U 2 are nominal discrete random variables taking values in {1, 2} and {1, 2, 3}, respectively, V 1 and V 2 are ordinal discrete random variables taking values in {1, 2, 3, 4} and {0, 0.25, 0.75, 1.5, 2.5}, respectively, and is a N (0, 0.5 2 ) random variable. We took m x,1 (X 1 ) = sin(2πX 1 ), m x,2 (X 2 ) = cos(2πX 2 ), m u,1 (U 1 ) = −3·I(U 1 = 1) + 3 · I(U 1 = 2), m u,2 (U 2 ) = −5 · I(U 2 = 1) + 0 · I(U 2 = 2) + 5 · I(U 2 = 3). For m v,1 and m v,2 , we considered the two cases: We note that the SBF-PLAM technique is designed for the case where m v,1 and m v,2 are linear, while the gB-SBF method is for general component maps. For the generation of U 1 , U 2 , V 1 and V 2 , we considered two scenarios. To describe them, let M k (q 1 , . . . , q k ) denote a k-variate multinomial distribution with sampling probabilities q j ≥ 0 such that q 1 + · · · + q k = 1. In both of the following scenarios, U 1 , U 2 , V 1 and V 2 are mutually independent.
We note that the SBF-PLAM technique gains some efficiency, in comparison with the partially linear approach (without additive modeling for the effect of X), only when E( We generated a training sample of size n and a test sample of size N = 100 for M = 500 times. We computed the gB-SBF estimator based on the gB-SBF algorithm. We compared the gB-SBF and the SBF-PLAM via the mean squared prediction error defined by based on the mth training sample. Table 1 gives the MSPE ratios of the gB-SBF relative to those of the SBF-PLAM. The table indicates that the prediction based on the gB-SBF estimator performs better than the one based on the SBF-PLAM estimator, even when m v,1 and m v,2 are linear. Our interpretation for this is that the SBF-PLAM procedure, after estimating the parametric and nonparametric parts based on a profiling method, does not update the estimators further, which might have degraded its performance. We think that the inferior performance might be also the case with other methods based on one-step update. On the contrary, our gB-SBF method operates an iterative algorithm, as described in (A.3) in the Supplement, until convergence. In the nonlinear case, there is a large gap in MSPE between the SBF-PLAM and the gB-SBF methods and the gap grows further as n increases. The results suggest that the gB-SBF procedure is a powerful option.
In the second simulation study, we considered a functional response that is observed at discrete time points with noise. We generated discrete points T ik uniformly on for t ∈ [0, 1], where f 1 , f 2 and f 3 are some functions that make the component maps satisfy the constraints (2.5). We then generated Y i (T ik ) according to the additive model, where i are i.i.d. standard normal random variables and δ i are i.i.d. random noises from N (0, 0.1 2 ). We obtained Y i = Y i (·) by smoothing the observations At this pre-smoothing stage, we used the kernel-weighting approach with the standard Gaussian kernel and bandwidths chosen by the leave-one-out cross-validation. Based on this generation The measures of performance we chose are integrated mean squared error (IMSE), integrated squared bias (ISB) and integrated variance (IV), defined by is the estimate of m s based on the rth pseudo sample and ν s is either the Lebesgue measure Leb or the counting measure depending on s. Table 2 shows the result of the estimation performance. The table demonstrates that the values of IMSE, ISB and IV are decreasing as the sample size increases. This indicates that our method after a pre-smoothing procedure works quite well even in the case of discretely and nosily observed functional responses.

Real data analysis
In this subsection we analyse four datasets. They are the cases of density-valued response, compositional response, missing scalar response and randomly rightcensored scalar response.

Density-valued response
Dose-response data contains several dose of drugs and their effects on a response variable. Dose-response data analysis is important in finding an appropriate dose. In this analysis, we analyzed cytotoxicity experiments data on the pediatric cancer Ewing sarcoma obtained from the R package 'braidrm' by combining 'es1data', 'es8data' and 'ew8data' there. The data contains several toxicity levels of several drug types. For each drug type (U 1 ) and log(toxicity level) (X 1 ), multiple (ranging from 32 to 112) log-transformed CellTiter-Glo intensity of the tumor are obtained from Ewing sarcoma tumor cells. In the usual experimental data analysis, such multiple observations at each treatment (X 1 , U 1 ) are aggregated into its sample mean or other statistic. However, such aggregation causes huge loss of information. The multiple log-transformed CellTiter-Glo intensity at each (X 1 , U 1 ) can be understood as a random sample from its conditional distribution given (X 1 , U 1 ). Hence, based on the random sample, we estimated the conditional density of the log(intensity) by a kernel density estimator, and treated it as a density response Y ≡ Y (·). This procedure corresponds to the pre-smoothing in the usual functional data analysis, and similar procedures are adopted in the literature on density-valued data. With the procedure, we obtained the dataset {(Y i (·), X 1i , U 1i ) : 1 ≤ i ≤ n}, where n = 72 is the number of combinations of (X 1 , U 1 ). Figure 1 shows the plot of the densities Y i (·).
Estimating E(Y (·)|X 1 , U 1 ) is very important since we can estimate the conditional distributions of outcomes given new values of (X 1 , U 1 ) without conducting new time-consuming and expensive experiments. Clearly, estimating the conditional distributions of outcomes gives much more information than estimating the conditional means of outcomes. We believe that this approach will provide a useful tool in experimental data analysis. The same idea can be also applied to various data analysis such as predicting the distribution of survival times at each treatment, distribution of sales at each sale condition, distribution of income/housing price at each national tax condition and distribution of outputs/defect rates of an item at each process condition in a factory.
We note that our method is the unique nonparametric method for densityvalued responses and mixed predictors. To see how the discrete predictor U 1 helps in predicting Y (·), we compared the prediction performance of our estimator with those of the Nadaraya-Watson [10] and the k-nearest neighbor [21] estimators for Hilbertian responses. For the latter two nonparametric estimators, we used only the continuous predictor X 1 . The measure of performance was the

Fig 2. Estimated component maps for 'toxicity level' (left) and 'drug types' (right).
leave-one-observation-out average squared prediction error (ASPE) defined by is the prediction of Y i (·) based on the sample without the ith observation. Here, for two density functions f (·) and g(·) supported on a Borel set S ⊂ R, We found that the value of ASPE was 23.29 for our estimator, 41.67 for the Nadaraya-Watson estimator and 36.61 for the k-nearest neighbor estimator. This reveals that the discrete predictor and the additive structure may improve substantially the prediction accuracy. Figure 2 shows the estimated component mapsm x,1 andm u,1 . The definitions of 0, ⊕ and used to obtainm x,1 andm u,1 for this case can be found in Section 2.1. We note that, in Figure 2, each line or bar along the log(intensity) at each log(toxicity level) or drug type represents a density. The first plot indicates that, as the toxicity level decreases, the density of the log(intensity) is gradually skewed to the left, while the density tends to be right-skewed as the toxicity level increases. This shows that strong toxicity level tends to kill more tumor cells. It also demonstrates that the toxicity level around x 1 = −12 has a similar effect on the intensity of the tumor to those at higher levels (x 1 > −12), and thus indicates that the level x 1 = −12 is a right dosage for such effect. The second plot says that, as the drug type moves from 'Temozolomide' to 'BMN 673', the density is gradually skewed to the right. This reveals that 'BMN 673' is the most effective drug for the tumor.

Compositional response
It is a general belief that a political election is determined by population characteristics and underlying political orientation. Recently, [15] analyzed the 2017 Korea presidential election data to see the effects of these factors. In that study, however, the effect of underlying political orientation, which is believed to be one of the most important factors, could not be analyzed, because the earlier method can only deal with continuous predictors. In fact, there have been no nonparametric regression method for compositional responses and mixed predictors. This motivated us to analyze the data with the gB-SBF method.
The original dataset analyzed in [15] contains the election result and population characteristics for 250 electoral districts in Korea. For each electoral district as the subject unit we have the proportion of votes earned by five candidates, people's average age (X 1 ), people's average years of education (X 2 ), average housing price (X 3 ) and people's average paid national health insurance premium (X 4 ). The variables X 3 and X 4 are measures of richness. Since the election was mainly focused on who would be elected among the candidates from the three major parties representing progressive, conservative and middle party, we considered the three-dimensional compositional vector 2 and Y 3 are the proportions of votes earned by the progressive, conservative and middle party, respectively, divided by the sum of the three proportions.
To incorporate the effect of the underlying political orientation, we added three discrete predictors V 1 , V 2 and V 3 representing the number of congress members from the progressive, conservative and middle party, respectively, elected from the 2016 Korea parliamentary election. We excluded two electoral districts since there was a mismatch between the 2017 presidential and the 2016 parliament elections. We also removed two other cases, one with V 1 = 4 and the other with V 2 = 3 since those values are not well supported by the data. This resulted in a total of n = 246 observations with all V j in the range {0, 1, 2}, which we actually used in our study.
To assess the prediction performance, we divided the 246 observations into 10 partitions S k , 1 ≤ k ≤ 10, with each partition having 24 or 25 observations, and then computed the 10-fold average squared prediction error (ASPE) defined by where |S k | is the number of observations in S k andŶ is the prediction of Y i based on the sample without the observations in S k . Here, for two compositional vectors a = (a 1 , a 2 , a 3 ) and b = (b 1 , b 2 To see how the discrete predictors (V 1 -V 3 ) help in predicting Y, we compared the ASPE of our method that was based on the seven predictors of mixed types (X 1 -X 4 and V 1 -V 3 ), with the B-SBF estimator [15] that was based on the four J. M. Jeon et al. continuous predictors. To see the effect of the additive structure, we also included in the comparison the full-dimensional Nadaraya-Watson-type estimator [27], which was also based on the four continuous predictors. We note that there exists no non-parametric or semi-parametric competitor designed for compositional responses with mixed predictors. We found that the values of ASPE was 0.31 for the gB-SBF, 0.82 for the B-SBF estimator and 0.99 for the full-dimensional Nadaraya-Watson-type estimator. This reveals that the discrete predictors and the additive structure are helpful in predicting Y, confirming the general belief that political orientation is an important factor in election results.  though the strength is much lower than age. As for education level, whose effect is shown in the top-right map, people get more conservative as they are more educated until reaching a level of medium-high and then it is reversed from that level to the highest. The esimated component maps in Figure 4 suggest that larger number in the parliament does not always lead to higher proportion of votes. This indicates that there is some non-monotone relationship between V = (V 1 , V 2 , V 3 ) and Y, contrary to general expectation, which one would not detect with the other methods.

Missing scalar response
We analyzed the 'ACTG175' data in the R package 'speff2trial' (Version 1.0.4). This dataset contains the treatment history of 2,139 patients infected by HIV type I. The prognosis of the treatment depends on factors, known at the beginning of the treatment, such as the CD4 T cell count at baseline (X 1 ), hemophilia (yes or no, U 1 ), history of intravenous drug use (yes or no, U 2 ), race (white or non-white, U 3 ), type of treatment (zidovudine only or others, U 4 ), Karnofsky score ({70, 80, 90, 100}, V 1 ) and antiretroviral history stratification ({1, 2, 3}, V 2 ). Here, higher V 1 means healthier. For the values of antiretroviral history stratification, V 2 = 1 means no prior antiretroviral therapy, V 2 = 2 a prior antiretroviral therapy but for a period less than or equal to 52 weeks, and V 2 = 3 means a prior antiretroviral therapy for a period more than 52 weeks. We took, as a response variable Y , the CD4 T cell count observed at 96 ± 5 weeks after the observation of X 1 . The response variable has 797 missing observations caused by the health condition of patients and/or the cease of the treatment.
We applied the gB-SBF method with the correction for missingness detailed in Example 1 (gB-SBF-correct), and also to the dataset consisting of non-missing observations only (gB-SBF-non-missing). There were 9 observations with V 1 taking the value 70 in the dataset, among which 7 had missing responses. We deleted them from the dataset, so that we could apply a 10-fold cross-validation to the remaining dataset of size 2,130 to compute the prediction error where S k , 1 ≤ k ≤ 10, are partitions each of which is of size 213, n k is the number of non-missing observations in S k andŶ is the prediction of Y i based on the 'gB-SBF-correct' or the 'gB-SBF-non-missing' applied to the observations not in S k . The ASPE was 14,683 for the 'gB-SBF-correct' while it was 16,383 for the 'gB-SBF-non-missing'. This suggests that our correction for missing observations improves the prediction performance. The results suggest that no hemophilia, using an intravenous drug, being white, getting a treatment other than zidovudine, having higher Karnofsky score and/or getting shorter prior antiretroviral therapy, improve the prognosis.

Randomly right-censored scalar response
We next considered the "BMT" data in the R package "KMsurv". In this dataset, there are 137 patients who received a bone marrow transplant as a treatment for their acute leukemia. The prognosis of the transplant depends on some risk factors known at the time of transplantation, such as patient and donor's age and gender. To focus on the effects of age and gender, we considered four predictors: patient's age (X 1 ), donor's age (X 2 ), patient's gender (U 1 ) and donor's gender (U 2 ), to predict survival time. The survival times of 56 patients were censored, so the censoring proportion is about 40%.
We applied to the dataset the gB-SBF method and the SBF method based on varying-coefficient models [44]. The latter models take the form of linear models but the coefficients are functions of continuous predictors. Also, different coefficients should be functions of different continuous predictors. Hence, when there are not enough continuous predictors compared to the number of discrete predictors, it is not possible to apply the latter approach. For this data, there are only two models that the latter approach can deal with: where Y is the survival time as described in Example 2 in Section 4.2. We compared the prediction performance based on the average squared prediction error (ASPE) defined by where n 1 = 81 is the number of uncensored observations andŶ is the prediction of Y i based on the sample without the ith uncensored observation. We used the unbiased transformation given in Example 2 to obtainŶ . We also considered regression without the unbiased transformation, which ignores the censoring information.
The results are presented in Table 3. We find that the methods with the unbiased transformation are more predictive than those without it. The results also indicate that the proposed new class of methods, based on the fully nonparametric modeling for both continuous and discrete predictors, works better than the existing class of methods based on the varying-coefficient models.   13. This suggests that, if the patient or the donor is younger or female, then the patient survives longer. In particular, donor's age and gender are more significant factors than patient's age and gender. We remark that, as demonstrated in this application, the gB-SBF method gives easier interpretation than the methods based on varying-coefficient models, since we may assess the effects of the predictors separately.

Conclusion and discussion
In this paper, we propose and study a new class of methods for mixed predictors, which is the first fully nonparametric version of the standard linear model. Our framework covers the cases of (in)completely observed Hilbertian responses, such as Euclidean, functional, density-valued and compositional responses, with various types of predictors. Our unified approach based on the novel idea of backfitting the discrete predictors as well as the continuous predictors possesses many advantages over the existing approaches. Also, it is a unique nonparametric method for certain types of responses such as density-valued and compositional responses. The proposed method is supported by a complete theory, and its superiority and a wide spectrum of applications are illustrated via extensive numerical experiments. In particular, for the existence of the proposed estimators and the convergence of the algorithm materializing them, our theory is developed in full generality under minimal conditions. The conditions are data-specific and are valid in most cases, which is of great significance and importance to practitioners. The general theory includes non-asymptotic results, works for predictors taking values in general σ-finite measure spaces and for general estimators of the main ingredients of the SBF methodology, such as the densities of the predictors and the marginal regression maps, and even allows dependent data. We think that the general theory and method we develop here casts a long shadow to future study on the subject areas.
Our coverage of compositional responses does not include those with zero entries since spaces that contain such compositional vectors do not form a Hilbert space. For such compositional data with zero entries, one may apply the parametric approach in [38], for example. For density-valued responses with continuous predictors, [11] considered an additive model but on the transformed conditional Fréchet mean via log-quantile and log-hazard transformations studied in [33]. It is different from our additive model that assumes additivity directly on the conditional mean based on the 'Aitchison' geometry given in Section 2.1. The target in [11] is to estimate the conditional Fréchet mean minimizing the expected Wasserstein distance from Y, while our target is to estimate the conditional mean minimizing E( Y · 2 ) with the Aitchison norm · .

Appendix
Here, we first present the results for the case of no continuous predictor. Then, we provide some additional details in the methodology and its implementation, followed by additional theoretical results and all technical proofs.

A.1. Case of no continuous predictor
Recall that the results in Section 4 are valid as long as d x ≥ 1 and d x +d u +d v ≥ 2, even in the cases where d u or d v equals zero, with trivial modification. Here we complement Section 4 by adding some results for the case where there is no continuous predictor. In the latter case the rates of convergence are different from the case d x ≥ 1, so that it is of theoretical interest. It is also important in practice since we encounter many occasions where we only have discrete predictors. For brevity we state the results here for the case d u , d v ≥ 1. However, the results hold as long as d u + d v ≥ 2, even if d u or d v is zero, which is clearly seen with trivial modification.
In the case where d x = 0, the corresponding additive model is where E(m u,j (U j )) = E(m v,j (V j )) = 0 for all j and E( |U, V) = 0. The new definitions ofm 0 ,m u,j ,m v,j ,m [r] u,j andm v,j are immediate by omitting dx j=1 K hj (x j , X ij ) in the definition of κ i (w) in Section 4. In this case, we put μ + =μ U,V andμ We first note that the corresponding versions of Theorems 1 and 2 hold under the conditions (S3 * ) and (S4 * ). For these, we do not need i.i.d. data. However, for the corresponding versions of Theorem 3 and Corollary 1 and for Theorem 6 below, we assume that for brevity. Let p uv be the joint density of (U, V) with respect to du j=1 C u,j ⊗ dv j=1 C v,j , where C u,j and C v,j are the counting measures on U j and V j , respectively.

Condition (B * ).
(B1 * ) The joint density p uv is strictly positive on We note that (B3 * ) is weaker than the condition (B5), and under the condition (B * ) the corresponding versions of Theorem 3 and Corollary 1 hold. The next theorem shows that the discrete gB-SBF estimator for the model (A.1) may also achieve the univariate error rate.
hold for some sequence b n . Assume the conditions (B1 * ) and (B2 * ). Then, for all j it holds that In a simulation study we present below, we compared the discrete gB-SBF estimatorμ U,V defined at (A.2) with the full-dimensional discrete kernel estimatorμ U,V defined as in (4.9). Sinceμ U,V is model-free, our focus in this comparison was then to see howμ U,V andμ U,V perform in non-additive scenarios or in higher-dimension as well as in additive and lower-dimensional settings.
We considered two scenarios, one for a 4-dimensional and the other for a 6-dimensional predictor: Here, m u,1 and m u,2 are the same as those in the first simulation, and m v,1 and m v,2 are the functions in the linear case. The predictors U 1 , U 2 , V 1 and V 2 are the same as those in (b) and is again a N (0, 0.5 2 ) random variable. The additional ordinal discrete predictors V 3 and V 4 have the same distributions as V 1 and V 2 , and m v,3 and m v,4 are the same as m v,1 and m v,2 in the nonlinear case in the first simulation. We note that ρ controls the departure from additivity. We took ρ = 0, 0.25, 0.5, 0.75 and 1 in both scenarios. We considered (5.1) as a measure of performance with the same n, N and M as in the first simulation. Table 4 shows the results. It shows that the performance ofμ U,V gets worse as d u + d v increases. This indicates that, when the true model departs moderately from additive or the number of discrete predictors is large, the discrete gB-SBF estimator can be a better option than the full-dimensional estimator.
Likewise, defineμ t,+j (·; ·) andμ t,j+ (·; ·) for t = u and v. For example, Then, the system of equations definingm tup is given bŷ The constraints corresponding to (2.16) are Next, to express the gB-SBF algorithm, we letm v,dv ). Then the gB-SBF algorithm for the case of mixed predictors is then given bŷ

A.3. Implementation and smoothing parameter selection
We may not evaluate the integrals in (A.3) with the usual numerical integration techniques since Bochner integrals are defined in an abstract way. To implement the gB-SBF algorithm, we adopt the following idea: for any measure space (S, Σ, λ) and for any integrable function f : S → R it holds that where b is a constant in a Banach space. Because of this, it turns out that the original gB-SBF algorithm at (A.3) based on Bochner integrals may be implemented through a simple iteration scheme based on Lebesgue integrals. Specifically, suppose that we takê t,ij for t = x, u and v that satisfy x,ij (x j )) 2p x,j (x j )dx j < ∞. The constraints (A.6) are not restrictive since we may set all initial weights w in Section A.2, respectively, from the (d x,i1 , . . . , w [r] v,idv ) for each 1 ≤ i ≤ n, andμ t,+j andμ t,j+ asμ t,+j andμ t,j+ with ⊕ and being replaced by + and ×, respectively. Then, by making use of (A.4) we may verifŷ In the case where H is a space of compositional vectors or density functions, the ⊕ and operations in (A.4) and (A.6) may be performed by the usual Euclidean addition + and scalar multiplication × via centered log-ratio (clr) transformations. In case H = S k for some k ∈ N, the transformation clr : S k → R k and its inverse clr −1 are given by clr ((a 1 , . . . , a k ) .
In case H = B 2 (S), the space of density functions f (·) supported on a Borel subset S of R k with finite Lebesgue measure for some k ∈ N and satisfying S (log f (s)) 2 ds < ∞, the transformation clr : B 2 (S) → L 2 (S) and its inverse clr −1 are given by In both cases,m [r] t,j (t j ) for t = x, u, v and r ≥ 0 can be computed by In our numerical studies in Section 5, we used the Epanechnikov kernel K(t) = (3/4)(1 − t 2 )I(|t| < 1). We chose the initial weights as follows: so that they are square integrable and satisfy (A.6). Instead of applying the stopping criteria described in the gB-SBF algorithm in Section 2.4, we simply stopped the iteration when the following criteria were all met: Now, we discuss the selection of the smoothing parameters h j , λ j and s j . We note that a full-dimensional grid search is not feasible when the number of predictors, d x + d u + d v , is large. We propose a selection rule, named as "CSS"(Coordinate-wise Smoothing-parameter Selection). The same idea was employed in [15]. Let CV(h 1 , . . . , s dv ) denote a cross-validatory criterion for the tuple of smoothing parameters (h 1 , . . . , s dv ).
CSS algorithm. Take grids G t = dt j=1 {g t,j,1 , . . . , g t,j,Lt,j } for t = x, u and v with L t,j ∈ N. Choose initial smoothing parameters h Repeat the procedure until (h We note that the CSS algorithm always ends in finite steps since the grid size is finite. Also, the selected vector of smoothing parameters achieves a coordinate-wise minimum. In our numerical study, we used a 10-fold cross-validation. For the gB-SBF and for the methods of [45] and of [44], we chose G x = dx j=1 {a j + 0.01 × k : k = 0, . . . , 100} for some small values a j that satisfy (S1 * ). We also took the bandwidth grid {a + 0.01 × k : k = 0, . . . , 100} for some small a > 0 for the methods of [10] and of [27], and took the nearest neighbor grid {1, . . . , 50} for the method of [21]. We chose G u =

A.4. Closedness of S H (p)
In this subsection, we prove the closedness of S H (p) defined at (3.1). The proof is largely based on the projection theory of Hilbert spaces. The materials covered here is used in the proofs of other theoretical results. Recall the definition of the probability measureP Z −1 introduced immediately below (2.9). Define an inner product ·, · 2,n of L 2 ((Z, A ,P Z −1 ), H) by where ·, · is an inner product of H. Then, the L 2 ((Z, A ,P Z −1 ), H) with the inner product ·, · 2,n is a Hilbert space.
The closedness of S H (p) is essential for the proof of the existence of the gB-SBF estimators, since it is a part of a sufficient condition for the existence of a minimizer of the objective functionalF defined in Section 3.2, see Lemma 4 in [5]. The closedness is also important for the convergence of the gB-SBF algorithm. To see why, define the linear operatorsπ j : whenever the integral exists, and putπ j (f )(z j ) = 0 otherwise. The following proposition shows thatπ j are projection operators.
This shows thatπ j is a projection operator. Now, define a linear operatorT : where I is the identity operator. We note thatT is an alternating projection operator. According to the projection theory (Lemma S.7 in [15]), the closedness of S H (p) implies thatT is a contraction, i.e., T L(S H (p)) := sup{ T (g) 2,n : g ∈ S H (p), g 2,n = 1} < 1, (A.10) where · 2,n is the norm induced by ·, · 2,n . The property T L(S H (p)) < 1 is essential for the convergence of the gB-SBF algorithm since the constantγ in Theorem 2 is in fact T 2 L(S H (p)) , see the proof of Theorem 2 given in Section A.6.2. Proving the closedness of S H (p) requires an advanced theory of functional analysis. To describe this, letπ j |L 2 ((Z k , A k ,P Z −1 k ), H) denote the operator π j restricted to L 2 ((Z k , A k ,P Z −1 k ), H) for k = j. When H = R, Z j = [0, 1] and ν j are the Lebesgue measure for all 1 ≤ j ≤ d, the common approach in the existing SBF literature to establishing the closeness of S H (p) is to prove thatπ j |L 2 ((Z k , A k ,P Z −1 k ), H) for all 1 ≤ j = k ≤ d are compact operators (e.g. [25]). Indeed, ifπ j are projection operators andπ j |L 2 ((Z k , A k ,P Z −1 k ), H) are compact, then S H (p) is closed by Theorem 8.1 in [7]. Unfortunately, the restricted projection operators are not compact for infinite-dimensional H, as we demonstrate it below.
Proof. Let L(H) denote the space of all bounded and linear operators from H to itself. Then, under the condition (S2),π j |L 2 ((Z k , A k ,P Z −1 k ), H) is an integral operator with the kernelk jk : Z × Z → L(H) defined byk jk (z, z * )(h) = h (p jk (z j , z * k )/(p j (z j )p k (z * k ))). This with Theorem 3.1 and Theorem 3.2 in [15] gives the proposition.
To prove that S H (p) is closed, we use Proposition 1 and Lemma S.7 in [15], the latter of which tells that the closedness of S H (p) is equivalent to the conclusion of the next proposition.
where with slight abuse of the notation for · 2,n , we write for real-valued maps g ∈ L 2 ((Z, A ,P Z −1 ), R) as well. Proposition 3 in specialization to H = R implies that there exists a constantĉ > 0 such that, for any g ∈ S R (p), there exist g j ∈ L 2 ((Z j , A j ,P Z −1 j ), R) for 1 ≤ j ≤ d satisfying g = Then, it holds that where the inequality follows from (A.11). Moreover, we get a.e. with respect toP Z −1 . This completes the proof.
The following proposition now follows from Proposition 3.

A.5. Some lemmas
For the below lemma, we write P Z −1 −j for the distribution of Z −j and write p Z−j for the density of Z −j with respect to ν −j . The lemma is used to prove Theorems 3, 4 and 5. Proof. We only show that f 1 (z 1 ) = c 1 a.e. with respect to P Z −1 1 , since for j ≥ 2 we may simply exchange the roles of 1 and j. We claim that, if p(z) ≥ c · p 1 (z 1 ) · p Z−1 (z −1 ) for all z ∈ Z, then

Lemma 1. Assume that there is a constant
Let N ∈ A be a P Z −1 -null set. Then, This proves (A. 12).
For D ∈ B(H), we may prove that for some A ∈ A 1 and B ∈ A −1 .
For the next lemma, we definê where i+ is the ith observation of + defined at the beginning of Section 4.3.
Recall the definition of C j,xj given at (4.10). The following lemma is used to prove Theorem 5.
Assume that the conditions on h j , λ j and s j in (D3) hold, that K is bounded, that E( + α ) < ∞ for some α > 2 and that, for all u k , v k and 1 ≤ j ≤ d x , (a) E( + α |X j = ·), E( + , e l · + , e l | X j = ·, U k = u k ) and E( + , e l · + , e l | X j = ·, V k = v k ) are bounded on a respective neighborhood of x j , E( + , e l · + , e l | X j = ·, X k = ·) and p xx,jk are bounded on a respective neighborhood of (x j , x k ), and E( + , e l · + , e l | X j = ·) for all l and l , are continuous on a common neighborhood of x j ; (b) p x,j is continuous on a neighborhood of x j and p x,j (x j ) > 0. Then, n 2/5 (μ A x,1 (x 1 ), . . . ,μ A x,dx (x dx ),μ A u,1 (u 1 ), . . . ,μ A v,dv (v dv )) 1 −1 K 2 (t)dt · E( + , e l · + , e m | X j = x j ) with α j being the constants in the condition (D3). Also, if d x + 1 ≤ j ≤ d or d x + 1 ≤ k ≤ d, then E ( S n (w), e jl H d · S n (w), e km H d ) → 0.
On the other hand, for g ∈ H and d x + 1 ≤ j ≤ d, P j • C w • P * j (g), e l = 0.
This proves P j (G(0, C w )) = 0 for d x + 1 ≤ j ≤ d. The independence of G(0, C j,xj ) for different 1 ≤ j ≤ d x follows from Theorem 4.2 in [15].

A.6. Proofs for Section 3
A.6.1. Proof of Theorem 1 First, we show thatF is a strictly convex and continuous functional satisfyinĝ F (f ) → ∞ as f 2,n → ∞. This together with Lemma 4 in [5] and Proposition 4 implies that there exists a minimizer ofF in S H (p). Note that the strict convexity is trivial. For the continuity, we note that whereĉ andM are defined in (A.21). For the divergence, we note that Next, we prove thatF is Gâteaux differentiable. whereĉ is the constant in Proposition 3 andM = max 1≤j≤d μ j 2 2,n < ∞. The inequality (A.21) may be proved by applying the Hölder inequality and considering a decomposition d j=1 g j of g with g j ∈ L 2 ((Z j , A j ,P Z −1 j ), H) such that d j=1 g j 2 2,n ≤ĉ g 2 2,n whose existence is guaranteed by Proposition 3. Hence, DF (f ) is the Gâteaux derivative ofF at f . Thus,F is Gâteaux differentiable. Now,f ∈ S H (p) being a minimizer ofF is equivalent to DF (f )(g) = 0 for all g ∈ S H (p) by Theorem 5.3.19 in [4]. With the specification of g ∈ S H (p) in DF (f )(g) = 0 to g j ∈ L 2 ((Z j , A j ,P Z −1 j ), H) for each 1 ≤ j ≤ d, the equation implies that Z−j (f (z) μ(z)) p(z)dν −j (z −j ) = 0 (A. 22) a.e. with respect to ν j , for all 1 ≤ j ≤ d.
Letf =f 0 ⊕ d j=1f j be a decomposition off withf j ∈ L 2 ((Z j , A j ,P Z −1 j ), H) such that Zjf j (z j ) p j (z j )dν j (z j ) = 0 for all 1 ≤ j ≤ d. Plugging the decomposition into the left hand side of (A.22) and using (2.9), we see thatf 0 =m 0 and (f j : 1 ≤ j ≤ d) satisfieŝ 23) a.e. with respect to ν j , for all 1 ≤ j ≤ d. We define the right hand side of (A.23) bym j (z j ) for all z j ∈ Z j . Then, (m j : 1 ≤ j ≤ d) ∈ d j=1 L 2 ((Z j , A j ,P Z −1 j ), H) and it satisfies (2.15) and (2.16).
For the uniqueness ofμ + , suppose that there exists another solution (m j : 1 ≤ j ≤ d) in Sincem ⊕ m ⊕ =T (m ⊕ m ⊕ ) and T L(S H (p)) < 1 by Proposition 4, we conclude thatm ⊕ =m ⊕ a.e. with respect toP Z −1 . This proves the first part of the theorem.
For the proof of the second part, suppose that d j=1ĝ j (z j ) = 0 a.e. with respect toP Z −1 withĝ j satisfying (2.16). Sincep > 0 on Z by the assumption, this implies d j=1ĝ j (z j ) = 0 a.e. with respect to ν, so that, for any map η j ∈ L 2 ((Z j , A j ,P Z −1 j ), H), we get Because of the marginalization property Z −jkp Z−j (z −j )dν −jk (z −jk ) =p k (z k ) and the constraints (2.16), the equation (A.27) implies that for all η j ∈ L 2 ((Z j , A j ,P Z −1 j ), H). This impliesĝ j (z j ) = 0 a.e. with respect to ν j . This proves the second part of the theorem.
× u k ∈U k 1 0 p xu,jk (x j , u k )p x,j (x j ) − p xu,jk (x j , u k )p x,j (x j ) 2 dx j .
We decompose the integrand of the above integral aŝ For the first term on the right hand side of (A.35), we note that (A.36) Lemma 2 implies that the first term on the right hand side of (A.36) is o p (1). For the second term, we observe E(p xu,jk (x j , u k )) = E(K hj (x j , X j )L λ k (u k , U k )) hj (x j , x j )p xu,jk (x j , u k )dx j = 1 0 K hj (x j , x j )p xu,jk (x j , u k )dx j + o(1) uniformly for x j ∈ [0, 1]. Hence, the first term on the right hand side of (A.35) is o p (1) uniformly for x j ∈ [0, 1]. Similarly, one may prove that the second term on the right hand side of (A.35) is o p (1) uniformly for x j ∈ [0, 1]. Sincep x,j (x j ) ≥ c with probability tending to one for some constant c > 0, we obtain (A.34).

A.7.3. Proof of Theorem 4
We only sketch the proof since a full proof is too long. Hereafter, we denote uj ∈Uj and vj ∈Vj by uj and vj , respectively. Let D t,k (z k , z k ) =