High-Dimensional Bayesian Inference in Nonparametric Additive Models

A fully Bayesian approach is proposed for ultrahigh-dimensional nonparametric additive models in which the number of additive components may be larger than the sample size, though ideally the true model is believed to include only a small number of components. Bayesian approaches can conduct stochastic model search and fulfill flexible parameter estimation by stochastic draws. The theory shows that the proposed model selection method has satisfactory properties. For instance, when the hyperparameter associated with the model prior is correctly specified, the true model has posterior probability approaching one as the sample size goes to infinity; when this hyperparameter is incorrectly specified, the selected model is still acceptable since asymptotically it is proved to be nested in the true model. To enhance model flexibility, two new $g$-priors are proposed and their theoretical performance is examined. We also propose an efficient MCMC algorithm to handle the computational issues. Several simulation examples are provided to demonstrate the computational advantages of our method.


Introduction
Suppose the data {Y i , X 1i , . . ., X pi } n i=1 are iid copies of Y, X 1 , . . ., X p generated from the following model where ǫ i 's denote the zero-mean random errors, and for each j = 1, . . ., p, X j is a random variable taking values in [0, 1], f j is a function of X j satisfying E{f j (X j )} = 0.The zero-expectation constraint is assumed for identifiability issue.Model (1.1) is called the additive component model; see [37,25] for an excellent introduction.Suppose model (1.1) contains s n significant covariates, and the remaining p − s n covariates are insignificant.Here we assume p/n → ∞ as n → ∞, denoted as p ≫ n or equivalently n ≪ p, but ideally restrict s n = o(n), i.e., the true model is sparse.Our goal is to explore an automatic fully Bayesian procedure for selecting and estimating the significant (nonvanishing) f j 's in model (1.1).
When each f j is linear in X j , (1.1) reduces to a linear model.There has been a considerable amount of frequentist approaches exploring issues on model selection in ultrahigh-dimensional situations, i.e., log p = O(n k ) for some k > 0. The representative ones include regularization-based approaches such as [51,33,46,49,23,35,29,44,48], and correlation-based approaches such as [12,14,47].An insightful review is given by [13].
Model selection on the basis of a Bayesian framework is conceptually different.Specifically, Bayesian approaches conduct stochastic search of the models and evaluate each model by its posterior probability.Three major advantages of Bayesian selection methods are worth mentioning: (1) Bayesian approaches can perform model selection, parameter estimation and inference in a unified manner through posterior samples, no additional procedures such as prescreening, thresholding or data splitting are needed; (2) the choice of the hyperparameters is flexible by fulfilling stochastic draws; and (3) Bayesian methods allow the practitioners to incorporate prior information in the process of model search.The last feature might be attractive in small sample problems where prior information may be useful to address data insufficiency.There has been an amount of literature on Bayesian model selection in linear models.For example, when p is fixed, [16,1,20,30,7] show that, under certain regularity conditions, the posterior probability of the true model converges to one as n increases, in other words, posterior model consistency holds.This means that the proposed Bayesian selection method is asymptotically valid.Later on, these results were generalized by [38,27] to the growing p situation with p = O(n).In ultrahigh-dimensional situations, [40] considered a fully Bayesian hierarchical model with a prior controlling the model size and obtained posterior model consistency.A straightforward MCMC algorithm was developed for model search.
Based on an extended Bayesian information criteria, [31] established posterior model consistency in generalized linear models.However, in many practical applications there might be little evidence confirming linearity of the f j 's, for which a nonparametric assumption on the f j 's will largely enhance model flexibility, leading to the so-called nonparametric additive models.Surprisingly, theoretical studies relating to model selection in nonparametric additive models are almost all in frequentist settings.For instance, [32,28,36] explored issues relating to component selection with smoothing constraints assumed on the nonparametric functions.[34,24] proposed penalty-based approaches and studied their asymptotic properties.[11] proposed a learning approach based on independent correlation and proved selection consistency.To the best of our knowledge, theoretical studies in Bayesian settings are nonexistent, especially when p ≫ n.In terms of empirical evaluation, [42] proposed an objective Bayesian approach using penalized splines, [8] proposed a Bayesian framework based on adaptive regression trees, and [41] proposed a Bayesian framework based on a spike-and-slab prior induced from normal-mixture-inverse-gamma distributions.However, theoretical validity of these imsart-generic ver.2011/11/15 file: ABUSFinal.texdate: May 7, 2014 methods in ultrahigh-dimensional scenarios has not been justified.
In this paper, we propose a fully Bayesian hierarchical model which involves a new spike-andslab prior on the function coefficients and a novel prior controlling the model size, namely, the size-control prior.The spike-and-slab prior has two important features: first, it either removes or includes the entire block of function coefficients, which is useful for model selection purpose; second, within each block, suitable decay rates are assumed on the function coefficients via their prior variances to produce smooth estimate of the nonparametric function.The size-control prior, which involves a size-control parameter, effectively restricts the scope of the target models, and facilitates both theoretical and computing issues.Based on the proposed Bayesian framework, we show that when the size-control parameter is correctly specified, posterior model consistency uniformly holds when the hyperparameters are confined by suitable ranges; when the size-control parameter is incorrectly specified, the selection results are still acceptable in the sense that the selected model is asymptotically nested in the true model, in other words, the number of false positives asymptotically vanishes.Interestingly, the asymptotic results are shown to be true even in the hyper-g prior settings.
Furthermore, a novel and nontrivial blockwise MCMC procedure is proposed for computation.Our MCMC procedure allows stochastic search of all critical hyperparameters including the blocks of the function coefficients, the indicator variables representing inclusion/exclusion of the variables, the size-control parameter, and even the number of basis functions used for model fitting.The most challenging part in computation is the so-called trans-dimensional problem, which is successfully resolved by a novel and nontrivial variation of the "dimension-matching" technique proposed by [18] in the reversible jump MCMC approach.Simulation results demonstrate satisfactory selection and estimation accuracy of the proposed method.Performance under different basis structures is also examined.To the best of our knowledge, our work is the first one establishing a both theoretically and empirically effective fully Bayesian procedure for function component selection in ultrahighdimensional settings.
The rest of the paper is organized as follows.In Section 2, we carefully describe our fully Bayesian model and the prior distributions on the model parameters.In Section 3, asymptotic results are provided for both well specified and misspecified model spaces.In the meantime, two new types of g-priors are constructed and their theoretical properties are carefully studied.Section 4 contains the details of the MCMC algorithm.Section 5 includes the simulation examples showing the satisfactory performance of the proposed method.Section 6 summarizes the conclusions.Technical arguments are provided in appendix.
Throughout we use |γ| to denote the number of ones in γ, which is called the size of γ.It is clear that there are totally 2 p possible γ's representing 2 p different models, each of which determines a subset of {f j , j = 1, . . ., p} that are included in model (1.1).In other words, under γ, model (1.1) is equivalent to ).Thus, γ\γ ′ is the 0\1 vector indicating the functional components present in model γ but absent in model γ ′ , and γ ∩ γ ′ is the 0\1 vector indicating the functional components present in both models γ and γ ′ .We say that γ is nested in γ ′ (denoted by γ ⊂ γ ′ ) if γ\γ ′ is zero.We further assume {f 0 j , j = 1, . . ., p} to be the true functional components, and denote γ 0 = (γ 0 1 , . . ., γ 0 p ) T with γ 0 j = I(f 0 j = 0).That is, the data {Y i , X 1i , . . ., X pi } n i=1 are truly sampled from model Y i = j∈γ 0 f 0 j (X ji ) + ǫ i , i = 1, . . ., n.Thus, γ 0 represents the true model where data are generated, and s n = |γ 0 | denotes the size of the true model, i.e., the number of components f j 's included in the true model.
For j = 1, . . ., p, define an inner product f j , fj j = E{f j (X j ) fj (X j )} for any f j , fj ∈ H j , where H j is the class of functions on [0, 1] satisfying E{|f j (X j )| 2 } < ∞ and E{f j (X j )} = 0.This inner product induces a norm denoted by • j , that is, f j j = E{|f j (X j )| 2 }.Suppose the density function d j (x j ) of X j satisfies 0 < K 1 ≤ d j (x j ) ≤ K 2 < ∞ for any x j ∈ [0, 1] and j = 1, . . ., p, where K 1 , K 2 are constants.Clearly, under •, • j , H j is a well-defined Hilbert space.Let {ϕ jl , l = 1, 2, . ..} ⊂ H j be the orthonormal basis functions for H j under •, • j .Any function f j ∈ H j thus admits the Fourier series f j = ∞ l=1 β jl ϕ jl , with β jl = f j , ϕ jl j being the Fourier coefficients.It can be shown that f j = 0 if and only if all the Fourier coefficients β jl 's are zero.Therefore, to detect whether f j vanishes or not, it is sufficient to detect whether the β jl 's are zero.In general, f j might correspond to infinitely many Fourier coefficients.Handling all the Fourier coefficients is computationally infeasible.Furthermore, it is commonly believed that only a finite subset of the Fourier coefficients capture most of the information possessed by f j .Thus, we consider the partial Fourier series f j ≈ m l=1 β jl ϕ jl with truncation parameter m, where m = m n is a sequence increasing with n.General theory on Fourier analysis leads to that f j − m l=1 β jl ϕ jl j approaches zero as m → ∞, showing the validity of such approximation.We introduce some additional matrix notation to simplify the expression of our model.For j = 1, . . ., p and l = 1, . . ., m, define β j = (β j1 , . . ., β jm ) T , Φ jl = (ϕ jl (X j1 ), . . ., ϕ jl (X jn )) T , and Z j = (Φ j1 , . . ., Φ jm ).Thus, each Z j is n by m.For and γ, define Z γ = (Z j , j ∈ γ), the n by m|γ| matrix formed by Z j 's with j ∈ γ, and define β γ to be the m|γ|-vector of Fourier coefficients formed by β j 's with j ∈ γ.Define Y = (Y 1 , . . ., Y n ) T to be the response vector.
We assume that the model errors ǫ i 's are iid zero-mean Gaussian with variance σ 2 , therefore, (2.1) Since each f j can be well approximated by m l=1 β jl ϕ jl for some sufficiently large m, the mean of In matrix form, this becomes (2.2) When γ j = 0, f j = 0 implies that all the Fourier coefficients β jl 's are zero.When γ j = 1, f j = 0, we place normal prior distributions over its Fourier coefficients.Explicitly, for j = 1, . . ., p, we adopt the spike-and-slab prior for β jl 's, i.e, where δ 0 (•) is the point mass measure concentrating on zero, {τ 2 l , l ≥ 1} is a fixed nonincreasing sequence, and c j 's are temporarily assumed to be fixed.Note that the c j 's are used to control the variance of the nonzero coefficients, and therefore can be viewed as the variance-control parameters.
In many applications we may choose τ 2 l = l −(2ω+1) for l ≥ 1, where ω > 1/2 is a fixed constant characterizing the degree of smoothness; see, e.g., [4].The prior (2.3) can be viewed as a direct multivariate extension of the conventional spike-and-slab prior on scalar coefficients considered by [9].Note that γ j = 0 or 1 will exclude or include the entire block of the coefficients β jl 's, and within the nonvanishing block, the coefficients follow the zero-mean Gaussian priors with variances decaying at the rates τ 2 l 's, which may be useful to produce smooth estimates of the functions.In [41], a different type of spike-and-slab prior was considered.Specifically, each coefficient block is represented as the product of a scalar with normal-mixture-inverse-gamma prior and a vector whose entries follow the bivariate mixture normal priors with a constant variance.
A variety of priors can be assumed on σ 2 .For convenience, we consider the inverse χ 2 prior, i.e., where ν is a fixed hyperparameter.Other priors such as the noninformative priors or the inverse Gamma priors can also be applied.All the results developed in this paper can be extended to such situations without substantial difficulty.
In high-dimensional inference, it is commonly believed that only a small subset of covariates contribute substantially to the model.Treating this as prior information, the models with larger sizes should be assigned with zero prior probabilities, and only the models with smaller sizes should be assigned with positive weights.We call this a size-control prior on the model space.Namely, we choose the prior on γ as where π γ for |γ| ≤ t n are fixed positive numbers, and t n ∈ (0, n) is an integer-valued hyperparameter controlling the sizes of the candidate models.We name the set of models whose sizes are not exceeding t n as the target model space.
Denote D n = {Y i , X 1i , . . ., X pi } n i=1 to be the full data variable.It can be shown by direct calculations that, based on the above hierarchical model (2.2)-(2.5), the joint posterior distribution of (γ, where • denotes the Euclidean norm of a vector, and T m = diag(τ 2 1 , . . ., τ 2 m ).Integrating out β γ and σ 2 in (2.6), it can be checked that the marginal posterior of γ is where is the m|γ| by m|γ| diagonal matrix with diagonal elements (c j τ 2 1 , . . ., c j τ 2 m ) for j ∈ γ.We adopt the convention Z ∅ = 0 and Σ ∅ = U ∅ = W ∅ = 1, where ∅ means the null model, i.e., the vector γ with all elements being zero.So (2.7) is meaningful for γ = ∅.The selected model γ is defined to be the one that maximizes p(γ|D n ).Clearly, γ belongs to the target model space since any model outside the target space has zero posterior probability.

Main Results
Suppose the data are truly drawn from the model 0 is a fixed (unknown) positive number, and f 0 j ∈ H j for j ∈ γ 0 .Recall that γ 0 is a p-dimensional 0\1-vector representing the true model, and s n = |γ 0 | denotes its size.We only consider s n > 0, i.e., the true model is nonnull.Any f 0 j for j ∈ γ 0 admits the Fourier expansion f 0 j = ∞ l=1 β 0 jl ϕ jl , where β 0 jl 's represent the "true" unknown Fourier coefficients of f 0 j .When m is sufficiently large, f 0 j is approximated by its partial Fourier series, that is, f 0 j ≈ m l=1 β 0 jl ϕ jl .To insure that such partial Fourier series is a valid approximation, we assume a uniform error rate on the tails of the Fourier series.Specifically, we assume that there are some positive constants a > 1 and C β such that max It is easy to see that (3.1) is equivalent to max , uniformly for m ≥ 1.That is, the errors of the partial Fourier series of the nonzero f 0 j 's uniformly decrease to zero at rate m −a .For instance, when f 0 j 's uniformly belong to the Sobolev's ellipsoid of order a/2, i.e., max j∈γ 0 ∞ l=1 l a |β 0 jl | 2 < ∞, for some constant a > 0, it can be checked that (3.1) holds.Namely, a measures the degree of smoothness of the nonzero functions.A larger a implies that the nonzero functions are more smooth.
Define l n = j∈γ 0 f 0 j 2 j and θ n = min j∈γ 0 f 0 j j .Define P γ = Z γ (Z T γ Z γ ) −1 Z T γ to be the n by n projection (or smoothing) matrix corresponding to γ.We adopt the convention P ∅ = 0. Let λ − (A) and λ + (A) be the minimal and maximal eigenvalues of matrix A. Suppose the truncation parameter m is chosen within the range [m 1 , m 2 ], where m 1 = m 1n , m 2 = m 2n with m 1 ≤ m 2 are positive sequences approaching infinity as n → ∞.The variance-control parameters c j 's are chosen within [φ n , φn ] for some positive sequences φ n , φn .

Well Specified Target Model Space
In this section we present our first theorem on posterior consistency of our model selection procedure.We consider the situation t n ≥ s n , that is, the hyperparameter t n is correctly specified as being no less than the size of the true model.Thus, the true model is among our target model space, for which we say that the target model space is well specified.We will present a set of sufficient conditions and show that under these conditions, the posterior probability of the true model converges to one in probability.Thus, the selection procedure asymptotically yields the true model.
It is clear that S 1 (t n ) and S 2 (t n ) are disjoint, and S(t n ) defined by is the class of all models with size not exceeding t n , i.e., the target model space.We first list some conditions that are used to show our theorem.
Assumption A.1.There exists a positive constant c 0 such that, as n → ∞, with probability approaching one  1).
In the following proposition we show that Assumption A.1 holds under suitable dependence assumption among the predictors X j 's.To clearly describe this assumption, let {X j } ∞ j=1 be a stationary sequence taking values in [0, 1], and define its ρ-mixing coefficient to be ρ(|j , where the supremum is taken over the measurable functions f and g with E{f (X j ) 2 } = E{g(X j ′ ) 2 } = 1.Ideally we assume that the predictors X 1 , . . ., X p in model (1.1) are simply the first p elements of {X j } ∞ j=1 .
Assumption A.2 holds if we choose p(γ) to be constant for all |γ| ≤ t n , i.e., we adopt an indifference prior over the target model space.To see when Assumption A.3 holds, we look at a special example.We choose τ 2 l = l −5 for l ≥ 1. Suppose log p ∝ n k for 0 < k < 1, s n ∝ 1, ψ n ∝ 1, and the smoothness parameter a = 4. Choose t n ∝ 1, m 1 = ζn 1/5 + c 1n and m 2 = ζn 1/5 + c 2n , where and r > 0 is a constant.Note that such choice of m 1 and m 2 yields minimax error rate in univariate regression.Let h m = (log m) r for m ≥ 1. Ideally we suppose that the selected t n is greater than s n .Choose φ n and φn as log(φ n ) ∝ n k 1 and log( φn ) ∝ n k 2 with max{0, k −1/5} < k 1 < k 2 < 4/5.In this simple situation, it can be directly verified that Assumption A.3 holds.Furthermore, Proposition 3.1 says that to satisfy Assumption A.1, an additional sufficient condition is t 2 n m 2 2 log p = o(n), which implies k < 3/5.Therefore, the dimension p cannot exceed the order exp(O(n 3/5 )), which coincides with the finding by [36].truncation parameter m is a practically difficult problem in nonparametrics; see [36,11].Therefore, a method that is insensitive to the choice of the truncation parameter within certain range will be highly useful.In Theorem 3.2 we theoretically show that the proposed Bayesian selection method is among the ones which provide insensitive selection results.On the other hand, we also show that our method is insensitive to the choice of the variance-control parameters c j 's.This is both theoretically and practically useful since it allows us to place an additional prior, such as the g-priors, over the c j 's while preserving the desired posterior model consistency; see Section 3.4.By slightly modifying the assumptions, it is possible to show that (3.3) actually holds uniformly for t n within some range, as established by [40] in the linear model setting.That is, posterior model consistency is also insensitive to the choice of t n .We ignore this part since in our nonparametric models with g-priors, insensitivity of the truncation parameter m and the variance control parameters c j 's should be paid more attention.This may also simplify the statements so that the results become more readable.
To the best of our knowledge, Theorem 3.2 is the first theoretical result showing the validity of the Bayesian methods in function component selection in ultrahigh-dimensional settings.

Misspecified Target Model Space
In this section, we investigate the case 0 < t n < s n , that is, t n is misspecified as being smaller than the size of the true model.Therefore, the true model is outside the target model space, for which we say that the target model space is misspecified.We conclude that in this false setting the selected model is still not "bad" because it can be asymptotically nested in the true model, uniformly for the choice of m and c j 's.
) is exactly the target model space, i.e., the class of γ with |γ| ≤ t n .Throughout this section, we make the following assumptions.

Assumption B.1.
There exist a positive constant d 0 and a positive sequence ρ n such that, when n → ∞, with probability approaching one,

Assumption B.3. There exists a positive sequence {h
h m decreasingly converges to zero, mh m increasingly converges to ∞, and The following result presents a situation in which Assumption B.1 holds.For technical convenience, we require the predictors to be independent.It is conjectured that this result may hold in more general settings.
So the growth rate of p is again not exceeding exp(O(n 3/5 )).(ii).Furthermore, suppose Assumption A.3 ( 4) is satisfied, and there is γ ∈ T 0 (t n )\{∅} and a When the hyperparameter t n is incorrectly specified as being smaller than the size of the true model, the selected model γ cannot be the true model since necessarily | γ| < s n .Theorem 3.4 (i) shows that in this false setting, γ can be asymptotically nested to the true model with probability approaching one.This means, as n approaches infinity, all the selected components are the significant ones which ought to be included in the model.Here, the result holds uniformly for m and c j ' s within certain ranges, showing insensitivity of the choice of these hyperparameters.To the best of our knowledge, Theorem 3.4 is the first theoretical examination of the function selection approach when the model space is misspecified.
We should mention that in Theorem 3.4 (i), it is possible that γ = ∅ since ∅ is a natural subset of γ 0 .When γ 0 is nonull, we expect γ to include some significant variables.this is possible if there exists a nonnull model that can be separated from the null model.Explicitly, the condition (3.6) says that the functions {f 0 j , j ∈ γ} dominate the functions {f 0 j , j ∈ γ 0 \γ}, in terms of the corresponding norms • j 's.This can be interpreted as that the model γ includes a larger amount of the information from the true model than its completion γ 0 \γ.Theorem 3.4 (ii) says that under this condition, with probability approaching one, γ is more preferred than the null.Therefore, γ is asymptotically nonnull.

Basis Functions
The proposed approach relies on a proper set of orthonormal basis functions {ϕ jl , l ≥ 1} in H j under the inner product •, • j .In this section we briefly describe how to empirically construct such functions.
Suppose for each j = 1, . . ., p, {B jl , l ≥ 0} form a set of basis functions in L 2 [0, 1].Without loss of generality, assume B j0 to be the constant function.For example, in empirical study we can choose the trigonometric polynomial basis, i.e., Other choices such as Legendre's polynomial basis can also be used; see [6].We may choose a sufficiently large integer M with M < n.For j = 1, . . ., p and 1 ≤ l ≤ M , define Bjl to be a real-valued function whose value at Let A j be an M by M invertible matrix such that A T j Σj A j = I M .Write A j = (a j1 , . . ., a jM ), where a jl is the l-th column, an M -vector.Then define ϕ jl as a real-valued function whose value at jl W ji , for j = 1, . . ., p and l = 1, . . ., M .In the simplest situation where X ji 's are iid uniform in [0,1], for j = 1, . . ., p, it can be seen that Σj ≈ I M , for which we can choose A j = I M , leading to ϕ jl = Bjl for l = 1, . . ., M .
Next we heuristically show that the functions ϕ jl 's approximately form an orthonormal basis.By the law of large numbers, , where δ ll ′ = 1 if l = l ′ , and zero otherwise.Thus, {ϕ jl , l = 1, . . ., M } approximately form an orthonormal system.Furthermore, any f j ∈ H j admits the approximate expansion , where βj = ( βj1 , . . ., βjM ) T .This means that the function f j can be approximately represented by the ϕ jl 's for l = 1, . . ., M .Consequently, {ϕ jl , l = 1, . . ., M } approximately form an orthonormal basis in H j given that M is large enough.

Mixtures of g-prior
The results in Sections 3.1 and 3.2 can also be extended to the g-prior setting.Suppose c j = c for j = 1, . . ., p.We assume c to have prior density g(c), a function of positive values over (0, ∞) satisfying ∞ 0 g(c)dc = 1, i.e., g is a proper prior.Then (2.7) is actually p(γ|c, D n ).The posterior distribution of γ is therefore p g (γ|D n ) = ∞ 0 p(γ|c, D n )g(c)dc, with the subscript g emphasizing the g-prior situation.Then we have the following results parallel to Theorems 3.2 and 3.4.The interpretations are similar to those for Theorems 3.2 and 3.4.Their proofs are similar to those in [40], and thus are omitted.Theorem 3.5.Suppose Assumptions A.1-A.3 are satisfied.Furthermore, g is proper and, as n → ∞, Theorem 3.6.Suppose 0 < t n < s n .Let Assumptions B.1-B.3 be satisfied, and g be proper and max γ∈T 1 (tn)∪T 2 (tn) pg(γ|Dn) max γ∈T 0 (tn) pg(γ|Dn) → 0, in probability.
(ii).If, in addition, Assumption A.3 (4) holds, and there exist a γ ∈ T 0 (t n )\{∅} and a constant We propose two types of g-priors that generalize the Zellner-Siow prior by [50] and generalize the hyper-g prior by [30].We name them as the generalized Zellner-Siow (GZS) prior and the generalized hyper-g (GHG) prior respectively.Let b, µ > 0 be fixed hyperparameters.The GZS prior is defined to have the form and the GHG prior is defined to have the form We conclude that both GZS and GHG priors can yield posterior consistency.To see this, since we assume p ≫ n, we have It can be directly examined that, as n → ∞, the GZS prior satisfies directly that the above φ n and φn satisfy the assumptions of Theorem 3.5, implying posterior consistency of the g-prior methods.Clearly, the modes of the GZS and GHG priors are both p µ /(b + 1) which converges to infinity as n inflates, yielding consistent selection results.

Computational Details
In this section, sampling details will be provided.Instead of fixing t n and m, we may place priors over them to make the procedure more flexible.Let c j = c for j = 1, . . ., p. Assume a g-prior (either GZS or GHG) g(c) for c, and denote the priors for t n and m by p(t n ) and p(m) respectively.
For convenience, we consider the flat priors p(γ|t some prefixed positive integer T n , and p(m) = I(m 1 ≤ m ≤ m 2 ) for some fixed positive integers m 1 and m 2 .For other choices of p(t n ) and p(m), the computational details in this section require corresponding modifications.
To complete the second step, we apply a nontrivial variation of the conventional blockwise technique (see [21,45]) to sample β j 's and γ j 's, given an updated sample m (q+1) .Note that the sample β (q) j 's from the previous q-th iteration have dimension m (q) which might be different from m (q+1) .This phenomenon of different dimensions makes the conventional blockwise sampling approach fail since there is an underlying conflict between the current state of m and the (conditioning) blocks from the previous iteration.Motivated from the "dimension-matching" technique in the reversible jump MCMC approach (see [18]), we propose to modify the β (q) j 's to be of dimension m (q+1) to match the current state of m.Specifically, if m (q+1) < m (q) , define β(q) j to be an m (q+1)dimensional vector which consists of the first m (q+1) elements of β (q) j .If m (q+1) > m (q) , then define β(q) j = ((β ) T , where 0 h denotes the h-dimensional zero vector.That is, β(q) j is m (q+1) -dimensional with the first m (q) elements being exactly the ones of β (q) j , and the remaining m (q+1) − m (q) elements being zero.If m (q+1) = m (q) , then set β(q) j = β (q) j .Repeating the above procedure for all j = 1, . . ., p, one gets β(q) j 's, a "modified" set of samples from the previous stage.Suppose we have updated samples (β ), . . ., (β j ′ ) for j ′ = j +1, . . ., p, and b j = (β j , γ j ), where β j is an m (q+1) -dimensional nominal vector and both β j and γ j will be updated.For convenience, define b −j = {b j ′ , j ′ = 1, . . ., p, j ′ = j} to be the conditioning blocks.The full conditional of b j given b −j and other variables highly depends on the size of γ −j = (γ p ). Specifically, for effective sampling, |γ −j | cannot exceed t (q) n since otherwise the block b j will have zero posterior probability.When |γ −j | = t (q) n , γ (q+1) j has to be zero since otherwise the conditional probability becomes zero.In this case, one simply sets β , j ′ = j), an n by m (q+1) (p − 1) matrix.Similarly, define β(q) −j to be the m (q+1) (p − 1)-vector formed by β −j .Note p(γ j , γ −j ) = 1.Then we have from (4.1) that Integrating out β j in (4.4), one gets that where . Similarly, one gets from (4.1) that δ 0 (β j ).( Integrating out β j in (4.6) one gets that . (4.7) Consequently, from (4.5) and (4.7) we draw γ where θ j = det(Q . It can be shown from (4.4) and (4.6) that from which β (q+1) j is drawn.In the above procedure, finding the matrix product Z (q+1) −j β(q) −j is a time-consuming step.It is possible to avoid computing this matrix product by iteratively using the following relation In practice, one only needs to compute Z (q+1) −1 β(q) −1 since the subsequent products can be iteratively updated through (4.10).
The proposed blockwise sampling scheme (4.8) and (4.9) can be viewed as a generalization of [21] from m = 1 (without group structure) to general m (with group structure).This generalization is nontrivial because we allow m, the dimension of β j , to change across the consecutive iterations.
When updating b j given the blocks from the previous iteration whose dimensions might be different from the current value of m, we have to apply the "dimension-matching" technique to the previous samples of β j 's so that they have the same dimension as the current m.By doing so, one can apply the conventional blockwise techniques to update the blocks consecutively.Note that when m does not change across the iterations, there is no need to use such "dimension-matching" procedure.
Furthermore, the proposed blockwise technique can only be used for the constrained situation, i.e., when |γ −j | ≤ t (q) n , which is essentially a constrained version (with group structure) of the conventional blockwise sampling approaches.Sampling σ 2 .From (4.1), it can be easily seen that the full conditional of σ 2 is where IG(a, b) denotes the inverse Gamma distribution.Denote σ 2 (q+1) as the updated sample.Sampling c.When g(c) is chosen to be the GZS prior specified as (3.7), we can use a Gibbs sampling step to draw c (q+1) .Indeed, the full conditional of c is found to be When g(c) is chosen to be the GHG prior specified as (3.8), we need an Metropolis-Hasting step.

Numerical Study
In this section we demonstrate the performance of the proposed method through empirical studies.
Specifically, we compare our Bayesian method based on GZS and GHG priors, denoted as BGZS and BGHG respectively, with the iterative nonparametric independence screening combined with penGAM, denoted as INIS-penGAM, and its greedy modification, denoted as g-INIS-penGAM, both proposed by [11].Other well-known approaches include the penalized method for additive model (penGAM) proposed by [34], and the iterative sure independence screening (ISIS) combined with SCAD proposed by [12,15]; see [11] for numerical details.We adopted two simulation settings considered by [24,11] in the following examples in which p = 1000 and n = 400.We chose somewhat arbitrarily the hyperparameter ν = 6 in the prior (2.4).In both the GZS and GHG priors defined by (3.7) and (3.8), we chose b = 0. To see how sensitive the results are with respect to the choice of µ, we considered difference values of µ.The test functions are defined by , and f 4 (x) = 0.1 sin(2πx) + 0.2 cos(2πx) + 0.3 sin(2πx) 2 + 0.4 cos(2πx) 3 + 0.5 sin(2πx) 3 .
Example 5.2.We adopted the simulation setting of Example 4 in [11].This example is more challenging in that it contains more true functions than Example 5.1.Specifically, the data were generated from the following model where ǫ ∼ N (0, 1).The covariates X j 's were generated according to Example 5.1.
In Examples 5.1 and 5.2, [11] used five spline basis functions to represent the nonparametric functions.In the present paper we considered both Legendre polynomial basis and trigonometric polynomial basis.In both cases, we chose m 1 = 4 and m 2 = 6 so that the number of basis functions m is varying around 5 to enhance flexibility.We used µ = 0.5, 0.6, 0.8 and 0.8, 0.9, 1. there are at most m|γ| nonzero Fourier coefficients.In the present setup, this quantity is upper bounded by mT n .We chose T n = [n/(3m)] so that the maximum number of nonzero coefficients does not exceed n/3.In [10,29] it was shown that the number of nonzero coefficients cannot exceed n/2 for uniqueness of the solution in sparse recovery.Here we reduced the upper bound to n/3 to gain more sparse solutions.For GHG prior, we chose σ 2 κ = 0.2 for the MH update of κ in sampling c; see Section 4 for detailed description.
Recall that the Fourier coefficient vector β j may change dimension across iterations, i.e., the socalled trans-dimensional problem.The resulting chains may include varying-dimension components.
It is well known in the literature that the classic approaches for convergence diagnostics may fail.Following [19], we used the chains of MSE, a natural scalar statistics, to monitor MCMC convergence of the Fourier coefficients, which successfully resolves the trans-dimensional problem.
Although we are aware that such scalar statistics cannot guarantee convergence of the full chains, its computational convenience is attractive.Moreover, the scope of the current paper focuses more about the selection and estimation issues, for which monitoring convergence of the MSE chains is believed to be a reasonable strategy.In our study we used Gelman-Rubin's statistics (see [17]) to monitor convergence of the chains relating to MSE and the remaining parameters.Confirming chain convergence, we dropped the first half of the posterior samples as burnins and only used the second half to conduct statistical procedures.
We reported the average number of true positives (TP), the average number of false positives (FP), the prediction errors (PE) based on BGZS and BGHG, and compared them with INIS and g-INIS.Marginal inclusion rule is adopted to select the model.That is, the jth variable is selected if its posterior exclusion probability P j = 1 − p(γ j = 1|D n ) ≤ p for some quantity p ∈ (0, 1).We chose p = 0.5 to yield median probability models; see [1].The TP/FP is the number of true/false inclusions in the selected model.The PE was calculated as Q q=1 Y − Y (q) 2 /(nQ), where Y (q) = Z (q) β (q) is the fitted response value obtained from the qth iteration.In other words, PE is the average value of the mean square errors (MSE) along with the iterations.
Results on TP, FP and PE using BGZS and BGHG were summarized in Tables 1-2 and Tables 3-4, based on Legendre polynomial basis and trigonometric polynomial basis, respectively.Results on INIS and g-INIS were directly summarized from [11].In Example 5.

Conclusions
A fully Bayesian approach is proposed to handle the ultrahigh-dimensional nonparametric additive models, and the theoretical properties are carefully studied.The numerical results demonstrate satisfactory performance of the method, in terms of selection and estimation accuracy.The method can achieve high level accuracy in both Legendre polynomial basis and trigonometric polynomial basis.Therefore, basis selection is not a critically important issue for the proposed approach, though, to make the approach highly accurate, the choice of the hyperparameter µ in the proposed gpriors should be slightly different in using different bases.µ ∈ [0.5, 0.8] and µ ∈ [0.8, 1.1] for Legendre polynomial basis and trigonometric polynomial basis, respectively.The values outside these ranges are found to merely slightly lower the accuracy within an acceptable range.
Acknowledge: Zuofeng Shang was a postdoctorate researcher supported by NSF-DMS 0808864, NSF-EAGER 1249316, a gift from Microsoft, a gift from Google, and the PI's salary recovery account.Ping Li is partially supported by ONR-N000141310261 and NSF-BigData 1249316.

Appendix: Proofs
To prove Theorem 3.2, we need the following preliminary lemma.The proof is similar to that of Lemma 1 in [40] and thus is omitted.
By Weyl's inequality on eigenvalues (see [22]) and by ( 7.3), one can properly choose a small c 0 > 0 to satisfy (3.2), which completes the proof.Using similar proofs of Proposition 2.1 in [38], it can be shown that (3.2) implies Assumption A.1.The details are straightforward and thus are omitted.

Proof of Theorem 3.2
Denote β 0 j = (β 0 j1 , . . ., β 0 jm ) T for j = 1, . . ., p. Define k n = j∈γ 0 β 0 j 2 and ψ n = min j∈γ 0 β 0 j .Before giving the proof of Theorem 3.2, we should mention that Assumption A.3 is actually equivalent to the following Assumption A.4 which assumes the growing rates on terms involving the Fourier coefficients of the partial Fourier series, i.e., k n and ψ n .The difference between Assumptions A.3 and A.4 is that l n and θ n in the former are replaced with k n and ψ n in the latter, respectively.This modified assumption is easier to use in technical proofs.

Assumption A.4. There exists a positive sequence {h
h m decreasingly converges to zero, mh m increasingly converges to ∞, and To see the equivalence, it can be directly shown by (3.1) that uniformly for m ∈ [m 1 , m 2 ] On the other hand, for any j ∈ γ 0 and any m ∈ [m 1 , m 2 ], we have f 0 By (7.4) and (7.5) and direct examinations, it can be verified that Assumption A.4 is equivalent to Assumption A.3.We will prove the desired theorem based on the equivalent Assumptions A.1, A.2 and A.4.
In the end we analyze the term J 2 .Using the proof of Lemma A.2 in [38], it can be shown that for any c j 's ∈ [φ n , φn ] and m ∈ [m 1 , m 2 ], for any γ ∈ S 1 (t n ), and for any γ ∈ S 2 (t n ).(7.8)To make the proofs more readable, we give the brief proof of (7.8) determinant formula (see [43]), Assumption A.1 and straightforward calculations we have .
Taking logarithm on both sides, we get the first inequality in (7.8).When γ ∈ S 2 (t n ), since det(W γ ) ≥ 1, the second inequality in (7.8) follows by To the end of the proof, we notice that based on the above approximations of J 1 to J 5 , there exist some large positive constants C and N such that when n ≥ N , w.p.a.1., for any c where the last limit follows by Assumption A.
where the last limit follows by Assumption A.4 (2).This completes the proof of Theorem 3.2.
Before proving Theorem 3.4, we need the following lemma.The proof is similar to that of Lemma 2 in [40] and thus is omitted.

. 3 )
Theorem 3.2 says that under mild conditions the posterior probability of the true model converges to one in probability.This means, with probability approaching one, our Bayesian method selects the true model, which guarantees the validity of the proposed approach.Here, convergence holds uniformly over c j 's ∈ [φ n , φn ] and m ∈ [m 1 , m 2 ].This means, the selection result is insensitive to the choice of c j 's and m when they belong to suitable ranges.It is well known that choosing the imsart-generic ver.2011/11/15 file: ABUSFinal.texdate: May 7, 2014

Table 1
FP and PE, and comparable TP.In Example 5.2 where trigonometric basis was used, the performance is not as good as using Legendre polynomial basis, but is still satisfactory.Specifically, when ρ = 1 and µ = 0.8, both BGZS and BGHG yield slightly larger TP and FP than INIS and g-INIS (implying less conservative selection results), and when µ = 1.1, both methods yield slightly smaller TP and FP (implying more conservative selection results); when ρ = 0, µ = 0.8 or 0.9, both BGZS and BGHG can select all the significant variables though they yield slightly larger FP.In all the cases, the proposed Bayesian methods yield smaller PE.The above results are not sensitive to the choice of µ, though certain µ may yield slightly better performance.Due to the essentially different basis structures, the feasible ranges of µ should be slightly different.We found that, at least in the above examples, µ ∈ [0.5, 0.8] and µ ∈ [0.8, 1.1] are feasible ranges for Legendre polynomial basis and trigonometric polynomial basis.Any choice of µ within these ranges can provide satisfactory results.Values outside the ranges may slightly lower the level of accuracy.Simulation results of Example 5.1 using Legendre polynomial basis.

Table 2
The numerical findings suggest us to use Simulation results of Example 5.2 using Legendre polynomial basis.

Table 3
Simulation results of Example 5.1 using trigonometric polynomial basis.

Table 4
Simulation results of Example 5.2 using trigonometric polynomial basis.