Bayesian Multiple Quantile Regression for Linear Models Using a Score Likelihood ∗

. We propose the use of a score based working likelihood function for quantile regression which can perform inference for multiple conditional quantiles of an arbitrary number. We show that the proposed likelihood can be used in a Bayesian framework leading to valid frequentist inference, whereas the commonly used asymmetric Laplace working likelihood leads to invalid interval estimations and requires further correction. For computation, we propose a novel adaptive importance sampling algorithm to compute important posterior summaries such as the posterior mean and the covariance matrix. Our proposed approach makes it feasible to perform valid inference for parameters such as the slope diﬀerences at diﬀerent quantile levels, which is either not possible or cumbersome using existing Bayesian approaches. Empirical results demonstrate that the proposed likelihood has good estimation and inferential properties and that the proposed computational algorithm is more eﬃcient than its competitors.


Introduction
Quantile regression (QR), first introduced by Koenker and Bassett (1978), provides a more comprehensive description of the relationship between an outcome and a set of covariates of interest compared to the more commonly used conditional mean regression. Moreover, it is also more robust against outliers. Suppose Y denotes the continuous response variable of interest and X denotes the p-dimensional covariate vector, with its first element equals to 1. For a given quantile level τ ∈ (0, 1), the linear quantile regression model assumes the τ th conditional quantile function of Y given X to be where β(τ ) is the regression quantile vector of length p which depends on the quantile level of interest τ . By allowing the slopes to change with τ , quantile regression allows for a much more general conditional relationship between Y and X (Koenker, 2005).
For a given sample {(y i , x i )} i=1...,n , the standard quantile regression estimator (Koenker and Bassett, 1978) is given bŷ where ρ τ (u) = u(τ − I(u < 0)) is the asymmetric check loss function. Inference on the quantile regression parameter can be carried out based on the asymptotic normality results in Koenker and Bassett (1978); Koenker (2005). This requires estimating the joint covariance matrix of two quantile levels which involves density estimation that can be sensitive to bandwidth selection. Bayesian approaches have an advantage as they can automatically estimate the variance and more generally the posterior distribution of the parameters without having to estimate densities based on which credible intervals can be constructed (Yu and Moyeed, 2001;Yang and He, 2012;Yang et al., 2016).
The main challenge for Bayesian methods is that the quantile regression model does not specify a parametric likelihood, which is required in the Bayesian framework. Yu and Moyeed (2001) proposed to use a working likelihood based on the Asymmetric Laplace Distribution (ALD) given by where σ is a fixed scale parameter. The intuition behind this likelihood is that it incorporates the check loss into the likelihood function and is maximized at the standard QR estimatorβ(τ ). This is called a working likelihood since quantile regression problem does not specify any particular data generating process and there is no "true" likelihood function. Computation of the posterior with the ALD likelihood is easy to implement using Gibbs sampling (Kozumi and Kobayashi, 2011) or Metropolis-Hastings algorithms. Sriram et al. (2013) showed that the resultant posterior exhibits posterior consistency properties. However, Sriram (2015); Yang et al. (2016) noticed that the asymptotic variance of the posterior distribution based on ALD likelihood is incorrect, implying invalidity of inference from a frequentist viewpoint. To remedy this, the authors proposed a correction to the posterior variance to make valid inference. It is worth noting that if one takes a subjective Bayesian perspective and truly believes that the data is generated from a Laplace distribution, the inference using the ALD likelihood will be valid. However, that is a very strong assumption and is likely to be violated in most applications where quantile regression is of interest. Therefore, we take an objective Bayesian perspective where it is desirable to have valid frequentist properties for the Bayesian procedures.
To test this in the Bayesian framework, we can reparametrize the quantile regression model with θ = β 1 (τ 1 ) − β 1 (τ 2 ) after which testing the slope difference is transformed to testing whether θ = 0. If zero is not included in the credible interval based on the posterior distribution for θ, we would reject the null hypothesis. For this procedure, we require a valid posterior distribution that can automatically capture the joint uncertainty of the regression quantiles at multiple levels. Sriram et al. (2016) proposed a working likelihood to model multiple quantile levels as an extension of the ALD likelihood and showed it provides consistent estimation of the regression quantiles. However, it cannot quantify the joint uncertainty for multiple quantile levels and the inference based on this posterior is not valid as discussed earlier. Although the correction proposed by Yang et al. (2016) can be generalized for this case, it requires fitting two different likelihoods that are not compatible with each other and is not preferable also since it would not utilize the information between two nearby quantile levels.
Bayesian methods for simultaneously modeling conditional quantiles at multiple levels τ 1 , τ 2 , . . . , τ m have also been considered. For estimating multiple unconditional quantiles in a Bayesian framework, Lavine (1995) proposed a multinomial substitution likelihood. Dunson and Taylor (2005) extended Lavine (1995)'s idea and proposed a generalization of this substitution likelihood for the regression case with categorical covariates having a limited number of categories. However, this likelihood does not generalize to the usual quantile regression setting with continuous predictors. There are several existing Bayesian approaches which accommodate continuous predictors but a majority of them have computational or inferential challenges in modeling a finite collection of quantile levels which forms a major motivation for our current work.
Lancaster and Jae Jun (2010) used Bayesian exponentially tilted empirical likelihood and provided an explicit form for the posterior density. Yang and He (2012) proposed a Bayesian empirical likelihood method which also allows a simultaneous estimation of multiple quantiles. This approach provides valid posterior inference unlike the ALD likelihood based approach and it is applicable to model quantile slope differences. However, the empirical likelihood based methods are computationally demanding even under modest dimensions. Feng et al. (2015) proposed an approximation likelihood based on linearly interpolated density targeting multiple quantile regression but they require the number of quantiles to go to infinity which is not efficient when we are only interested in a finite collection of quantile levels. If quantile regression model is assumed for all the quantile levels, a global likelihood for all the regression quantiles can be specified because the conditional density can be written as a function of the conditional quantiles (Reich et al., 2011;Yang and Tokdar, 2017). Reich et al. (2011) proposed a model based spatial quantile regression method within the Bayesian framework by assuming a global quantile regression model at all the quantile levels. Using a Bernstein basis polynomial representation, their approach is to simultaneously estimate the regression quantile process, which is computationally expensive especially when we are only interested in estimating a few quantile levels. Moreover, unlike our proposed approach in the paper, this approach does not ensure validity of the coverage based on credible intervals. Yang and Tokdar (2017) proposed a semiparametric Bayesian quantile regression method by using a Gaussian process prior distributions on the function-valued parameters. However, the resultant estimators are not necessarily √ n− consistent due to the semiparametric nature of the priors. The likelihood evaluation is computationally intensive and not suitable if the interest lies in only a few quantile levels. Also, it does not guarantee the frequentist validity of the resultant posterior intervals.
In this paper, we propose the use of a working likelihood based on the score function for implementing quantile regression at multiple quantile levels and propose a new adaptive importance sampling algorithm for its computation. The proposed likelihood can be used for Bayesian settings and we provide theoretical justification for this approach by showing that the posterior corresponding to the proposed likelihood provides valid frequentist inference for multiple quantile levels. With the ability to incorporate multiple quantiles into one single likelihood, we can perform inference on parameters such as the slope differences directly, which is not easy to perform using existing Bayesian approaches. Compared to the frequentist approaches, our formulation allows automatic quantification of uncertainty of multiple regression quantiles, the flexibility of incorporating prior information, additional shrinkage or smoothness in the regression quantile parameters by using an appropriate prior distribution. Computationally, we propose an adaptive importance sampling algorithm, which does not require any optimization steps and is suitable for estimating multiple quantile levels together efficiently. Our empirical results show that the proposed importance sampling algorithm is very efficient and is more efficient than the computational algorithms typically used for the ALD likelihood.
For the purpose of computing generalized method of moment (GMM) estimators, Chernozhukov and Hong (2003) proposed using Markov Chain Monte Carlo (MCMC) algorithms based on likelihoods constructed by exponentiation of the objective function of interest. While the score likelihood advocated in this paper is considered by Chernozhukov and Hong (2003), the focus of our current work is to generalize the score likelihood to perform valid inference for parameters involving multiple quantile levels which is not feasible using existing Bayesian approaches and to demonstrate that our proposed computational algorithm is more efficient in terms of effective sample size than the Metropolis-Hasting algorithm proposed by Chernozhukov and Hong (2003) for generic likelihoods.
The rest of the paper is structured as follows: Section 2 will first present the formulation of our likelihood. Section 3 provides our proposed computational methods which can be used to compute the posterior and make inference based on the proposed likelihood. Section 4 demonstrates the performance of the proposed method and computational algorithms via simulation studies. Section 5 provides an application to a real dataset. Finally, Section 6 provides a conclusion.

Score Based Likelihood for Quantile Regression
We first focus on the case for modeling a single quantile level. The score function for the quantile regression objective function given in (1.1) is a p dimensional vector, which takes the following form: where ψ τ (u) = τ − I(u < 0). Since the quantile regression estimatorβ(τ ) minimizes the quantile loss function, s τ (β(τ )) ≈ 0. Consider the following working likelihood: where W is a p × p positive definite weight matrix, and C is a constant free of β. We shall also use the short notation L(β) for L(Y | X, β) as it is viewed as a function of β.
Since the exponent in the posterior is a quadratic form, values of β for which s τ (β) is close to zero will correspond to high posterior density. As the expectation of the score function equals zero uniquely at the true regression quantile β 0 (τ ), we expect that this leads to posterior concentration around the truth. Our Theorem 1 makes this intuition more formal under regularity conditions. Quasi-posteriors for Bayesian inference have been considered in the literature, for example, we refer to Li and Jiang (2016) for quasiposterior inference based on Bayesian generalized method of moments. Chernozhukov and Hong (2003) considered such working likelihoods in the context of computation of frequentist estimators using MCMC algorithms.
The weight matrix we use in the likelihood is: Our theoretical results from Section 2.4 show that this choice for W will lead to valid inference based on the posterior. Compared with the ALD likelihood, this will allow us to carry out posterior inference directly without having to correct the posterior covariance as suggested by (Sriram, 2015;Yang et al., 2016).
The likelihood (2.1) does not arise from a distributional specification for Y given (X, β). Therefore, this can be viewed as a working likelihood. To visualize the score likelihood in (2.1), we provide a simple illustration using the univariate case without covariates. We generate samples from N (0, 1) and plot the score likelihood for the median in Figure 1. In Figure 1, we take numerical integration of the likelihood over the interval [−0.5, 0.5], which is large enough, to obtain a good approximation the normalizing constant C. The ALD likelihood with σ = 1 is also provided for comparison. The dashed black curve in Figure 1 is the true asymptotic distribution for the median. Since all the likelihoods provide consistent estimation, we centered all of them to zero to provide a better comparison of their variances. It can be seen that the variance of the ALD likelihood is different from the true asymptotic distribution as it is affected by the scale parameter σ. On the other hand, the score based likelihood is quite close to the asymptotic normal distribution, especially as the sample size increases.
The score likelihood (2.1) can also be generalized to model multiple conditional quantiles. Consider modeling m quantiles τ 1 , τ 2 , . . . , τ m at the same time, with the true quantile regression parameters β 0 = (β(τ 1 ), β(τ 2 ), . . . , β(τ m )). For any vector β = (β 1 , β 2 , . . . , β m ), define the score function to be s(β) = (s τ1 (β 1 ), s τ2 (β 2 ), . . . , s τm (β m )), where s τi (·) is the score function correspond to the quantile level τ i for i = 1, . . . , m. The score based likelihood for multiple quantile levels takes the following form: where W is an mp × mp positive definite weight matrix. The weight matrix in this case is given by ⊗ denotes the Kronecker product, and x ∧ y = min(x, y). Our theoretical results show that this W will lead to correct posterior inference. It is also natural to incorporate monotonicity of the conditional quantiles so that the proposed likelihood becomes: We place the monotonicity constraints on the estimated quantiles only at the observed xvalues. This implies monotonicity of the conditional quantiles at all the covariate values belonging to the convex hull formed by the observed x's. In the literature, alternative approaches for achieving monotone quantile curves in the Bayesian context have been considered. For example, Rodrigues and Fan (2017) used a two stage approach where posterior samples from Bayesian quantile regression are made to be monotone using a post-hoc adjustment procedure. Yang and Tokdar (2017) proposed a more elegant way to impose monotonicity for all the quantile levels τ ∈ (0, 1) by reparametrizing the quantile regression model. Both these approaches, however, can be too restrictive in terms of their constraints if we are only interested in a few quantile levels.

Prior Distributions
An advantage of the proposed Bayesian framework for quantile regression is that prior information or smoothness in the regression quantiles can be induced with appropriately specified prior distributions. When there is no additional information available on the regression quantile parameters, a bounded uniform prior such as β ∈ [−n, n] mp can be used. Such a prior is imposed to ensure a proper posterior distribution. In practice, an improper uniform prior may also be used for simplicity. If the regression quantiles are expected to vary smoothly as a function of the quantile level τ , this can be induced by using prior distributions with high correlations between regression parameters corresponding to nearby quantile levels. For example, we can use a multivariate normal distribution with the covariance between β 1 (τ 1 ) and β 1 (τ 2 ) to be a smooth covariance kernel function such as the squared exponential kernel given by where σ 2 and b are hyperparameters. For simplicity, the prior covariance between parameters of different predictors can be assumed to be zero as is common in the linear regression context (George and McCulloch, 1993;Narisetty and He, 2014). Such Gaussian priors have been used in quantile regression settings (Yang and Tokdar, 2017;Rodrigues and Fan, 2017) to induce smooth conditional quantiles. With the non-informative uniform prior or the Gaussian prior, the posterior will be a proper distribution and can be used for inference.

Estimation of Contrasts in Quantile Regression
Using scored-based likelihood and a prior distribution of β, we can obtain the posterior distribution of any linear combination of the quantile regression parameters, and in particular for any contrasts. For simplicity, consider a simple quantile regression model and we are interested in testing slope differences θ = β 1 (τ 1 ) − β 1 (τ 2 ). If we assume an bounded uniform prior, the posterior of θ can be simply calculated from the posterior samples of β 1 (τ 1 ) and β 1 (τ 2 ). In general, if prior information is available only on the parameter θ, we can place a prior π(θ) on θ and uniform prior on the other parameters, the posterior distribution of θ would take the following form: Similarly, the proposed framework can be used for obtaining posterior samples for any linear combinations of the quantile regression coefficients across multiple quantile levels.

Posterior Concentration Results
In this section, we shall show that the score-based likelihood together with either of the priors discussed will lead to a posterior that is asymptotically similar to a normal distribution with the posterior variance matching the frequentist variance of the quantile regression estimator. Before we state the assumption and theorems, we define some notations used in this paper.
Notation. For a sequence of random variable X n and sequence of constant a n , the notation X n = O p (a n ) is defined by: For any > 0, there exist an M > 0 and N > 0, such that P (|X n /a n | > M) < , when n > N, To obtain the asymptotic result, the following assumptions are made.
Assumption 1 (On the distribution of X). max x i = O p (n 1/4 log n) −1/2 and E x i 4 < B for all i and some constant B.
Assumption 2 (On the data generation mechanism). The conditional quantile function of Y given X is linear, , is absolutely continuous with continuous density f i (· | x i ) bounded away from 0, and uniformly bounded Assumption 3 (On the existence of asymptotic limits).
for some matrix D 1 (τ j ), for j = 1, 2, . . . , m. The sample version of D 0 converges, We would like to point out that these are standard assumptions for quantile regression and for instance similar conditions have been used by He et al. (1996). We slightly modify the last two assumptions to accommodate our random design setting.
We further impose the following additional assumptions which appeared in Example 3 in Chernozhukov and Hong (2003). Assumptions A1-A3 are technical assumptions to ensure posterior concentration for score likelihood. Assumption A1 is regarding the identifiability of β 0 and Assumptions A2-A3 guarantee regularity of the score function as a function of β. Heuristically speaking, these conditions will be satisfied if the underlying conditional distribution of the response is sufficiently smooth which is usually the case in most practical applications.

Assumption A1. The expectation of score function is uniquely minimized at
The following theorem shows that our proposed likelihood leads to asymptotically valid frequentist inference for multiple quantile levels τ 1 , . . . , τ m at the same time.
where M n is any sequence diverging to infinity with n.
The first part of the Theorem is regarding the shape of the posterior around the true parameter. This result is analogous to the asymptotic results for the frequentist quantile regression estimators (Koenker and Bassett, 1978;Koenker, 2005). The second part of the theorem ensures posterior concentration, which requires stronger technical conditions as indicated by the results of Chernozhukov and Hong (2003) which are the basis for this part of the theorem. We defer those assumptions to the supplementary material (Wu and Narisetty, 2020). For the second part, the prior needs to be supported on a fixed compact set, which is for instance satisfied by a Uniform prior on a large fixed interval. One possibility to ensure this condition is to truncate any prior on a compact set. While we expect that this is not a necessary condition, we require it for our theoretical results. In practice, priors with unbounded support such as the Gaussian prior and a Uniform prior on [−n, n] also work well as observed in our simulation studies.
Notice that since we are in the Bayesian framework, it is possible to impose monotonicity constraints to avoid the crossing problem. Therefore, we only consider β i 's such that Heuristically, since the crossing probability is small, the posterior distribution would not be affected by these constraints. However, in small samples, our simulation results show that the proposed joint likelihood leads to smaller mean squared error compared with estimating them separately using the frequentist approach.

Extension to Censored Quantile Regression
For censored data, quantile regression is preferred over conditional mean regression since the latter is not necessarily identifiable without making stringent distributional assumptions. Motivated by the Powell's estimator (Powell, 1984(Powell, , 1986, the proposed likelihood can be extended to censored quantile regression with fixed censoring. Consider a simple case with fixed left censoring at 0, where u i is the error term. The Powell's estimator is defined aŝ The Powell's estimator is consistent and has asymptotic normality with covariance ma- Define the score function for censored quantile regression to be We propose the following likelihood to be used for Bayesian censored quantile regression where the weight matrix W (β) is given by (2.10) The motivation of our likelihood is that the likelihood is maximized atβ P because Powell (1984) so that the posterior is consistent. We shall also show that with our choice of the weight matrix, the resultant posterior is approximately normal with the correct variance. Unlike the uncensored case, notice that the weight matrix now depends on the parameter β which makes the theoretical analysis more challenging. To obtain the asymptotic posterior concentration results for censored quantile regression, we need the following assumptions similar to the ones used by Powell (1986).
We also assume that the finite sample version of M 0 converges to the true value.
We further impose the following technical assumptions, which are also made for Proposition 1 in Chernozhukov and Hong (2003).

Assumption A4. The expectation of score function is uniquely minimized at
for some M 0 (β) and M 0 (β) is continuous uniformly and positive semidefinite for all β ∈ Θ.
These are technical conditions to ensure the posterior concentration using score likelihood for censored quantile regression. Assumption A4 is regarding the identifiability of β 0 and Assumptions A5-A6 ensure regularity of the score function. Assumption A6 also imposes a distribution assumption on c τ (β), which is the same as the condition (iii) in Proposition 1 of Chernozhukov and Hong (2003).
The following theorem shows that our proposed likelihood asymptotically leads to valid frequentist inference for censored quantile regression.

Theorem 2. a) (Asymptotic Normality of Posterior Distribution)
Consider the proposed likelihood in (2.9) with the bounded uniform prior and the following weight matrix Under Assumptions 1, 2, and 4-6, for β such that β − β 0 = O(n −1/2 ), the posterior of β satisfies where M n is any sequence diverging to infinity with n.
For the censored case as well, the posterior distribution concentrates and is asymptotically normal, centered at the true regression parameter β 0 , with the correct co- The asymptotic normal expansion of the posterior distribution is same as the asymptotic distribution of the Powell's estimator. While both approaches result in similar behavior asymptotically, the score likelihood based approach is better suited for computations since the Powell's objective function is highly non-convex (Womersley, 1986;Chernozhukov and Hong, 2003). Note that Chernozhukov and Hong (2003)'s approach is different from ours as they used the Powell's objective function in place of the score function in the likelihood we proposed. In the next section, we will discuss our proposed computational strategies.

Computational Methods
We propose an efficient computational method for computing the posterior corresponding to the proposed likelihood function. Since the posterior distribution is asymptotically normal, the mean and the variance of the posterior provide a good description of the distribution. We propose an adaptive importance sampling algorithm for computing the posterior mean and the variance and the same approach can be used if higher moments of the posterior distribution need to be obtained. To ensure efficiency in terms of a good effective sample size, we propose an adaptive approach to update the proposal distribution of the importance sampling algorithm in an iterative manner. Importance sampling has been widely used as an alternative to Markov Chain Monte Carlo and the performance of the importance sampling algorithm relies on a good proposal distribution (Robert and Casella, 2013). The essential idea of adaptive importance sampling is to update the proposal distribution based on an iterative process to gain efficiency (Bugallo et al., 2017;Cornuet et al., 2012). Adaptive importance sampling methods are often used for computation in the contexts of mixture models (Cappé et al., 2008) and missing data models (Celeux et al., 2006).

Adaptive Importance Sampling Algorithm (Ada-IS) for a Single Quantile Level
We now describe our algorithm to compute the posterior mean and the covariance matrix. As the posterior distribution is asymptotically normal, it is reasonable to use a normal distribution as the proposal distribution. We use the results from a linear regression (or median regression) as an initialization, and use importance sampling to update the mean and the covariance matrix of the proposal distribution. The resultant estimators are expected to be better than the initialization and will be used as the new proposal to improve the efficiency of the algorithm. Updating the proposal just a few times turns out to be helpful in making the final proposal closer to the posterior distribution. The details of the algorithm are stated as follows.

Ada-IS Algorithm for a single quantile level:
(1) Fit a linear mean (or median) regression model, and use a 1 to denote the fitted slope parameter. Let a 0 (τ ) be the τ th sample quantile of y i − x T i a 1 , i = 1, 2, . . . , n and a = (a 0 (τ ), a 1 ). Use q(b) = N (a,Σ) as the initial proposal, wherê and c is a constant to be selected.
(2) Generate samples b (1) , b (2) , . . . , b (M ) from q(b) for some large M and calculate the importance weights w (r) = L(b (r) )π(b (r) )/q(b (r) ), where π is the prior distribution for β. Effective sample size (ESS) is calculated based on (3) The posterior mean can be estimated aŝ , and the jk th entry of the posterior covariance matrix can be estimated bŷ (4) Use the mean and the covariance matrix from Step 3 to form a new normal proposal distribution and repeat Steps 2-4 for a pre-specified number of times.
(5) Estimate the posterior mean and the covariance matrix by using importance sampling from the final proposal distribution obtained from Step 4. In Step 1, the motivation for the initialization for the covariance matrix is that when the densities f (x i β(τ )) are approximately equal to the constant k, then the true covariance matrix would take the following form This is equivalent to find a proper constant c. A larger value of c will inflate the variance and make the tails of the proposal distribution heavy, while a smaller c will make the tail lighter. In our empirical studies, we used a common value of c = 1 and did not optimize it. However, it is possible to try a few different values on a pilot run and choose the one providing the largest effective sample size in practice. It is also possible to use the estimates from quantreg as initialization, which is expected to perform better than the least squares initialization. However, even using the least squares initialization, in most cases, the proposed method performs well in terms of obtaining a good effective sample size. For Step 4, it is also possible to make the algorithm adaptive by continuing to update the proposal until a desired amount of effective sample size is reached. In our implementation, we will update the proposal for a pre-specified number of times rather than aiming for a target effective sample size.
For small sample sizes, the posterior distribution could be heavy tailed, causing the posterior variance to be potentially over-estimated. To improve the finite sample performance, we can estimate the posterior variance based on the posterior interquartile range, which is more robust. Simulation examples show that the finite sample performance of the estimation based on the posterior interquartile range is slightly better compared to the one based on the posterior variance. The importance sampling algorithm can be used to estimate the posterior quantiles and the interquartile range. We now present the algorithm to estimate the standard deviation σ 1 of the first slope coefficient β 1 after obtaining samples from the proposal.
• Sort the samples corresponding to the parameter β 1 obtained from the importance sampling algorithm b (1) in increasing order. For simplicity, denote the resultant values by b (1) < b (2) < . . . < b (M ) and the corresponding importance weights by w (1) , . . . , w (M ) .
Then, an estimator of σ 1 by matching the estimated interquartile range (IQR) with the asymptotic interquartile range is given bŷ where the factor 1.349 is the interquartile range of the standard normal distribution.
As suggested by a reviewer, another possibility to improve the effective sample size of the importance sampling algorithm in finite samples is to utilize a heavy tailed distribution such as the t distribution as proposal. We have observed that this strategy also works equally well in practice and difference between these two strategies is not substantial as the Ada-IS algorithm updates the proposal distribution along the path.

Ada-IS Algorithm for Multiple Quantile Levels
For multiple quantile regression, it is more challenging to devise a proposal distribution which will yield good effective sample size. The difficulty is that it is not straightforward to obtain a good initialization for the joint covariance matrix of the multiple regression quantiles. To address this, we first obtain good estimators of the mean and the covariance matrix for each of the individual quantile levels using the Ada-IS algorithm for a single quantile level. This will give a good estimation for the block diagonal part for the joint covariance matrix. We propose a way to impute the off-diagonal blocks, which correspond to the covariances at different quantile levels, based on the diagonal blocks as described in the following algorithm.
(2) Use an mp-dimensional multivariate normal proposal, withμ = (μ 1 , . . . ,μ m ) as the initialization for the mean and (Σ 1 , . . . ,Σ m ) as the block diagonal part for the joint covariance matrix. For the covariance matrix between the quantile levels τ i , τ j , we useΣ where γ ≤ 1 is a value close to 1 chosen to make the covariance matrix Σ positive definite.
(3) Generate samples from the proposal distribution and estimate the mean and covariance matrix as described in the Ada-IS Algorithm for a single quantile. Use these estimates to form a new normal proposal distribution as described in as described in Step 3 of the Ada-IS Algorithm for a single quantile.
(4) Update the proposal for a pre-specified number of times. Estimate the posterior mean and the covariance matrix using importance samples from the final proposal distribution from Step 3.
We will now provide a heuristic intuition behind the imputation of the off-diagonal blocks of the covariance matrix in Step 2. Ideally, we would like to havê We aim to approximate D 1 (τ i ) −1 D 0 D 1 (τ j ) −1 based on the approximation of the diagonal blocks: If two quantile levels are close, we would assume that ) ≈ 0. Therefore, we can obtain the following approximation, . The proposed imputation for Σ ij given by (3.1) follows from this approximation. Notice that for modeling multiple quantile levels, if a sample from the normal proposal does not satisfy the monotonicity constraint, its weight will be set to zero. This will make sure that the conditional quantiles are monotone. However, this may cause the algorithm to be less efficient, especially when m × p, the number of coefficients to be estimated, is large and the quantile levels are very close to each other. This constraint can be easily removed from the algorithm if higher efficiency is desired.

Simulation Study
In this section, we use simulated datasets to study the performance of our proposed approach. The following model is used for generating the data. For i = 1, · · · , n: where i 's are independent and the x i 's are n grid points equally spaced between 0 and 20. The conditions assumed for the theoretical results are satisfied by this data generating process used for our simulation studies. We choose n = 2000 to evaluate the asymptotic behavior while n = 100 and 200 are used to evaluate the small sample size performance. Each simulation study uses 1000 Monte Carlo replications to aggregate the results.

Single Quantile Regression
We consider the median quantile regression with τ = 0.5. We consider the following methods for comparing their performance in terms of parameter estimation and computational efficiency.
(1) Importance sampling algorithm (IS): Use the initialization described in Step (1) of Ada-IS Algorithm for single quantile with c = 1. In principle, one can choose a better c by searching over a small grid of values but we use this as a default. Implement an importance sampling algorithm without updating the proposal.
(2) Adaptive importance sampling algorithm (Ada-IS): Implement the Ada-IS algorithm for single quantile. Start with the same initialization in (IS) and update the proposal three times with 2000 samples generated each time to get an informative proposal.
(3) Metropolis-Hastings algorithm (MH): This is an Metropolis-Hastings algorithm as proposed by Chernozhukov and Hong (2003) and with a good proposal distribution already obtained from the Ada-IS algorithm. That is, use a multivariate normal distribution with mean and covariance matrix the same as the informative proposal in (2) as the proposal distribution.
(4) Asymmetric Laplace Likelihood (ALD): This one uses the asymmetric Laplace likelihood and implements a Metropolis-Hastings Algorithm using the same proposal distribution as in (3). For estimating the variance, the correction proposed by Yang et al. (2016) is used.
We first consider the case with n = 2000 observations generated from the above model. For each of the algorithms, 10000 final samples are obtained to estimate the parameters and the variances. In Table 1, we report the mean squared error (MSE), bias, and variance of the estimators for different methods mentioned above. The bias is calculated based on the average of the 1000 Monte Carlo replications. The average of the estimated variancesσ is compared with the true asymptotic variance, which is given by τ (1 − τ )D −1 1 D 0 D −1 1 . We also report the coverage of the 90% credible/confidence intervals, the effective sample size for all the methods. Quantile regression estimators using the R package quantreg are also provided here for comparison.
The simulation results in Table 1 demonstrate the asymptotic validity of the score based likelihood and efficiency of the proposed algorithm. The proposed score based likelihood provides good point estimation and has better mean squared error compared with the other methods. However, the estimated variances tend to be slightly larger than the asymptotic variance of the estimator, which also causes the coverage to be slightly larger than the nominal level. In terms of the effective sample size, the proposed adaptive importance sampling algorithm has a clearly superior performance with an effective sample size of 9485 which substantially improves upon the one-step importance sampling algorithm which has an effective sample size of 1089. Moreover, the Ada-IS algorithm for the score likelihood has much better efficiency even compared to the commonly used Metropolis-Hastings algorithm with the ALD likelihood. Also, since importance sampling is a non-stochastic algorithm, it does not suffer from the potential convergence  issues associated with the Metropolis-Hastings algorithm. In the later simulation studies, we use the adaptive importance sampling algorithm as the default algorithm for the proposed score likelihood.
We investigate the finite sample performance of the proposed algorithm for smaller sample sizes n = 100 and n = 200. As mentioned earlier, the posterior distribution could be heavy tailed for small sample sizes so we use the interquartile range approach discussed in Section 3.1 to estimate the posterior variance. Table 2 compares the finite sample performance of our approach with the frequentist approach and the corrected ALD likelihood approach. The performance of the estimated variances and the corresponding coverage probabilities are reported. Notice that for small sample sizes, our score likelihood based approach tends to have coverage probabilities larger than the nominal level without the finite sample correction while quantreg and the corrected ALD likelihood tend to have under-coverage. With the finite sample correction, the score likelihood provides better empirical coverage compared to the ALD likelihood. We provided further simulation results in the supplementary material (Wu and Narisetty, 2020) based on heavy tailed t distributions as proposals which also perform well for small sample sizes (see the results presented in Table A1 in the supplementary material).

Multiple Quantile Regression
To evaluate the performance of our method for multiple quantile regression, we choose five quantile levels which are relatively close to each other, τ 1 = 0.4, τ 2 = 0.45, τ 3 = 0.5, τ 4 = 0.55 and τ 5 = 0.6. This will help us evaluate if the imposed monotonicity constraint will result in higher efficiency. We use the same model given by (4.1) to generate n = 2000 samples and implement our Ada-IS algorithm for multiple quantiles. We use γ = 0.9 when imputing the off-diagonal blocks for the joint covariance matrix as a default but a better choice can be made in practice by choosing from a grid of values. Results from quantreg obtained separately for different quantile levels and Ada-IS without monotonicity constraints are also presented.
The simulation results for the multiple quantiles are shown in Table 3. For estimating the conditional median, the average posterior variance based on the joint estimation is smaller than the one from Table 1, which is based on the single quantile level. Also, the MSE's obtained using the default Ada-IS algorithm are smaller than those without imposing monotonicity constraints. These provide empirical evidence that estimating multiple quantile levels jointly is better owing to the monotonicity constraints, at least when the model is correctly specified. On the other hand, the effective sample size is 1745 out of 10000 samples from the final proposal. The effective sample size loss compared with separate estimation is expected due to the increased dimension and the monotonicity constraints. However, one should not be discouraged, since when estimating them separately we still need 5 times the computational time. Moreover, by jointly modeling multiple quantile levels, the joint covariance matrix at different quantile levels is obtained which is useful for making inference about parameters such as the slope differences at different quantile levels as demonstrated in Section 4.4.

Simulation Results for Censored Quantile Regression
We evaluate the performance of the score based likelihood for censored quantile regression. Consider median regression with a fixed left censoring for the model (4.1). We consider a sample size of n = 2000 and censoring times at C = 10 and C = 15 corresponding to censoring rates around 15% and 25%. We implement the Ada-IS algorithm for censored quantile regression and compare the results with the Powell's estimator implemented using quantreg. The results are summarized in Table 4. We can see that the proposed likelihood provides good point estimation with a smaller mean squared error compared with the Powell's estimator. Also, the coverage probabilities for the Powell's estimator are much lower the nominal 90% level, while the coverage probability using the score based likelihood is much closer to 90%. This suggests that the proposed method using the score likelihood has a clear advantage over the Powell's approach for censored quantile regression.

Inference for Slope Differences
One advantage of the proposed likelihood is the ability to provide inference for functions of quantile regression coefficients corresponding to multiple quantile levels such as the slope difference θ = β 1 (0.75) − β 1 (0.25), which is the difference of the slope at the 75% and 25% levels. Existing methods based on the ALD likelihood do not provide valid inference on θ directly. Yang et al. (2016) requires two separate models to be fitted to estimate D 1 (0.25) and D 1 (0.75) which may potentially be used to obtain corrected inference for θ but this does not allow placing a prior distribution directly on θ. Yang and Tokdar (2017)'s method may be used for inference on θ but it requires modeling of all the quantile levels globally which is computationally intensive and also does not guarantee the frequentist validity of the resultant posterior intervals.
On the other hand, the score based likelihood we propose provides the flexibility to place prior distributions on θ directly. This allows both the options of placing a bounded uniform prior if we do not have any prior information or an informative prior. In our simulation study, along with the uniform prior we consider two informative priors as follows: θ ∼ N (μ, σ 2 ) where μ = 0, σ 2 = 1 and μ = 1, σ 2 = 1. Denote the prior density of θ by π(θ) and uniform priors on the other parameters β 1 (0.25), β 0 (0.75), β 0 (0.25). Then the likelihood should be modified with a Jacobian transformation.
For computation, we can still use the Ada-IS algorithm but we only need to focus on the marginal distribution of θ for performing inference on θ. Table 5 presents the simulation results for our approach using three different prior choices along with the results for the standard quantile regression using quantreg. Although practically it is not quite possible to have informative prior such as N (1, 1), we want to include it in the comparison to see how an informative prior would improve the performance. From the simulation results, we can see that all the methods provide reasonably good estimation while the three score based methods have smaller mean squared error and smaller variance compared with quantreg. We see slightly smaller mean squared error for the N (1, 1) informative prior but the difference is insignificant when compared with other prior choices. Our approach provides smaller mean squared error and larger coverage.

Application to Immunoglobulin-G Data
We now apply our methods to an immunoglobulin-G data set analyzed in Yu and Moyeed (2001). This data set refers to the serum concentration (grams per litre) of immunoglobulin-G (IgG) in 298 children aged from 6 months to 6 years. Following Yu and Moyeed (2001), we use IgG concentration as the response variable Y and consider a quadratic quantile regression model with independent variable x to be the age.  We estimate the regression parameters for τ = 0.25, 0.5, 0.75 jointly using bounded uniform prior and a multivariate normal prior with zero mean and covariance matrix generated by the squared exponential kernel in expression (2.4), with σ 2 = 2 and b = 0.5. We also compare the results with Quantreg which are summarized in Tables 6 and 7.
The results in Table 6 suggest that while the point estimators based on both the approaches are quite similar, their variances are different. Our method (Ada-IS) indicates a smaller variance for the quantile level τ = 0.25 but a larger variance for the other quantile levels compared with quantreg. For the case with multivariate normal prior, the corresponding parameter for different quantile levels shrink towards similar values and has smaller variances compared to the uniform prior. The quantile curves corresponding to the score likelihood approach using the multivariate normal prior and those from quantreg are plotted in Figure 2 which shows that they are quite similar to each other.
We also use our proposed score likelihood to test the following hypotheses: H 0 : θ i = β i (0.75) − β i (0.25) = 0 vs H 1 : θ i = 0, for i = 1, 2. We try a bounded uniform prior and a N (0, 1) prior for both θ 1 and θ 2 . The estimated parameters, corresponding standard errors and interval estimation for the linear and quadratic terms are shown in Table 7. We report the 90% credible intervals based on the computed posterior distributions for our approach and the 90% confidence intervals for Quantreg. All the interval estimations for θ 1 contain zero while neither of the intervals for θ 2 contain zero, which indicates that there is a significant difference in the quadratic effect at different quantiles and no significant difference in the linear effect. This is a simple demonstration that the proposed likelihood can be used in a natural way to carry out inference for quantities such as the slope differences in a Bayesian framework while incorporating prior information.

Conclusion
In this paper, we propose a score based likelihood for quantile regression problems which particularly targets modeling multiple quantile levels at the same time. We also propose Ada-IS algorithm for posterior computation which empirically shown to have good efficiency in terms of effective sample size. The score-based likelihood automatically and correctly estimates the variance of the regression quantiles and does not require estimation of the conditional densities as needed by frequentist methods, which is an extremely challenging task especially when the data are not independent and identically distributed . While the ALD likelihood also avoids estimation of the conditional densities, the posterior inference provided based on the score likelihood is valid and does not need further adjustment. The score likelihood can naturally perform inference at multiple quantile levels and deal with the issue of quantile crossing more naturally compared to frequentist methods, which often resort to adhoc strategies (He, 1997;Liu and Wu, 2009;Zou et al., 2008;Bondell et al., 2010). Performing valid Bayesian inference for functions of regression quantiles at multiple quantile levels such as β(τ 1 ) − β(τ 2 ) is possible based on the proposed score likelihood but it is not clear how the ALD based inference can be corrected in such situations. For the censored quantile regression case discussed in Section 2.5, the objective function is highly non-convex. The proposed likelihood provides a more efficient computational technique based on importance sampling to avoid direct optimization of the highly nonconvex objective function that is prone to be stuck in local solutions (Womersley, 1986;Chernozhukov and Hong, 2003).
For the score based likelihood, the contribution of each observation to the likelihood is not multiplicative which is the case for the ALD likelihood. However, the multiplicative property of the likelihood is not necessary for valid inference and commonly used likelihoods such as empirical likelihood (Owen, 1988(Owen, , 2001 are also not multiplicative. Yang et al. (2016) developed a Bayesian quantile regression approach based on empirical likelihood and showed valid posterior properties.
One potential limitation of the proposed score likelihood is that it is not as straightforward to evaluate it in an online manner as stream of data come in as compared to evaluating the ALD likelihood. However, it is still possible to devise computationally efficient strategies that keep track of the score function and the weight matrix from the previous batch to reduce the computational time needed for evaluation of the score likelihood in an online manner. Development of such computationally efficient algorithms that can be used to generalize the proposed importance sampling algorithms to online settings is an interesting direction for a future study.
It would be another interesting future research direction to investigate if the proposed techniques can be generalized to high dimensional settings so that problems such as Bayesian variable selection for quantile regression can be addressed using the score likelihood.

Supplementary Material
Supplementary Material for "Bayesian multiple quantile regression for linear models using a score likelihood" (DOI: 10.1214/20-BA1217SUPP; .pdf). The online supplement contains the technical proofs for Theorems 1 and 2, and additional simulation results.