Quantile Regression by Dyadic CART

In this paper we propose and study a version of the Dyadic Classification and Regression Trees (DCART) estimator from Donoho (1997) for (fixed design) quantile regression in general dimensions. We refer to this proposed estimator as the QDCART estimator. Just like the mean regression version, we show that a) a fast dynamic programming based algorithm with computational complexity $O(N \log N)$ exists for computing the QDCART estimator and b) an oracle risk bound (trading off squared error and a complexity parameter of the true signal) holds for the QDCART estimator. This oracle risk bound then allows us to demonstrate that the QDCART estimator enjoys adaptively rate optimal estimation guarantees for piecewise constant and bounded variation function classes. In contrast to existing results for the DCART estimator which requires subgaussianity of the error distribution, for our estimation guarantees to hold we do not need any restrictive tail decay assumptions on the error distribution. For instance, our results hold even when the error distribution has no first moment such as the Cauchy distribution. Apart from the Dyadic CART method, we also consider other variant methods such as the Optimal Regression Tree (ORT) estimator introduced in Chatterjee and Goswami (2019). In particular, we also extend the ORT estimator to the quantile setting and establish that it enjoys analogous guarantees. Thus, this paper extends the scope of these globally optimal regression tree based methodologies to be applicable for heavy tailed data. We then perform extensive numerical experiments on both simulated and real data which illustrate the usefulness of the proposed methods.


Introduction
We consider the problem of nonparametric quantile regression in general dimensions and specifically consider the setting of fixed/lattice design regression or array denoising.In this setting, we are given an array of independent random variables y ∈ R L d,n where L d,n is the d-dimensional square lattice or grid graph with nodes indexed by {1, . . ., n} d .Then the goal is to estimate the true τ quantile array θ * where for all i ∈ L d,n , τ ∈ (0, 1) is a fixed quantile level and where ρ τ (x) = max{τ x, (1 − τ )x} is the usual piecewise linear convex quantile loss function.For example, when τ = 0.5, our setting here amounts to estimate the true median array of the noisy array y.The model here can be called the quantile sequence model.This generalizes the usual Gaussian sequence model where the quantile τ is taken to be 0.5 and the distribution of y is taken to be multivariate normal with the covariance matrix a multiple of identity.Assuming lattice design is common practice for studying non parametric regression estimators and clearly our setting is relevant for image denoising and computer vision when d = 2 or 3.The problem of estimating the true signal θ * becomes meaningful when the true array satisfies some additional structure so that the effective parameter size is much less even though the actual number of unknown parameters is the same as the sample size.Structured signal denoising is a standard problem and arises in several scientific disciplines, e.g see applications in computer vision (e.g.Bian et al., 2017;Wirges et al., 2018), medical imaging (e.g.Lang et al., 2014), and neuroscience (e.g.Tansey et al., 2018).
In this paper we are interested in scenarios where θ * is (or is close to) a piecewise constant array on a rectangular partition of L d,n .For mean regression, Dyadic classification and regression trees (DCART) method introduced in Donoho ( 1997) is known to be computationally efficient while achieving adaptively minimax rate optimal rates of convergence for classes of signals which are piecewise constant in a rectangular partition of L d,n ; see Chatterjee and Goswami (2019) for a thorough study of statistical adaptivity of DCART.However, since we are interested in quantile regression, we would like to propose a quantile version of Dyadic CART.
The most natural way to define the quantile version of Dyadic CART estimator is as follows: where we define k rdp (θ) as the smallest natural number for which there exists a dyadic partition Π of L d,n such that θ is constant in each element of Π and |Π| = k rdp (θ).The estimator we propose and study in this article is a slightly modified version of the above estimator in (1).We refer to this proposed estimator as the QDCART estimator.The precise definition of our estimator and the meaning of a dyadic rectangular partition of L d,n and the complexity parameter k rdp (θ) will be given in Section 2. The usual mean regression version of Dyadic CART estimator is a computationally feasible decision tree method proposed first in Donoho (1997) in the context of regression on a two-dimensional grid design.This estimator optimizes the same criterion as in (1) except that the quantile loss is replaced by the usual squared loss.Subsequently after Donoho (1997), several papers have used ideas related to dyadic partitioning for regression, classification and density estimation; e.g see Nowak et al. (2004), Scott and Nowak (2006), Blanchard et al. (2007), Willett and Nowak (2007).Recently, the paper Chatterjee and Goswami (2019) generalized the Dyadic CART estimator to general dimensions and to higher orders and studied the ability of Dyadic CART to estimate piecewise constant signals of various types.Dyadic CART has also been recently used for recovering level sets of piecewise constant signals; see Padilla et al. (2021).It is fair to say that the two most important facts about the usual mean regression version of Dyadic CART are: • The Dyadic CART estimator attains an oracle risk bound; e.g see Theorem 2.1 in Chatterjee and Goswami (2019).This oracle risk bound can then be used to show that the Dyadic CART estimator is nearly minimax rate optimal for several function classes of interest.
• The Dyadic CART estimator can be computed by a bottom up dynamic program with computational complexity linear in the sample size, see Lemma 1.1 in Chatterjee and Goswami (2019).
These two properties of the Dyadic CART make it a very attractive signal denoising method.However, the oracle risk bound satisfied by Dyadic CART is known to hold only under sub-Gaussian errors.A natural question is whether it is possible to define a version of Dyadic CART which satisfies a result like Theorem 2.1 in Chatterjee and Goswami (2019) without any tail decay assumptions on the error distribution and still retains essentially linear time computational complexity?This is the main question that motivated the research in this article and naturally led us to study a quantile regression version of Dyadic CART.The results in this paper answer our question as affirmative.We now summarize our results.
• Theorem 1 gives an oracle risk bound for the QDCART estimator proposed in this paper.The advantage of our risk bound is that it holds under an extremely mild assumption (see Assumption 1 in Section 3) on the distribution of the error or noise variables.For example, our risk bound holds when the error distribution is heavy tailed like the Cauchy distribution for which even the first moment does not exist.In contrast, Theorem 2.1 in Chatterjee and Goswami (2019) heavily relies on the subgaussian nature of the errors.Therefore, our main contribution here is to establish the robustness of the quantile version of Dyadic CART to heavy tailed errors.The result in Theorem 1 can be thought of as generalizing Theorem 2.1 in Chatterjee and Goswami (2019) to the heavy tailed setting.
• Once the oracle risk bound in Theorem 1 has been established, it has been shown in Chatterjee and Goswami (2019) how this automatically implies that the QDCART estimator would be minimax rate optimal for several function/signal classes of interest.In particular, this opens the door for us to establish minimax rate optimality of our QDCART estimator over the space of piecewise constant and/or bounded variation arrays.We provide these results in Section 3.1.At the risk of reiterating, the state of the art mean regression estimators for estimating piecewise constant and/or bounded variation arrays typically require subgaussianity of the errors while the QDCART estimator is robust to heavy tailed error distributions.A natural competing quantile regression estimator to QDCART is the Quantile Total Variation Denoising estimator studied in Padilla and Chatterjee (2020).Just like for the corresponding mean regression counterparts, we argue in Section 3.1 that the QDCART estimator has certain advantages over the Quantile Total Variation Denoising estimator, not least the fact that QDCART is computable in essentially linear time in any dimension whereas Quantile Total Variation Denoising is not known to have linear time computational complexity in multivariate settings (d > 1).
• We explain in Section 3.3 that our proof technique for Theorem 1 can also be used to derive similar risk bounds for other variants of the QDCART estimator.For example, in Chatterjee and Goswami (2019) the Optimal Regression Tree (ORT) estimator was introduced and studied for mean regression.This ORT estimator is similar to the Dyadic CART estimator with the same optimization objective function except that the optimization is done over all decision trees or hierarchical partitions (not necessarily dyadic).It was then shown in Chatterjee and Goswami (2019) that this estimator attains a better risk bound than Dyadic CART in general.However, its computational complexity is slower and scales like O(N 2+1/d ) in d dimensions in contrast to the O(N ) computational complexity of Dyadic CART.The proof techniques of this paper actually also imply that a quantile version of the ORT estimator can be defined which will enjoy the corresponding risk guarantee.We prefer to present our main results only for QDCART to make the exposition short and because of its significantly better computational complexity.
• We give a bottom up dynamic programming algorithm which can exactly compute the QDCART estimator.This algorithm is similar to the original one proposed for the DCART estimator in Donoho (1997), suitably adapted to our setting.The computational complexity of our algorithm is O(N (log N ) d ) (see Theorem 2) which is slightly slower than the O(N ) computational complexity of the DCART estimator.This extra log factor in the computation seems unavoidable to us because of the need to compute and propagate quantiles of various dyadic rectangles.Our algorithm is described in detail in Section 5.

Outline
The rest of the paper is organized as follows.Section 2 presents the precise definition of the QDCART estimator.The main theoretical result (Theorem 1) of this paper is then presented in Section 3. We then provide implications of our main result (Theorem 1) to the class of bounded variation signals in Section 3.1, and to the class of piecewise constant signals in Section 3.2.Section 3.3 discusses the quantile optimal tree regression (QORT) estimator.Section 4 is a discussion section.Section 4.1 presents a overview of the proof of our main theorem.In Section 4.2, we compare our theoretical guarantees for QDCART with what is known for a natural competitor estimator, the quantile total variation denoising estimator.Section 5 provides the details of our algorithm for implementing QDCART.Section 6 contains extensive numerical results in both simulated and real data examples.Finally, Section 7 contains the full proofs of all our theoretical results.
2 Description of QDCART Estimator Henceforth, we will just use the word rectangle to denote an axis aligned rectangle.The size of a rectangle R = d i=1 [a i , b i ] is denoted by |R| and defined as Let us define a rectangular partition of L d,n to be a set of rectangles R such that (a) the rectangles in R are pairwise disjoint and (b) Let us consider a generic discrete interval [a, b].We define a dyadic split of the interval to be a split of the interval [a, b] into two intervals fo equal size.We assume that the interval has even size for ease of exposition.If not, then one can set forth a convention for defining the middle point and then follow it throughout.A dyadic partition of L d,n is constructed iteratively as follows.Starting from the trivial partition which is just L d,n itself, we can create a refined partition by dyadically splitting L d,n .This will result in a partition of L d,n into two rectangles.We can now keep on dividing recursively, generating new partitions.In general, if at some stage we have the partition Π = (R 1 , . . ., R k ), we can choose any of the rectangles R i and dyadically split it to get a refinement of Π with k + 1 nonempty rectangles.A recursive dyadic partition (RDP) is any partition reachable by such successive dyadic splitting.Let us denote the set of all recursive dyadic partitions of L d,n as P rdp (L d,n ). Figure 1 shows a depiction of a dyadic partition.
For a given array θ ∈ R L d,n , let k rdp (θ) denote the smallest positive integer k such that a set of k rectangles R 1 , . . ., R k form a recursive dyadic partition of L d,n and the restricted array θ R i is a constant array for all 1 ≤ i ≤ k.In other words, k rdp (θ) is the cardinality of the minimal recursive dyadic partition of L d,n such that θ is piecewise constant on the partition.A visual representation of k rdp (θ) is given in Figure 2 for a signal θ ∈ R 2,n .
To define our estimator, we will need a few more notations.If Π is any rectangular partition of L d,n we let S(Π) be the linear subspace of R L d,n consisting of vectors with constant values on each rectangle of Π.We also write R ∈ Π to mean that the rectangle R is one of the constituent rectangles of the partition Π.We now define O Π,τ (•) be a function from R L d,n to S(Π) such that and where q τ (y R ) is the empirical τ -quantile of the set of values y R := (y i ) i∈R .
Armed with the above notation we can reformulate the optimization problem in (1) by noting that θ defined in (1) is the same as O Π,τ (y) where the partition Π is an optimal solution to the following discrete optimization problem: However, the estimator defined in (1) is not quite the estimator we propose and study in this paper as we need to modify the estimator slightly.To describe our QDCART estimator, which is the main object of study in this paper, we now define for any fixed quantile level 0 < τ < 1, θrdp = O Π,τ (y) (3) where for tuning parameters λ, γ > 0.
Note that in view of (2), the above QDCART estimator is basically the same as the estimator in (1) with a slight modification.We restrict the optimization space to all partitions in P rdp (L d,n ) with the constraint that the size of each of its constituent rectangles is larger than γ > 0. This restriction is needed to avoid the estimator from being affected by large outliers.We say more on this point in Remark 1.

Main Results
We first state an assumption on the distribution of the coordinates of the data vector y.
Assumption 1.There exist positive constants L, f and f such that for any Assumptions like the above are standard and commonly made in the quantile regression literature.For instance, without the upper bound part, Assumption 1 (Equation ( 5)) also appeared in Padilla and Chatterjee (2020) and is a weaker version of Condition 2 from He and Shi (1994), and is closely related to Condition D.1 in Belloni and Chernozhukov (2011).The lower bound in Assumption 1 is needed to ensure uniqueness of the τ quantiles of the marginal distributions of the coordinates of y.We believe that Assumption 1 is very mild.For example, sequences of distributions which are stochastically dominated by a distribution with continuous density (w.r.t.Lebesgue measure) which is bounded away from 0 on any compact interval satisfy Assumption 1.If it is assumed that the errors are i.i.d.(which is a commonly made assumption) then if the error distribution itself has continous density (w.r.t Lebesgue measure) which is bounded away from 0 on any compact interval then Assumption 1 is satisfied.In particular, the error distributions could be i.i.d.Cauchy with no moments existing.
Before stating our main result we will need to make the following definition.
Definition 1.Let θ and θ be arrays of the true τ /2-quantiles and (1 − τ )/2-quantiles of y respectively, so that Then we denote Definition 1 simply quantifies the supremum norm of the τ /2-quantiles and (1 − τ )/2-quantiles of y.We are now ready to state our main result for the QDCART estimator.
Theorem 1. Suppose that Assumption 1 holds.There exists universal constants c 1 , C 1 , C 2 , C 3 > 0 such that for any 0 < < 1, if we set γ = c 1 log N and then with probability at least where Theorem 1 provides the generalization of the oracle risk bound in Theorem 2.1 in Chatterjee and Goswami (2019) to the quantile setting.We now list the differences of Theorem 1 with the oracle risk bound (Theorem 2.1 in Chatterjee and Goswami ( 2019)) known for the mean regression counterpart.
1. Theorem 2.1 in Chatterjee and Goswami (2019) requires that Y − θ * , the vector of errors, consists of i.i.d.mean zero Gaussian random variables.In contrast, Theorem 1 holds under Assumption 1 which does not require any tail decay assumptions for the distributions of the coordinates of the error vector (independence is still assumed).In particular, Assumption 1 allows error distributions with no moments as well like the Cauchy distribution.
2. The result in Chatterjee and Goswami ( 2019) is stronger in the sense that theirs is an upper bound in expectation, given as for all δ ∈ (0, 1), and for some constant C > 0. Our result in Theorem 1 gives a tail probability inequality which does not ensure that θrdp −θ * 2 N has a finite first moment.It does ensure however that where P refers to an appropriately defined sequence of probability distributions corresponding to denoising problems of increasing size.
3. In effect, the the upper bound in Theorem 1 is only off by logarithmic factors compared to the upper bound in Theorem 2.1 in Chatterjee and Goswami (2019).Our bound in Theorem 1 contains some extra terms which are benign.The factor U should scale like O(1) for any realistic error distribution sequence.The factor θ ∞ inside the infimum in the definition of Q(θ * ) essentially introduces another multiplicative factor of θ * ∞ ≤ U.
Remark 1.The choice of γ in Theorem 1 ensures that the QDCART estimator will be well behaved in the sense of the ∞ norm, see Lemma 6.Such a restriction on the size of the rectangles in the optimal partition is actually needed.Otherwise, the QDCART estimator can be arbitrarily large in some locations under the presence of heavy tailed errors.If one considers standard subgaussian type assumptions on the errors, then this restriction on the size of the rectangles can be removed.
We now turn to the issue of computation.In this article we also give an algorithm to compute the QDCART estimator based on bottom up dynamic programming.This algorithm is similar to the original algorithm given in Donoho (1997) adapted to the quantile setting.We now state our computation result as a theorem.
Theorem 2. There exists an absolute constant C > 0 (not depending on d, n) such that the computational complexity, i.e. the number of elementary operations involved in the computation of the QDCART estimator in d dimensions is bounded by The description of the algorithm and the proof of its computational complexity are given in Section 5.

Implications for Bounded Variation Signals
It was shown in Donoho (1997) and Chatterjee and Goswami ( 2019) that an oracle risk bound of the type shown in Theorem 1 implies minimax rate optimality (up to log factors) for other function classes of interest as well.We now proceed to discuss consequences of Theorem 1 for the class BV d,n (V ) of bounded variation signals.This class of signals is defined as We arrive at the next corollary by combining Theorem 1 with existing approximation theoretic results shown in Chatterjee and Goswami (2019) (see Proposition 8.9 and Theorem 4.2 there) Corollary 3.For any θ * ∈ BV d,n (V ), there exists a constant C > 0 only depending on the dimension d such that Therefore, under the same assumptions and the choice of λ and γ in Theorem 1, the same probability tail bound as in (6) holds for any θ * ∈ BV d,n (V ) with Q rdp (θ * ) replaced by the bound above.
The rates implied by Corollary 3 are minimax optimal, save for logarithmic factors, in the class BV d,n (V ), see the discussion in Tibshirani (2014) for the case d = 1 and the corresponding one in Hutter and Rigollet (2016), Sadhanala et al. (2016) for the case d > 1.It was shown in Theorem 5.1 from Chatterjee and Goswami ( 2019) that the mean regression version of Dyadic CART is minimax rate optimal (up to log factors) in the class BV d,n (V ).Corollary 3 can be seen as an extension of this result to the quantile setting which holds under much weaker tail decay conditions.

Implications for Piecewise Constant signals
We now discuss consequences of Theorem 1 for the class of piecewise constant signals in dimensions d = 1 and d = 2. Towards that end, given θ ∈ R L d,n , we define k(θ) as the size of the smallest rectangular partition Π of L d,n such that θ is constant in each rectangle of Π.By construction, k(θ) ≤ k rdp (θ) for all θ ∈ R L d,n .Furthermore, Proposition 3.9 in Chatterjee and Goswami (2019) shows that there exists an absolute constant C > 0 such that for all θ ∈ R L d,n it holds that Combining Theorem 1 with ( 8) and ( 9) we immediately obtain our next corollary Corollary 4. For any θ * ∈ R L d,n , there exists a constant C > 0 only depending on the dimension d such that Therefore, under the same assumptions and the choice of λ and γ in Theorem 1, the same probability tail bound as in (6) holds for any θ * ∈ R L d,n with Q rdp (θ * ) replaced by the bound above.
Notice that in Corollary 4, the resulting rate implied is Õ(k(θ * )/N ) which is the usual parametric rate of estimation for a signal θ * consisting of k(θ * ) pieces if one knows the locations of the end points of the constant pieces of θ * .Here, of course the QDCART estimator does not know the true partition corresponding to the true signal.Corollary 4 can be seen as an extension of Corollary 3.10 in Chatterjee and Goswami (2019) to the quantile setting which holds even under heavy tailed error distributions.
Remark 2. The situation when d > 2 is more difficult as versions of (8) and (9) are not known to hold in higher dimensions than 2. We refer the reader to Chatterjee and Goswami (2019) where this issue has been thoroughly discussed.We prefer therefore to just state our results for dimensions d ≤ 2.

Quantile ORT Estimator
The ORT estimator, introduced in Chatterjee and Goswami ( 2019) is a variant of the Dyadic CART estimator which enjoys better statistical risk guarantees in general but has significantly slower computational complexity; see Lemma 1 in Chatterjee and Goswami (2019).Just as we have proposed QDCART, it is natural to extend the optimal regression tree (ORT) estimator to the quantile setting as well.This leads us to define the quantile optimal regression tree (QORT) estimator.Before giving the definition of QORT, we need to introduce some additional notation. Given , a hierarchical split consists of choosing a coordinate j ∈ {1, . . ., d} and then constructing the rectangles R 1 and with a j ≤ l ≤ b j and l ∈ Z + .Thus, the difference of a hierarchical split with a dyadic split is that the former is not restricted to split an interval only at the midpoint.Starting from L d,n , one can keep on performing hierarchical splits recursively, creating refined partitions.A hierarchical partition/decision tree is any partition reachable by such successive hierarchical splits.Note that this is the usual definition of a decision tree except we are carrying out everything on the lattice L d,n .We denote by P tree (L d,n ) the set of hierarchical partitions of L d,n .
Given θ ∈ R L d,n , we denote by k tree (θ) the smallest number of elements of any hierarchical partition in which θ is piecewise constant.It is clear that for any θ ∈ R L d,n we must have Armed with the notation above, we can now define the estimator where for tuning parameters λ, γ > 0. By construction, θtree is the quantile version of the ORT estimator proposed and studied in Chatterjee and Goswami (2019).With the notation above in hand, we are now ready to present our main result for the QORT estimator.
Theorem 5. Define for any θ ∈ R L d,n , the quantity Under the same assumptions and the choice of λ and γ in Theorem 1, the same probability tail bound as in (6) holds for any θ * ∈ R L d,n with one difference; Q rdp (θ * ) is replaced by Q tree (θ * ).
Remark 3. The above theorem basically says that 1 N θtree − θ * 2 = O P (Q tree (θ * )).This is in general a better bound than saying 1 Theorem 5 generalizes to the quantile setting the general risk bound proven in Chatterjee and Goswami (2019) for the ORT estimator.It is clear that the implications presented for bounded variation and piecewise constant function classes continue to hold for the QORT estimator as well.However, the QORT estimator would have significantly worse computational complexity than the QDCART estimator which is why we focus more on the QDCART estimator in this paper.It should be possible to provide an algorithm demonstrating a computational complexity result scaling like Õ(N 2+1/d ) for the QORT estimator, analogous to Theorem 2. We do not carry this due to space considerations.
Remark 4. All our theoretical guarantees are in the regime when d is held fixed and n is growing.Our estimator and our results are practically useful when d is small, like 1, 2 or 3.

Proof Technique
Our proof follows a M-estimation approach by viewing the QDCART estimator as a penalized M estimator.This M estimation viewpoint was also used to analyze the Quantile version of Trend Filtering as in Padilla and Chatterjee (2020) and we borrow some of the techniques from Padilla and Chatterjee (2020).The mean regression version of Dyadic CART was throroughly analyzed in Chatterjee and Goswami (2019).We also use some proof techniques developed there and adapt it to our setting.We now discuss a sketch of the proof of our main result in Theorem 1.We have divided the proof sketch into several steps for the convenience of the reader.
From the M estimation viewpoint, the natural loss function which arises is the following population quantile loss function Another loss function that plays a role in our proof is the following Huber loss type function The Huber loss function ∆ 2 is always upper bounded by the population quantile loss; this is the content of Lemma 13 in Padilla and Chatterjee (2020) which says that, under Assumption 1, there exists an absolute constant c 0 > 0 such that for all δ ∈ R L d,n it holds that With the notation above in hand, we now proceed to sketch the different steps involved in the proof of Theorem 1.
Step 1: (Preliminary Localization) We first show that θrdp ∞ ≤ U with high probability.The reader can think of this as a preliminary localization step.
Recall U from Definition 1. Within this proof sketch, we will assume U = O(1) which is the regime of interest.For sequences a n and b n , we will also use the notation a n b n to denote that a n ≤ Cb n for an absolute constant C.
This preliminary localization step is crucial in our proof because it allows us to conclude that The last inequality above follows from Lemma 13.The above means that the loss functions M, ∆ 2 are essentially equivalent (up to constants) to the squared loss.In the reminder of the proof sketch all the events are intersected with θrdp ∞ ≤ U as the complement event θrdp ∞ > U has negligible probability and can be handled separately.
Step 3: (Peeling) To bound the required probability, we perform the peeling step which is a standard step in empirical process theory.We write N as explained in (28).Therefore, we only need to sum up to J = O(log N ) = Õ(1) in our peeling step.
Step 4:(Basic Inequality, Suprema and Markov's Inequality) For i ∈ L d,n , define the sample version of the quantile population loss function as a random function M : R → R such that Now the so called basic inequality gives us for any θ ∈ Θ, with Θ the parameter space defined in (20).Now we can bound p j as follows .For any reference θ ∈ Θ, notice that Above, we used the basic inequality and added and subtracted M ( θ) because we would like to obtain an oracle risk bound with respect to θ.Now we will just "sup out" θrdp to obtain where again we used the equivalence of the squared loss and the M loss.Next, we simply use Markov's inequality to obtain p j ≤ T 1,j + T 2 , where Step 5:(Symmetrization and Contraction) At this point, T 1,j can be viewed as the expected suprema of a penalized empirical process.To simplify matters further, we use the tools of symmetrization (Lemma 8) and contraction (Lemma 9) to convert a penalized empirical process to a penalized Rademacher process.Specifically, we show that where ξ is a vector of independent Radamacher random variables, see Lemma 12.
Step 6:(Bounding Penalized Rademacher Complexity) At this point, we are left with the task of bounding the penalized Rademacher complexity term as in (15).This task has essentially been done in Proposition 8.9 in Chatterjee and Goswami (2019), the only difference being that they bounded the corresponding Gaussian complexity term.It is not hard to convert the ideas there to our setting where we have Rademacher variables.In lemma 12 we show that provided that λ is chosen to be not less than O(log 2 N ).
It follows from this and the previous steps that and so we can take an infimum over θ ∈ Θ on the right hand side above.A simple approximation lemma (see Lemma 15) is now used to justify that we can actually take an infimum over all θ ∈ R L d,n in the previous display which leads to The above display implies that θrdp − θ * 2 = O P (Q rdp (θ * )).This concludes the proof.

Comparison with Quantile Total Variation Denoising
We have shown that QDCART is computable in near linear time and enjoys attractive statistical properties.
We do not compare QDCART with corresponding mean regression estimators simply because under heavy tails, the mean regression estimators perform poorly while our results continue to hold; see the simulations section in Padilla and Chatterjee (2020).Therefore, comparison is appropriate with other quantile regression estimators.We believe that the most natural competitor to QDCART is the Quantile Total Variation Denoising (QTVD) estimator, proposed and studied in Padilla and Chatterjee (2020).Actually, there are two versions of QTVD, the so called constrained and the penalized version.In the univariate case, Padilla and Chatterjee (2020) refers to the constrained version of this estimator as constrained quantile fused lasso (QCFL), and to the penalized version as penalized quantile fused lasso (PQFL).Both of these estimators were analyzed in Padilla and Chatterjee (2020).
In comparing Corollaries 3 and 4 with the existing results known for CQFL and PQFL estimators, we make the following points.
1.The results proven in Padilla and Chatterjee (2020) for CQFL and PQFL only need the lower bound portion in Assumption 1 while Corollaries 3 and 4 require also the upper bound in Assumption 1 to hold.However, this is a very mild condition that leads to a guarantee for QDCART in terms of the mean squared error, a stronger (and a much cleaner) result than those for CQFL and PQFL which are based on the loss ∆ 2 N (•) defined as Results under the squared error loss are not yet known for the CQFL and PQFL estimators.The main reason for this is that the localization result described in Step 1 in Section 4.1 is not available for CQFL and PQFL.
2. The dependence on the total variation of the true signal V in both Corollaries 3 and 4 are optimal in the sense that they match the right dependence known in the mean regression case; see Sadhanala et al. (2016) for lower bounds on bounded variation signal classes.The results for the CQFL and PQFL estimators in Padilla and Chatterjee (2020) seem to have sub optimal dependence on V. Thus, to the best of our knowledge, our results on QDCART are the first quantile regression based estimators which enjoy minimax rate optimality with respect to both the sample size N and the total variation of the unknown signal V.
3. Our results for QDCART depend on the quantity U something that is not the case for QTVD.However, in any realistic setting, we would have U = O(1).
4. Corollary 4 gives a near parametric rate of convergence for piecewise constant signals.In the univariate case, the corresponding result is known for the CQFL and PQFL in Theorems 2 and 4 from Padilla and Chatterjee (2020).However, these results need the true signal to satsify a minimal spacing condition which is not needed for QDCART.This is potentially a significant advantage of QDCART over QTVD, even in d = 1, as far as attaining adaptively optimal rates of convergence for piecewise constant signals is concerned.
5. In the mean regression problem, it is known that when d = 2, the TVD estimator cannot attain near parametric rates of convergence for a rectangular piecewise constant signal; see Theorem 2.3 in Chatterjee and Goswami ( 2021).Therefore, it is expected that the QTVD estimator would also not be the best tool for estimating rectangular piecewise constant signals.On the other hand, the QDCART estimator does attain the Õ( k(θ * ) N ) rate in 2 dimensions and seems to be the right tool for estimating rectangular piecewise constant signals as well.
6.The QDCART estimator is computable in Õ(N ) time in any dimensions.In contrast, it is unknown and unlikely that the QTVD estimator is computable in Õ(N ) time in dimension larger than 1.
Remark 5.The last three points above show that the QDCART estimator is a computationally faster alternative to the QTVD estimator while also enjoying some statistical advantages.We perform numerical experiments to further compare finite sample performance of QDCART and QTVD estimators in Section 6.

Background and Related Literature
Our work in this paper falls under the scope of nonparametric quantile regression.We now briefly review some classical work on nonparametric quantile regression.In the context of median regression some early works include Utreras (1981), Cox (1983), andEubank (1988).Koenker et al. (1994) proposed one dimensional quantile smoothing splines.These estimators were studied in He and Shi (1994)

Future Directions
There are different research directions that we leave for future work.A natural extension is to consider piecewise polynomial structures in the estimator, similarly to Chatterjee and Goswami (2019).However, we are currently unaware of how to extend our theory to such a setting.The main bottleneck is that given a fixed sub rectangle of L d,n we do not know how to obtain an ∞ upper bound when fitting quantile regression constrained to the class of polynomials of degree r > 0. When r = 0 the latter can be done as in Corollary 6.This a crucial ingredient in our proof that we do not know how to handle when dealing with higher order piecewise polynomial signals.
Moreover, it would be worthwhile to mention here that all our theoretical results hold under a theoretical choice of the tuning parameters.In our experiments, following Yu and Moyeed (2001), we choose the tuning parameters by Bayesian information criterion (BIC) for quantile regression.It would be interesting to provide theoretical guarantees for an estimator which chooses the tuning parameters in a data driven way, for example, by some form of cross validation.

Computation of the QDCART Estimator
The goal of this section is to develop a computationally efficient algorithm for the QDCART estimator defined in( 3).In doing so, our construction will imply the conclusion of Theorem 2. This algorithm is an adaptation of the original algorithm given in Donoho (1997) (also see Lemma 1 in Chatterjee and Goswami ( 2019)) to the quantile setting.
We have to solve the discrete optimization problem in (4).Let us first see how we can solve the discrete optimization problem in (2) where there are no constraints on the size of the rectangles.
In order to study the optimization problem (2), we first define a corresponding subproblem for any rectangle R. For any rectangle R ⊂ L d,n and a partition Π ∈ P rdp (L d,n ), we let Π(R) := {A∩R : A∩R = ∅, A ∈ Π} be the partition induced by Π in R. We then let P rdp (R) := {Π(R) : Π ∈ P rdp (L d,n )}.In words, P rdp (R) is the set of recursive dyadic partitions of the rectangle R.
Then we define the following subproblem and define its optimal value as OPT(R) := min Clearly, OPT(L d,n ) is the optimal value of the objective function associated with QDCART.The basic idea is to be able to solve smaller subproblems as above and build these smaller solutions to solve the full optimization problem.The following dynamic programming relation allows us to build up from bottom up.

OPT(R) := min
where by saying that "R 1 and R 2 dyadic split of R" we mean that R 1 and R 2 were obtained after performing a dyadic split of R. The above relation follows because of the separable nature of our optimization objective and the second term inside the minimum above corresponds to not splitting R at all.We now proceed visiting dyadic rectangles bottom-up according to the length of R. The length of R is defined as the sum of the lengths of the sides of the rectangles.We will start from the minimum possible length d all the way to nd.Our goal is to store OPT(R) and SPLIT(R) for each dyadic rectangle R, where SPLIT(R) indicates the optimal split for rectangle R. Note that the total number of possible splits is d (one for each dimension) and thus SPLIT(R) can be represented by a single integer within the set [d].
For each dyadic rectangle R, let us denote where SQL stands for sum of quantile loss and q τ (y R ) is an empirical τ quantile of the set of observations in y R .Assume that we have succesfully computed SQL(R) for each dyadic rectangle.Then, at each dyadic rectangle R, to compute OPT(R) we have to compute the sum OPT(R 1 ) + OPT(R 2 ) for each possible non trivial dyadic split of R into R 1 , R 2 and then compute the minimum of d+1 numbers.Once OPT(R) is computed, SPLIT(R) is also automatically calculated when we are computing the minimum of these d + 1 numbers.
Note that since we are visiting dyadic rectangles bottom up, we have already computed OPT(R ) for all sub rectangles R ⊂ = R.Therefore, the computation required for computing OPT(R) is d + 1.The total number of dyadic rectangles is at most 2 d N. Therefore, the total computation required to compute OPT(R) for all dyadic rectangles R is at most (d + 1)2 d N.
Now we proceed to explain how to compute SQL(R) for each dyadic rectangle R. We again do this by a bottom up scheme by visiting dyadic rectangles according to their length (small to large).Our aim here is to compute a sorted list of observations within each dyadic rectangle R. We do this bottom up.For each dyadic rectangle R we consider a particular dyadic split R 1 and R 2 which is obtained by dyadically splitting (in dictionary order) the first coordinate of R. In the case the first coordinate is a singleton, we use the second coordinate to split and so on.Now on our bottom up visits, we iteratively compute sorted lists for these dyadic rectangles.For instance, for a given dyadic rectangle R, we take its corresponding dyadic split (in dictionary order) into R 1 , R 2 .Now we have already created a sorted list for R 1 and R 2 .To create a sorted list for R we just need to merge two sorted lists.We can do this by the standard merge sort algorithm.The computation required at this step is Once we are able to construct the sorted list of observations within R, now SQL(R) can be readily computed in O(|R|) time.
Now consider a dyadic rectangle R of a given size 2 i 1 × . . .× 2 i d .The total number of dyadic rectangles of this size Therefore, the total computational work needed to compute SQL(R) for all dyadic rectangles R with this given size is simply Now note that the total number of distinct sizes 2 i 1 × . . .× 2 i d is at most (log n) d .Therefore, the total computational work needed to compute SQL(R) for all dyadic rectangles of all sizes with our bottom up scheme is O(N (log n) d ).
Finally, we see that the total computation required to compute OPT(R) and SPLIT(R) for all dyadic rectangles R is O(N (log n) d + d2 d N ).After OPT(R) and SPLIT(R) have been constructed, we can find the optimal partition by going top-down.This would be a lower order computation.
Based on the discussion above, it is not hard to see that to compute (4) we just have to modify the above algorithm slightly.We do not need to compute OPT(R) and SPLIT(R) for rectangles R with |R| < γ.Thus, when we visit dyadic rectangles bottom up according to its size, we just visit the feasible rectangles.Also, for a given rectangle R, to compute OPT(R) we now have to compute the sum OPT(R 1 )+OPT(R 2 ) for each possible non trivial dyadic split of R into R 1 , R 2 where R 1 , R 2 are both feasible and then compute the minimum of these numbers and i∈R ρ τ (y i − q τ (y R )).

Comparisons in 1d
We now proceed to evaluate the performance of QDCART in the 1d setting.In our simulations we consider as benchmarks the the penalized quantile fused lasso (PQFL) proposed in Brantley et al. (2019) and studied in Padilla and Chatterjee (2020), and the univariate mean regression DCART method from Donoho (1997).For our evaluations in this subsection, for QDCART and DCART we consider a grid of 25 values of λ given as {2 −2 , 2 −1.75 , . . ., 2 4 } and we set γ = 8.As for PQFL, we take λ such that log λ ∈ {1 + j(7.5−1) 99 : j ∈ {0, 1, . . ., 99}}.Then, for each method and choice of tuning parameter we calculate the average mean squared error, averaging over 100 data sets generated from different scenarios and with n ∈ {512, 1024}.
For each method and each scenario we then report the optimal MSE.The only remaining ingredient is to explain the different scenarios for generating data that we consider.These are described next.
For each scenario, we generate the data y ∈ R L 1,n as for i ∈ L 1,n and some θ * , ∈ R L 1,n .We now explain the constructions of θ * and for the different scenarios.
Scenario 1. (Large Segments).In this case θ * ∈ R L 1,n satisfies and we generate i ind ∼ t(2.5)where t(2.5) denotes the t-distribution with 2.5 degrees of freedom.
Scenario 2. (Large and Small Segments).We generate as in Scenario 1 and set Scenario 3. (Large Segments and Cauchy Errors).We take θ * ∈ R L 1,n as in Scenario 1 and generate i ind ∼ Cauchy(0, 1).Scenario 4. (Large and Small Segments, and Heteroscedastic Errors).The vector θ * is the same as in Scenario 2 and satisfies for all i that where ν i ind ∼ N(0, 1).
A visualization of data generated under each scenario is given in Figure 3.The results of our comparisons are given in Table 1.Overall, we can see that the QDCART estimator is competitive against QFL.In Scenario 2, where some of the constant pieces of the true signal are very small, we see that the QDCART estimator performs better.This is in agreement with Corollary 4 where no minimum length assumption is needed for the QDCART to attain near parametric rates.Observe that the mean regression DCART estimator performs poorly under heavy tailed scenarios.

Comparisons in 2d
We now proceed to evaluate the performance of QDCART for 2d grid graphs and use DCART and QTVD as benchmarks.For our experiments in this subsection the tuning parameter λ for QDCART and DCART is taken such that log 10 (λ) is in the set {−1 + 6.5j 59 : j ∈ {0, 1, . . ., 59}}.As for the tuning parameter λ for QTVD we take it such that log 2 (λ) is in the set {−1 + 7j 19 : j ∈ {0, 1, . . ., 19}}.As before, for each method and choice of tuning parameter we calculate the average mean squared error, averaging over 100 data sets generated from different scenarios.We set d = 2 and n ∈ {64, 128}.For each method and each scenario we then report the optimal MSE.Next we describe the different generative models, where in each case the data are generated as y i,j = θ * i,j + i,j where i,j ind ∼ t(2.5) for i, j ∈ {1, . . ., n} and with θ * ∈ R n×n .
Scenario 5. We set θ * i,j = 1 if n/5 < i < 3n/5 and n/5 < j < 3n/5, 0 otherwise.Scenario 6.Now we take θ * satisfying Scenario 7.For this model we let   A visualization of the data generated under Scenarios 5-7 is provided in Figures 4-5.There we can see that QDCART can be competitive against QTVD.
A more comprehensive evaluation of performance comparisons is provided in Tables 2-3.In Scenario 6 where the level sets are non-rectangular, QTVD seems to do better than QDCART.In Scenario 7, however, QDCART performs slightly better.We believe this is because the level set can be well represented by a dyadic partition.We reiterate here that a potential practical advantage of QDCART over QTVD in an image denoising setting is the fact that the QDCART estimator can be computed in near linear time.

Ion channels data
We conclude our experiments section with a real data example involving ion channels data.Ion channels are a class of proteins expressed by all cells that create pathways for ions to pass through the cell membrane.As explained in Jula Vanegas et al. (2021), over time, the ion channel changes its gating behavior by closing and reopening its pore which leads to a piecewise constant current flow structure.The original data that we use was produced by the Steinem Lab (Institute of Organic and Biomolecular Chemistry, University of Gottingen), and it was recently analyzed by Jula Vanegas et al. (2021).It consists of a single ion channel of the bacterial porin PorB, a bacterium that plays a role in the pathogenicity of Neisseria gonorrhoeae.The original data is 600000 time instances.For our comparisons we focus on a portion of length 32511 and construct a subsampled vector y ∈ R 2048 .Thus, our resulting signal is similar to that in Cappello et al. (2021).A depiction of the data is shown in Figure 6.Given the signal y ∈ R 2048 , we fit both PQFL and QDCART with values τ ∈ {0.1, 0.5, 0.9}.For PQFL we consider values of the penalty parameter λ such that log λ is in the set {1 + j6.5 99 : j ∈ {0, . . ., 99}}.As for QDCART we take λ such that log 2 λ is in {−2 + 7j 25 : j = 0, . . ., 25}.Then for both PQFL and QDCART we choose the tuning parameter that minimizes the BIC for quantile regression criteria from Yu and Moyeed (2001)  , and where v is the estimated degrees of freedom.Motivated by Tibshirani and Taylor (2012), we let With the above choice of tuning parameter for both PQFL and QDCART, we compute the estimates which are displayed in Figure 6.There, we can see that the estimators are roughly similar, validating our theoretical findings that QDCRT and PQFL have similar statistical properties.
Finally, in order to provide a clearer quantitative comparison, we proceed as follows.We randomly choose 50% of the entries of the signal described above and we use this as training.Then we use the remaining 50% of the data as testing, and for each coordinate in the test set we make a prediction based on the closest coordinate in the training set.With this in hand, for each competing method, we compute prop 0.5 , the proportion of the test samples that are below its predicted median.We also compute cov 80% , the proportion of samples in the test set that are between their predicted 0.1 and 0.9 quantiles.The quantities prop 0.5 and cov 80% are then averaged over 100 repetitions and reported for PQFL and QDCART.For QDCART we obtain the values prop 0.5 = 0.502 and cov 80% = 0.781, whereas for PQFL we obtain prop 0.5 = 0.502 and cov 80% = 0.772.These results suggest that QDCART provides ever so slightly better prediction intervals than PQFL.

General estimator
Let S be a collection of linear subspaces of R N .For any subspace S ∈ S we denote its dimension with Dim(S) and define a penalty function k S : R L d,n → Z + induced by S as with the convention that the minimum of the empty set is ∞.We are interested in subspaces of arrays which are piecewise constant on a rectangular partition of L d,n .We denote by Π S the rectangular partition of L d,n so that S is the subspace of arrays which are constant on every rectangle of Π S .We will denote a generic rectangle of L d,n by R. When we say R ∈ Π S we are referring to a rectangle R of the partition Π S .
The collection of partitions P rdp or P hier will give rise to corresponding collections of linear subspaces S and the associated complexity measures k rdp (θ) and k tree (θ) respectively.For a collection of subspaces S corresponding to a collection of rectangular partitions, we now define an additional function s S : R L d,n → Z + as follows: In words, s S (θ) is the size of the minimal rectangle of the minimal rectangular partition Π S within S ∈ S such that θ is constant on every rectangle of Π S .
For a given collection of subspaces S corresponding to a collection of rectangular partitions of L d,n and a constant c 1 > 0, we will consider the general 0 < τ < 1 quantile estimator where λ > 0 is a tuning parameter and

Preliminary lemmas
Throughout the proof, without loss of generality we assume that τ ≤ 0.5.The case τ > 0.5 can be handled similarly.Throughout we will suppose that Assumption 1 holds.We will also drop the subscript in θS,c 1 and just write θ to avoid notational clutter.
Remark 6.The sequence U is clearly a function of set of marginal distributions of y.For realistic distributions the sequence U = O(1).For instance, if we assume that the error distribution is i.i.d then U is a constant sequence.
Lemma 6.For any α > 0, if we set c 1 ≥ 2α τ 2 then we have Proof.Let Π be the optimal partition on which θ is piecewise constant.Fix a v ∈ L d,n and denote by R the rectangle of the partition Π containing v. We know that θv is a sample τ quantile of the observations in y R .Therefore, we can assert that where the second inequality follows by Hoeffding's inequality.By choosing c 1 = 2α/τ 2 and by using the last two displays imply that P(∃v : θv < min The same bound can readily be shown for P(∃v : θv > max u∈L d,n θ u ) by a similar argument.Both these assertions with a further union bound finishes the proof.
Our next result is a modified version of Lemma 9.1 in Chatterjee and Goswami ( 2019) where we prove the corresponding result for rademacher random variables instead of gaussian random variables.
Lemma 7 (Lemma 9.1 in Chatterjee and Goswami ( 2019)).Let θ ∈ R L d,n be a fixed array and ξ ∈ R L d,n be an array of independent Rademacher random variables.Then there exists C > 0 such that if λ ≥ C log N then it follows that Proof.Let S ∈ S and O S be the orthogonal projection matrix onto S.Then, following the arguments in the proof of Lemma 9.1 in Chatterjee and Goswami (2019) it follows that Let v S,1 , . . ., v S,m S an orthonormal basis of S with m S = dim(S).Then where the second inequality holds by the usual expectation of maximum of sub-Gaussian random variables inequality.Then from an application of McDiarmid's inequality inequality as in Page 62 of van Handel (2014), we obtain that for any t > 0, Hence, by union bound and the fact that |{S ∈ S, dim(S) = k}| ≤ N 2k , we have that And so again, by union bound we obtain that Similarly, The claim follows from ( 23) and ( 24) by simple integration.
Next, we recall some notations from Section 4.1.Furthermore, let us denote and ∆ 2 N (θ) = ∆ 2 (θ)/N .Our analysis relies on viewing the estimator defined in (19) as a penalized M estimator or a penalized empirical risk minimization estimator.Hence the natural loss function for us is the population quantile loss M function given above.However, we would like to give risk bounds for the square loss function.For this purpose, the ∆ function defined above plays an important role in converting bounds in the M loss function to bounds for squared error loss.
We now proceed to state some results (Lemmas 8-11) involving involving the functions M (•) and ∆ 2 (•).These are results that also appeared in Padilla and Chatterjee (2020), the only difference with the results in Padilla and Chatterjee ( 2020) is that we now use the penalty function k S (θ) instead of the TV(θ) function.We omit writing the proofs of these results since the proofs are very similar to what is already given in Padilla and Chatterjee (2020).
where the first inequality follows as in Lemmas 8 and 9, the third by Cauchy Schwarz inequality, and the last by Lemma 7.

Figure 1 :
Figure1: From left to right the panels show an example of a sequence of three dyadic splits that lead to a dyadic parition.

Figure 2 :
Figure 2: The left panel shows the representation of a θ ∈ R L 2,n that takes on two values.The right panel shows a dyadic partition with a minimal number of elements where θ is piecewise constant.In this example k rdp (θ) = 12, the number of rectangles in the dyadic partition in the right panel.
and E d,n is the edge set of the graph L d,n .The class of signals BV d,n (V ) is rich enough to contain signals that are smooth in certain regions of their domain but discontinuous in other regions.The problem of estimation of a signal in the class BV d,n (V ) has attracted a lot of attention in the statistics literature, see for instance Mammen and van de Geer (1997); Tibshirani (2014); Sadhanala et al. (2016); Hutter and Rigollet (2016); Padilla et al. (2018); Chatterjee and Goswami (2021); Ortelli and van de Geer (2019); Guntuboyina et al. (2020).
under the assumption that the quantile function is Hölder continuous.Other related quantile nonparametric estimators include the quantile random forest proposed by Meinshausen and Ridgeway (2006).Brown et al. (2008) developed a wavelet-based estimator for median regression.Recently, Ye and Padilla (2021) developed the k-nearest neighbour quantile fused lasso approach and Padilla et al. (2020) studied quantile regression with ReLU networks.

Figure 3 :
Figure 3: The top two panels show the true signal (median) θ * for Scenarios 1 and 2. The rest of panels show an example of data generated under each of the scenarios.

Figure 4 :
Figure4: The first row of panels shows from left to right an instance of data generated under Scenario 5 and the true median signal θ * .The second row shows the estimate provided by QDCART and the one by QTVD.The third and fourth rows of panels show the corresponding plots for Scenario 6.

Figure 5 :
Figure 5: The first row of panels shows from left to right an instance of data generated under Scenario 7 and the true median signal θ * .The second row shows the estimate provided by QDCART and the one by QTVD.

Figure 6 :
Figure6: The top two panels show the estimated PQFL and QDCART for τ ∈ {0.1, 0.5, 0.9} when using the ion data.The bottom panel then shows the ion raw data.

Table 1 :
Average mean squared error 1 * i − θi ) 2 , averaging over 100 Monte carlo simulations for the different methods considered.Captions are described in the text.

Table 2 :
Average mean squared error 1 * i − θi ) 2 , averaging over 100 Monte carlo simulations for the different methods considered and with data generated from Scenarios 5 and 6.Captions are described in the text., averaging over 100 Monte carlo simulations for the different methods considered and with data generated from Scenarios 7. Captions are described in the text.