Optimal Convergence Rates of Deep Neural Networks in a Classification Setting

We establish optimal convergence rates up to a log-factor for a class of deep neural networks in a classification setting under a restraint sometimes referred to as the Tsybakov noise condition. We construct classifiers in a general setting where the boundary of the bayes-rule can be approximated well by neural networks. Corresponding rates of convergence are proven with respect to the misclassification error. It is then shown that these rates are optimal in the minimax sense if the boundary satisfies a smoothness condition. Non-optimal convergence rates already exist for this setting. Our main contribution lies in improving existing rates and showing optimality, which was an open problem. Furthermore, we show almost optimal rates under some additional restraints which circumvent the curse of dimensionality. For our analysis we require a condition which gives new insight on the restraint used. In a sense it acts as a requirement for the"correct noise exponent"for a class of functions.


Introduction
We consider i.i.d. data (Y i , X i ) n i=1 with Y i ∈ {0, 1} and X i ∈ R d . Our goal is to provide an estimator of the formŶ = 1(X ∈Ĝ) whereĜ is constructed with a neural network which approximates Y well with respect to the misclassification error. We show optimal convergence rates under the following two conditions. First, the underlying distribution Q satisfies a noise condition as in [22] described below. Second, the boundary of the set with f Q (x) := Q(Y = 1|X = x) satisfies certain regularity conditions.
Neural Networks have shown outstanding results in many classification tasks such as image recognition [7], language recognition [4], cancer recognition [10], and other disease detection [16]. Our work follows current approaches in the statistical literature to explain the success of neural networks, e.g. the impactful contributions [12,21]. The objective is to fill a gap in the literature by proving optimal convergence rates in a specific setting which was also considered in [11]. We focus on deep feedforward neural networks with ReLU-activation functions. Deep networks have been considered in many theoretical articles [12,13,18,19] and have proven useful in many applications [20,15]. Intuitively, we wish to approximate the set G * Q directly instead of approximating the regression function f Q . The classification setting we consider is similar to the setting given in [17,22]. In particular, we assume that Q satisfies a noise condition which can be described as follows. For Q-measurable sets G 1 , G 2 define The condition then states that there exists a constant κ ≥ 1 such that for some constant c 1 > 0 and all G. This requirement is sometimes referred to as the Tsybakov noice condition. It can be interpreted as a restraint on the probability distribution regarding regions close to the boundary where f Q (x) = 1 2 . Roughly speaking, it forces the mass to decay at a certain rate when one approaches this boundary. Using this, one can achieve rates approaching n −1 for small κ, i.e. if there is not much mass in the region around f Q (x) = 1 2 . The condition has been used in many statistical articles considering classification such as [1] and [23] who analyse support vector machines. Similarly to [17], we show optimal convergence rates in the case where the boundary of G * Q satisfies certain regularity conditions, i.e. is similar to an element of a Dudley class [6]. More precisely, we consider sets which are slightly more general then the sets given in [18]. While many other approximation results using neural networks exist, see [5] using sigmoid activation functions or [24] using piecewise linear functions among others, the methods used in [18] inspired us to obtain the results for our setting. The sets they consider have been used in many articles such as [11,19]. As an estimator, we use a risk minimizer of the empirical version of the misclassification error. Precisely calculating this estimator involves finding a global minimum of a highly non-convex loss with respect to the parameters of a neural network. Typically, such calculations are not feasible in practice. Thus, the results we provide are theoretical in nature and do not have direct useful applications, as is typical for results of this kind [13,21]. From our point of view, the main value of current contributions is to show results such as consistency in situations which are typical for statisticians using relatively simple classes of neural networks. In time, the techniques developed may be used to show claims in cases which are closer to those encountered in reality and using classes of neural networks which are closer to those used in practise. A lot of work has been done regarding consistency of feedforward deep neural networks. [21] prove optimal convergence rates with respect to the uniform norm in a regression setting. Among others, similar results were given by [9] for non-continuous regression functions with respect to the L 2 -norm, [14] who did not use a sparsity constraint, and [2]. Regarding results for classification, [19] show convergence rates considering the misclassification error in a noiseless setting. Consistency results which include condition (1.1) in the assumptions are given by [3,13,8]. In contrast to our approach, the previously mentioned articles attempt to estimate the regression function f Q instead of directly estimating the set G * Q . Additionally, while some obtain optimal convergence rates, the settings do not correspond to the setting given in [22]. In particular, the (optimal) convergence rates differ from ours in these papers. A very interesting contribution was made by [11] who consider an almost identical situation to ours, while their estimators differ. However, the rates they obtain are not optimal in the minimax sense.

Contribution
Our contribution includes the following.
• First and foremost, Theorem 4.1 together with Corollary 3.7 prove optimal convergence rates in the minimax sense for the setting described above. To the best of our knowledge, we are the first to obtain optimal convergence rates using neural networks corresponding to the setting given in [22] and thus close this gap in the literature.
• Theorem 3.5 establishes convergence rates in a general setting, where the boundary of the set G * Q can be well approximated by neural networks. This enables us to prove rates in a variety of settings. We use this theorem to prove optimal convergence rates under an additional constraint, which circumvents the curse of dimensionality, in the sense that the rates do not decrease exponentially in the dimension d.
• In order to prove the results stated here, we require a condition which together with condition (1.1) forces κ to be the "correct parameter" for the distribution Q. We believe that this condition may bring new insights to condition (1.1) and may be helpful in other situations where (1.1) is required.

Outline
After introducing some notation, we rigorously introduce the problem at hand in Section 2. Here, we also provide some convergence results considering empirical risk minimizers with respect to arbitrary sets. These results are then used to prove our main consistency theorems regarding neural networks in Section 3. Section 4 includes the corresponding lower bounds followed by some concluding remarks in Section 5.

Notation
We introduce some general notation which is used throughout this article.
For x ∈ R, let x := max{k ∈ Z | k ≤ x} and x := min{k ∈ Z | k ≥ x}. Let λ be the Lebesgue measure. For a function g : Ω ⊆ R s → R and k ∈ N denote by the uniform-norm and the L k -norm, respectively. Note that we omit the dependence on Ω in the notation. For x ∈ R s , let x 2 and x ∞ be the euclidean-norm and the uniform-norm, respectively. For j ∈ {1, . . . , s} let x −j := (x 1 , . . . , x j−1 , x j+1 , . . . , x s ).
Additionally, let be the Hoelder-norm. For B > 0, define the class of Hoelder-continuous functions by Let G 1 , G 2 ⊆ Ω be two subsets. We write for their symmetric difference and 1(x ∈ G 1 ) := 1, for x ∈ G 1 , 0, otherwise for the indicator function corresponding to G 1 .

General Convergence Results
In this section, we state our results in a relatively general setting. The results on neural networks in the next section only consider the case where Q X has a bounded density with respect to the Lebesgue measure. Our setup is similar to the binary classification setup of [22].

Classification Setup
observations distributed according to some probability measure Q, where X i ∈ R d and Y i ∈ {0, 1}. Denote by Q X the marginal probability distribution with respect to X ∈ R d . The goal is to predict for some Q-measurable setĜ ⊆ R d . Note that a classifier is uniquely determined byĜ. Performance is measured by the misclassification error is a so called bayes rule and thus minimizes the misclassification error. Classification can equivalently be seen as estimation of G * Q by the setĜ, which is therefore equally referred to as classifier. For a Q-measureable set G ⊆ R d let be the empirical version of the misclassification error R(G). We consider empirical risk minimization classifiers defined bŷ where N n is some finite collection of Q-measurable sets for all n ∈ N.

Consistency Results
Proposition 2.1 establishes convergences rates for estimating G * Q usingĜ n under certain conditions on N n and Q. For the loss function, we consider a slight generalization of the misclassification error for p ≥ 1. The proposition is somewhat similar to Theorem 2 from [17]. In contrast to our approach, they consider the discrimination of two probability distributions with underlying distribution functions and do not allow for nonoptimal convergence rates. The proposition is an important component for the proof of our main theorem given in Section 3. The proofs of this section can be found in Appendix A.
Proposition 2.1. Let τ n > 0 be a monotonically increasing sequence. Let Q be a class of potential joint distributions Q of (X, Y ) and N n be a collection of subsets of R d for all n ∈ N such that the following conditions hold.
(i) For all Q ∈ Q all sets in n∈N N n and G * Q are Q-measurable. (ii) There exists a constant κ ≥ 1 such that for some constant c 1 > 0, all G ∈ n∈N N n and all Q ∈ Q.
Additionally, we assume that there is a constant N 0 ∈ N such that for all n ≥ N 0 the following holds.
(iv) There exist constants c 3 , ρ > 0 such that Then for all p ≥ 1 we have Condition (i) is needed for all terms to be well defined. Condition (iii) states that the set in question must be well approximated by elements of N n . A sufficient assumption is that N n is an -net of G * Q Q ∈ Q , where := c 1 τ −κ n . Together with (iv), this indirectly bounds the complexity of Q. If the class of sets {G * Q | Q ∈ Q} is to large, one will not be able to find sets N n that satisfy (iii) and (iv) at the same time. It is clear that the best rates are achieved with τ n = n 1 ρ+2κ−1 . We do not use the same sequences in conditions (iii) and (iv) since one can prove non-optimal convergence rates using this version of the proposition.
The second condition is the noise condition described in the introduction. Note that following [22], for κ > 1 condition (ii) holds if for all t > 0 and some c > 0. Roughly speaking, this forces the mass to decay at a certain rate when one approaches the boundary of G * Q . Note that we use (ii) instead of this assumption since it is slightly more general and includes the case κ = 1. Additionally, it appears more naturally in the proofs. Observing the alternative assumption, κ = 1 corresponds to the case where there is no mass close to the boundary of G * Q , meaning that f Q does not take on values close to 1 2 . Using Proposition 2.1 one can achieve rates approaching n −1 for small κ, ρ i.e. if there is not much mass in the region around f Q (x) = 1 2 and the complexity of N n , and consequently Q, is moderate. In contrast, the following proposition provides convergence rates if condition (ii) is not satisfied.
Proposition 2.2. Let τ n > 0 be a monotonically increasing sequence. Let Q be a class of potential joint distributions Q of (X, Y ) and N n be a collection of subsets of R d for all n ∈ N such that the following conditions hold.
(i) For all Q ∈ Q all sets in n∈N N n and G * Q are Q-measurable. Additionally, we assume that there is a constant N 0 ∈ N such that for all n ≥ N 0 the following holds.
(ii) There is a constant c 2 > 0 such that for all n ∈ N and Q ∈ Q there is a G ∈ N n with Then for all p ≥ 1 we have Note that the requirement in conditon (ii) of Proposition 2.2 corresponds to requirement (iii) of 2.1 with κ = 1. However, for p = 1 the best rate achievable is of order n − 1 p+2 , which is always slower n − 1 2 . Proposition 2.2 provides rates in absence of condition (ii) of Proposition 2.1. We do not claim optimality for these rates.

Convergence Rates for Neural Networks
We begin by shortly introducing neural networks. The idea is to use Proposition 2.1 to obtain optimal convergence rates up to a log factor. Neural networks are used to define a suitable class of sets N n for every n ∈ N .

Definitions regarding Neural Networks
A neural network with network architecture where each W s ∈ R ms×m s−1 is a weight matrix and b s ∈ R ms is a shift vector. The realization of a neural network Φ on a set D ⊆ R m 0 is the function We denote by for W ∈ R m 0 ×m 1 and has realization R(Φ)(x) = W x. Note that, in general, the weights of a neural network Φ, i.e. the entries of its shift vectors (b 1 , . . . , b L+1 ) and weight matrices (W 1 , . . . , W L+1 ), are not uniquely determined by its realization R(Φ). In the following, for brevity, we occasionally introduce a network by defining its realization. In such a case, it is clear from the presentation of the realization which precise neural network is considered.
As a first step, we wish to introduce a suitable finite class of sets parameterized by neural networks and count the number of elements. We define these sets as R(Φ) −1 (1) where Φ is a realization of a neural network. Equivalently, we could have considered neural networks with a binary step function in the output layer or find a neural networkΦ and define the approximating set R(Φ) −1 ((0.5, 1]), which is closer to the idea that the realization of the neural network represents some sort of probability. Since this is not the idea of our approximation results, we stick to the version above. In order to obtain a finite class, we need to reduce the number of considered elements of N L,m,σ while maintaining reasonable approximating capabilities. A typical approach in the theoretical literature is to use a sparsity constraint. For s > 1 we therefore only consider realizations of neural nets which have at most s nonzero weights. If s is the total number of nonzero weights, we say that the network has sparsity s. Additionally, we assume all weights to be elements of the set Thus, we only consider weights |w| ≤ 1. Concluding, we use the following notation to describe the collection of sets we are interested in.
Definition 3.2. Let L 0 , c ∈ N and s 0 > 1 be fixed. Denote byÑ L 0 ,s 0 ,c the set of realizations of neural networks with d dimensional input, one dimensional output, at most L 0 layers, ReLU activation functions and sparsity at most s 0 , where all weights are elements of W c . The class of corresponding sets given by neural networks is then Note that the requirements from Definition 3.2 allow for realizations of neural networks with arbitrary hidden layer dimensions (m 1 , . . . , m L ) ∈ N L . However, it is easy to see that every element ofÑ L 0 ,s 0 ,c is a realization of a neural net which satisfies the properties described in the Definition and m i ≤ s 0 for all i ∈ {1, . . . , L 0 }. Using this, we receive an upper bound on the number of elements of N L 0 ,s 0 ,c by counting the number of corresponding neural networks. Thus, the following bound is independent of the choice of activation functions σ.
Proof. First of all, if s 0 ≤ L 0 , clearly only the last s 0 layers have influence on the realization of a neural network. Thus, an upper bound is given by counting the number of neural nets with at most min{s 0 , L 0 } layers, at most sparsity s 0 , weights in W c and m i ≤ s 0 for all i ∈ {1, . . . , L}. Each weight can take on |W c | = 2 c+1 + 1 different values. The total number of weights can be bounded by Note that if s 0 ≤ L 0 , the input dimension does not influence the outcome. Therefore, there are at most possible combinations to pick s 0 (possibly) nonzero weights. Thus

Conditions on the Bayes-Rule
In order to define the set of probability distributions we consider for approximation, we restrict the possible bayes rules. We then add a smoothness condition to the function f Q near the boundary of the respective bayes rule. Intuitively, the boundary should satisfy some kind of smoothness condition so that it can be approximated by neural networks. Additionally, the set must be discretizable in some sense. When using F = F β,B,d−1 the class of sets we use is similar to a class defined in [18]. Note, that the class used here is larger. This version depends on a set F which represents a class of boundary functions. The idea is that we can obtain different convergence rates for different classes using the same procedure.
1. For all ν = 1, . . . , u we have 3. If β > 0, the following holds. For ν = 1, . . . , u and all x ∈ D ν ∩∂H there exists g ν,x ∈ H β,B such that for y jν ∈ [max{0, The idea is to use sets defined by realizations of neural networks to approximate G * Q ∈ K F Q,β,B, 1 , 2 ,r,d from Definition 3.4 for a suitable class F in order to apply Proposition 2.1. Figure 1 shows an example for an element of K F Q,β,B, 1 , 2 ,12,2 , where F is the set of piecewise linear functions. The definition contains an additional condition on the function f Q close to the boundary of G * Q . Following the intuition mentioned in [17], for β > 0 condition (ii) in Theorem 2.1 means that f Q acts like x β close to the boundary of G * Q , where κ = 1 + β. More precisely, condition (ii) requires that f Q does not increase slower than x β . In order to prove the combination of conditions (iii) and (iv), we require that β is the correct rate, meaning that f Q does not increase faster than x β . In Section 4 we prove that this condition does not lower the complexity of the problem. Thus, the rates obtained by [17] are still optimal.

Main Theorems
We begin by stating the central result of this article. We then use this result to show consistency results for more specific cases. The rates we obtain in Theorem 3.5 are optimal up to a log factor. In the following, all proofs of this section are given in Appendix B.
such that the following holds. There exist 0 , C 1 , C 2 > 0 and C 3 , C 4 ∈ N such that for any γ ∈ F and any ∈ ( Define κ := 1 + β and let Q be a class of potential joint distributions Q of (X, Y ) such that the following conditions hold.
(a) There is a constant M > 1 such that for all Q ∈ Q the marginal distribution of Q X has a Lebesgue density bounded by M .
(b) There are constants r ∈ N and 1 , 2 > 0 such that for all Q ∈ Q the bayes rule satisfies G * is the class of sets corresponding to any neural network. Let Then there exist constants C 1 , C 2 > 0 and C 3 ∈ N such that for all p ≥ 1 we have

Results for Regular Boundaries
We can now prove results for specific classes of sets F to obtain convergence results. A first important example is the class F β,B,d . The following Lemma is a consequence of Theorem 5 in [21].
Then there exist constants C 1 , C 2 > 0 and C 3 ∈ N such that for all p ≥ 1 we have n ) and L 0 , s 0 , c 0 from Lemma 3.6. Corollary 3.7 together with Theorem 4.1 from the next chapter prove optimal convergence rates, which was the main goal of this paper.
Following up, note that the rates we receive from Corollary 3.7 are affected by the curse of dimensionality. Observe that the rates obtained by Theorem 3.5 are influenced by condition (c) on the one hand and the ability of neural networks to approximate sets in F on the other. The dependence on the dimension d in Corollary 3.7 comes from the latter. Thus, a natural approach to circumvent the curse of dimensionality is to approximate a smaller set F. Intuitively, it is clear that without strong restrictions on the distribution we can only overcome the curse if the complexity of the boarders of the sets we approximate is small enough so that they themselves can overcome the curse. In the literature, many different sets are considered which infer useful approximation capabilities of neural networks. Here, we use a class of sets introduced by [21] which is close to class F β,B,d .
However, this does not enlarge the class considerably. It can easily be seen that we can instead increase the bound B to find an even larger class. The idea for using the set G r,t,β,B,d is that its complexity does not depend on the input dimension d 1 , but only on the most difficult component to approximate. The complexity of the components depend on their effective dimension t i and their implied smoothness. As described by [21], the correct smoothness parameter to consider is Examples for sets that can profit from Definition 3.8 are additive models (r = 1, t 1 = 1), interaction models of order k (r = 1, t 1 = k), or multiplicative models (they are a subset of G r,t,β,B,d when r = log 2 (d) + 1, t i = 2 for all i).
Next, our goal is to establish a convergence result when the set of boundary functions is G r,t,β,B,d . Similarly to the approach above, we first provide a lemma which provides approximation results using neural networks.
Let G r,t,β,B,d be defined as in Definition 3.8 and define Then there exist 0 , c 1 , c 2 > 0, c 3 , c 4 ∈ N such that the following holds. For any function γ ∈ G r,t,β,B,d and any ∈ (0, 0 ), there exists a neural network The following corollary establishes the corresponding convergence result. Theorem 4.2 provides the lower bound in the case where t i ≤ min{d 1 , . . . , d i }.
Define κ := 1 + β 1 and let Q be a class of potential joint distributions Q of (X, Y ) such that the following conditions hold.
(a) There is a constant M > 1 such that for all Q ∈ Q the marginal distribution of Q X has a Lebesgue density bounded by M .
(b) There are constants r 1 ∈ N and 1 , 2 > 0 such that for all Q ∈ Q the bayes rule satisfies G .
Then there exist constants C 1 , C 2 > 0 and C 3 ∈ N such that for all p ≥ 1 we have Note that Corollary 3.10 is a generalisation of Corollary 3.7. The rate now depends on ρ which in turn depends on t 1 , . . . , t r 2 instead of the input dimension d 1 . Clearly, the effective dimensions t i can be much smaller then the input dimension d 1 , for example, when the boundary functions come from an additive function.

Lower Bound
We now establish lower bounds on the convergence rates from corollaries 3.7 and 3.10. Note that the lower bounds also prove that the rates obtained in Theorem 3.5 can not be improved up to a log-factor. Since Corollary 3.10 is a generalisation of 3.7, we only have to prove a lower bound for the setting given in the former. For clarity, we provide both statements. The proofs of this section can be found in Appendix C.
Intuitively, B 1 bounds the factor of the x β 2 -term of f Q close to the boundary from above. On the other hand, c 1 bounds this term from below. Thus, not every combination of B 1 , c 1 > 0 is possible. We prove Theorem 4.1 for large c 1 > 0. We do not provide the exact ratio of B and c 1 required since it is not important for the statement. Lastly, the lower bound corresponding to Corollary 3.10 is given.
Define κ := 1 + β 1 and let Q be the class of all potential joint distributions Q of (X, Y ) such that (a),(b) from Corollary 3.10 hold for some M > 1, r ∈ N, 1 , 2 > 0. Let (c) hold with c 1 > 0 large enough and set for every p ≥ 0, where G contains all estimators depending on the data (X 1 , Y 1 ), . . . , (X n , Y n ).

Concluding Remarks
We establish optimal convergence rates up to a log-factor in a classification setting under the (1.1) using neural networks. Theorem 3.5 can be applied for many different boundary functions. The complexity of the class of boundary functions F is one of the main driving factors of the convergence rate. In particular, many approaches which circumvent the curse of dimensionality in a regression setting can be used to circumvent the curse in this classification setting.
Note that this paper is of a theoretical nature. While sparsity constraints are considered thoroughly in the theoretical literature, they are not widely used in practice. Additionally, we did not discuss the minimization required for the calculation ofĜ n . This is a very interesting but complicated topic which is not in the scope of this article. Observe that the class of neural networks used in Theorem 3.5 depends on κ as well as ρ. We believe that one can extend the results of this paper by either having adaptive classes of neural networks or a class independent of κ and ρ in a similar manner to [22]. One obstacle to overcome is the fact that the conditions on the probability distribution Q required are not strictly weaker for larger κ and ρ. Lastly, while the goal of this paper is to prove results considering neural networks, it also contains new insights on the noise condition (1.1). In order to establish optimal convergence, an additional condition in order to show approximation results of neural networks with respect to the metric d f Q is necessary. Intuitively, the reverse inequality is required for certain sets. Note that requiring the reverse inequality is an overly restrictive assumption which holds for almost no classes of possible distributions Q for κ = 1. This proved to be a major challenge and is solved by (3.) in Definition 3.4. While this condition is always also satisfied for larger but not lower β (and thus κ), the reverse is true in condition (1.1). Thus, together the requirement is that κ is the "correct rate". Note that condition (3.) still allows for highly noncontinuous f Q close to the boundary of G * Q . This is essential, since considering only smooth f Q close to the boundary leads to different convergence rates as shown in Theorem 2 of [11].

A General Convergence Results
The proof of Proposition 2.1 is similar to the proof of Theorem 2 in [17]. For the sake of completion, we provide the entire argumentation here anyway.
Proof of Proposition 2.1. Let n ≥ N 0 . Without loss of generality, we may assume that τ n ≤ n 1 ρ+2κ−1 , since otherwise the conditions are also satisfied when usingτ n = n 1 ρ+2κ−1 . We begin by proving the assertion for the first term. The idea is to bound for some t > 0. First, observe that for any G ∈ N n Regarding (iii), for every n ∈ N there exists a G n ∈ N n such that For t > 0, define Then, for t ≥ 4c 2 and G ∈ Ξ t we have Recall that by definitionĜ n minimizes R n (·). Therefore, in view of the calculations above, for t ≥ 4c 2 holds. Using inequality (A.1) in the third row yields and thus It remains to find upper bounds for the two terms above. In order to bound the first term, note that for (x, y) ∈ R d × {0, 1} and any G ∈ N n we have For all i = 1, . . . , n this implies |U i (G)| ≤ 2 and where the last inequality follows from (ii). By Bernstein's inequality, for all a > 0 for some constant k 2 > 0. Noting that by definition τ n ≤ n 1 ρ+2κ−1 and κ ≥ 1, by (iv) we have To bound the second term we use Bernstein's inequality with a = c 2 τ −κ n and receive for some constant k 3 > 0. Therefore, for t ≥ max 4c 2 , 2c 3 k 2 κ 2κ−1 we find an upper bound Observing Proving that the second term in the assertion is finite follows directly, since regarding (ii) for all Q ∈ Q and sets G ∈ N n it holds hat Proof of Proposition 2.2. Let n ≥ N 0 . Without loss of generality, we may assume that τ n ≤ n 1 ρ+2κ−1 , since otherwise the conditions are also satisfied when using τ n = n 1 ρ+2κ−1 . The idea is to bound Regarding (ii), for every n ∈ N there exists a G n ∈ N n such that For t > 0, define Then, for t ≥ 4c 2 and G ∈ Ξ t we have Recall that by definitionĜ n minimizes R n (·). Therefore, in view of the calculations above, for t ≥ 4c 2 holds. Using inequality (A.2) in the third row yields and thus It remains to find upper bounds for the two terms above. In order to bound the first term, note that for (x, y) ∈ R d × {0, 1} and any G ∈ N n we have for some constant k 2 > 0. Noting that by definition τ n ≤ n 1 ρ+2 , by (iii) we have To bound the second term we use Bernstein's inequality with a = c 2 τ −1 n and receive we find an upper bound Observing that d f Q (Ĝ n , G * Q ) ≤ 1, we conclude

B Convergence Rates for Neural Networks
The first goal of this section is to prove Theorem 3.5. We then follow this up by proving Lemma 3.6 and Lemma 3.9.

B.1 Proof of the Main Result
In order to simplify the approximation results below, we introduce a lemma considering the parallelization and concatenation of two networks Φ 1 and Φ 2 . Since these results have been shown in many other articles e.g. [18,21], we omit the proof.
be realizations of neural networks with L 1 , L 2 layers, sparsity s 1 , s 2 and weights in W c 1 , W c 2 , respectively.
can be realized by a neural network with L = max{L 1 , L 2 } layers, sparsity s ≤ s 1 + s 2 + 2dL and weights in W max{c 1 ,c 2 } .
Note that we are only using weights |w| ≤ 1. In order to approximate high numbers, we use the following lemma.
Proof. The network Φ 1 is given by The other network is where W 0 = 0 and b 0 = 1.
Next, we construct a neural network for each G ∈ K F Q,β,B, 1 , 2 ,r,d which approximates G well with respect to the metric d f Q . The rough idea for the construction of the network is similar to ideas used in [18]. However, the precise construction in order to adapt to the metric in question differs substantially. The proof of the following theorem is one of the main contributions of this paper.
such that the following holds. There exist 0 , C 1 , C 2 > 0 and C 3 , C 4 ∈ N such that for any γ ∈ F and any ∈ (0, 0 ) there is a neural network Φ with L ≤ L 0 ( ) := C 1 log( −1 ) layers, sparsity s ≤ s 0 ( ) := C 2 −ρ log( −1 ) and weights in W c with c = c 0 ( ) := C 3 + C 4 log( −1 ) such that Define κ = 1 + β and let Q be a class of potential joint distributions Q of (X, Y ) such that the following conditions hold.
(a) There is a constant M > 1 such that for all Q ∈ Q the marginal distribution of Q X has a Lebesgue density bounded by M .
(b) There are constants r ∈ N and 1 , 2 > 0 such that for all Q ∈ Q the bayes rule satisfies G * Q ∈ K F Q,β,B, 1 , 2 ,r,d . Let Then there exist constants C 1 , C 2 > 0 and C 3 ∈ N such that the set satisfies the following property. There is a constants c 2 > 0 and N 0 ∈ N such that for all n ≥ N 0 and Q ∈ Q there is a G ∈ N n with Proof. Set 0 := min{ 1 , 2 4 }. Choose N 0 large enough such that τ N 0 ≥ −1 0 . The proof is outlined as follows. We first construct a candidate set G using neural networks. Then, we show that it satisfies the desired properties.
Let n ≥ N 0 , Q ∈ Q and G * Q = H 1 ∪ · · · ∪ H u as in Definition 3.4 with u ≤ r. We begin with the construction of the candidate set G. The idea is to define a network which approximates G * Q well on each set H ν separately. Define ι ν , j ν , a ν i , b ν i , D ν , γ ν and g ν,x as in Definition 3.4. First, for each ν = 1, . . . , u we consider a setD ν with boarders lying on a grid. The advantage of usingH ν =D ν ∩ H ν instead of H ν is twofold. On the one hand, the grid and parameters of N n are defined such that the boarders ofD ν can be constructed precisely. On the other, using the grid, two setsH ν 1 ,H ν 2 have a minimum distance for ν 1 = ν 2 , which is important for our method to work. For δ > 0 let Set := τ −1 n . Define I := 0, h κ , 2h κ . . . , 1 − h κ and let for ν = 1, . . . , u, j = 1, . . . , d. Now, set otherwise.
Note thatb ν jν −ã ν jν ≥ 2 for all ν = 1, . . . , u by the choice of 0 . Figure 2 shows the collection of setsD ν in the example considered in Figure 1. Obviously Figure 2: The collection of setsD ν when considering the example from Figure  1. The dotted lines are the boarders of the sets D 1 , . . . , D 12 . Note that δ is quite large in this example and observe, that the distance between two sets D ν 1 ,D ν 2 is at least 2 δ+1 .
We begin by finding constants C 1 , C 2 , C 3 > 0 such that Clearly, this realization R(Φ) can be achieved with and weights in W C 3 c 0 (τn) with with suitably chosen C 1 , C 2 > 0, C 3 ∈ N. Note that the constants do not depend on u.
Next, we show that the set G := R(Φ) −1 (1) satisfies the desired approximation property d f Q (G, G * Q ) ≤ τ −κ n . First, for ν = 1, . . . , u define E ν as follows. Let It is easy to see thatD ν = ∅ implies D ν ⊆ E ν . Set E := u ν=1 E ν ∪D ν . Figure  3 shows E in the example considered in figures 1 and 3. Clearly G * Q , G ⊆ E. Thus We need to bound both terms. For (I) we observe that by construction h κ ≤ 2τ κ n we have Note that E covers a majority of the space, since = τ −1 n is quite large in this example. Observe that G * Q , G ⊆ E.
The calculations for the second term are a bit more involved. First, observe that by construction of G, for all ν = 1, . . . , u we have We have the following cases.
The remainder of the proof of our main result is now simple.

B.2 Proofs for Regular Boundaries
Next, we prove Lemma 3.6. We first provide the corresponding statement from [21]. Lemma 3.6 is a reformulated version. and weights |w| ≤ 1 such that Proof. Theorem 5 in [21].
Theorem B.4 implies the following. For any f ∈ F β,B,d there exists a network Φ with L = 8 + (k 2 log −1 + 5)(1 + log 2 d ) layers, sparsity and weights |w i | ≤ 1 such that Note that Let V be defined as in the proof of Lemma 3.3. Following the proof of Lemma 12 of [21], we see that for any g ≤ 4(L+1)V there is a neural network Φ with L layers and sparsity s such that where the nonzero weights of Φ are discretized with grid size g. Now, define with c 3 := log 4(c 1 + 1)(dc 2 + c 1 (c 2 + 1) 2 , Therefore, all weights are elements of W c with c = c 3 + c 4 log −1 and Lastly, we prove Lemma 3.9. The extension to this case is similar to the extension in [21].
Proof of Lemma 3.9. Let j=1 . We first construct a candidate network Φ for γ. Then, we show that it approximates gamma well and satisfies the required properties.
In order to construct a network that approximates γ well, we first approximate γ ij and τ ij using neural networks. The final network is constructed using concatenation and parallelization. Let i = 1, . . . , r, j = 1, . . . , d i + 1 and i > 0. Using Lemma 3.6 there exist . Additionally, the function ι ij is the realization of a network with 0 Layers and sparsity t i . Since concatenating and parallelizing networks using Lemma B.1 leads to linear transformations on the upper bounds on the Layers, sparsity and the constant c , there exist constants c 1 , c 2 > 0, c 3 , c 4 ∈ N such that the function and weights in W c . Now, let > 0 be small enough. We show thatγ approximates γ well for suitabily chosen i . Following Lemma 9 in [21] we have for some constant C > 0. Set Additionally, the network Φ has and weights in W c for some constants c 1 , c 2 > 0, c 3 , c 4 ∈ N.

C Lower Bound
We first prove Theorem 4.1. The outline of the proof is similar to the proof of Theorem 3 in [17]. However, the setting of Theorem 3.5 differs substantially from theirs. This leads to a new situation and new technical challenges to overcome in the proof of Theorem 4.1 .
Proof of Theorem 4.1. By Hoelders inequality and condition (c) it is enough to consider the case p = 1 and the first inequality. Let Q 1 ⊆ Q be a finite set of potential probability measures of (X 1 , Y 1 ), . . . , (X n , Y n ). Then Hence, it suffices to show that for any estimator G n we have a.s., (C.1) for some constant c > 0. We now define the set Q 1 . Then, we prove Q 1 ⊆ Q. Lastly, we show that Q 1 satisfies (C.1).
for some k 2 , k 3 > 0. Finally, let We now show that Q 1 ⊆ Q by properly selecting the constants c 1 , k 1 , k 2 , k 3 such that f Qw is well defined for all w ∈ W and showing that Q 1 satisfies the conditions (a),(b),(c).
First of all, we choose k 1 , k 3 small enough and (given k 2 > 0) K 0 large enough such that for all K ≥ K 0 we have for all x ∈ [0, 1] d and w ∈ W .
(a) Clearly, for all w ∈ W the marginal distribution of Q w with respect to X has a Lebesgue density which bounded by 1 ≤ M .
(b) We need to show Q,β,B, 1 , 2 ,r,d for all w ∈ W .

clear.
This implies the assertion.
We first bound the term min{dQ n 0 , dQ n 1 }dψ. By using the fact that for some constant c > 0 for n large enough. Thus for some constant c > 0. This concludes the proof.
Lastly, the proof of Theorem 4.2 is provided. The ideas used in the proof are very similar to those used in the proof of Theorem 4.1 above. We therefore only focus on the differences. As in the proof of Theorem 4.1 define φ : R → [0, 1] to be an infinitely many times differentiable function with the following two properties: • φ(t) = 0 for |t| ≥ 1, • φ(0) = 1.