Estimation error analysis of deep learning on the regression problem on the variable exponent Besov space

Deep learning has achieved notable success in various fields, including image and speech recognition. One of the factors in the successful performance of deep learning is its high feature extraction ability. In this study, we focus on the adaptivity of deep learning; consequently, we treat the variable exponent Besov space, which has a different smoothness depending on the input location $x$. In other words, the difficulty of the estimation is not uniform within the domain. We analyze the general approximation error of the variable exponent Besov space and the approximation and estimation errors of deep learning. We note that the improvement based on adaptivity is remarkable when the region upon which the target function has less smoothness is small and the dimension is large. Moreover, the superiority to linear estimators is shown with respect to the convergence rate of the estimation error.


Introduction
Machine learning has attracted significant attention, and has been applied to various fields. In particular, deep learning has been in the spotlight owing to its notable success in different fields, for example, image and speech recognition [21,14]. Although its success has been confirmed experimentally, the reason why deep learning functions well has not been fully understood theoretically. This problem has been studied in a statistical context by many researchers, and one of the approaches to this problem is the following nonparametric regression problem: where the infimum is taken for all measurable mappings that map n observations to L 2 (P X ). This is called the minimax optimal risk. In previous studies on learning theory, these settings are considered on various function classes, such as a Hölder space and Besov space. The worst-case error analyses on these function classes have been studied for deep learning as well as several other classic estimators, and their minimax optimal rates have been extensively considered. For example, the rate of a Hölder space (C β (Ω), Ω ⊂ R d : bounded open domain) is n − 2β 2β+d [27,20,15], whereas that of a Besov space (B s p,q ([0, 1] d )) is n − 2s 2s+d [19,6,5,10]. Note that the definitions of a Hölder space and Besov space are written in the following section. It was shown that deep learning can achieve the near minimax optimal rate. Table 1 shows the results of previous studies.
In recent studies, the approximation and generalization capabilities of deep ReLU networks have been studied. In [35], the approximation error of deep ReLU networks on a Hölder space is derived. In addition, [25] analyzed the estimation error of deep ReLU networks on a Hölder space and derived the worst-case estimation error that achieves a near optimal rate. [28] studied the approximation and estimation errors on a Besov space that has a broader function class than a Hölder class. [28] noted that deep learning is "adaptive," that is, deep learning can estimate each function effectively by capturing the local smoothness. Indeed, we can see in the analysis by [28] that adaptivity is important for achieving the near minimax optimal rate. In addition, although deep learning can achieve a near optimal rate even if the spatial homogeneity of the target function smoothness is low, no linear estimator can achieve the optimal rate in such a case. In [16], the superiority was demonstrated when the target functions are piecewise Table 1 Approximation and estimation errors of deep learning on function spaces on [0, 1] d . The approximation and estimation errors are measured using the L 2 -norm and its square, respectively. Here, N is the number of units in each layer of deep neural network and n is the sample size. s and β are the parameters of smoothness, and d represents the dimension of the domain where the function spaces are defined.
In this study, we analyze the generalization capability of ReLU neutral networks on a variable exponent Besov space that has different conditions of smoothness depending on coordinate x. Because the smoothness of the target function depends on the input location x (we denote by s(x), i.e., the smoothness of the target function at the location x), the difficulty of estimating the function is not spatially uniform. This problem setting highlights the necessity of the adaptivity of the estimators in contrast to previous studies. In previous studies, the estimation problem on this type of function class with non-uniform properties over the input location x has not been analyzed. In addition, the approximation theory of the variable exponent Besov space has not been studied, although the wavelet decomposition in the variable exponent Besov space has been analyzed, e.g., in [17]. Therefore, we first need to develop an approximation theory on the variable exponent Besov space using the B-spline basis expansion. In section 3, we derive the lower bound for the general s(x) and analyze the upper bound for the approximation error in the case of s(x) = s + β x − c α 2 . In section 4, based on the result in section 3, we derive the upper bounds of the approximation and estimation errors of deep learning. As shown in Table 1, the upper bound of the approximation error of deep learning isÕ(N − s d (log N ) − s−δ α ) and that of the estimation error isÕ(n − 2s where N is the number of units in each layer of the neural network, n is the sample size, and ν is a constant, which depends only on p, d. The polynomial order of the approximation error depends on the minimum value of s(x), and the poly-log order becomes significant when α is small, that is, the domain around the minimum value of s(x) is small. Moreover, the influence of the poly-log order of the estimation error increases if the dimension d is large and the area around the minimum value of s(x) is small. In section 5, we show the superiority of deep learning over linear estimators for 0 < p < 2 or p = 2 and 0 < α < d 3 : where the left-hand side is the worst-case estimation error of deep learning, and the right-hand side is the minimax optimal estimation error over all linear estimators. The contributions of this paper are summarized as follows: • For the analysis of the generalization ability of deep learning, we first derive an approximation theory of the variable exponent Besov space using B-spline bases. We derive the lower bound of the approximation error on the variable exponent Besov space with any continuous smoothness function s(x). In addition, we derive the upper bound of the approximation error on the variable exponent Besov space with specific smoothness function s(x) = s + β x − c α 2 . • To clarify the adaptivity of deep learning, we derive the upper bound of the approximation and estimation errors of deep learning. Subsequently, we show that, as the region where the target function has less smoothness is smaller, the approximation and estimation errors are further improved. This result supports the adaptivity of deep learning. • For a relative evaluation, we compare deep learning with popular linear estimators, such as a least squares estimator, Nadaraya-Watson estimator, and kernel ridge regression. We show the superiority of deep learning over linear estimators with respect to the convergence of the estimation error.

Notation
In this section, we introduce some of the notations used. Throughout this paper, Ω denotes [0, 1] d . For a subset A in R d , we define f Lr(A) as the L r -norm of a measurable function f on A: In particular, if A = Ω, we denote · Lr(A) by · r . Let (S, Σ, P ) be a pairing of probability space S, σ-algebra Σ on S, and probability measure P . Furthermore, let X(s) be a random variable on R d with a probability distribution P X . For a measurable function f , we define f Lr(P X ) as follows: Let X be a quasi-normed space. We denote the unit ball of X by U X . That is, We write the support of function f as In addition, let a > 0 and T be a mapping from R d to R d . Next, we introduce the following notations: a := min{n ∈ Z | a ≤ n}, a := max{n ∈ Z | a ≥ n},

Nonparametric regression and minimax optimal rate
In this paper, we consider the following nonparametric regression model: where ) and X i ∼ P X . Here, P X is a probability distribution on Ω. Moreover, the noise i is i.i.d. centered Gaussian noise, that is, i ∼ N (0, σ 2 ) (σ > 0). We assume that f • is contained in some function class F, that is, f • ∈ F. We want to estimate . For the regression model (1), we use the following quantity to evaluate the estimation for each f • ∈ F: and the expectation is taken over the sample observation ( To evaluate the quality of estimatorf , we define the following worst-case estimation error: Hereafter, we call R(f, F) an estimation error for simplicity. We can see from the definition that R(f, F) is the worst-case estimation error for f • ∈ F. We can evaluate the estimators based on the convergence rate of R(f, F) as the sample size n increases. We use the following lemma to derive the estimation error of deep learning. Note that for a normed space (V, · ), F 1 ⊂ V and δ > 0, we denote the δ-covering [32] by N (δ, F 1 , · ∞ ), which represents the minimum number of balls of radius δ needed to cover F 1 . ([25], [13]). Let F 1 be a function set andf be the least squares estimator in F 1 . That is,f

Lemma 2.1
(3) In Lemma 2.1, the first term represents the approximation error, and the second term represents the complexity of the model. To reduce the estimation error, we need to set the complexity of the model such that the first and second terms are balanced. It can be seen from this lemma that we need to derive the approximation error to derive the estimation error. Therefore, we first discuss the approximation error and then derive the estimation error using Lemma 2.1.

Adaptive approximation
There are two types of approximation methods, non-adaptive and adaptive. For a set of target functions to be approximated, non-adaptive methods fix the basis functions and only change the coefficients of the linear combination. By contrast, adaptive methods change the basis functions and coefficients for each target function. Deep learning is a type of adaptive method because, for each target function, it constructs an appropriate feature extractor that operates as a basis function tailored to each target function. Indeed, deep learning can achieve an (almost) optimal approximation error rate that no non-adaptive method can achieve. Here, we introduce the quantities for an evaluation of these methods and some facts from previous studies.
We define quantities for evaluating both non-adaptive and adaptive methods. We follow the definitions of such quantities presented in [8]. First, we introduce the quantity used to evaluate a non-adaptive method. Let X be a quasi-normed space defined on a domain D ⊂ R d equipped with the norm · X . Non-adaptive methods are evaluated by the following N -term best approximation error (Kolmorogov N -widths) with respect to the norm of X: where the infimum is taken over all N -dimensional subspaces in X. In the definition of d N (W, X), each function in W is approximated by the fixed N dimensional subspace S N . Thus, the N basis functions are fixed against the choice of f ∈ W , and only the coefficients of linear combinations are changed. Next, we introduce two quantities (σ N , ρ N ) that evaluate the adaptive methods. Let W and B be subsets of X. The approximation of W by B with respect to the X's norm is evaluated by the following quantity: Let Φ = {φ k } k∈K be a subset of X indexed by a set K (φ k ∈ X). We define Σ N (Φ) such that it consists of all N linear combinations of the elements of Φ as Here, we define the quantity that evaluates the approximation by N linear combinations of functions in Φ as follows: In contrast to d N (W, X), for each function in W , we can choose N basis functions from Φ and the coefficients of linear combinations adaptively. Let B be a family of subsets in X. An approximation by B is evaluated using the following quantity: If B is the family of all subsets B such that the pseudo-dimension is at most N , we denote d(W, B, X) by ρ N (W, X), which is called a non-linear n-width. Here, the pseudo dimension of B is defined as the largest integer N such that there exist points a 1 , . . . , a N ∈ D and b 1 , . . . , b N ∈ R that satisfy where sgn(x) = 1 {x>0} − 1 {x≤0} . Because the pseudo-dimension of any Ndimensional vector space from a set in D to R is N (see [12]), ρ N (W, X) can evaluate non-adaptive and adaptive methods. Note that if X = L r (Ω), we denote each quantity by It was shown that the approximation error on a Besov space and that on other function spaces related to a Besov space can be improved through an adaptive method. We take the example of a Besov space on Ω and see an improvement when using adaptive methods (see Definition 2.1 for the definition of a Besov space). First, we determine the approximation error using non-adaptive methods. The lower bound of the non-adaptive methods is (see [24,22,33]). For the functions in B s p,q , s controls the smoothness and p controls the spatial homogeneity of the smoothness. In addition, q controls the degree of emphasis on the local smoothness, although it dose not directly influence the convergence rate. However, it was shown in [8] that an adaptive method can achieve the optimal rate of approximation error. Specifically, it was proved in Theorems 5.2 and 5.4 in [8] that, for 0 < p, q, r ≤ ∞ and d 1 p − 1 r + < s, the optimal rate of the approximation methods containing adaptive methods is where M is the set of all M d k,j (see subsection 2.5) whose degree m satisfies s < min{m, m − 1 + 1 p }, which do not vanish identically on Ω. Moreover, an adaptive method can achieve the rate [8]. If parameter p that controls the spatial homogeneity is small, some functions in B s p,q have smooth and rough parts depending on the input location x. The usefulness of an adaptive method for p < 2 can be interpreted such that adaptive methods can increase the resolution of rough parts, which contributes to an effective approximation.
In addition, the improvement of the estimation using an adaptive method was shown in [28] for a Besov space and a mixed-Besov space and in [29] for an anisotropic-Besov space. By applying an adaptive method to the analysis of the estimation error through deep learning, it was shown that the estimation error could be improved. In [28] and [29], it was proven that the estimation error by deep learning can achieve the minimax rate up to the poly-log order.

Besov space
In this section, we introduce the basic properties of the function spaces, in particular, a Besov space and variable exponent Besov space. First, we define a Besov space as follows: ). Let 0 < p, q ≤ ∞, s > 0, r ∈ N and r > s. We define the r-times difference as The r-th module of the smoothness is defined as follows: We define the following quantity using ω r,p (f, t): By using |f | B s p,q (Ω), we define the norm as follows: We note some comments regarding the quantities in Definition 2.1.
• Operator Δ h is similar to the differential, and if function f is an r-times continuous differentiable function on R, it holds that lim h→0 , by applying the Hölder's inequality, it can be easily confirmed that ω r,p (f, t) becomes larger as p increases. • As s increases, t −s increases for t, which is close to 0. Therefore, when s is larger, ω r,p (f, t) needs to be smaller for t, which is close to 0. Thus, we can interpret that s controls the local smoothness. • If r ∈ N satisfies r > s, the definition of B s p,q (Ω) does not depend on r. It is known that some other function spaces can be reproduced from a Besov space by taking some parameters to satisfy certain conditions. First, we define a Hölder space and Sobolev space.
In addition, we define m := β . For the m-times continuous differentiable function f : R d → R, we define the norm of the Hölder space C β (Ω) as follows: Using this norm, the Hölder space is defined as follows: , we define the norm of the Sobolev space as follows: In [30], the relationships between a Besov space and other function spaces are provided. In addition, the relationships between Besov spaces with different parameters are written. We introduce some of them here. We note that for normed vector spaces ( . Thus, a Besov space has close relationships between a Hölder space and Sobolev space. We can obtain the properties of other function spaces by analyzing the Besov space.

Decomposition of functions in Besov space by cardinal B-spline
For the analysis provided after this section, we introduce some facts regarding the approximation on the Besov space and decomposition using a cardinal B-spline basis researched in [3] and [8]. In this study, we mainly use the approximation theory of a Besov space with a cardinal B-spline basis, because the authors in [35,25] showed the effective approximation for polynomials using a deep ReLU network; thus, a B-spline cardinal basis is convenient for the approximation theory when applying a deep neural network.
First, we introduce facts regarding B s p,q (Ω) studied in [3]. Let m ∈ N and N be the univariate B-spline basis, i.e., N ( In addition, we define the tensor product of the B-splines as and for k ∈ Z + and j ∈ Z d , we define the cardinal B-spline basis as Λ(k) denotes the set of j in which M d k,j does not vanish identically on Ω. Here, it can be shown in the same manner as Corollary 2-2 in [8] , which satisfies the following inequality: and degree m of the cardinal B-spline basis satisfies Additionally, we let Q −1 := 0 and q k := Q k − Q k−1 , and it is known that q k can be represented as follows: Note that, throughout this paper, Q k (f ) and q k (f ) indicate the decomposition of f above; if the decomposition target f is apparent, we denote these by Q k and q k . For s < min{m, m − 1 + 1 p }, by Theorem 5.1 and and Corollary 5.3 in [3], f ∈ B s p,q (Ω) can be decomposed as follows: Note that the convergence is with respect to the B s p,q (Ω) norm. Moreover, the following norm equivalence holds:

Variable exponent Besov space
Next, we define a variable exponent Besov space. Here, we consider the case in which only parameter s is variable and parameters p and q are fixed. Before the definition of a variable exponent Besov space, we define the log-Hölder continuity that is assumed for the function of smoothness.

Definition 2.4 (log-Höder continuity). For a function
We can see that the log-Höder continuity is stronger than the continuity, and is weaker than the Lipschitz continuity.
There are some methods to define a variable exponent Besov space. In this study, we define this as follows: We assume 0 < inf x∈Ω s(x) and let r := sup x∈Ω s(x) + 1. We define ω * r,p (f, t) in a similar manner as the case of B s p,q (Ω): Next, we define |f | B s(x) p,q (Ω) as follows: In the same manner as in B s p,q (Ω), the norm is defined as the sum of the L p norm and |f | B s(x) p,q (Ω) , that is, Note that for the case of B s p,q (Ω), t −s is contained in the definition of |f | B s p,q (Ω) ; whereas, in the case of a variable exponent, it is contained in the definition of ω * r,p (f, t). By definition, we can see that the permissible smoothness changes depending on x.
We can consider that the definition above is the smoothness parameter of the Besov space s when replaced with variable s(x). From the definition, it holds that where s min = min x∈Ω s(x have been studied by [18,2]. Next, we introduce some properties of B s(x) p,q (R d ) that are studied in [1].
• We assume s 0 , s 1 ∈ L ∞ (R d ), and for all x ∈ R d , it holds that s 0 (x) ≥ s 1 (x). We let • For δ > 0, we assume that for all x ∈ R d , it holds that By letting C u (R d ) be the set of functions that are bounded and uniformly continuous functions, it holds that • We assume that for all x ∈ R d , s(x) satisfies s(x) < 1. We define the Zygmund space C s(x) as Then, it holds that Thus, it is known that a variable exponent Besov space also reproduces other function spaces by taking certain parameters to satisfy some conditions.

Lower bound of approximation error
In this section, we evaluate the lower bound of the approximation error on the unit ball of the variable exponent Besov space U B s(x) p,q (Ω) . In the main result (Theorem 3.1), we will show that the polynomial order of the approximation error on U B s(x) p,q (Ω) cannot be improved from N − s min d for any s(x). To prove Theorem 3.1, we show the following lemma.
Moreover, by applying a triangle inequality, we have Thus, for any t ∈ (0, 1], it holds that where, for t ∈ (0, 1], we use t −s(x) < t −smax . Therefore, by (3.1) and (3.2), there exists C 0 > 0, and we have the following: Note that C 0 does not depend on f , but does depend on ξ and r. There- . We want to expand the local argument around s min to the argument on Ω by using the extension operator. To introduce the extension operator in Lemma 3.3, we define the minimally smooth domain.  1, 2, . . .), such that the following conditions hold: ≤ L holds, and Λ can be written as follows: In [26], it is stated that all convex sets in R d are minimally smooth domains. For a better understanding of Definition 3.1, we prove that a cube is a minimally smooth domain.

Lemma 3.2. Let A be the interior of [0, 1] d . Then, A is a minimally smooth domain.
Proof. The value of x ∈ ∂A can be expressed as follows: that transfers (1, 1, . . . , 1) to (0, 0, . . . , √ d). We also define Here, we define h : R d−1 → R as follows: where u = a r gm i n u − u 2 . It is clear that h(u) is a Lipschitz continuous function. We also let S (0,...,0) be the following: For each point x in D, based on the definition of D and radius 1 We define the vertex set of [0, 1] d as {z 1 , z 2 , . . . , z 2 d } ⊂ R d . For each vertex z j , through the translation that transforms z j into (0, . . . , 0), we can apply the same argument as (0, . . . , 0) and obtain S zi ; it thus holds that where x is in the neighborhood of z j , which corresponds to D. We set ∈ N such that it satisfies

Lemma 3.3. Let S ⊂ R d be a closed subset whose interior is a minimally smooth domain. Then, for 0 < p, q ≤ ∞ and 0 < s, there exists an extension operator
Proof. Although [4] proved this for only 0 < p ≤ 1, it can also be proved for 1 ≤ p < ∞ using the technique in Theorem 6.6 in [4]. Furthermore, for p = ∞, it can be proven in the same way as the case 0 < p ≤ 1. By using Lemma 3.1, Lemma 3.2, and Lemma 3.3, Theorem 3.1 can be proven. We redefine M as the set of all M d k,j , whose degree m satisfies s max < min{m, m − 1 + 1 p }. (Ω) that satisfies the two properties in Lemma 3.3. Note that Lemma 3.3 can also be used for extending functions to Ω because for a function, g : R d → R, it holds that g B s p,q (Ω) ≤ g B s p,q (R d ) . By using these two properties and Lemma 3.1, it can be proven that the image of the restriction mapping Let Ef be an approximation function of f . Here, the following inequality clearly holds: Note that the set of functions that satisfy the condition with respect to ρ N on Ω are mapped subjective to the set of functions that satisfy the condition on Q by the restriction. By applying (2.1), under each condition σ N or N , the following inequality holds: inf Thus, the proof is completed. p,q (Ω) ⊂ B smin p,q (Ω), it is known that N − s min d can be achieved [8]. Therefore, our interest is in improving the rate by a factor slower than a polynomial order. In fact, we will show that a poly-log order improvement can be realized.

Upper bound of approximation error
In this section, we analyze the approximation error using the adaptive method. Because it is difficult to deal with a general s(x), we analyze a specific s(x) defined by Throughout this paper, we fix parameters α, β, s > 0, c ∈ Ω. It is clear that s(x) takes the minimum value at x = c, and as α decreases, the gradient of s(x) around c becomes larger. We note in our analysis that the gradient of s(x) around the minimum point is important for the order of the approximation error. Thus, this form of s(x) is convenient and sufficient to characterize the approximation of the variable exponent Besov space. Consider the case in which s(x) takes the minimum value at several points, and the gradient of s(x) at each minimum point behaves as in a single minimum situation. The approximation error of this case is the same as that of s(x), taking the minimum value at only one Let us determine whether s(x) satisfies the log-Hölder continuity. If α > 1, it is clear that s(x) satisfies the log-Hölder continuity because s(x) is a Lipschitz continuous function. Thus, it suffices to consider the case of 0 < α ≤ 1. Because if we exclude the neighborhood of c, s(x) is a Lipschitz continuous function in Ω, it suffices to consider a neighborhood of c. Based on the triangle inequality and 0 < α ≤ 1, the following inequality holds: Thus, By lim t→∞ log(e+t) t α = 0, the log-Hölder continuity is confirmed.
Here, let A be a subset in R d , and denote the set of indexes j that satisfy suppM d k,j ∩ A = ∅ by Λ A (k). Moreover, let m A,k := |Λ A (k)|. We reorder indexes j ∈ Λ A (k) as {v A,j } m A,k j=1 such that the coefficients are ordered in descending order. That is, the following inequality holds: Here, we denote Subsequently, we let δ := d 1 To prove the Theorem 3.2 that provides an upper bound of the approximation error in the case of s(x) = s + β x − c α 2 , let us show Lemma 3.4.

Lemma 3.4. Let 0 < p, q, r ≤ ∞, t > 0, A = [c − t, c + t] d , and s(x) = s + β x − c α . Suppose that s > δ and the degree of cardinal B-spline m satisfy
, we define f N as follows: where Q k (·) is the approximation determined through a B-spline basis for the functions in B s p,q (Ω) defined in subsection 2.5. Under this condition, the inequality below holds: The proof is given in Appendix A.1.

there exists f N that is represented as an N linear combination of cardinal B-spline times some indicator function and satisfies the following inequality:
We take the adaptive approximation method around the minimum of s(x). Let a k be a positive number depending on N , which will be fixed below. It is clear that for a k > 0, We consider increasing the resolution of B-spline in A, such that the number of B-splines on A satisfies 2k d . That is, the following formula holds: Thus, N k is defined as N k = log k log a k 1 α . We have the approximation error by Lemma 3.4, Therefore, it holds that Here, setting a k = k logk s−δ α , we then obtain the desired result.

Remark 3.2.
Here, β does not appear in the convergence rate in the Theorem 3.2, but is hidden in the constant term. In addition, note that the constant term in is taken independent of c.

Remark 3.3 (approximation theory for general s(x)
). The method in Theorem 3.2 can be applied to a general s(x). Note that, if the measure of x satisfying s(x) = s min is not 0, the method does not make sense, that is, the approximation error is not better than N − s min d including a poly-log order. A summary of the approximation method is as follows: 1. Set a positive integer a k , and increase the resolution of the domain 2. Fix a k so that 2k smin a k 2 (k+N k )smin .
It can be seen that the measure around the minimum point is important for the approximation error rate of this method. This is controlled by exponent α in s(x) = s + β x − c α 2 . This also indicates that our analysis for the specific choice of s(x) provides essential insight for more general situations.
It can be seen that the poly-log part of the order N − s d log N log(log N ) − s−δ α descreases as α decreases. That is, as the gradient of s(x) around the minimum point sharpens, the approximation error improves. Moreover, for r < p, the poly-log order does not depend on dimension d. Thus, if d is large, the polylog factor has a relatively strong effect. The dependence of the poly-log order on p can be interpreted as follows. Because p controls the homogeneity of the smoothness of functions in a variable exponent Besov space, if p is small, the number of B-spline bases for the adaptive method around the minimum of s(x) should be larger.
The adaptive method in Theorem 3.2 is taken at around c. Note that, if c is fixed and r ≤ p, the B-spline bases can be fixed in a non-adaptive manner, that is, we may fix the bases independent of the target function f . Therefore, the corollary below immediately follows.
(Ω) can be approximated by a fixed N linear combination of the B-spline basis times an indicator function. (Ω) with that of B s+ p,q (Ω). By calculating the upper bound of that satisfies the following inequality, we can see that the approximation error by an adaptive method is equivalent . Therefore, it can be seen that for N , which is not too large, the improvement using the adaptive method is significant if α is small.

Approximation error of deep learning
In this section, we evaluate the approximation and estimation errors of deep neural networks on B s+β x−c α 2 p,q (Ω). We denote the ReLU activation function by η(x) = max{x, 0}, where η(x) is operated in an element-wise manner for vector x. We define the neural network with the ReLU activation, depth L, width W , sparsity constraint S, and norm constant B as follows: where, for matrix A, A ∞ is the maximum absolute value of A and A 0 is the number of non-zero elements of A.
We evaluate the worst-case approximation error of the deep neural networks Φ(L, W, S, B) on U B s+β x−c 2 α p,q (Ω) with L r norm. That is, the quantity we want to evaluate is sup We assume that there exists a density function of P X with respect to the Lebesgue measure that is denoted p(x). Moreover, we assume that there exists First, we use the lemma below. This indicates the approximation of the Bspline function by a neural network. by a neural network. The proof is similar to that of Proposition 1 in [28]. The outline of the proof aims to approximate f N in Theorem 3.2 by a neural network. The proof is presented in Appendix A.2.

Estimation error of deep learning
First, for the estimation error, we confirm that the lower bound of the polynomial factor is n − 2s min 2s min +d if X takes a value around the minimum point of s(x) with a certain probability. Let a satisfy s(a) = s min . We suppose that there exists a constant t > 0 such that p(x) satisfies inf x∈Q p(x) > 0, where This assumption ensures that X takes values in the domain where the estimation is most difficult with a certain probability. Under this assumption, we can obtain the inequality below: and the expectation is taken with respect to the observed data . Moreover, note that the following holds: Here, the transformation from f into g and fromf intoĝ is based on the translation and scale change of x. By the definition of a Besov space, it holds that f B s p,q (Q) g B s p,q (Ω) . It is known that inff R(f, B s p,q (Ω)) n − 2s 2s+d [19,6,5,10]; thus, by the same argument as Theorem 3.1, for all s(x) it holds that Therefore, it is important to consider the poly-log order when determining the difference in the convergence rate. Now, we evaluate only the L 2 norm risk, and thus introduce ν := d 1 To obtain the upper bound of the estimation error through a deep neural network, we use the following lemma.
Before the proof of Theorem 4.2, we introduce some notations. Let F > 0, and we define Ψ(L, W, S, B) as follows: We can see that Ψ(L, W, S, B) is the clipping of Φ(L, W, S, B) by F . The clipping is easily realized by the ReLU function as follows: wheref ∈ Ψ(L, W, S, B) is the least squares estimator for f • whose definition is given in Eq. Here In addition, because the density function p(x) satisfies p(x) ≤ T , By applying Lemma 2.1 with δ = 1 n , we have Here, we let N satisfy N n 2d(s−ν) (2s+d)α ; consequently, we can obtain the desired result.
For p > 2, the poly-log order of estimation error is (log n) − 2s(d−3α) (2s+d)α , and if s = 1, it is (log n) − 2(d−3α) (2+d)α . Thus, we can see that the influence of the polylog order increases as dimension d increases, and as α decreases. By contrast, because the polynomial order is affected by the curse of dimensionality, we can also see that the influence of the poly-log order increases as the dimension increases.

Numerical evaluation of the improvement of the estimation error by adaptive approximation
The estimation error is improved by the adaptive method, and we numerically evaluate the effect of the improvement. This numerical evaluation compares the estimation error for the realistic number of observations when the parameters change. We confirm that the improvement is significant if the parameters satisfy certain conditions. In particular, we will see that the poly-log factor is significant when α is small or d is large. Figure 2 represents the estimation error of as green, the polynomial order of the estimation error of deep learning on as orange, and that of B s p,q (Ω) n − 2s 2s+d as blue on the log-log graphs with s = 1, 2 ≤ p. In each graph, the horizontal line represents the number of observations, and the vertical line represents the estimation error. Figure 2(b) shows the case in which d is larger than that of Figure 2 Note that we only need to consider the inclination of the graphs here because the constant factor is different in each graph. We can see that the improvement due to the adaptive method is significant if d is large or α is small. Under this condition, we can consider that the order of the convergence rate for the estimation in B , if the number of observations is realistic. Therefore, we can see that if α is small or d is large, the estimation error is far better than that of B s p,q (Ω).

Superiority to linear estimator
In this section, we show the superiority of deep neural networks to linear estimators by focusing on the adaptivity of such networks. A linear estimator is a class of estimators that is linearly dependent on outputs (Y 1 , Y 2 , . . . , Y n ). This class includes some popular methods in the field of machine learning, for example, linear regression, Nadaraya-Watson estimator, and kernel ridge regression. Because kernel methods can be considered as a learning method using shallow neural networks, we may regard this comparison as that between deep neural networks and shallow neural networks.

Fig 2. Numerical evaluation of the estimation error
This class includes the least squares estimator, Nadaraya-Watson estimator, and kernel ridge regression. For example, the estimator from kernel ridge regression can be written as follows: where λ > 0, and Y = (Y 1 , Y 2 , . . . , Y n ) , k : R d × R d → R is a positive semidefinite kernel, k(x) := (k(x, X 1 ), k(x, X 2 ), . . . k(x, X n )) , K = (k(X i , X j )) i,j .
Linear estimators were compared with deep learning in previous studies. It was shown that deep learning is superior to any linear estimator for specific target function classes F [28,29,13,16].
Here, we introduce some properties of linear estimators. Whether the minimax optimal rate of a linear estimator is equivalent to that of a function space that is larger than the original target function class F was considered in previous studies. Indeed, it was proved in [13] that the optimal minimax rate of a linear estimator does not differ from that of a convex hull of the original target function class: where the infimum is taken over the linear estimators and conv(F • ) is defined as follows: In addition, under certain assumptions, this was shown not only for the convex hull, but also for Q-hull [7,5]. We define a function set G as follows: Because we do not know the location of c in estimating a function in G, it is difficult to identify which part of the function is less smooth (hard to estimate). This setting is more natural than that in which the target is in U B s+β x−c α 2 p,q

Related work
We note that there are some papers that investigated estimation problems for a function class with variable smoothness. In the context of kernel density estimation, [9] and [23] treated special Hölder spaces whose smoothness is locally defined. Although the measurement of the estimation error and treated function space differ from this paper, estimators in [9] and [23] adapt to the variable smoothness. The main idea that is common in both papers is adjusting bandwidth locally so as to adapt to the local smoothness. This idea is similar to that of this work which adjusts the resolution level of B-spline basis. However, there are many differences between those works and this work. [9] and [23] only treated the case d = 1. [9] restricted s(·) to 0 < s(·) ≤ 1 and Proposition 3.13 in [23] which is consistent with Theorem 3.1 only considered the case 0 < inf x∈Ω s(x) ≤ 1. In addition, we suggested the relation between the poly-log improvement of L 2 estimation error and s(·) and d while Theorem 3.15 in [23] gave the upper bound of estimation error in abstract form. Although those works proved the pointwise convergence which is stronger than L 2 convergence of this work, we proved the convergence for a broader function class which includes the function class equipped with locally Hölder smoothness. In addition, we gave the approximation theory of B-spline basis for a variable exponent Besov space which is broader than locally Hölder space and we proved the superiority to linear estimators.
The nonparametric regression problem on a Besov space B s p,q (Ω) was studied extensively and you may refer to Chapter 9, 10, 11 in [11] and Chapter 4.3, Chapter 6.3 and Chapter 8.2 in [10]. In particular, adaptivity of wavelet threshold estimators has been known [6,5]. Wavelet threshold estimators adapt to the spatially homogeneity of smoothness, and the nonlinearity of those estimators improve the convergence rate when the parameter p is small. However, as far as we know, the performance of wavelet shrinkage estimators in the variable exponent setting has not been analyzed. We would like to defer this to a future work.

Conclusion
We showed that the polynomial order of the approximation and estimation errors cannot be improved from the order of the minimum value of s(x) and that the adaptivity of deep learning yields a poly-log order improvement. This improvement is remarkable when the dimension is large and the area around the minimum value of s(x) is small, that is, the domain, where the estimation is the most difficult, is small. In addition, we have shown that, for 0 < p ≤ 2, no linear estimator can achieve the poly-log improvement, and we can ensure the superiority to linear estimators with respect to the estimation error. Notably, these results provide insight into the high performance of deep learning in the application fields.