Deep learning for inverse problems with unknown operator

We consider ill-posed inverse problems where the forward operator $T$ is unknown, and instead we have access to training data consisting of functions $f_i$ and their noisy images $Tf_i$. This is a practically relevant and challenging problem which current methods are able to solve only under strong assumptions on the training set. Here we propose a new method that requires minimal assumptions on the data, and prove reconstruction rates that depend on the number of training points and the noise level. We show that, in the regime of"many"training data, the method is minimax optimal. The proposed method employs a type of convolutional neural networks (U-nets) and empirical risk minimization in order to"fit"the unknown operator. In a nutshell, our approach is based on two ideas: the first is to relate U-nets to multiscale decompositions such as wavelets, thereby linking them to the existing theory, and the second is to use the hierarchical structure of U-nets and the low number of parameters of convolutional neural nets to prove entropy bounds that are practically useful. A significant difference with the existing works on neural networks in nonparametric statistics is that we use them to approximate operators and not functions, which we argue is mathematically more natural and technically more convenient.


Introduction
We consider a linear inverse problem, where our goal is to recover a function f given observations from T f in a white noise regression model [47] dY where M ⊂ R d is a bounded set, T is a linear, bounded operator in L 2 and dW is a white noise process (see Section 2.1.2 in [19]). We assume that f ∈ C s L := {f ∈ C s | supp f ⊆ [0, 1] d , f C s ≤ L} is an s-Hölder function for s, L > 0.
The noise level σ > 0 is assumed to be known, as it can otherwise be estimated efficiently [40,45].
While linear inverse problems are well understood, in this paper we consider the challenging task of solving the inverse problem (1) when the operator T is unknown. Instead, we have access to "training data", which consist of pairs of functions f i ∈ C s L and their images where f i are functions drawn from a probability distribution Π in C s L (see Definition 2), and dW i are independent white noise processes defined in the same probability space.
Inverse problems with unknown operator appear in applications where the data generating mechanism is (partially) unknown or numerically too demanding to be used in practice. One example of that is physical systems where T is not explicitly known but computed in intensive finite element simulations [48]. Another example are instrumental variable models in econometrics, where the unknown in the operator arises from the unknown correlations between the variables of interest [20].
Not knowing the operator makes the problem quite challenging. If the operator was known, solving the problem would mean roughly speaking to find a pseudoinverse of the operator T (technically, of T * T ), and apply it to a denoised version of the data (1). Since the operator is unknown, we can only rely on the training data in order to construct an approximate pseudoinverse. In the past twenty years, several methods have been proposed for solving this problem under a quite strong assumption: the functions f i in the training set have to form an orthonormal system in L 2 (see e.g. [17,37] and Subsection 1.1 below for a discussion). If the training data is so structured, it is easy to construct a pseudoinverse even explicitly, as done in [17]. However, it is unrealistic to have such structured training data in practice, and it is hence of interest to ask what can be done for more generic training data.
In this paper we consider that question by imposing a minimal assumption on the training data: the functions f i are drawn independently from a common probability distribution Π. This distribution represents the type of signals commonly appearing in a particular application, e.g. brain images in tomography. Now, when dealing with such unstructured training data, it is not straightforward how to extract from it information about the operator T and its pseudoinverse, as opposed to the explicit formula available under the orthonormality assumption. We approach this problem with the current learning paradigm, meaning that we look for a method that performs the inversion well in the given training data. Specifically, we employ neural networks as methods, and look for the neural net that minimizes the sum of squared residuals over the training dataF Deep learning for inverse problems 725 where F denotes a class of neural networks to be specified later. The intuition behind this method is as follows: if the networkF is able to approximately solve the inverse problem for all the training data, then we expect it to approximately solve it for similar data, that is for data drawn from the same distribution Π.
One caveat at this point: we do not consider optimization issues in this work, and assume that we just have a neural networkF that (somehow) minimizes the risk. However, we prove our results in a way that they apply for neural nets that approximately minimize the empirical risk in a precise sense (see Definition 6). This is a slight improvement, as it would be unrealistic to pretend to find the exact minimizer of the nonconvex minimization problem (3), see Remark 6. Our focus in this paper is on the theoretical analysis of the (approximate) minimizerF . In particular, we want to quantify its accuracy for solving the inverse problem (1), i.e., we want to show that F (Y ) − f tends to zero in an appropriate sense as N → ∞ and σ → 0. Our analysis provides such a convergence in a quantitative way, with Theorem 1 proving that up to logarithmic factors for β ≥ 0, where the operator T belongs to a class of β-smoothing linear operators that includes convolution operators, see Definition 1 and Example 1. This statement proves that the intuition of empirical risk minimization applies here: the networkF , trained to approximately invert the training data, is able to solve the inverse problem in unseen data drawn from the same distribution.
Furthermore, equation (4) shows that, if we have enough training samples (N ≥ σ −4 2s+2β+3d/2 2s+2β+d ), then the convergence rate is the minimax rate over s-Hölder functions for β-smoothing inverse problems (see Theorem 3 in [8]). This is remarkable, as it means that, if enough training data is given, our method performs roughly the same as if the operator was known. Pinpointing this behavior quantitative is of interest also from an epistemological viewpoint, as it represents the transition from classical model-based methods to modern modelagnostic training data-based methods.
We prove statement (4) for the approximate minimizerF over a class F of neural networks that have the architecture of U-nets, proposed in [42] for image segmentation tasks. These are convolutional neural networks with skip connections that have been used for inverse problems before, see Section 1.2 below. Crucially, our analysis gives us quantitative prescriptions for the choice of all parameters in the U-net (see Assumption 3 and equation (14)), thus providing a guide for the practitioner in the difficult task of choosing them.
The proof of our convergence result is based on two steps: first, we prove approximation properties for our class of U-nets. This is done by relating Unets to the thresholding operator with respect to the wavelet-vaguelette system associated with the operator T . In a second step, we bound the complexity of the class of SU-nets in terms of its covering numbers. This is the most challenging step and the heart of the paper, and the central argument here is to consider the stability of SU-nets under perturbations of their coefficients. We refer to Section 2.4 for a detailed blueprint of the proof.

Inverse problems with unknown operator
As remarked above, inverse problems with unknown operator pose a relevant and challenging problem. In the past 20 years, several approaches have been proposed to deal with them. The first work was that of Efromovich and Koltchiskii [17], which given training data builds a linear estimator for the pseudoinverse of T * T . Their approach allows to have a different noise level in the training and the test data. Further, they recover the minimax rate of estimation if the training sample is large enough (specifically, if N ≥ σ − 1 2s+2β+1 in dimension d = 1). A similar approach was proposed by Marteau [37], while a new type of nonlinear estimator with two variants was introduced by Hoffmann and Reiß [21]. They considered a (roughly equivalent) model where instead of having noisy training data, the operator is perturbed by a random, additive term. This corresponds to our setting having infinite training data and potentially different noise level in the training and the test data. Their estimators are shown to attain the minimax rates. A somewhat simplified model was considered by Cavalier and Hengartner [9], Johannes and Schwarz [26] and Marteu and Sapatinas [39], where the training data was supposed to consist of eigenfunctions of the operator T , whence the eigenvalues are observed with noise. In that setting, the different problem of estimating a quadratic functional was considered recently by Kroll [32].
A different estimator was proposed by Rosenbaum and Tsybakov [43] in the context of high-dimensional regression where the design matrix is observed with noise. This is a sort of discrete counterpart of the problem considered here. Their estimator is given by a Lasso-type method modified so that the noise in the design matrix is taken care of. Finally, a different approach was introduced by Trabs [46], who proposed a Bayesian method for the joint estimation of the unknown function f and the operator T . His setting is similar to that in [21] in that he observes the operator T up to additive random noise, instead of observing training data. His work is also novel in that he considers different degrees of uncertainty in the operator, which is parametrized by a unknown term that might be finite or infinite dimensional.
For completeness we mention here a family of works dealing with inverse problems with unknown operator that arise in instrumental variable models in econometrics. These works also correspond to the setting where the operator T is known up to noise, because it is estimated from data in a certain way. The work of Hall and Horowitz [20] provides a good overview of this set of works until 2005, as well as two kernel-based nonparametric estimators with convergence rates. Further developments in this direction were given by Johannes et al [27], Marteau and Loubes [38], and Loubes and Pelletier [34], where they employ an interesting problem dependent form of regularization.
We remark that, in all the methods described above, one either effectively has infinitely many training data, or the training data {f i } is assumed to form an orthonormal system. This is at the heart of the constructions used, e.g. in [17], as the orthonormality is key for building their linear approximation to the pseudoinverse. We also remark that this assumption is quite restrictive and cannot be guaranteed in applications where the training data is either given, or has a very specific structure so that orthonormality cannot be naturally imposed (think e.g. of tomography images). In this sense, we find our approach more satisfying and realistic, as it does not impose any conditions on the training data besides being drawn independently from a probability distribution.

Why neural networks
As stressed above, the lack of structure on the training data makes it unfeasible to construct an explicit pseudoinverse of T directly. Instead, we resort to finding a method that minimizes (3), thus fitting the training data well. For this approach to work, the class F over which we minimize in (3) should be large enough to contain an operator that approximately inverts T . This requires a lot of expressive power, as the operator is unknown and can be anything within a class of smoothing operators (see Definition 1). In this work, we consider the case where F is a class of neural networks. Specifically, it will be a class of U-nets, proposed by [42]. Even though the U-net was originally proposed for the segmentation of MRI images, it has also been applied with minor modifications to proper inverse problems, such as limited angle X-ray tomography [24], 2-photon microscopy [33] and denoising of seismic data [36], to mention just a few applications. In this work, we will use a modification of U-nets, which we call simplified U-nets, or SU-nets, see Definition 5.
Let us give an intuition of why U-nets are suitable for our task. As argued in [42] and as can be seen in Figure 1, the U-net operates as follows: First, in the downward ("contracting") path, features are extracted from the input with the help of convolutional layers. The deeper we go in the vertical direction, the more features we extract, and the bigger the (effective) size of the convolutional kernels used. This means that the bottom layers carry information about lower frequencies, while higher frequency information is contained in the upper layers. Then, in the upward ("expanding") path, information from the deeper layers is combined with the layer immediately on top. This natural interpretation of Unets in terms of frequencies is one of the key ideas of this work, as it will help us relate it to the wavelet transform, from what we will prove their approximation properties and their ability to invert an operator. See Definition 4 and Example 3 for the precise relation between these objects.

Literature on neural networks for nonparametric problems
In the past few years, following the deployment and success of neural networks in practical applications, researchers in the mathematical statistics and inverse problems communities have begun to explain their performance in model problems with theories. As the literature is vast, in this section we will only mention  [42] for segmentation of MRI images. Image taken from [42], reproduced with permission of Olaf Ronneberger. the works directly relevant to us, that is those using neural nets for nonparametric problems such as regression and inverse problems and proving reconstruction rates. In that category, all of the relevant works study the approximation of functions with different types of neural networks under different assumptions. We refer to the classical works of Kolmogorov [31] for shallow networks, and Cybenko [10] and Hornik [22] for bounded activation functions. See the book by Anthony and Bartlett [2] for an overview of the developments until the 2000s. Since then, new techniques have allowed the treatment of deep networks and of unbounded activation functions, see e.g. [44] and [4] for results in a nonparametric statistics setting, and [7,12,16,18,30,49,50] and the review [13] for an approximation theoretic viewpoint. We also mention the works of D.-X. Zhou, specially [51], for the explicit treatment of convolutional networks.
The results proven in these works have a common structure: they consider the approximation and entropy properties of a class of neural networks in terms of their complexity (measured by covering numbers and depending on the architecture, number of nonzero parameters, depth, width, etc), and their performance is given by a trade-off between these. In this respect, our strategy coincides with theirs. There is however a crucial difference between our use of neural nets and theirs: in all the papers mentioned above, the goal is to approximate a function given function values (that is, a regression problem or an inverse problem). In our case, the network approximates the solution operator of an inverse problem. This is a big difference, both in terms of the techniques used and of the implications.
In this sense, one of the contributions of this paper is to present a different way of modeling neural networks in statistical applications: as operators (or functionals) rather than as functions. We believe that this viewpoint is more natural in certain applications beyond the one we present here, such as segmentation and other image processing tasks (where the neural net is naturally used as an operator), or classification of complex objects such as image and sound (where the neural net is used as a functional). The naive answer to this statement might be that one can approximate functionals with functions, and operators with high-dimensional functions, and so use the function approximation properties of neural networks. Our answer to that has two parts: first, if we substitute an operator with a function from R n1 to R n2 , the current theory gives an estimation error of the order O(n 2 ) (times the noise level to a power O(n −1 1 )). In particular the dependence on n 2 is terrible and likely to make the results useless. We take this as a hint that it is incorrect, as well as unnatural to model operators as high-dimensional functions in this setting. Of course, the answer to this is that one has to impose the right regularity properties. This brings us to our second argument, namely that in some applications such as image processing tasks, the object we look for naturally has operator-like properties, such as locality, boundedness, maybe even a representation as a Fourier integral operator, but it is not apparent, rather debatable, whether it has function-like properties, such as pointwise smoothness.
Finally, we want to mention that some researchers have already proposed to use neural networks to approximate the solution operator in inverse problems, see e.g. [1] and the very good review in [3]. Their motivation is however not to solve inverse problems with unknown operators, but to learn the right regularizer or regularization parameter for the problem. Furthermore, to the best of our knowledge these approaches do not prove reconstruction rates in terms of the smoothness, noise level and number of training data.

Contributions
To summarize, in this work we make two contributions: First, we apply a type of convolutional neural network to linear inverse problems, and prove quantitative convergence rates. For that, we build a theoretical framework based on relating the neural network architecture to the wavelet transform, and we bound the complexity of the class of neural nets in terms of covering numbers. Furthermore, our theoretical analysis gives clear prescriptions for the choice of the many free parameters of the neural net, such as depth, size of the kernels, etc. Finally, we stress that the present framework is promising in that it can be generalized and applied to other settings, such as classification problems.
As a second contribution, we propose a method for solving inverse problems with unknown operators and prove convergence results for it. As explained in Section 1.1, this is a challenging problem for which solutions are known only under restrictive assumptions. In contrast, our assumptions are significantly weaker than those in the literature, while our results are of the same quality. In particular, our convergence rate depends on the size of the training set and the noise level, and the minimax rate over Hölder spaces is attained for a large enough training set.

Organization of the paper
The paper is organized as follows. In Section 2.1 we present some of the basic definitions and assumptions, in Section 2.2 we introduce SU-nets and compare them at length with the wavelet transform, in Section 2.3 we state our main result and in Section 2.4 we give an overview of the proof. A discussion of our results is given in Section 3, while the main proofs are in Section 4.

Assumptions
Definition 1. For constants β ≥ 0 and C T , a 1 , a 2 > 0, the class T β,CT = T is defined as the set of all linear operators T : The operator T is β-smoothing, meaning that it maps between Sobolev spaces according to a 1 g H t−β ≤ T * g H t ≤ a 2 g H t−β for any g ∈ dom(T * ) and any t ≥ 0 (iii) The operator T is translation invariant, meaning that it commutes with translations, i.e., if the translation operator is defined by τ h g(x) := g(x + h), then T τ h g = τ h T g for all g ∈ dom(T ) and all h ∈ R d . Assumption 1 (Wavelet-vaguelette decomposition). Given a bounded set M ⊂ R d , the linear operator T and a father Daubechies wavelet ϕ J,k,0 (at resolution level J), we assume that there are functions ψ k (x) ∈ L 2 whose support is contained in the set M and such that for any function g ∈ dom(T ), we have The set of functions ψ k is the so-called vaguelette system associated to T and the Daubechies wavelet basis, as introduced in [14]. Example 1. An example of operators belonging to the class T β,CT and satisfying Assumption 1 are convolution operators with kernel decaying fast enough. This includes relevant inverse problems, such as models for microscopy and astronomy [6], and also the solution operators to certain differential equations with the method of Green's function [5].
Let us consider this example in more detail. On one hand, we see that a convolution operator with a kernel K is bounded with operator norm equal to K L 1 . Concerning the support properties in Assumption 1, suppose that the operator is given by convolution with a kernel K satisfying which is a 2L-smoothing operator, L ∈ N, where F denotes the Fourier transform. The associated vaguelette can be computed explicitly [14] and has the form where Δ denotes the Laplacian in R d . From this equation we read out that supp ψ k ⊆supp ϕ 0,k,0 , which is compact for Daubechies wavelets. In this case, the bounded set M in Assumption 1 can be taken to be the compact support of the Daubechies wavelets at scale J = 0.

Remark 1.
The translation invariance assumption in Definition 1 is made for a particular reason: it guarantees that the vaguelettes ψ k (·) in Assumption 1 are given by translations of a single function ψ(·), i.e., ψ k (x) = ψ(x − k 2 −J ). This will simplify the structure of the neural networks employed later. We remark, however, that our approach can be extended to operators that are not translation invariant at the price of a more complex network architecture. See Section 3 for a discussion.
Definition 2 (Probability distribution in C s ). For s > 0, consider the set C s of s-Hölder functions supported in the unit cube [0, 1] d , which is a metric space with the metric induced by the s-Hölder norm. Let B denote the Borel σ-algebra induced by the topology given by its norm. On the measurable space (C s , B) we consider probability measures Π defined with the usual axioms, see e.g. Chapter II in [41]. In this paper, we will only work with measures Π supported in the bounded set C s An example of a distribution as in Definition 2 is a Gaussian prior on the wavelet coefficients with an appropriate variance. More precisely, we consider random functions f whose wavelet decomposition with respect to a suitable wavelet basis is given by This ensures that Ef C s ≤ L. Further, the constraint in the support of the function f can be achieved by using compactly supported wavelets (e.g. Daubechies) and by choosing the largest scale carefully.

Formalization of SU-nets
In this section we introduced a class of neural networks which we call SU-nets. They are a modification of the U-net introduced in [42]. Before we define the nets in Definition 4, we motivate their structure by comparing them with the computation of the wavelet transform.
Remark 2 (Wavelet transform via recursion). In statistics, the wavelet coefficients of a function g with respect to a wavelet basis {ϕ j,k,e } are usually computed and denoted by the inner products c j,k,e = g, ϕ j,k,e , where ϕ j,k,e (x) = 2 jd/2 ϕ e (2 j x − k). In typical statistical applications one does not consider the computation of the coefficients any further. In computer science, however, the question of computation is in focus. The wavelet coefficients of a function can be computed in several ways, one of which is specially efficient and will be useful for our formulation of U-nets: the recursive computation. In a nutshell, we begin computing the wavelet coefficients of the signal at the smallest scale, and compute the ones at larger scales recursively.
In equations, we begin with the coefficients at the smallest scale, say J, for the father wavelet only: c J,k,0 = g, ϕ J,k,0 . The rest of the coefficients are computed recursively as follows and g e [l] given by The coefficients also satisfy a recursion in the other direction, i.e. from big to small scales: where the sum over r runs over all indices r ∈ Z d such that (k − r)/2 ∈ Z d . We refer to Theorem 7.10 in [35] for the proof of these statements. A warning here is due: Mallat uses the scale index in the opposite direction than we do, that is he writes φ j,k (t) := 2 −j/2 φ(2 −j t − k). Beware of this change in notation when comparing the results. While the computations above apply to all wavelets, in this paper we will use Daubechies wavelets [11], which are convenient due to their support and smoothness properties. Furthermore, using them gives some constraints in the filters h[l] and g e [l], which we explain now. Let {ϕ j,k,e } be a family of onedimensional Daubechies wavelets with M ∈ N vanishing moments. It follows from Theorem 4.2.10 in [19] that For a d-dimensional basis of Daubechies wavelets constructed by tensorization, this support property propagates in the obvious way. In particular, we see that for such a basis, the Lebesgue measure of the support of its elements is |supp ϕ 0,0,e | = (2M − 1) d . Furthermore, by the definitions of h[l] and g e [l] in (5), we see that they also have a support of size We can deduce further properties of the filters h[l] and g e [l]. By Corollary 4.2.4 in [19], using that More generally, these filters satisfy a certain orthogonality property, but we will not need it here.

Definition 3 (Convolution operations on tensors). 1) For d ∈ N, we say that
A is a d-tensor if it is a d-dimensional array of real coefficients. 2) Given a d-tensor, we define its support size or simply its size as its number of components. We will often work with d-tensors indexed by an equidistant regular grid of 2 Jd points in [0, 1] d , which we denote by Γ J , for J ∈ N. 3) Let A and γ be two d-tensors. We define their downsampled convolution as the d-tensor given by where k ∈ Z d and the sum runs over all indices l such that γ l and A 2k−l are defined. 4) Let A and γ be two d-tensors. We define their upsampled convolution as the d-tensor given by where k ∈ Z d and the sum runs over all indices l such that (k + l)/2 ∈ Z d and γ l and A (k+l)/2 are defined.
Definition 4 (Structure of the SU-net). Let d, J, S filter ∈ N, and let ψ, φ ∈ d-tensors with support size S filter , and let τ j ≥ 0 be real numbers. An operator F (·) that maps a function g ∈ L 2 ([0, 1] d ) to function F (g) is called a Simplified U-net, or SU-net, if its output is given by (9) according to the following steps: -First layer: given the input function g ∈ L 2 , define the d-tensor with Γ J as in Definition 3.
-Contracting path: -Activation functions: where the nonlinearity acts coefficient-wise in the tensors. -Expanding path: -Last layer: where the sum runs over all indices of the tensor s (J) , which are indexed by the grid Γ J .
A graphical representation of the SU-net is given in Figure 2. Notice that the support of the tensors s (l) and d (l),e decreases as l decreases: more precisely, it is divided by 2 d every time l decreases by one. Conversely, the support size of s (l) is multiplied by 2 d every time l increases by one. This is akin to the pooling operation in U-nets, where it is usually performed either as max-pooling or average-pooling.
By definition, SU-nets are operators that map functions to functions. Their definition is such that the class of SU-nets contains the well-known operator that performs wavelet thresholding. Consider the SU-net with filters given by They hence have support S filter ≤ (2M − 1) d , and norm given by Let further the functions ψ and φ in the SU-net be given by i.e. by the father wavelet at scale J. Consequently, we see that the wavelet coefficients of the signal g are given by Finally, note that the upward ("expanding") path recomputes the wavelet coefficients that the smallest scale according to equation (6), and the last layer multiplies the thresholded coefficients with the father wavelet to yield the reconstructed function which is the wavelet thresholded version of g.
We now further constraint the class of SU-nets that we will use by including some quantitative bounds.
3) The input filter ψ belongs to the Sobolev space H r (M) and satisfies the following: 4) The output filter φ is fixed and given by φ( is the father of a basis of Daubechies wavelets with R continuous derivatives (and 6 R + 1 vanishing moments and support size (12 R + 1) d ).
Definition 5 means that, in order to specify an SU-net from class F J , we need to choose the input filter ψ (a continuous function), the compactly supported discrete filters α (j) , etc, and the thresholds τ j . The output filter in particular is fixed.

Remark 3.
Following the construction and analogy in Example 3, we notice that the choice of the ReLU activation function in (7) is analogous to the softthresholding rule used in wavelet thresholding. Accordingly, different activation functions correspond to different thresholding rules (see page 202 in [28] for some examples), which then again have different motivations and interpretations. We wonder whether the statistical meaning of the tresholding rules can be transfered to the activation functions, thus providing guidance in their choice. We think for instance about the similarity between shrinkage in the James-Stein estimator and the batch normalization operation in neural networks.
Remark 4 (Ensuring countability). For theoretical purposes in Section 4.3, we will need the set F J of SU-nets to the countable. In order to achieve that, we will define them as above, but all filters and thresholds taking real values will instead take rational values, and the continuous filter ψ(x) ∈ H r will be taken from a countable dense subset of the unit ball of H r .

Example 3 (Continued)
. In order to accommodate the wavelet thresholding construction of Example 3 into the class F J = F J (r, R, S filter , κ τ , C ψ,L 2 , C ψ,H r ) of SU-nets, we need to choose a wavelet basis with M = 1 + 6R vanishing moments, as it implies that the wavelet functions have 0.18 · (M − 1) ≥ R continuous derivatives and belong to H r for R > r (see Remark 4.2.11 in [19]). This implies that we need to choose S filters = (12R + 1) d .
Further, we need to choose C ψ,L 2 = 1 and C ψ,H r = ϕ 0,0,0 H r 2 Jr = 2 Jr (see (10) for the last equality). With this choice of parameters, the wavelet thresholding operator as described in Example 3 belongs to F J .
We use this property in Section 4.2 below to analyze the approximation properties of SU-nets.

Remark 5.
Notice that while the first layer in an SU-net is a convolutional layer, it is a continuous convolution, which is not standard in the deep learning community. We stress that this continuity is enforced for mathematical convenience, see the remarks after Assumption 2, and it would be discretized in practice.
Further, we remark that having the first layer as a convolution is enough for our purposes here, since we are dealing with translation invariant operators, for which the associated vaguelettes are all shifts of a single function, see Example 4 below.
For more general operators, we may need to perform more complex operations in the first layer in order to invert the operator locally. For non translation invariant operators, this could be done with the wavelet-vaguelette decomposition for general (not translational invariant) operators [14] or with a suitable Galerkin inversion as in [21]. In that case, one layer may not be enough to perform these operations, and several layers may be needed.
Example 4 (Wavelet-Vaguelette transform). Let T be an operator satisfying Definition 1 and Assumption 1. Consider the SU-net constructed in Example 3 above, with the only difference that ϕ is a basis of Daubechies wavelets with R = 1 + 6(r + β) vanishing moments, and the function ψ is taken to be the vaguelette associated with the basis of wavelets and operator T . In this case, the corresponding SU-net performs thresholding with respect to the waveletvaguelette decomposition of the operator. This operation was proposed by [14] for solving statistical inverse problems.
In order for a class of SU-nets F J as in Definition 5 to accommodate this operator, we need to choose the parameters such that for the constant a 1 > 0 in Definition 1. This is so because of the β-smoothing property of the operator T , which implies that assuming that ϕ j,k,e is a family of Daubechies wavelets with at least t continuous derivatives and more than t vanishing moments.
Example 5 (U-net). In the Simplified U-net, all filters are chosen freely from a bounded set, as in F J in Definition 5. In particular, SU-nets are a particular subset of U-nets in [42] with the following properties: -All layers are convolutional, with filters of size S filter in the middle layers.
-The activation functions are nonlinear in the skip connections, and linear otherwise, -At each level, a constant number 2 d − 1 of feature channels is used.

Main results
Assumption 2 (Training data). Let M ⊂ R d be a bounded set and T ∈ T β,CT be a β-smoothing operator as in Definition 1, both satisfying Assumption 1. We call training data a set {(Y i , f i ), i = 1, . . . , N } of observations given by where dW i are independent white noise processes defined in the same probability space.
Notation. In the following, we write E f to represent the expectation with respect to the function f drawn from the distribution Π. Accordingly, E W represents the expectation with respect to the white noise process W .
Notice that we observe the transformed data Y i and the corresponding true signals f i as continuous data (i.e. at "full" resolution). This has several reasons: on one hand, we are analyzing SU-nets from the viewpoint of nonparametric statistics and inverse problems, so it is natural to estimate continuous functions, rather than discrete signals (which would fall in the realm of high-dimensional statistics). Finally, we choose to work with continuous signals for mathematical convenience: in particular, we leverage the approximation properties of wavelet bases, wavelet-vaguelette decompositions, etc, which are naturally expressed in the continuous setting. This being said, we remark that our results below can in principle be extended to discretely sampled f i and dY i in the training data, with the essential modifications being in Section 4.2, where we use the approximation properties of wavelets in order to prove approximation properties for SU-nets.

1) We say that a trained SU-netF is a solution to the optimization problem
where we minimize over the filters ψ ∈ H r , α (j) , β (j),e , a (j) , b (j),e and thresholds τ j . 2) For ρ > 0 fixed, we say that a networkF (

Remark 6.
Clearly, a trained SU-net is also ρ-approximately trained for any ρ > 0. We consider ρ-approximately trained networks for an important reason: in practice, training is a difficult, nonconvex problem for which few guarantees are known (although first results look promising, see e.g. [25,30] and references therein). In that sense, it is reasonable to allow deviations from the exact solution to the optimization problem (11), unattainable in practice, and instead consider approximate solutions which may be reached in practice. From the theoretical side, this relaxation is also valuable in order to find a approximate minimizer with rational weights, as required in Remark 4).
Assumption 3 (Choice of parameters). Given J, s, β, N, σ, a 1 > 0, define the parameters r = max{s, J, d/2 + 1}, R = r + β, S filter = (12 R + 1) d , and Remark 7. The support size above corresponds to Daubechies wavelets with R continuous derivatives and M = 1 + 6 · R vanishing moments, i.e. (2M − 1) d , see Example 3. The choice of the smoothness follows from the need that Daubechies wavelets of smoothness R approximate functions in C s well (which is needed in Proposition 2 below). Classical results (see e.g. Section 4.3.1 in [19]) imply that the wavelets need to have M > s vanishing moments and R > s continuous derivatives. In order to achieve this, we need to choose Daubechies wavelets with M = 1 + 6 · R vanishing moments and support (12 · R + 1) d (see the continuation of Example 3 above).
Furthermore, for our bounds on the covering numbers of F J in Proposition 3 we will need the smoothness of the wavelets to be as large as possible in order to reduce the size of our parameter space. There we are confronted with another effect: the larger the smoothness, the larger the support of the wavelets. And as stated in Proposition 3, if the support size S filter = (2M − 1) d grows unboundedly, our entropy bound increases.
A reasonable compromise between these effects is to choose wavelets with r = max{s, J} continuous derivatives, where J is the depth of our SU-net (see the proof of Proposition 4) Theorem 1. Fix s, β, σ, N, L, a 1 > 0 and let For r, S filter , C ψ,L 2 , C ψ,H r and ρ N,σ as in Assumption 3, let F ∈ F J (r, R, S filter , κ τ , C ψ,L 2 , C ψ,H r ) be a ρ N,σ -approximately trained network as in Definition 6 with N training data sampled from the probability distribution Π from Definition 2 in C s L and with noise level σ. Assume that the data arises from a β-smoothing operator T ∈ T β,CT as in Definition 1 and satisfying Assumption 1. Then there is a constant C > 0 independent of σ, L and 740

M. del Álamo
Proof. The result follows from Proposition 4 below.
We remark that the constant in (15) is explicit, and can be found in the proof of Proposition 4. (15) is taken with respect to the observed data f following the distribution Π (and also with respect to the training data f i coming from Π, and with respect to the noise in the training data W i and in the observed data W ). On one hand, this means a clear limitation to the predictive power of the trained network. On the hand, it is only natural that a network trained on a specific set of data generalizes well on similar data, but not necessarily well on unrelated data. We find this result valuable in that it shows the limitations of neural networks trained on a limited set of data.

Remark 9.
In the setting of Theorem 1, let the size of the training set be given by √ N = σ −2γ for a parameter γ > 0. Then we have the following regimes: -If γ < 2s+2β+3d/2 2s+2β+d , the right-hand side of (15) behaves like N − s 2s+2β+3d/2 (log N ) 3 , which is slower than the classical nonparametric estimation rate. This is the regime where we do not have enough training data, and the lack of knowledge on T dominates the error. Still, the error tends to zero when N increases.
-If γ ≥ 2s+2β+3d/2 2s+2β+d , then the right-hand side of (15) behaves like σ 4s 2s+2β+d with an additional logarithm if equality holds. Ignoring the logarithm, this is the minimax rate of estimation for functions in C s L . This means that F performs optimally in this regime. In particular, here we have so many samples N in the training set, that not knowing the operator is not an issue.
Regarding the optimality of the result in Theorem 1, we have seen that the right-hand side matches the minimax lower bound for N large enough. On the other hand, if N ≤ σ −4 2s+2β+3d/2 2s+2β+d , the rate is driven by N and we do not know whether there is a matching lower bound in this regime. We suspect that the rate is in fact suboptimal in this regime, as the exponent of N seems arbitrary and it arises from a complicated balance of terms in the entropy bound. As explained in Remark 11 below, one possibility could be to find an alternative way of approximating the vaguelettes, maybe using additional information on the operator. Generally, we suspect that a different approach to bound the entropy term could yield an improvement in the exponent, but we do not pursue this issue any further in this work.

Structure of the proof
With F J = F J (r, R, S filter , κ τ , C ψ,L 2 , C ψ,H r ), define the set of functionals that map functions g ∈ dom(T ) to nonnegative numbers. The expectation E W is taken with respect to the white noise process dW . In Proposition 1 we split the error into a stochastic error, an approximation error and the optimization error ρ N,σ . The first two errors are taken care of in Propositions 2 and 3.
See Section 4.1 for the proof.
Proposition 2 (Approximation error). For β ≥ 0, let T be a β-smoothing operator as in Definition 1 and satisfying Assumption 1. For s, J ∈ N and r, S filter , κ τ , C ψ,L 2 , C ψ,H r as in (12), consider the set of SU-nets F J = F J (r, R, S filter , κ τ , C ψ,L 2 , C ψ,H r ). Then we have for a constant C > 0 independent of σ and L.
The proof is based on choosing F to be the operator that performs thresholding with respect to the wavelet-vaguelette decomposition of T , see Section 4.2.

M. del Álamo
The proof of Proposition 3 is based on the usual metric entropy techniques and is given in Section 4.3.

Remark 10.
In order to prove entropy bounds on H J , we apply the classical entropy integral bound to it. Since it is usually applied to classes of functions, and not to classes of functionals, we review here the key assumptions needed on Finally, in order to apply the entropy bound in Theorem 3.5.1 in [19] to H J , we need to show that it is countable. But this follows from the fact that the set of SU-nets is countable, see Remark 4. All in all, we have that (H J ,ρ J ) is a separable pseudo-metric space with the pseudo-metric given bỹ where f i ∼ Π are independent random functions from the training data.

Remark 11. The term C d/(2r)
ψ,H r in the upper bound in Proposition 3 is responsible for the ackward-looking exponent s 2s+2β+3d/2 in the final convergence rate (15). The structure of the term C d/(2r) ψ,H r is intuitive from the viewpoint of empirical processes: we are dealing with the entropy of a set of d-dimensional and (roughly) r-differentiable functions, so bounds of the form K d/2r are expected, where K is the radius of the set. The issue is that we need the upper bound C ψ,H r to grow like 2 J(r+β) (see Assumption 3) in order for our network to be able to approximate vaguelettes of the right regularity. A possible improvement of this issue, and hence of the convergence rate in (15), would be to devise an alternative way of approximating the vaguelettes. Now we are ready to prove our main result combining Propositions 1 to 3.

Proposition 4.
Let r, R, S filter , C ψ,L 2 , C ψ,H r and ρ N,σ be as in Assumption 3. LetF be a ρ N,σ -approximately trained SU-net as in Theorem 1. Assuming that J is given by (14) we have the bound where the constant C > 0 is independent of T, f, σ, L and N . In particular, the claim is uniform over operators T ∈ T β,CT satisfying Definition 1 and Assumption 1.
Proof. First, choosing the parameters as in Assumption 3, the claim of Proposition 3 reads which is bounded by a constant independent of J, σ, L, C T and N . Combining Propositions 1 to 2 with this result we get Depending on the relation between σ 2 and √ N , there are two different values of J that minimize the above expression. In order to analyze that, we introduce the exponent γ > 0 defined such that √ N = σ −2γ . Oversampled regime: γ ≥ 2s+2β+3d/2 2s+2β+d In this case, the optimal choice of J is J = 1 2s+2β+d log 2 σ −2 , which yields the bound Notice that the term ρ N,σ defined in (13) is of the same asymptotic order as the first two terms.
In this case, the optimal choice of J is J = 1 2s+2β+3d/2 log 2 ( √ N ), which yields the bound Here, again, the term ρ N,σ is of the same asymptotic order as the first term.
Finally, notice that in both regimes, the first terms dominate (i.e. are bigger), and only for the choice γ = 2s+2β+3d/2 2s+2β+d are both terms of the same polynomial order.

Conclusion and outlook
In summary, in this paper we have considered inverse problems with unknown operators and proposed a method to solve them given training data. We proved a convergence rate for the method that depends on the noise level and the amount of training data, and that coincides with the minimax rate in the regime where the amount of data is above a threshold. We remark once again that the main advantage of our result over the available literature is that our assumption on the training data is minimal and realistic, while the assumptions used in the literature of inverse problems with unknown operators are strong and not always realistic.
Further, we have employed neural networks to solve our problem, hence showing how they can be applied to solve inverse problems in general. In particular, this is to the best of our knowledge the first work in which a convergence guaranty is proven for neural networks in inverse problems. Crucially, our analysis yields concrete prescriptions on how to choose the many hyperparameters of the network, such as its depth, the number of channels, the size of the convolution kernels, etc.
As it stands, our result concerns continuous neural network that are not employed in practice. As remarked above, this is done for mathematical convenience, in order to profit from the approximation theory available in the continuous world. We see this issue pragmatically, and merely think of the continuous neural net as a mathematical model for the discrete one.
Finally, we remark that in this paper we have proposed a different way of thinking about neural networks, that is, to see them as operators (or functionals), and have presented some techniques to analyze them. We see this however as a basic result proven with basic techniques, and both of then can be extended in a variety of directions, a few of which we mention here.
First, the generalization of the present results to more general operators, either lacking translation invariance or even a wavelet-vaguelette decomposition, is of interest. We suspect that dealing with this would require the architecture of the network, probably by introducing a block of dense layers before the SU-net kicks in.
Another generalization of interest is to allow a different noise level in the train and test data. This could be done again by modifying the network architecture and including a "module" that is able to estimate the noise level of the input, plus skip connections that bring the estimated noise level to the thresholds in the ReLU nonlinearities. For this extension, it is crucial to compute how this change in architecture affects the entropy estimates.
Further, it would be of interest to look into the distribution Π in more depth: on one hand, it would help to see how it is in practical applications (see e.g. the works of Mumford, such as [23]) and which regularity/structural properties it satisfies. Moreover, it is possible to adapt our results to the distribution Π more precisely, and try to find the "right" regularity exponent s and thus the optimal convergence rate for data from a distribution Π, and the corresponding lower bound. As a first approximation, one could consider the covering numbers of the support of Π, on not of the whole of C s L , in the computation of the entropy bound.
As a last remark, we believe that our techniques can be used to analyze neural networks in other problems, such as segmentation or classification of highdimensional (continuous) signals, for which they are not yet fully understood.

Proofs
In this section we prove the main propositions of the paper. For compactness of the notation, we employ a special notation for the expected values: E f denotes the expected value with respect to the random function f ∼ Π, drawn from the distribution Π. And E W denotes the expected value with respect to the white noise process dW .

Proof of Proposition 1
Proof of Proposition 1. Conditionally on the f i 's, we have The difference of the first two terms can be bounded as follows where in the last equality we define the functions The second term in (19) can be dealt with as follows: conditionally on dW 1 , . . . , dW N we have for any F * ∈ F M , using the properties ofF in Definition 6. Consequently, using conditional expectations we get and since the element F * ∈ F M was arbitrary, we can take the infimum in the right-hand side, which yields the claimed bound.

Approximation properties
Here we prove Proposition 2 considering two separate cases for the sake of illustration: first, the case that T is the identity operator, so we just have a denoising problem. In this case, we use an SU-net to perform wavelet thresholding, see Proposition 5. And second, the case where T is a nontrivial smoothing operator, in which case we use an SU-net to perform thresholding with respect to its wavelet-vaguelette decomposition, see Proposition 6.
Proposition 5 (Approximation in the regression model (T = id)). Let J ∈ N be arbitrary and L > 0. Let Π be a probability distribution on C s L as in Definition 2. Consider the class F J = F J (r, R, S filter , κ τ , C ψ,L 2 , C ψ,H r ) from Definition 5 with parameters as in Assumption 3. Then there is a network F ∈ F J such that, for data with dW and f independent, F (·) achieves the estimation error where C > 0 is a numerical constant independent of J, σ and L.
The proof consists of two stages: first, we show that an SU-net can be used to compute the wavelet thresholding operator of a signal; and second, we apply the well known theory on wavelet thresholding to bound its accuracy.
Proof. Let ϕ j,k,e denote a family of Daubechies wavelets with M = 1 + 6 R vanishing moments, and hence R > s continuous derivatives. Recall from Remark 2 that its filters h and g e satisfy and the continuous filters chosen to be It is immediately verified that the filters satisfy the conditions of Definition 5 with the constants of Assumption 3. In particular we have since ϕ 0,0,0 has R = r continuous derivatives, see equation (10) above. All in all, we conclude that the chosen SU-net F indeed belongs to F J with the chosen parameters. Now, by construction the function F (g)(x) is the wavelet thresholded version of g, where its wavelet coefficients c j,k,e = g, ϕ j,k,e are soft thresholded by a value τ j > 0. Classical results in nonparametric regression (see e.g. [15]) imply that choosing the threshold of the order τ j = σ √ 2 j yields a bound where f C s denotes the s-Hölder norm of f , and c 1 , c 2 > 0 are numerical constants independent of f , J and σ. Taking the supremum over f ∈ C s L gives the bound sup which we use to bound the expectation with respect to f ∼ Π using the inequality This completes the proof.
Proposition 6 (Approximation for inverse problems model (T = id)). Let J ∈ N be arbitrary. Consider the class of networks F J = F J (r, R, S filter , κ τ , C ψ,L 2 , C ψ,H r ) with parameters as in Assumption 3. Let the operator T and the set M satisfy Definition 1 and Assumption 1. Then there is a network F ∈ F J such that, given data where Π is a probability distribution in C s L and C > 0 is a numerical constant independent of σ, J and L.
Proof. The proof essentially follows the same steps as the proof of Proposition 5, so we will just point out the differences. The difference is that here we do not want to do wavelet thresholding with an SU-net, but instead we want to threshold the wavelet vaguelette decomposition of the data. In order to do that, consider a basis of Daubechies wavelets ϕ j,k,e with 1 + 6 R vanishing moments, just as in the proof of Proposition 5. We define hence an SU-net F with discrete filters equal to the discrete filters of the Daubechies wavelets as in (22), and the continuous filter in the last layer is chosen to be a father wavelet, i.e., φ(x) = 2 Jd/2 ϕ 0,0,0 (2 J x).
For the continuous filter ψ in the first layer of F , we consider the waveletvaguelette decomposition of the operator T , which exists by Assumption 1. Moreover, by the assumption of translation invariance in Definition 1, the vaguelettes are all given by translations of one function, which we denote by ψ. Due to the β-smoothing property of T , this function ψ, which is implicitly defined by the equation T * ψ = 2 Jd/2 ϕ 0,0,0 (2 J x), satisfies , using equation (10) to compute the Sobolev norms, and the fact that ϕ 0,0,0 has R = r + β continuous derivatives. This choice of ψ and its bounds are hence in accordance with the parameters in Assumption 3, and we conclude that the constructed network F belongs to F J with the desired parameters.
But for this choice of the network, the first layer performs the following operation: for k ∼ N (0, 1) no longer independent. This is however not an issue, as it is wellknown that thresholding these observations with a slightly modified threshold yields optimal results (see e.g. [29]), where the modified threshold is of the same order of magnitude as the universal threshold σ √ 2J, so it is in our interval [0, σ 2 Jβ log N ] of allowed thresholds.
With this choice of ψ in the first layer, we have transformed the data into the coefficients with respect to the Daubechies wavelet basis, at the cost of blowing up the noise level by a factor ψ L 2 = O(2 Jβ ) which is why we need to choose The following layers in the SU-net perform the same operations as in Proposition 5: computation of the wavelet coefficients, thresholding, and synthesizing the function back together. The same theory as for the reconstruction properties of wavelets applies. Hence, taking the increased noise level into account, the final result (23) reads now The rest of the argument is now identical to the steps after equation (23) above, so we do not repeat it.

Entropy bounds
In this section we prove Proposition 3 by applying the usual covering number techniques to the class H J of functionals. The covering numbers of this class is then related to those of the class F J of neural nets, which we bound in Section 4.3.2. The key ingredient for bounding the covering numbers is an analysis of the stability of neural networks under perturbations of their parameters, given in Section 4.4.
Proof of Proposition 3. We use Theorem 3.5.1 in [19] applied to the class of functions H J = H J (r, R, S filter , κ τ , C ψ,L 2 , C ψ,H r ), which maps C s L functions to R. P N is hence the empirical distribution of the training data f i ∼ Π in C s L , and we have 750 using Claim 2 below, as well as f i ≤ L and T op ≤ C T . In this setting, applying Theorem 3.5.1 in [19] with the modification that 0 / ∈ H J , which gives an additional term T 0 , yields is the random semi-metric in (17) and D(H J ,ρ N , δ) denotes the δ-packing numbers of H J with respect toρ N . The additional term T 0 is given by for an arbitrary element h 0 ∈ H J and independent centered sub-Gaussian random variables ξ i with unit variance. We bound this term choosing the functional h 0 that corresponds to the zero SU-net F (·) = 0, i.e. the network that always returns zero. This network belongs to F J , as we can take all the filters to be zero. For this choice of h 0 , T 0 can be bounded as It remains to compute the integral above. By Proposition 7 below, we have an upper bound on the covering number of H J , which translates to a bound on the packing numbers by the inequality D(H J , ρ, τ ) ≤ N (H J , ρ, τ /2). Using that and equation (24) yields where the constants H J and C W are defined in the statement of Proposition 7.
Applying Claim 1 to bound the integral yields Using that B ≥ 2 Jd/2+d 4C 2 ψ,L 2 J 2 max{LC T , σ} 2 , we see that The expression above can hence be bounded by For J large enough, there is a constant C 0 such that Define further the constant C 1 = 2 3(d+1)/2 52 d/(2r) 1−d/(2r) . This implies that Putting everything together yields which yields the claim.

Claim 1.
For K ≥ 1, A, a, α, C > 0, r > d/2 and C ≥ e A, we have This completes the proof.

Covering numbers of H J
Given a set of independent random functions f i ∼ Π, define a random semimetric in F J by In this subsection, we bound the covering numbers of H J with respect toρ N by upper bounding them with the covering numbers of F J with respect to the semi-metric ρ N . Proposition 7. The covering numbers of H J (r, R, S filter , κ τ , C ψ,L 2 , C ψ,H r ) satisfy where Proof. By Claim 3 we have the inequality and the right-hand side can be bounded with Proposition 9 to yield the result.

Claim 2. For any
for F with filters φ, ψ and α (j) , a (j) , etc, as in Definition 5 we have the bound Proof. Using the definition of h and the bounds in Proposition 10 for bounding Claim 3. Consider the random semi-metricsρ N in H J (r, R, S filter , κ τ , C ψ,L 2 , C ψ,H r ) and ρ N in F J (r, R, S filter , κ τ , C ψ,L 2 , C ψ,H r ) based on the same set of random functions f i ∼ Π. Then, conditionally on the functions f i , we have where in the last inequality we use Claim 2. Hence, we havẽ which yields the claim.

Proposition 8.
Let F, G be networks in F J , and denote the differences between their filters a (k) by Δ (a,k) , between their filters ψ by Δ (ψ) , and so on. Then Proof. We construct a sequence {F l , l = 0, . . . , E} of SU-nets contained in F J that begins at F 0 = F , ends at F E = G, and each pair of consecutive nets only differ in one filter. Since there are 1 + 2 d+1 J filters and J thresholds in an SU-net, we can achieve this with a sequence of E = 1 + (2 d+1 + 1)J + 1 nets (the "+1" being F 0 = F ). With this construction, we have P α,k + P a,k + � e P b,k,e + P β,k,e + P τ,k−1 , where P ψ denotes the perturbation in the filter ψ, P α,k in filter α (k) , and accordingly for the other terms. By Proposition 11 below, each of those terms can be bounded to yield the claim.
Proposition 9. Let J, S filter ∈ N and r, κ τ , C ψ,L 2 , C ψ,H r > 0 be fixed. The covering numbers of F J (r, R, S filter , κ τ , C ψ,L 2 , C ψ,H r ) with respect to the semimetricρ N satisfy where H J = 2 d J S filter log(3 max{1, κ τ 2 Jd/2 }) and C W = C ψ,L 2 26 J 2 2 Jd/2+d max{C T L, σ, 1}. Proof. First notice that we can cover a ball of Euclidean radius R in R s with (3 R/) s balls of radius (see e.g. Proposition 4.3.34 in [19]). Using that the discrete filters in Definition 5 have support S filter and Euclidean norm bounded by one, this means that we can cover the set of all possible filters in F J with filters whose Euclidean distance to each other is not more that . The second term arises from the continuous filter ψ, which belongs to the ball of radius Using equation (25) in Proposition 8, this implies that we can construct N f ilters · N thres SU-nets that cover F J with d i -balls of radius using that E|X| = � 2/π for X ∼ N (0, 1), and that f i L 2 ≤ f i C s ≤ L and that T op ≤ C T . This means that we have N f ilters · N thres SU-nets that cover F J withρ N -balls of radius C ψ,L 2 26 J 2 2 Jd/2+d max{C T L, σ, 1} Defining δ := C ψ,L 2 26 J 2 2 Jd/2+d max{C T L, σ, 1}, we can reformulate the above statement as follows: we can cover F J with N δ :=3 2 d J Sfilter · exp{2 3(d+1) (C ψ,H r C ψ,L 2 26 J 2 2 Jd/2+d max{C T L, σ, 1}/δ) d/r } · (C ψ,L 2 26 J 2 2 Jd/2+d max{C T L, σ, 1}/δ) 2 d J Sfilter · max{1, (κ τ 2 Jd/2 ) J } balls of radius δ with respect to the semimetricρ N . The logarithm of this number can be further bounded as log N δ ≤ 2 d J S filter log(3 max{1, κ τ 2 Jd/2 }) Proof. The result follows from Theorem 4.3.36 in [19], where the constant is not specified. Here we give an upper bound for the constant. First notice that the conditions on r imply that r ≥ 1. Hence, the unit ball of H r is contained in the unit ball of H 1 , so we have log N (H r 1 , L 2 , 1) ≤ log N (H 1 1 , L 2 , 1) ≤ K H 1 .
We hence have reduced the problem to computing the constant K H 1 instead of K H r . We proceed as in the proof of Theorem 4.3.36 in [19], but keeping track of the constants. In order to bound the covering numbers of H 1 1 with balls of radius , let j 0 = log −1 . For any two functions f (a) , f (b) ∈ H 1 1 , we have , the ball of 2 -radius one in dimension 2 jd (2 d − 1) (i.e. on the k and e coefficients). By Proposition 4.3.34 in [19], this can be done with at most (3 −1 ) 2 dj (2 d −1) . Notice that, by the equations above, the j coverings induce a covering of H 1 1 with balls of radius R = � j≤j0 2 −j j + 2 −j0+1 . Combining this with the above yields .

Stability of F under perturbations
In this section we prove stability estimates for networks under perturbations of their filters. For simplicity of the notation, all norms · in this sections are the 2 norm unless otherwise stated.
Proposition 10 (Size of the terms). Using the notation and assumptions from Definition 5 for a network F with input f , the following estimates for the inner layers hold: s (J−l) ≤ 2 Jd/2 |f, ψ L 2 |