Towards adaptivity via a new discrepancy principle for Poisson inverse problems

A new form of the discrepancy principle for Poisson inverse problems with compact operators is proposed and discussed in relation to various other proposals. It is shown that filter-induced spectral regularization with a priori chosen smoothing parameter produces estimators that are rate-minimax under source conditions on the estimated function. With the discrepancy principle used for a posteriori choice of the smoothing parameter, the filter-induced solutions are consistent (in probability), but the convergence rates under source conditions are suboptimal, at least in the finitely smoothing case, which often happens when discrepancy principles are used in stochastic inverse problems. Finite sample performance of the proposed procedure applied to a stereological problem of Spektor, Lord and Willis is illustrated with a simulation experiment. MSC2020 subject classifications: Primary 62G05, 93E10, 65J20, 45Q05.


Introduction
Let Π ng be a Poisson process on a measure space (Y, B Y , μ Y ) with an integrable intensity function ng w.r.t. to μ Y . The parameter n ∈ N is known. It quantifies the "experiment size", and will asymptotically tend to infinity. The function g is unknown. Additionally, we assume that with some operator K. The Poisson inverse problem consists in estimating f , given observed the Poisson process Π ng . In applications, nf is usually itself an intensity function of an unobservable Poisson process on possibly another measure space, say (X , B X , μ X ), and K is related to a data distortion mechanism that maps the points of the unobservable process on X to the points of the observable process Π ng on Y. Plethora of practical problems fit into this general scheme including, e.g., restoration of medical images, image deblurring in microscopy and astronomy, and some stereological problems.
In this article, we assume that K is a linear and compact operator from L 2 (X , B X , μ X ) to L 2 (Y, B Y , μ Y ) of the form (Kf )(y) = K(x, y)f (x) dμ X (x). (2) Compactness of K is ensured, e.g., if the kernel K is square-integrable, or if X = Y is a compact subset of R k , μ X is the Lebesgue measure and K is weakly singular. For simplicity, we also assume that K is injective. The goal is to estimate f under the L 2 loss. It is well-known that, in this setup, K −1 is unbounded as an operator on Im K and, in effect, some sort of regularization is necessary. For a general introduction to Poisson processes and the related inverse problems, we refer to [25,18,20]. A good introduction to regularization techniques and related problems in a non-stochastic setup is given, e.g., in [14]. A general treatment of minimax convergence rates of so-called filter-induced estimators under various noise models can be found in [4]. Typically, optimal values of regularization parameters depend on the, generally a priori unknown, regularity of the estimated function. Therefore, in all cases, data-driven procedures of choosing the value of the regularization parameter, preferably leading to adaptive estimators, are of central interest. Adaptivity is meant here as the ability to attain the same (preferably minimax) convergence rates as those achieved in cases with a priori known smoothness. Adaptivity issues for inverse problems were studied mainly under Gaussian white noise assumptions. For a recent review see, e.g., [5] and [6], where, in particular, limits for the performance of stopping criteria based on residuals are obtained for a discretized Gaussian white noise model and various spectral filter estimators. In [15,16], a discrepancy principle is studied in a general stochastic inverse problem setup with unknown error distribution, under the assumption that repeated, independent, noisy measurements of the left-hand side of equation (1) are available. In the present article, we focus on the Poissonian noise, not satisfactorily covered in [15], but we borrow several ideas from that article.
Poisson inverse problems were studied in some generality in, e.g., [29,30,1,17]. Rate minimax solutions of such problems with L 2 risk were found mainly on case-by-case basis (e.g., [8,13,32]). Adaptive thresholded wavelet solutions under the L 2 risk, rate-minimax over some Besov balls (up to a log factor) were found for a stereological Spektor-Lord-Willis problem in [10,11]. In [12], a general Goldenshluger-Lepski oracle-inequality approach was adopted to the same problem and adaptivity of a specific kernel-based estimator was proved over a scale of restricted Sobolev spaces. More general adaptive thresholded wavelet solution was constructed in [1], but only for the Kullback-Leibler (KL) risk. A variant of the Lepski balancing principle suitable for Poisson inverse problems was also studied in [17] in connection with penalized maximum likelihood estimators. In a recent article [19], a circular deconvolution was studied for Poisson data with weighted L 2 risk, an adaptive solution based on model selection was constructed, and many additional references on Poisson inverse problems were given.
An alternative approach to the problem of data-driven choice of regularization parameters was proposed in [34] (see also [2]). A version of the Morozov discrepancy principle suitable for discretized Poisson inverse problems was constructed there, based on matching the KL divergence to its expectation. Existence and uniqueness of the resulting penalized maximum likelihood estimators were proved in [3]. This was further refined in [26] by proving consistency and convergence rates of the KL risk under some source conditions. However, as the KL distance primarily measures the closeness of the distributions and our main interest is in function rather than distribution estimation, we prefer to work with the L 2 loss.
In this article, we propose a new version of the discrepancy principle for Poisson inverse problems that, when compared to [3,26], is applicable to a broader class of estimators (i.e., not just penalized maximum likelihood estimators), leads to consistent (in probability) filter-induced estimators with explicit and close to optimal rates of convergence under L 2 distance, and does not require any discretization, which may be of importance, because it is known that discretization effects are not always negligible and may affect the convergence rates ( [29,30]).

The new discrepancy principle
A nice feature of Poisson inverse problems is that the noise level depends on the signal and may thus be estimated from data. Efficient, √ n-consistent estimation of the noise level is possible in the pre-conditioned model and our version of the Morozov discrepancy principle will be formulated in terms of q, its estimatorq n and the noise level δ 2 n := E q n − q 2 . The adjoint operator has the form (K * g)(x) = K(x, y)g(y) dμ Y (y) and the observable process can be represented as so-called mixed empirical process (cf., [25, p. 17]) where N n is Poisson distributed with mean n g dμ Y , δ a is the Dirac measure concentrated at a, and Y j , j = 1, 2, . . . form an i.i.d. sample from a density proportional to g, independent of N n . Assume that K is non-negative, which is satisfied in all typical applications. Then, by Campbell's theorem (cf., e.g., [18,Sec. 3.2]), which means thatq is an unbiased estimator of q(x). Moreover, if K(x, y)g(y) dμ Y (y) is finite, then, again by Campbell's theorem, which means that the noise level may be unbiasedly estimated witĥ Letf α be an estimator of f with a regularization parameter α. We propose to select α as an approximate solution of the following discrepancy principle (DP) equation withq n andδ 2 n defined in (5)- (8) and with some fixed positive τ . This is a natural reformulation for the pre-conditioned model of the general Morozov discrepancy principle, as described, e.g., in [14].
In general, the DP equation (9) does not have to have a solution and, even if the solution exists, it does not have to be unique. If the left-hand side of (9) equals zero for α = 0, as it will be the case (in the limit) for filter-induced regularizations studied later in this article, we set, for definiteness, and refer to (α n ) n∈N as the DP sequence of regularization parameters. A similar idea of building the DP equation for the pre-conditioned model was studied in [7] for inverse problems with Gaussian white noise error. In that case, it proved necessary to use a weighted version of the discrepancy in order to achieve optimal rates of convergence of DP-based solutions with a priori known smoothness. A posteriori rates, both in [7] and in our Poissonian case are, however, suboptimal (and, in fact, identical, as will be seen later).
The idea proposed in [7] was further developed (including some lower bounds) in [28], to allow for using residuals less smoothed than in [7], which proved essential for adaptation over some range of Sobolev-type ellipsoids. A version of DP based on smoothed residuals was also recently developed in [9] for nonparametric estimation of a regression function within the framework of learning algorithms embedded in a reproducing kernel Hilbert space. Again, smoothing the residuals proved essential for adaptation, especially over ranges of faster rates.

Filter-induced regularizations and estimators with a priori smoothing
Theoretical properties of the discrepancy principle proposed in the previous section will be studied for the family of estimators induced by spectral filters. In this section, we begin, however, with studying the properties of those estimators with a priori chosen smoothing level. Recall that, for an operator K ∈ L(F, G) (space of linear, bounded maps from F to G), the family of operators for all g ∈ Im K, when α → 0. For Hilbert spaces F and G, and for compact operator K, filter-induced regularizations can be conveniently defined in terms of the singular value decomposition SVD: there exist a sequence σ 1 ≥ σ 2 ≥ · · · > 0 of singular values, finite or converging to zero, and corresponding sequences (v i ) i and (u i ) i of orthonormal right and left singular vectors such that Span Here and in what follows ·, · denotes the inner product in respective Hilbert spaces.
In order to control the variance of the estimators, we will usually consider so-called bounded filters.

G. Mika and Z. Szkutnik
Since, in our setup,q n estimates q = K * g, it seems natural to definef α := F α (K * K)q n . Notice, however, that F α (K * K) is an operator in L 2 (X , B X , μ X ) but, in general,q n does not have to be in L 2 (X , B X , μ X ). Nevertheless, even in that casef α may be well defined as an element of L 2 (X , B X , μ X ) witĥ where q n , v i := n −1 The second equality in (13) holds true because of (12) and the Campbell's theorem. Under condition (13), we thus take (11) and (12) as the definition of the estimator. Convergence rates of this estimator will be studied for smoothness classes defined via so-called source condition (cf., e.g., [4]) with some positive ν and ρ. For finitely smoothing K with with some R, and one may expect minimax L 2 convergence rates n −2bν/(2bν+2b+1) (cf. [23,4]).
Recall, that the qualification of the filter {F α , α > 0} is the maximal ν 0 such that for any ν ∈ (0, ν 0 ] there exists a constant C ν > 0 such that The following theorem shows that, in the finitely smoothing case, filterinduced estimators with a priori regularization attain the expected convergence rates uniformly over subsets of F ν,ρ defined by an additional boundedness condition on Kf ∞ . Typically, the rates are minimax over F ν,ρ (cf. [23,4]). In many cases, however, the rates remain minimax over subsets of F ν,ρ with sufficiently large upper bound for Kf ∞ . This is easily seen, for example, for the stereological Spektor-Lord-Willis problem, studied in [13,32], where the finite subsets of F ν,ρ , used in the direct construction of lower bounds for the risk, have an obvious uniform upper bound for Kf ∞ .
Theorem 1. Let {F α , α > 0} be a regularizing filter with qualification ν 0 and let the singular values of an injective, compact operator K satisfy Although, as shortly discussed below, filter-induced estimators, at least for the Hilbert-Schmidt case with b > 1/2, may also be constructed and analysed by either embedding our problem in the general noise model considered in [4], or using the white noise model results in [7], we nevertheless decided to provide (see, Appendix) a direct proof of Theorem 1, because it also covers the case b ≤ 1/2.
To see the connection with the general noise model, let (Ω, F, P) be the basic probability space and write our model of observations in the form analogous to formula (2.11) where Y := n −1 Π ng is a Hilbert space process, i.e., a bounded, linear operator is the scaled Wiener-Ito integral (cf. [20]), also understood as a Hilbert space process with the action and E K * ε 2 = n E q n − q 2 < ∞ for the Hilbert-Schmidt case and under assumptions of Theorem 1 (cf. (7)). It follows directly from, e.g., Lemma 12.2 in [20] that the covariance operator of ε is the "multiplication by g" operator M g : φ → M g φ := gφ and its norm is M g = g ∞ ≤ M under the assumptions of Theorem 1. Also, so that condition (2.14) and the whole Assumption 1 in [4] are satisfied and the general theory applies. In particular, s and the last expression is the special case of the upper bound in formula (3.5) in [4], obtained as the integral w.r.t. the counting measure on N. The rates for the finitely smoothing Hilbert-Schmidt case with b > 1/2 thus follow from Section 5.3 in [4]. The white noise model results from [7] can also be used in our problem, because our basic bias-variance decomposition (see the proof of Theorem 1) is formally identical to analogous formula (11) in [7] and the whole rate calculation in [7] applies. In particular, not only the rates given in Theorem 1 follow (for b > 1/2), but one also covers the infinitely smoothing case with a weaker source condition traditionally studied with such operators, which is summarized for completeness in the following theorem.
Theorem 2. Let {F α , α > 0} be a bounded regularizing filter that satisfies the following qualification condition: there exist positive constants r and γ such that for all α > 0 and with

and let the singular values of an injective, compact operator
with C depending only on the filter and operator and on the function class parameters.
Finally, the polynomial case with b > 1/2 can also be viewed in relation to recent results in [15], becauseq n can be thought of as an arithmetic mean of n independent copies of K(·, y) dΠ g (y). If g ∞ ≤ M , one has E q n − q 2 = O(n −1 ), and one could try to directly use Theorem 2 in [15] to our preconditioned model (3), or Af = q, with A := K * K. This would, however, mean that one uses regularization R 0 α := F α (A 2 )A and the estimator with the source condition f = (A 2 ) r/2 s = (K * K) r s. Theorem 2 in [15] would then give forf 0 α the convergence rate n −r/(r+1) , which would correspond to n −ν/(ν+2) under our source condition f = (K * K) ν/2 s. Our Theorem 1 gives for our estimatorf , which is strictly better for b > 1/2. Note that E q n − q 2 < ∞ is necessary for the applicability of Theorem 2 in [15], and that this condition is roughly the same as b > 1/2.

Consistency and rates of convergence with discrepancy principle
Analysis of the properties of filter-induced estimators with regularization parameters selected with our version of the discrepancy principle will use both some classic techniques, described, e.g., in [14], and some new ideas proposed recently in [15]. As discussed in the latter article, consistency with L 2 risk cannot generally be expected, if plain discrepancy principle is used. Intuitively, this is because it can happen, although with diminishing probability, that the true noise level gets significantly underestimated, which may lead to huge L 2 loss and, in effect, to L 2 risk not converging to zero. We illustrate that with a counterexample that is an adaptation of a similar example from [15] to our Poissonian setup. On the other hand, L 2 -consistency can usually be guaranteed, although possibly with very slow rates, by adding to the DP-procedure a socalled "emergency stop" (cf., e.g., [15]). We, however, do not follow the latter idea and only prove convergence to zero in probability of the L 2 loss, i.e., weak consistency, and obtain corresponding convergence rates, both in close parallel to the development in [15].

Counterexample for convergence of L 2 risk
Equivalently, in terms of equation (2) and with f = where δ i,j stands for the Kronecker delta. We observe a Poisson process Π ng on N or, equivalently, the independent Poissonian counts, say (Z i ) i∈N , with means ng i , where Z i is the number of points Y j in the representation (4) of Π ng that are equal to i. Then, obviously, the quantities defined in Section 2 take the form q = (10 −i g i ) i∈N ,q n = n −1 (10 − It is more natural to parametrize this estimator directly with the number, say k, of retained components and writê Denote with (k n ) n∈N the corresponding DP sequence, i.e.
On Ω n one has for n large enough and, finally, as n → ∞. Large error on Ω n is not compensated by the small probability of Ω n , andf kn is not L 2 -consistent.
It should be noted that, due to exponentially decreasing P(Ω n ), this construction only applies to exponentially smoothing operators K.

Convergence in probability of L 2 loss
The discrepancy principle defined in Section 2 is meaningful, if the noise level is finite, which is guaranteed by the following assumption (cf. formula (7)): For the results in this section, we also need the filter to be left-continuous. This condition is clearly satisfied, e.g., for k-times iterated Tikhonov regularization, truncated SVD and Landweber iteration.
One immediate consequence of this assumption is that the left-hand side of the DP equation (9), which can be written as is left-continuous in α. As it also equals zero for α = 0 (in the limit), a DP sequence of regularization parameters is well defined with formula (10) and it satisfies the inequality K * Kf αn −q n 2 ≤ τδ 2 n . The following theorem applies, e.g., to truncated SVD and Landweber regularization, which are of infinite qualification, and, with k > 1, to the k-times iterated Tikhonov regularization, which is of qualification 2k.
In the finitely smoothing case, the obtained rates of convergence are suboptimal, when compared to estimators with a priori regularization in Theorem 1. However, in the limit, as b 1/2 in Theorem 1, i.e., on the boundary of the set of cases handled by both theorems, the two rates coincide. On the other extreme, as b → ∞, the rates in Theorem 1 approach n −ν/(ν+1) , while those in Theorem 3 remain constant.
It should be noted that the rate n −ν/(ν+2) is the same (up to a log factor) as that obtained in [7] for DP in the pre-conditioned model with Gaussian white noise (cf. Remark 10 and Corollary 2, together with Example 3, with somewhat different notation). This suboptimal rate also matches the lower bound obtained in [28] for discretized Gaussian white noise model and truncated SVD, with DP based on smoothed residuals (Proposition 4.2, with 2β = 2bν, p = b > 1/2 and α = 1, which corresponds to our version of DP for a pre-conditioned model).
Although the results in [28] do not directly apply to our Poissonian setup, Proposition 4.2 (case αp < 1/2) may suggest that adaptation could possibly be achieved also in our Poissonian setup, with DP based on less smoothed residuals. This may be a subject of further research.
Although stopping criteria based on cross validation, Lepski balancing principle or penalized risk estimation may allow for better adaptation (cf., e.g., [12] for the Poissonian case and the discussion of the idealized Gaussian white noise case in [5]), their computational cost may be prohibitive (cf., e.g., [12]). On the other hand, using DP is usually computationally much cheaper, because not all possible models have to be used in the optimization process, but its adaptation ability is often limited.

Simulation results
Finite sample behaviour of the proposed parameter selection rule was verified in the Spektor-Lord-Willis (SLW) stereological problem of unfolding the distribution of the radii of balls, randomly placed in an opaque medium, from an observed linear section through the medium. Early versions of the problem were described in [27] and [22] as models of some measurements in material sciences. The following formulation of the SLW problem as a Poisson inverse problem was taken from [31] and [13]. Assume that the balls' radii distribution is supported in [0, 1] and the balls' centers form a homogeneous Poisson process in R 3 with the expected number of c points per unit volume. With Lebesgue measures μ X and μ Y , if the balls' radii distribution has a density ρ, then the observed radii of the line segments (intersections of the balls with the linear probe) form a Poisson process on [0, 1] with intensity function ng(u), where n is the "size of the experiment" related to the total length of the observed linear probe and g(y) = 2y In order to work with a simpler form of the SVD of K, the following modified formulation of the SLW problem was used in the simulation study. Instead of considering the operator K as an operator from L 2 ([0, 1], μ X ) to L 2 ([0, 1], μ Y ), we change the dominating measures, both in the data space and in the solution space, and consider the folding operator K as an operator from L 2 ([0, 1],μ X ) to L 2 ([0, 1],μ Y ) with dμ X (x) = xdμ X (x) and dμ Y (y) = ydμ Y (y). The functions g and f are then replaced with l(y) = g(y)/y and h(x) = f (x)/x and the operator becomes The SVD of this operator was found in [13].
In this case, the estimatorsδ n andq n take the following simple form: where #A stands for the cardinality of the set A.
The simulations were conducted in the Python environment, version 3.7 and the code is available in [24]. Selected results for the following four functions (taken from [13,11,12]): 6,1] (x) will be presented. We set c = 1, so that all functions are probability densities. For a given density f of the spheres' radii and the experiment size n, artificial data samples were generated with the following algorithm. First, a numberñ was generated from the Poisson distribution with expectation n, using the method poisson from the package numpy.random. Then,ñ pairs (R, D) of independent random variables were generated: the ball radius R from the density f , with the acceptance-rejection method and the distance D from the ball center to the linear probe from the density d(x) = 2xI [0,1] , using the method beta from the package numpy.random. Pairs with D > R (the probe then misses the ball) were dropped, otherwise Y = R 2 − D 2 1/2 was being taken as the observed value. The mean numbers of the observed Y data points were 47.7%, 54.2%, 41.4%, and 35.4% of n, respectively for the functions Beta (4,2), SM LA, SM LB and Bimodal.
For each function and each experiment size, 10 artificial data samples were generated and the estimatorsf n were constructed asf n (x) = xĥ n (x), whereĥ n was the estimator of the function h in the problem (17). Three main regularization methods were used: • Landweber iterative procedure, • 2-times iterated Tikhonov procedure, • truncated singular value decomposition (TSVD).
For each estimation procedure, values of τ between 1 and 3 with step 0.1 were tested. The best results were obtained with τ set to 1.1, 1.4 and 1.0 for, respectively, Landweber, TSVD and 2-times iterated Tikhonov procedure and the relaxation parameter (β) for Landweber procedure set to 1.97, and those values were used in all results presented below. Large values of τ , above 1.5 for Landweber and Tikhonov procedure and above 2 for TSVD procedure, led to visible deterioration of the results. Varying the values of τ in the range of smaller values had only a minor impact on the obtained results. For the relaxation parameter of the Landweber procedure, values 2-, 4-and 8-times smaller than that finally selected were also tested. With all other parameters fixed, the results did not change significantly, but smaller values led to a materially longer computation time. A maximal number of iterations of the Landweber algorithm has been restricted to 1000 as a safety measure against occasional "pathological" data.
For each data sample and for each estimation procedure, the best possible regularization parameter was found, i.e. that minimizing the (numerically computed) L 2 distance betweenf n and the true f . The best and worst cases (out of 10 data samples) are presented for the first two regularization methods in the left panels of Figures 1-8. The right panels of those figures illustrate the performance of the discrepancy principle versus the oracle, able to produce for each given data the best possible solutions. Estimators based on TSVD usually tend to select too few components and perform somewhat worse.
The SM LB and Bimodal functions are much more difficult to estimate in the SLW problem than "easy" functions Beta (4,2) and SM LA, which can adequately be approximated with only a few low-rank singular functions of the SLW operator. The SM LB function is characterized by rapid local changes and needs essentially larger sample sizes than the "easy" functions. The same is true for Bimodal, because of relatively close local maxima separated by a shallow minimum. Clear improvement of the performance of estimators is seen, when n increases from 2 000 to 10 000 and to 1 000 000. The DP-based procedure seems to work reasonably well, the errors of the obtained estimators being close to those of the best possible solutions. The simulation results are also comparable with those presented in [13,11,12].
The simulation study was also performed for the standard 1-time iterated Tikhonov procedure, the qualification ν 0 of which equals 2. This procedure does not fulfill the assumptions of Theorem 3 and was included in simulations to check the effect of the violated condition. In almost all cases, the results indicate a substantial increase in estimation errors in comparison to the other procedures, which may suggest that the assumption on the qualification of the filter to be strictly larger than 2 may indeed be necessary in Theorem 3. Some exemplary outcomes are shown in Figure 9 for τ = 1 and sample size 1 000 000.
In all simulations, the performance of various DP-based procedures was assessed from the point of view of a potential user, who wants to apply them to his/her specific and fixed data set. Because of that, errors (rather than risks) are presented in the right-hand panels of Figures 1-8 and compared to the minimal errors that can be achieved with the regularization parameter chosen optimally for the given sample by an oracle that knows and uses both the true signal and the sample. This also allows for comparisons with other similar results for Poisson inverse problems, published earlier in [13,11,12]. Our approach is in contrast with some simulation results for Gaussian problems, published, e.g., in [6,5,28,33], and aimed more at illustrating theoretical adaptivity properties of the procedures. In that case, it is more natural to compare with oracle risks (rather than sample-wise minimal errors) and to show the dependence of the risk on the sample size.

Summary and conclusions
In this paper, a novel approach to data-driven choice of regularization parameters in Poisson inverse problems is proposed. This specialized version of the Morozov discrepancy principle leads to weakly consistent filter-induced estimators with explicit rates of convergence of the L 2 loss under source conditions on the estimated function.
As the reference case, properties of filter-induced estimators with a priori regularization in Poisson inverse problems are summarized. In the finitely smoothing case, minimax rates of convergence are attained over subsets of smoothness classes defined via source conditions and an additional boundedness condition on the image function. In order to also cover the non-Hilbert-Schmidt case, we provide a direct proof of Theorem 1, even if the case of operators with poly- nomially decaying singular values with exponent b > 1/2 is covered by Section 5.3 in [4]. The case of trace class operators may also be handled similarly to how the white noise model is treated in [7], which not only reproduces the rates in the finitely smoothing case with b > 1/2, but also covers in Theorem 2 the infinitely smoothing case with a weaker source condition traditionally studied with such operators.
The Morozov principle is used after pre-conditioning the initial model (1), which allows for construction of an unbiased and √ n-consistent estimator of the noise level. As it often happens for discrepancy principle in stochastic inverse  problems, the resulting estimators are not strictly adaptative in the sense defined in Section 1. In the finitely smoothing case, the rate of convergence with a posteriori regularization coincides with the respective rate of convergence with a priori regularization only on the boundary of the cases handled by both methods and they differ in other cases. The results for a posteriori regularization are valid only for the filter-induced estimators with qualification strictly larger than 2. This covers, in particular, truncated SVD, Landweber and k-times iterated Tikhonov regularization with k > 1, but excludes the standard (1-time iterated) Tikhonov regularization. Poor behaviour of the estimation procedure in the latter case is clearly seen in simulations.
Our theory covers DP-estimators only in problems with Hilbert-Schmidt operators, which excludes, for example, a well-known and important Wicksell's problem of stereology. Whether the presented methodology can be modified to also cover non-Hilbert-Schmidt cases, needs further investigation. It should be noted here that a version of DP applicable to compact but not necessarily Hilbert-Schmidt operators has been studied recently in [16] for discretized in- verse problems with independently repeated measuremenents in each channel and with unknown error distribution. However, as measurements in white noise are assumed in [16], the results do not seem to satisfactorily cover our Poissonian setup (cf. a related discussion in the last paragraph of our Sec. 3).
Another idea worth investigating is using our version of the Morozov principle with a weighted norm, pretty much in the same spirit as in [7]. Introductory simulation experiments suggest that this may significantly improve the quality of estimation, at least in finite samples.

Proofs
In all proofs, Id stands for the identity operator. For brevity, we also define The following lemma collects various technical results on regularizing filters. In the proof, we adopt to the current setup standard techniques used, e.g., in many places in [14]. Recall that q = K * Kf . Proof. For the first statement, write

G. Mika and Z. Szkutnik
For the second statement, set The interpolation inequality ((2.49) in [14]) gives Combining both inequalities proves the second statement.

Proof of Theorem 1
The proof goes along standard lines of writing the risk as the sum of variance and squared bias components, and separating low and high frequency terms in the variance component. Since EF α (K * K)(q n − q) = 0, one has because of Lemma 1. For the variance term, we obtain because of (13), and because g ∞ ≤ M . Consider first the case b > 1/2. Separating high (σ 2 i ≤ α) and low (σ 2 i > α) frequency terms, and using the definition of the bounded filter, we obtain

G. Mika and Z. Szkutnik
The condition σ 2 i ≤ α implies that i ≥ L := Bα −1/(2b) , with some constant B, and σ 2 i > α implies that i < U := Dα −1/(2b) with some constant D. Consequently, with a constant C = C(C R , C F , K, M). Summing that with the bias term and substituting α = α(n) n −2b/(2bν+2b+1) , which balances the orders of magnitude of the two terms, gives the conclusion. For the case b ≤ 1/2, the highfrequency terms are equal to zero for low-pass filters satisfying F α (λ) = 0 for λ ≤ α, the upper bound for the variance term has the same order of magnitude, and the conclusion follows in the same way.

Proof of Theorem 3
We use notation introduced in Section 2 and adapt the ideas from the proofs of Theorems 3 and 4 in [15] to the Poissonian case. Let Π (i) g , i = 1, . . . , n, be independent copies of a Poisson process Π g with intensity function g. Then g , because of the superposition theorem for Poisson processes (Th. 3.3 in [20]). Under Assumption 1, we thus have by the law of large numbers = K 2 (x, y)g(y) dμ X (x) dμ Y (y) =: γ 2 g < ∞, which implies that nδ 2 n P −→ γ 2 g because convergence in distribution to a constant is equivalent to convergence in probability to that constant.
For further use, define Δ 2 n := q n −q 2 and note that, by the CLT for random elements in Hilbert spaces (Th. 13.25 in [21]), where Z i are independent copies of Z := K(·, y) dΠ g (y) -a random element in L 2 (X , B X , μ X ), because K(·, ·) is square integrable as the kernel of a Hilbert-Schmidt operator. Obviously, EZ = K * g = q, by the Campbell's theorem. Also, and W is a Gaussian random element in L 2 (X , B X , μ X ). This implies that by continuous mapping theorem and, by Slutsky lemma, We start the proof from the second assertion of Theorem 3. Define with some c n → 0. Set := √ τ γ g /4 and observe that, by Chebyshev's inequality, as n → ∞ because, using the expansion K(x, y) = i σ i u i (y)v i (x), one easily obtains i σ 2 i u 2 i , g = K 2 (x, y)g(y) dμ X (x) dμ Y (y) < ∞ by Assumption 1 and, obviously, F cn (σ 2 i )σ 2 i → 1 for all i. This and (18) imply that P(Ω n ) → 1, as n → ∞. Using Lemma 1 and setting c n = [ √ τ γ g /(4ρC ν+2 √ n)] and, by Lemma 1 and equation (10) and because S αn ≤ 1 + C R , Moreover, on the set Ω n , n Δ n and, combining everything, , with some constant D, which completes the proof of the second assertion of Theorem 3, because of (18), (19) and (20), and because P(Ω n ) → 1.
It is elementary to see that if f has a finite expansion in terms of the singular elements v i , then it satisfies the source condition with any positive ν, hence f αn − f 2 = O P (n −ν/(ν+2) ) and, consequently,f αn is consistent. To complete the proof of the consistency part of Theorem 3, it is thus sufficient to consider f with infinite expansions. Since q, v i = σ 2 i f, v i , q then also has an infinite expansion. Let us fix > 0 and select L such that q, v L = 0 and (σ 2 L F α (σ 2 L ) − 1) 2 > 1/2 for all α ≥ . This is possible because, for such α, |F α (σ 2 L )| ≤ C F /α ≤ C F / and σ i → 0, as i → ∞. Define and note that P(Ω 1 n ) → 1 because of (18) and because, by the law of large numbers, q n , v L P −→ q, v L . Then, for n ≥ 64τ 2 γ 2 g / q, v L 2 and for all α ≥ , one has This means that on Ω 1 n we have α n ≤ , because of (10). Consequently, α n P −→ 0.
Both terms tend to zero in probability: the first one because α n P −→ 0, and the second one because it can, on Ω n ∩ Ω 1 n , be bounded as follows This completes the proof of Theorem 3.