Exponential Family Trend Filtering on Lattices

Trend filtering is a modern approach to nonparametric regression that is more adaptive to local smoothness than splines or similar basis procedures. Existing analyses of trend filtering focus on estimating a function corrupted by homoskedastic Gaussian noise, but our work extends this technique to general exponential family distributions. This extension is motivated by the need to study massive, gridded climate data derived from polar-orbiting satellites. We present algorithms tailored to large problems, theoretical results for general exponential family likelihoods, and principled methods for tuning parameter selection without excess computation.


Introduction
Modeling data using exponential family distributions on the vertices of a graph is a standard task in statistics and artificial intelligence. Examples include satellite images or photographs, traffic or mobility patterns, communications networks, spatiotemporal data, and many others. Suppose we observe y i ∈ R for i = 1, . . . , n on the nodes of a graph and assume that they independently follow a natural exponential family with density of the form for functions h : R → [0, ∞) and ψ : Θ → R and natural parameter θ * i ∈ Θ. The maximum likelihood estimator for θ * is easily shown to be ψ −1 (y) where we apply the function component wise. Unfortunately, this estimator fails to respect the known graphical structure, and therefore has high estimation risk (e.g., E ψ −1 (y) − θ * 2 2 ∝ n for the Gaussian family). In this paper, we imagine that the natural parameter vector θ * ∈ Θ n ⊆ R n is smooth on the graph in a total variation sense described below. We study methods to filter (estimate) the true parameter vector θ * , given observations y ∈ R n subject to this structure.
As an example, Figure 1 shows estimates for the instantaneous variance (imagining y i is a member of the Gamma family) of the temperature for New Year's Day 2010 over a grid for Canada using maximum likelihood and a few configurations of the main family of estimators we investigate. The smoothness imposed by the grid of neighbouring locations leads to predictable patterns in the estimate that follow topographical features like mountain ranges and bodies of water. We will revisit this example in more detail in Section 6. Before describing our methodology more carefully, we define notation. The top row shows the absolute centered data, 0th-order trend filter, and 2nd-order trend filter, in the latter 2 cases, with reasonable values of the tuning parameter. The bottom row shows the 1st-order trend filter for different tuning parameters, with the left most map, labeled "optimal", corresponding to the estimate when the degrees-of-freedom is chosen by minimizing an unbiased risk estimate.
Notation. Throughout this paper, we will focus on lattice graphs in d dimensions, though we note that our main theoretical results can be extended to arbitrary graphs with appropriate conditions on the graph-Laplacian. We define a graph difference operator D that is crucial for defining our estimators. In one dimension, on a chain graph, the difference operator D (1) n,1 is defined by where n > 1. We use the notation [m] to denote the set {1, 2, . . . , m} for positive integers m. The (k + 1) th order (forward) difference matrix D = D (k+1) n,1 ∈ R (n−k−1)×n is defined with the recurrence relation D n,1 for k > 0, n > k. For example, the 3 rd -order differences look like: (D (3) n,1 θ) i = −θ i+3 + 3θ i+2 − 3θ i+1 + θ i . For a general graph, let D (1) denote its incidence matrix. In d > 1 dimensions, we focus on lattice graphs with a length of N on each side and with a total number of vertices n = N d . In our estimators, unless otherwise specified, we penalize the variation of signals only along axis-parallel directions.
where the Kronecker products consist of d terms each, one term for each dimension.
Define · 2 to be the usual Euclidean norm and · n = n − 1 /2 · 2 to be the empirical norm. We will similarly denote other p -norms with an appropriate subscript. When there is no chance of confusion, we will assume that the ψ function in (1) applies component-wise. We use a ⊗ b to denote the Kronecker product of vectors a and b, a b to denote the elementwise product, and a, b = a T b to be the dot product. When clear, we will use f , f to denote componentwise first and second derivatives of the function f . We use ∨/∧ for maximum/minimum respectively and (x) + = x ∨ 0; while 1{A} is the indicator of the event A, taking the value one if true and zero otherwise. We use a n b n to mean a n ≤ cb n eventually for some constant c > 0, a n = Ω(b n ) to mean that a n ≥ cb n eventually, and Y n = O P (1) to mean that the sequence of random variables is bounded in probability eventually. We will also use Y n = O P (1) to mean that Y n = O P (log c (n)) for some c > 0. Finally, for the graph difference operator, we will write the singular value decomposition (SVD) of D = U ΣV T ∈ R m×n where U ∈ R m×m , Σ ∈ R m×n and V ∈ R n×n , and we write the nullspace of D as N = N (D).

Estimators
We consider two canonical estimators. The first filters the natural parameter θ * based on maximizing the likelihood while the second filters the mean β * := ψ (θ * ) directly. This distinction is important with respect to the nature of the expected smoothness. If we were to consider the data without regard for the graphical structure, then there is a direct correspondence between these two: the MLE for β * is given by applying ψ to the MLE for θ * . Furthermore, this equivalence holds trivially for estimating the mean of a Gaussian because β * = θ * . However, any requirement for smoothness over the graph destroys this relation for general exponential families.
Penalized MLE. We minimize negative log-likelihood with a smoothness imposing penalty: Here λ is a parameter for balancing fidelity to any anticipated smoothness over the graph, as encoded by D, with fit to the data y. Taking λ → 0 will result in the minimum occurring at θ = ψ −1 (y) while letting λ → ∞ gives the Kullback-Leibler projection of y on to N (D). By the likelihood principle, θ is the natural estimator to use when we expect that θ * is smooth with respect to the graph. However, as we will demonstrate, this estimator can have high excess estimation risk when ψ (θ * ) approaches 0. In Section 2.1 we will argue that this issue can be addressed by adding a penalty on the null-space component of θ. Specifically, the MLE with TF and null space penalty is θ = argmin θ 1 n n i=1 −y i θ i + ψ(θ i ) + λ 1 Dθ 1 + λ 2 P N θ 2 where λ 1 , λ 2 ≥ 0 are regularization parameters and P N is the projection operator on to N (D).
Mean Trend Filter. When the expected smoothness is in the mean rather than the natural parameter, it may be more appropriate to penalize the roughness in mean directly. For such a scenario, we consider the trend filtering estimator: β = argmin β 1 2n y − β 2 2 + λ Dβ 1 .
As before, λ balances data fidelity with smoothness, but here, the interpretation as λ → ∞ is more straightforward. In this case, the minimum occurs at the orthogonal projection onto the null space of D: β = (I − D T (DD T ) −1 D)y. This estimator was proposed in Steidl et al. (2006), Kim et al. (2009) and statistically analyzed in Tibshirani (2014), Wang et al. (2016) and others. We provide a thorough overview of previous work on mean trend filtering in a later section.
To understand the nature of the penalty in the above formulations, it is clearly important to understand its null space. Sadhanala et al. (2017) showed that the null space of D consists of Kronecker products of polynomials. We give a generalized version of their Lemma 1 here.
Lemma 1. A basis for the null space of D is given by the family of polynomials p(x) = x a 1 1 ⊗ x a 2 2 ⊗ · · · ⊗ x a d d : a j ∈ {0, . . . , k j } where x j are the coordinates of the observations along the j th dimension. The dimension of the null space is nullity(D) = d j=1 (k j + 1). Therefore, writing P as the matrix formed by the evaluations of this collection of polynomials over the grid, we can also write the Euclidean projection onto the null space of D as P N := P (P T P ) −1 P T . When applied to certain kinds of data (for example the satellite temperature data) it may be useful to imagine that some dimensions of the grid "wrap" like a cylinder. If the grid wraps along some dimension j ∈ [d], then a j = 0 regardless of k j and the contribution to the nullity for dimension j is as if k j = 0.
Characterizing the null space tells us the sorts of vectors θ * that have Dθ * 1 = 0, but it does not say anything about vectors with bounded trend filtering penalty. Consider the 0 penalty instead, Dθ 0 , for k 1 = · · · = k d = k. This is small when there are few changepoints, which are the indices j 1 , . . . , j M at which the k th derivative is non-zero, (Dθ * ) j 1 ,...,j M = 0. Because the 1 penalty tends to produce sparse vectors with small D θ 0 , the reconstructed signals are piecewise polynomials with a few changepoints that are automatically selected. The result is that trend filtering produces estimators that are locally adaptive, which means that the reconstructed signal is not oversmooth in regions of high signal variability (in θ * ) and not undersmooth in regions of low variability. In short the filter does not have one fixed resolution or bandwidth, but adapts the resolution to the observed signal. For a more complete explanation of this phenomenon, see Wang et al. (2016), Bassett and Sharpnack (2019). To simplify the theoretical exposition below, we will assume that k 1 = · · · = k d = k, but our results are easily modified for other situations.

Properties of exponential families
In this section, we review properties of exponential families, many of which will play a key role in our theoretical development. Considering the univariate random variable Y with density of the form in (1), we define the domain Θ = {θ ∈ R : ψ(θ) < ∞} and assume that Θ has a non-empty interior. Recall that the mean and variance of the distributions p(· | θ) are ψ (θ) and ψ (θ) respectively, for natural parameter θ ∈ Θ. Therefore, := Y − ψ (θ * ) has mean zero and a simple expression for its moment generating function (MGF) for s in a neighborhood of 0. Furthermore, ψ is convex and all its derivatives exist for all θ ∈ Θ (see Brown 1986). We say that a random variable X with mean 0 is sub-exponential if there are non-negative parameters ν, b such that For shorthand, we also say X is SE(ν 2 , b). We can show that random variables following exponential family distributions are sub-exponential in this sense.
Lemma 2. Fix θ * in the interior(Θ), and let Y be from a univariate exponential family with parameter θ * . Then for any δ > 0, Y − ψ (θ * ) is sub-exponential with some parameters ν and b depending on θ * and δ. Specifically, ν is related to the variance by ν 2 = ψ (θ * ) + δ. Table 1 gives the log-partition function ψ(θ) and sub-exponential parameters for Poisson, exponential, and chi-squared families. These calculations and the proof of Lemma 2 are in Appendix A. In each of the examples in Table 1, ν 2 is selected to be a multiple of the variance, but these are not the only choices of (ν, b) that would constitute valid sub-exponential parameters. Lemma 2 is not surprising given the form of the MGF, but seems not to be well-known. Related results can be seen in Brown (1986) or Kakade et al. (2010). Note that many exponential families have tails which decay faster (e.g., Gaussian or Binomial distributions), but all exponential families have sub-exponential tails.
Finally, we note that in all of these examples (Poisson, exponential, chi-square) the variance, and hence the curvature of ψ(θ * ) depends on θ * , resulting in heteroskedasticity. This is one of the main complications of the exponential family setting that we consider in this paper. Along with the heavy-tailed residuals, this setting is a major departure from the sub-Gaussian homoskedastic setting of most prior works.
KL divergence. The Kullback-Leibler (KL) divergence between exponential distributions of the same family has a simple algebraic form in terms of ψ; see Wainwright and Jordan (2008). The KL divergence with parameter vectors θ 0 and θ 1 ∈ R n is In the asymptotic setting with n → ∞, it makes more sense to examine the average divergence per coordinate. Thus we define KL (θ 0 θ 1 ) := 1 n KL (θ 0 θ 1 ) . For an exponential family as in (1), the KL divergence is the Bregman divergence of ψ

Summary of our contributions
Most of the existing work on trend filtering referenced above assumes sub-Gaussian noise, that is, where i is mean-zero and sub-Gaussian with common variance σ 2 . For general exponential families of the form in (1), y i − Ey i has heavier than sub-Gaussian tails. Furthermore, for general exponential families, the variance, as well as higher moments, are tied to the mean parameter. Therefore, consideration of heteroskedasticity is a necessary and fundamental component of our analysis.
Direct analysis for specific exponential families, such as Poisson (Bassett and Sharpnack, 2019) are rare. Van de Geer (2020) analyses a penalized MLE for the logistic family. However, the logistic family has sub-Gaussian tails and uniformly bounded variance which allows key parts of the analysis, such as the Dudley entropy integral bound, to work. In other words, the theoretical approach there cannot generalize to arbitrary exponential families.
Our results here apply to the entire exponential family. However, due to this generality, the results are necessarily weaker than could potentially be achieved under additional, more stringent conditions (such as by assuming Gaussian or logistic distributions, or requiring additional bounds on higher moments).
A key ingredient in previous analyses in the sub-Gaussian setting is that the Bregman divergence ψ( θ)−ψ(θ * )−( θ−θ * ) T ψ (θ * ), can be lower bounded by a multiple of θ−θ * 2 2 , because ψ is strongly convex. However, for general exponential families, ψ is not strongly convex, even if Dθ * 1 is wellcontrolled, unless θ * satisfies additional conditions. Without such assumptions, ψ (θ * ) can be arbitrarily small. If we make the (rather implausible) assumption that both the estimate θ and the parameter θ * are bounded, then we recover this strong convexity in the relevant region where θ and θ * lie. In this case, we can apply the same techniques used to analyze the sub-Gaussian case. We derive these bounds in Appendix B.7. However, without such an assumption, analysis requires entirely different techniques, and we show these results in Section 2.2.
Our main contributions are the following. 1. We derive error bounds on excess KL-risk for the penalized maximum likelihood estimator for general exponential families with subexponential noise (Section 2). We argue that there is a need to constrain the component of the natural parameter vector that falls in the null space of D as in equation (3). 2. We delineate two types of heteroskedasticity that are relevant under general assumptions: strong heteroskedasticity and mild heteroskedasticity. We show how our general KL-bounds behave under these regimes and how the heteroskedasticity interacts with the smoothness constraints and the dimensionality of the problem. 3. For k = 0, we show that the mean trend filter and the MLE with penalty are equivalent estimators, and hence, results for the mean trend filter apply immediately in this special case (though under different smoothness assumptions; Section 3). 4. We show that the mean trend filter nearly achieves the minimax optimal rate under squared error loss for mildly heteroskedastic data and all smoothness levels k and lattice dimensions d (Section 3). This result in fact holds for general sub-exponential noise , not just for the exponential families we consider in the paper. We incur an additional log n factor in the error bound for sub-exponential noise. It is specific to distributions where the mean parameter has bounded trend filtering penalty. 5. We give an algorithm for solving all of these cases for arbitrary likelihood, smoothness levels, and dimension, with the goal of operating on large data (Section 4).
6. We give a simple estimator for the out-of-sample prediction risk (at the original grid locations) to enable tuning parameter selection without requiring complicated forms of cross validation or other re-estimation procedures (Section 5). It is important to note that the results for MLE trend filtering and mean trend filtering are not directly comparable because they make different assumptions. The former constrains the natural parameter, while the latter constrains the mean parameter. These only coincide in the Gaussian case. We present empirical results demonstrating our methods on synthetic and real datasets in Section 6. We conclude with a discussion of the results. The remainder of this section gives a concise overview of our theoretical contributions and a thorough discussion of related work.

Overview of theoretical contributions
To better fix the context for our results, we provide here a concise description of these in the simplest cases (more precise statements are in Sections 2 and 3). Define α = (k + 1)/d, and define the "canonical scaling" as Dθ * 1 n 1−α . The canonical scaling is called such because it holds for evaluations of Hölder functions-functions where the kth order partial derivatives are Lipschitz continuous-at the grid locations. Under the canonical scaling, it is shown (Sadhanala et al., 2021) that for Gaussian data and 2 loss, the minimax rate over this class is given by Furthermore, the mean trend filter is rate optimal up to logarithmic factors in the Gaussian case.
Because, for Gaussian data, KL (θ 0 θ 1 ) ∝ θ 0 − θ 1 2 2 , the above immediately provides a lower bound for KL(Θ) = inf θ sup θ∈Θ 1 n KL θ θ across all exponential families. We show that, under additional boundedness conditions on Θ and similar constraints on the estimator, the MLE trend filter in equation (2) achieves this rate up to additional logarithmic factors. The case of the MLE trend filter without the artificial boundedness constraint described above is more complicated (Section 2.2). With the additional penalty on the null space in Equation (3), an addition we prove necessary for consistency, we can achieve the minimax rate for α ≤ 1/2. For α > 1/2, the upper bound is weaker than for Gaussian noise: we can show only that KL(θ * θ) = O P (n − 1 /2 ). While we are able to show consistency in this setting, we suspect that this bound is loose.
We also show that, under homoskedastic subexponential noise, the mean trend filter achieves the minimax rate up to additional logarithmic factors. The homoskedasticity condition can be relaxed, and this is examined in Section 3. We consolidate these results in Tables 2 and 3.

Related work
Much is known about trend filtering in one dimension (1d). The trend filtering method in (4) was proposed in Steidl et al. (2006), Kim et al. (2009) for 1d problems. Tibshirani (2014) connected trend filtering to locally adaptive regression splines, proposed in Mammen and van de Geer (1997), and analyzed its statistical properties. Tibshirani (2022) gives an in-depth background of the key ideas that make trend filtering and related methods work. Johnson (2013), Kim et al. (2009), Ramdas and Tibshirani (2016) propose methods to solve the convex optimization problem in 1d trend filtering. Trend filtering with k = 0, or total variation (TV) regularization, is an important

Conditions
Regime Lower bound Upper bound Literature technique for denoising images (two dimensions). TV methodology and computation was studied in Rudin et al. (1992), Tibshirani et al. (2005), Condat (2013), Barbero and Sra (2018). Trend filtering over general graphs was first proposed in Wang et al. (2016), and subsequently, other variants of trend filtering have been studied, for example depth-first search TV regularization (Madrid Padilla et al., 2018), kNN TV denoising (Madrid Padilla et al., 2020), quantile trend filtering (Madrid Padilla and Chatterjee, 2021), and sequential TV denoising (Baby and Wang, 2021). These methods use squared error loss, with the exception of Madrid Padilla and Chatterjee (2021), and so are not necessarily suitable for general exponential families. General exponential family distributions have a long history in statistics. Brown (1986) is a definitive treatment for studying the properties of exponential families while McCullagh and Nelder (1989) covers the details of generalized linear models. Direct analysis of trend filtering in this setting is more rare than for Gaussian loss. Van de Geer (2020) derived error bounds for estimating Bernoulli family parameters with bounded variation in 1d. In contrast to most other results, the theory applies without assuming boundedness of the estimated natural parameter. Khodadadi and McDonald (2019) examine computational approaches for variance estimation on spatiotemporal grids. Kakade et al. (2010) discuss strong convexity of general exponential families and use the results to analyze 1 penalized maximum likelihood. Vaiter et al. (2017) examine the geometry of penalized generalized linear models and derive important results for general regularizers that we use for specialized risk estimation in Section 4. Bassett and Sharpnack (2019) provides a bound on the Hellinger error for total variation denoising for the estimation of densities over edge segments in a general graph. Our results here are the first to analyze trend filtering over lattice graphs for general exponential families.
An important distinction exists between two varieties of theoretical results for trend filtering examined in the literature: (1) nearly parametric rates under sparsity assumptions with Dθ * 0 bounded; and (2) non-parametric rates for signals with bounded trend filtering norm Dθ * 1 . In general, these bounds are difficult to compare because they hold under different conditions, and either bound can be tighter for specific signals. Rinaldo (2009), Harchaoui andLevy-Leduc (2010), Lin et al. (2017), Guntuboyina et al. (2020), Ortelli and van de Geer (2021) give more general and tighter error bounds when the true signal is sparse (bounded L 0 norm). Throughout this work, we will focus on establishing non-parametric rates with trend filtering norm bounds. Mammen and van de Geer (1997) provide one of the earliest theoretical results on 1d trend filtering. In higher dimensions and on general graphs, researchers have typically confined their theory to special cases-e.g., specific dimensions, graph structure, and trend filtering order. Hütter and Rigollet (2016), Sadhanala et al. (2016) derive error bounds for total variation denoising (trend filtering with k = 0) on lattice graphs. Chatterjee and Goswami (2021), Ortelli and van de Geer (2020) show stronger error bounds when the signal has axis-parallel patches. Sadhanala et al. (2017Sadhanala et al. ( , 2021, extend the analysis to higher-order trend filtering on lattice graphs of arbitrary dimension. All of the aforementioned works study squared error loss with sub-Gaussian noise. Wang et al. (2016) analyze error bounds for graph trend filtering for specific cases (lattice graphs with a specific trend filtering order). In that work, the "eigenvector incoherence" technique is developed as a tool to analyze the mean squared error of any graph trend filtering problem. In this work, we adapt this technique to work with general exponential families.

Penalized MLE
In this section, we provide general results for trend filtering on d-dimensional lattice graphs with exponential family observations. As mentioned above, general exponential families have two interesting features. First, the distributions can be more heavy tailed than Gaussians, and as we have seen, they are generally sub-exponential. This is reflected in rates that are typically worse than in the Gaussian case. Second, the variance (as well as the sub-exponential parameters ν, b) is a function of the natural parameter, which results in heteroskedasticity. We find that our bounds rely heavily on the "level" of this heteroskedasticity. However, this reliance is most salient with respect to two asymptotic regimes.
We say mild heteroskedasticity occurs when both subexponential parameters, ν, b, are bounded as n increases. Henceforth, let ν, b denote the vectors (ν i ), (b i ) for i ∈ [n] where these are the subexponential parameters of centered Y . That is, if there exists an ω such that ν ∞ , b ∞ ≤ ω for all n, we say that the problem is only mildly heteroskedastic. Analysis in this case turns out to be largely similar to the standard homoskedastic setting. We say that strong heteroskedasticity occurs whenever it is not mild, however, typically we can measure the strength via ν ∞ / ν n . When this is close to 1, there is little variation of ν across coordinates. However, when ν ∞ / ν n is close to √ n, only a few coordinates dominate. Importantly, smoothness of θ * (such as a bound on Dθ * 1 ) does not generally have any implications for the level of heteroskedasticity, and furthermore, it is not generally possible to determine the level from data. Thus, considering both situations is necessary for a complete understanding.
Much of the difficulty for both estimation and theoretical analysis in the exponential family setting is that the negative log-likelihood is not strongly convex in general. If we assume that ψ (θ * i ) > 1/K for all i, then we can add this constraint to (2) which will ensure strong convexity. We provide an analysis of this approach in Appendix B.7, which is tight in the Gaussian case up to logarithm factors, see, for example, Sadhanala et al. (2021). Similar results were already derived in the literature, for example, in Prasad et al. (2020). As we will see, however, bounding the curvature in this way excludes important cases, and cannot be verified from data. Nonetheless, this assumption has a long history in statistics. For example, the standard approach to proving estimation consistency in low-dimensional generalized linear models is much the same (McCullagh and Nelder, 1989).

Additional penalty on the null space component of θ
The boundedness constraint discussed above is not desirable for at least two reasons. The first is that it is difficult to calibrate the constraint using data. The second is that strong convexity is an indirect way to get control of the nullspace of D, which is what we actually need. We now argue why this is the case.
Let the empirical and population risks at a parameter θ be respectively, and note that KL (θ 0 θ 1 ) = R(θ 1 ) − R(θ 0 ). For Gaussian data, minimization of the empirical risk, the Dθ 1 constraint, and strong convexity of the likelihood together control the discrepancy between the empirical risk and the population risk. The reason is that strong convexity controls behaviour of θ in the nullspace of D. But outside this setting, we no longer have strong convexity, and unfortunately, the penalty alone does not give sufficient control. The result is that, for non-Gaussian data, sup θ∈Θ |R n (θ) − R(θ)| can become arbitrarily large with high probability, even in simple settings, despite bounds on Dθ Remark 1 (Degenerate Poisson example). Consider the Poisson family, with true parameter θ * n = −2 log n1 for any n ≥ 1. The probability that all y i 's are 0 is e − 1 /n . On this event (where y = 01), for Notice that in this example, ψ (θ * i ) = n −2 , so the strong convexity bound is diminishing with n.
One can observe similar behaviour for the logistic family. Consider θ * n = −2 log n1 and verify that all y i 's are 0 with probability 1 + n −2 −n ≈ e − 1 /n . The MLE with only the Dθ 1 penalty behaves similarly to the Poisson example described above.
While artificially imposing strong convexity addresses this issue, it is both more direct and results in a more tractable estimator to constrain the component of θ in the null space of D. With this additional constraint, we can show the following risk bound. The proof is in Appendix B.3.
For the above example of degenerate Poisson, we can set a n = 2 log n, c n = 0 to see that the right hand side converges to 0 as n → ∞. This motivates us to penalize the null space component of θ in the MLE and use the estimator defined in (3) rather than that in (2). In the following, we call this estimator (3), the MLE and define α = (k + 1)/d. The minimizer in the optimization problem is unique because ψ is strictly convex.

Error bounds for penalized MLE
Generally, there are three degrees of freedom when stating results: (1) the trend filtering order k, (2) the dimension d, and (3) the exponential family and resulting sub-exponential parameters (ν, b). There is a natural trade-off between generality and interpretability of the results presented here, so we will prefer to present specific interpretable results as corollaries.
We introduce some additional notation to state our results. Let ρ , ∈ [N ] be the eigenvalues where µ is the constant with which the left singular vectors of D are incoherent. We can derive the following error bound on the excess risk of the estimator in (3).
Let θ be our estimate in (3) with parameters λ 2 = 2A n /n and λ 1 = 2B n /n. Then, with probability at least 1 − 4nde −t , See the proof in Appendix B.1. For regular grids, Lemma 11 (in Appendix B.9) controls the magnitude of L κ,1 , L κ,2 and hence the bounds in Theorem 1. Applying the lemma to the expression for B n in Theorem 1, we get the following corollary for regular grids.

Penalized MLE in special cases
We now illustrate Corollary 1.1 in a few special cases to provide intuition. As above, we focus on grid graphs with Poisson and Exponential distributions, and we assume that these are all of width N and dimension d, so that n = N d . Recall that for natural parameter θ * , the Poisson distribution has mean β * = exp(θ * ), while the Exponential distribution has mean β * = −1/θ * . For the Poisson distribution, an additive change in θ * results in a multiplicative change in the mean, and ν 2 = 2β * , which can easily result in strong heteroskedasticity. Only in special cases does a constraint on Dθ * 1 result in a bound on ν 2 , and generally, ν ∞ will depend on the signal in question. The first result is an example of weak heteroskedasticity, where the natural parameter is uniformly bounded.
The next example demonstrates Corollary 1.1 under strong heteroskedasticity.
Corollary 1.3. Consider any exponential family on a d-dimensional grid (d > 1) with a natural parameter that satisfies ν ∞ , b ∞ = O(n c ) and ν 2 = O(n c ) for some c > 0, and the canonical scaling for k = 0. Then An example of a signal satisfying these conditions is the Exponential distribution with In this case, ν ∞ is diverging, and so we have strong heteroskedasticity. The level of heteroskedasticity, parameterized by c, determines the rate of convergence and for c > 1 /d we cannot guarantee convergence.

Error bounds for the Mean Trend Filter
When k = 0, remarkably, it turns out that the penalized MLE in (2) is equivalent to the mean trend filtering estimator (4). In fact, this equivalence between the two estimators holds over arbitrary graphs, not just grids.
The proof is in Section B.4. Therefore, in the case k = 0, the penalized MLE can be solved quickly by solving the equivalent mean trend filter problem.
For k ≥ 1, equivalence between the two estimators need not hold in general, with the exception of the mean parameterized Gaussian family, where it holds trivially. The remainder of this section will focus on the general case. For the estimator in (4), we derive the following error bound.
The set of indices J can be chosen to minimize the bound. The proof is in Appendix B.5 and follows an approach similar to that in Wang et al. (2016). Tail bounds on sums of sub-Gaussian variables in their results are replaced with those on sums of sub-exponential variables. This results in additional log n factors in the error bound compared to the sub-Gaussian setting.
The proof technique for Theorem 3 relies on the properties of D. A potential alternative route to get error rates is via bounding the empirical process 1 n T ( θ−θ * ) with the Dudley entropy integral. However, the empirical process in our case is not sub-Gaussian and we could only derive a trivial upper bound in this way. This should not be entirely surprising however, because the entropy method was also used in Wang et al. (2016) in the sub-Gaussian noise setting, and it also failed to give a tight characterization in that context.

Error bounds with canonical scaling
We simplify this bound in some special cases. Assuming that ν, b are uniformly bounded, we get the following result for regular grids. Denote γ p = log 1 /p (n) if pα = 1 and 1 otherwise.
Then there is a choice of λ such that for α ≤ 1 /2, The proof is in Appendix B.6. This corollary does not discuss the case where α > 1 /2 and ω log n is outside of [n −α , √ n]. In that case, when the noise is high (ω log n √ n), the polynomial projection estimator β = P N y gives the tightest bound, and, when the noise is low ( ω log n < n −α ), the identity estimator gives the tightest bound.
The following corollary examines this result for some special cases.
Corollary 3.2. Consider the Poisson and Exponential families on a d-dimensional grid (d > 1) where the mean parameter is constrained. Specifically, suppose that β * ∞ = O(1) such that the canonical scaling holds with k = 1. Then mean trend filter satisfies This result matches with rates in the homoskedastic Gaussian case up to logarithmic factors, shown for example in Sadhanala et al. (2021). An example of a signal satisfying the conditions is a grid graph with width N and dimension d, so that n = N d and β While the previous result treated the (effectively) homoskedastic case by controlling the largest components of ν, b, the following corollary specializes Theorem 3 to canonical scaling under strongly heteroskedastic noise.
n 1−α , and assume σ 2 n/ log 2 n and σ ∞ n α /(γ 1 γ 2 log n). Then, the estimator β in Theorem 3 satisfies This result is most useful under strong heteroskedasticity where σ ∞ /σ ∝ √ n, and slightly stronger rates with weaker heteroskedasticy can be obtained in the 1/2 < α ≤ 1 case (see Corollary 4.1 in Appendix B.6). Suppose i in Theorem 3 is mean-zero Laplace noise with standard deviation parameter τ i and that Dβ * 1 satisfies canonical scaling. For this case, ν i = b i = cτ i for a constant c independent of β * i , while σ = c τ n and σ ∞ = c τ ∞ with the natural constraint that σ ∞ / √ n ≤ σ ≤ σ ∞ . For α < 1, the scaling requirement on σ ∞ is stronger, meaning that the estimator can only tolerate heteroskedasticity on the order of σ ∞ /σ ∝ n α < √ n. On the other hand, for α > 1 /2, the constraint on σ is stronger, meaning that we can tolerate σ ∞ /σ ∝ √ n. The associated rates of convergence will necessarily be much slower than in the homoskedastic sub-Gaussian case. Importantly, Corollary 3.3 illustrates that without control of the amount of heteroskedasticity, we cannot guarantee convergence of the estimator. In other words, while the estimator can tolerate strong heteroskedasticity as we have defined it here, it cannot tolerate arbitrary heteroskedasticity. Simply controlling Dβ * 1 is not generally enough to guarantee estimation consistency. In the next section, we make this precise, illustrating that in certain settings, there is no estimator that can achieve consistency without additional constraints.

Lower bounds for mean trend filtering
We now show that the upper bound in Corollary 3.1 is minimax optimal up to logarithmic factors. Consider the observation model where β ∈ R n is the true signal and i , i ∈ [n] are mean-zero noise terms. For a set S ⊂ R n denote its minimax risk for integers k ≥ 0, d ≥ 1, n ≥ (k + 1) d and C n ≥ 0. Let Lap(µ, σ) denote the Laplace distribution centered at µ ∈ R and with scale parameter σ > 0 with density p(x) = 1 2σ e −|x−µ|/σ over R.
Proposition 2. Consider the observation model in (6) where the Ω notation absorbs constants depending only on k, d.
The first term in the bound is due to the null space of D. To derive the second term, we embed an 1 ball in T k n,d (C n ) and adapt arguments from Birge and Massart (2001). The final term is obtained similarly to Sadhanala et al. (2017, Theorem 4), by embedding a Hölder ball of appropriate size in T k n,d (C n ). The proof is in Appendix C.1. Let us compare the lower bound in Proposition 2 with the upper bound in Corollary 3.1. The Laplace distribution with scale parameter σ is sub-exponential with parameters ν = cσ, b = cσ for some constant c > 0. Plugging in C n = n 1−α in the lower bound, and ω = cσ in the upper bound stated in Corollary 3.1, we can verify that the bounds match up to logarithmic factors.
The lower bound in Proposition 2 is for homoskedastic noise. When the noise is heteroskedastic, the estimation can be harder, in the sense that the minimax risk can be larger. Specifically, we can show the following lower bound on a TV class of signals for the Exponential family.
Proposition 3. Assume C n > 1. Consider the class of signals over a 2d grid The proof is in Appendix C.2. With canonical scaling C n n 1−α = √ n, this means a lower bound of Ω(1). In other words, there is no consistent estimator for the class of signals Θ( √ n). This result also hints at the difficulty of handling various regimes of noise parameters ν, b.

Algorithmic implementation
In this section, we discuss our algorithmic implementation, focusing on the multivariate setting for the MLE trend filter for which there are not currently generic procedures. For the Mean Trend Filter, there are many standard approaches that can apply immediately since this is a quadratic program. In the one dimensional case with k = 0, Kim et al. (2009) use a Primal-Dual Interior-Point method. Ramdas and Tibshirani (2016) examine a fast ADMM algorithm for k > 0. Wang et al. (2016) develop ADMM and Newton methods for general graphs and arbitrary k. We follow the approach of Khodadadi and McDonald (2019) for the MLE trend filter (2) and use an algorithm Algorithm 1 Linearized ADMM for the MLE trend filter 1: Input: y, φ, D, λ 1 > 0, λ 2 ≥ 0 2: Set: 6:

7:
Update u ← u + Dx − z 8: end while 9: return z called linearized ADMM. A more complete description is given in Appendix D. First, rewrite Equation (2) (substituting x for θ) as This is equivalent to (2) but with additional variables. The scaled form of the augmented Lagrangian for this problem is The scaled ADMM algorithm iteratively solves this problem by minimizing over x, then z and then updating u with gradient ascent. However the x solution involves a matrix inversion due to the quadratic in Dx which is best avoided when n is large. So we linearize L ρ (x, z, u) around the current value x o resulting in the following update for x x ← argmin where µ is chosen as the largest eigenvalue of D T D. To include the null space penalty, the changes only impact the x update, and (7) is adjusted accordingly with a subgradient of the penalty at x o (when P N x o = 0, choose the subgradient to be 0). The solution for the z-update is easily shown to be given by elementwise soft-thresholding, and the u-update is simply vector addition. Solving the x-update is potentially more challenging. Note that the form of (7) is the same for each i, so we can solve n one-dimensional problems. The KKT stationarity condition requires Therefore, for any negative loglikelihood as given by ψ, we want to solve ψ ( . For many functions ψ, the solution has a closed form. The binomial distribution with ψ(x) = log(1 + e x ) is an exception, though standard root finding methods have no difficulties.
To include the nullspace penalty, the x update changes slightly, but the logic is the same. This procedure is shown in Algorithm 1. In practice, we have found the algorithm to converge quickly when initialized from a small value of λ 1 (because the solution will be close to the MLE) and then calculated for an increasing sequence with the solution at smaller λ 1 used as a warm start. This is the opposite of most pathwise procedures which use a decreasing sequence of λ 1 .

Degrees of freedom and tuning parameter selection
We describe an unbiased estimator for the KL divergence between the estimate and the truth for the purposes of tuning parameter selection. Additional justification and description of its derivation are given in Appendix E. If Y ∼ N(θ * , σ 2 ), a now common method of risk estimation makes use of Stein's Lemma. The utility of this result comes from examining the decomposition of the mean squared error of θ(Y ) as an estimator of θ * .
where J denotes the Jacobian. This characterization motivates the definition of degrees-of-freedom for linear predictors: df := 1 σ 2 tr J θ(z) y (Efron, 1986), where θ(y) = Hy. Using Stein's Lemma, assuming σ 2 is known, we have Stein's Unbiased Risk Estimator Note that this is the risk for estimating the ndimensional parameter θ * . This estimator is appropriate for the mean trend filter, but, for the MLE trend filter, we prefer "Stein's Unbiased KL" estimator due to Deledalle (2017) that applies to continuous exponential families.
Lemma 3 (Theorem 4.1 in Deledalle 2017). Assume h is weakly differentiable and that θ(Y ) is weakly differentiable with essentially bounded partial derivatives. Then Because ψ(θ * ) does not depend on θ, we can ignore it for the purposes of choosing λ 1 , λ 2 in the MLE trend filter. To evaluate SUKL( θ) we need an expression for J β(y). This is given in the following result (the proof is deferred to Appendix E).
Theorem 4. For the MLE trend filter, the divergence of β(y), defined to be the trace of the Jacobian of y → β(y), written as tr J β(y) , is given by where P N (D) is the projection onto the null-space ofD, andD contains the rows of D such that D θ = 0.
Unfortunately, estimating the risk in this manner is not known to be possible for general discrete exponential families, though a few specific cases are possible. One such is the Poisson distribution. The following result more closely resembles an empirical derivative of β rather than the theoretical expression for J β(y) used in the previous results.
Lemma 4 (Theorem 4.2 in Deledalle 2017). Assume Y is Poisson and that θ(y) is weakly differentiable with essentially bounded partial derivatives. Then where e i is the i th standard basis vector, and z is a known function of the true parameter.
With these expressions in hand, we can select the tuning parameters λ 1 , λ 2 with minimal additional computations by minimizing SUKL( θ) or PUKL( θ) as appropriate.

Empirical results
We demonstrate the performance of both the MLE and the Mean trend filter estimators in a small scale simulation designed to compare the two in challenging settings. We also examine two applications: modeling hospital admissions by age due to COVID-19 in Davis, California; and describing changes in temperature measurements for the Northern hemisphere.

Simulation study
We briefly investigate the relative performance of the Mean Trend Filter and the MLE Trend Filter on a few synthetic examples. Our intention is to push the limits of both, thereby illustrating that the user should choose between the two based on whether smoothness is desired in the mean or in the natural parameter. We focus on one dimension for ease of visualization and k = 1. We examine both the exponential distribution and the Poisson distribution.
To create the true signal, we begin with a v-shaped function on the unit interval: Evaluating this at n equally-spaced points for any n gives a signal with Df n (x) 1 having the canonical scaling of 1 /n. For the exponential distribution, we set either θ * or β * equal to f n (x) and evaluate both the Mean Trend Filter and the MLE Trend Filter on sample data. When θ * is controlled, the mean at x = 0.5 approaches infinity as n grows, making estimation very challenging. The reverse occurs if β * is controlled. For the Poisson, because the mapping from natural parameter to mean is exponential, controlling one does not particularly challenge the opposite procedure with the above f n . To increase the discrepancy, we use g n (x) = 0.5 − f n (x) + log(n). The signal should create more discrepancy between the estimators as n grows, but results are less dramatic than those in the exponential case. Figure 2 shows estimation accuracy for both trend filters across four different scenarios. In all cases, we generated data using the signals described above for 20 values of n ranging from 20 to 1000. The values are evenly spaced on the logarithmic scale. For each n, we repeated the experiment 10 times. The left column (panel A) shows results for both distributions when the mean is smooth (mean is given by the smooth functions above) and error is measured using the mean-squared error between the estimate and the truth. In the exponential case, the mean trend filter is slightly more accurate for larger n, but the overall error also decreases with n since the problem is becoming easier. In the Poisson case, the estimates (and therefore their errors) are  nearly the same. The right column (panel B) shows results when the natural parameter is smooth.
Here, for both distributions, the MLE trend filter performs better (as measured by KL divergence), but the difference is again more pronounced for the exponential distribution. Figure 3 shows all the estimates for all four scenarios when n = 104. In the left two panels, for the exponential distribution, it is clear that whether the mean or natural parameter is smooth makes a substantial difference for the accuracy of the estimator. For the Poisson case (right two panels), there is much less discrepancy. In the case that the mean is smooth, both estimators appear relatively poor, though the MSE remains small in both cases. The reason is that the mean and the variance are the same, and both nearly constant. The difficulty is further exacerbated due to the discreteness of the data and only a small handful of values with non-negligible probability. Therefore, this setting is actually quite challenging. For context, on the typical dataset, the average absolute difference between observations at neighbouring points is about 2.5 compared with a 0.01 change in the signal.

Example applications
We apply our estimators to two real-world datasets for illustrative purposes. The first examines Poisson trend filtering for estimating the age-time hospitalization rates due to COVID-19 in the University of California system. The second estimates the instantaneous temperature variability over the Northern hemisphere from publicly available observations.

UC COVID-19 hospitalization data
We analyzed the COVID-19 hospitalization rate within five hospitals in the University of California system: UC Davis, UC Los Angeles, UC Irvine, UC San Diego, and UC San Francisco. The data is based on 4,730 patients, all 18 years old or greater, that were admitted between February 12, 2020 to January 6, 2021. We aggregate the hospitalization counts at the weekly level-there are 48 weeks in total-and by age (in 15 bins of 5 years each). This results in noisy and sparse hospitalization counts at the week-by-age level with an average count-per-bin of 6.57. The data was obtained from the authors of Nuño et al. (2021), where they perform a more comprehensive analysis. It is used under a data use agreement and has not been made available to the public due to privacy concerns. We apply k = 1 trend filtering with the Poisson exponential family in 2 dimensions to COVID-19 hospitalizations. We tune the λ parameter by minimizing PUKL( θ). One can see the results in Figure 4, where the smoothed version is on the left. Due to the low average count per bin, trends in hospitalization rate are much more clearly visible after applying trend filtering. We have marked the local maxima in the smoothed signal which produces only 4 points-this would not have been possible in the raw data.
Some broad trends are clearly visible from Figure 4. First, we can see two distinct waves for COVID-19 hospitalizations in summer 2020 and winter 2020-2021. Moreover, we can see that the highest hospitalization rates within the summer 2020 wave are among those aged 50-65, while in the winter 2020-2021 wave the highest rates are both within the 50-65 age range but also the 80+ age range. This suggests that the age distribution is not stationary, and changes with successive waves. This may be due to a number of factors, such as behavioral shifts and holiday effects.  Figure 4: Estimated daily hospitalization rate due to COVID-19 by 5 year age group and week in five UC hospitals. We apply k = 1 trend filtering with the Poisson exponential family (left) to the raw count data (right).

Temperature variability
Trends in temperature variability (rather than in mean) have direct implications for plant and animal life (Huntingford et al., 2013), because changes in variability also impact the probability of extreme weather events (Vasseur et al., 2014). Hansen et al. (2012) and Huntingford et al. (2013) suggest that adaptation to extremes is more difficult than to gradual increases in the mean temperature. Nevertheless, research examining trends in the volatility of spatio-temporal climate data is relatively scarce. Hansen et al. (2012) studied changes in the standard deviation (SD) of surface temperatures at each spatial location relative to that location's SD over a base period and showed that these estimates are increasing. Huntingford et al. (2013) took a similar approach for a different data set. They argued that, while there is an increase in the SDs from 1958-1970 to 1991-2001, it is much smaller than found by Hansen et al. (2012). Huntingford et al. (2013) also computed the time-evolving global SD from the detrended time-series at each position and argued that the global SD has been stable. The first row in Figure 5 shows the change in mean temperature averaged over the winter and summer months separately in the 1960s relative to the 2000s using the ERA 20C dataset (Poli et al., 2016). It shows strong increases in average temperatures in both periods over the majority of the hemisphere. The second row shows the estimated standard deviations from the KL trend filter over the same period. We use k = 1 in the temporal dimension and k = 2 spatially. These estimated SDs are then averaged over the two periods for summer and winter separately and we plot the difference. There is a slight decrease in the SD during the summer and a more pronounced pole-ward decrease during the winter with the exception of Siberia which shows a dramatic increase over both periods. To further examine the effect of increasing mean and decreasing standard deviation, we look at the temperature distribution over both periods for Toronto, Canada (circled on both maps). Clearly, as shown in Figure 6, the distributions for both summer and winter have shifted toward higher   Figure 5, the mean increases over the period while the standard deviation decreases, resulting in the loss of "colder" days. This phenomenon is most pronounced in the winter. temperatures in 50 years. But at the same time, especially in winter, the standard deviation has declined. Thus, there are far fewer cold days (temperatures between −10 • C and −20 • C) in the 2000s than in the 1960s.

Discussion
We studied estimation error bounds for two estimators with a trend filtering penalty on grid graphs. One estimator minimizes squared distance from the mean and the other maximizes log likelihood. The bounds are more involved, compared to, say, the homoskedastic sub-Gaussian noise case. Such cumbersome bounds are due to the fact there are many more parameters that influence the estimation error. We illustrated the bounds in several interesting regimes of signals with heteroskedastic and homoskedastic noise. We analyzed two datasets with our models showing the applicability of our methodology to real world problems. We showed that both estimators achieve minimax optimal error rates in some scenarios, though unfortunately, addressing all cases remains for future work.
Because our analysis examines the entire class of observations corrupted by subexponential noise, the result is a general bound on the error for all exponential families. But, this is a large class, and far from the only way to study the estimation error. More specific analysis in specific cases will likely result in sharper bounds. For example, van de Geer (2020) gets sharper rates for the Bernoulli family and Brown et al. (2010) examines a set of 6 families where the variance can be written as a quadratic function of the mean. However, those analyses are much less comprehensive than ours.
Other possible extensions are "mixed" loss and penalties. One could try to penalize the mean parameter combined with likelihood loss or the opposite. Preliminary investigations into the first case revealed similar issues as with the penalty on the natural parameter, namely an inability to control the error in the null space of D. Another natural avenue for future work would note that all of these (the estimators examined here and the mixed versions) have connections to state space models in time series. So the relationship between trend filtering and Kalman-type filters may yield new theoretical insights and computational algorithms.  -41. 4, 8, 9, 13, 15, 33 A Proofs for preliminary results

A.2 Subexponential parameters for some standard distributions
For a Poission random variable X with mean µ, note that for s ∈ R, Ee s(X−µ) = e µ(e s −s−1) . Therefore Ee s(X−µ) ≤ e µs 2 for s satisfying e s − 1 − s ≤ s 2 . Let s * be the non-zero solution to e x = 1 + x + x 2 . Then s * ≈ 1.793. From this, we can show that For exponential distribution, we can do a similar calculation to get the results in Table 4. For an exponential variable X with mean µ, for s ∈ R, Ee s(X−µ) = e −µs 1 − µs .
We can verify that X − µ is sub-exponential with parameters (ν 2 , b) given in Table 4. To arrive at these parameters, we set b = 2µ and find the ν 2 of the form cµ 2 for a constant c such that Ee s(X−µ) ≤ e ν 2 s 2 /2 for |s| ≤ 1/b. In a similar fashion, one can also verify the sub-exponential parameters for the χ 2 distribution specified in the bottom row of the table. A.3 Some properties of Sub-exponentials

Tail bounds on linear combinations of sub-exponentials
We use the following exponentially decaying tail bound for sums of sub-exponential variables at multiple places in our proofs.
Lemma 5. Let ν and c be vectors such that i is sub-exponential with parameters (ν i , c i ). Given a matrix A ∈ R n×r , assume we have K and H such that sup i=1,..,r ν A i 2 ≤ K and sup i=1,..,r c A i ∞ ≤ H, where A 1 , . . . , A r are the columns of A. Then The proof is similar to that of Bernstein inequality from Theorem 2.8.1 in (Vershynin, 2018).
Proof of Lemma 5. We have Note that A T i is mean zero with parameters ( ν ⊗ A i 2 , c ⊗ A i ∞ ). This is because by independence of j . When |s| < 1 a ij c j for all j, which is satisfied when |s| < Therefore, using the Chernoff bound, we have ..,r v i ⊗c ∞ , which we minimize in s to get our bound. This is intractable, so we require ν ⊗ A i 2 ≤ K and c ⊗ A i ∞ ≤ H for all i ∈ [r]. We then have Minimizing in s, for |s| < 1 H , we have s = t/K 2 or 1/H depending on which is smaller. Therefore, We state a few convenient ways of using Bernstein's tail bound inequality on linear combinations of sub-exponential random variables. Denote the sub-exponential tail bound function with probability at least 1 − 2e −u for u ≥ 1.
Lemma 6. Let i be independent, mean-zero, sub-exponential variates with parameters (ν 2 i , b i ) for i ∈ [n]. Let a ∈ R n be a fixed vector. Let φ be the sub-exponential tail bound function defined in (8). Then for t ≥ 0, Further, if ν = b, then for t ≥ 1, with probability at least 1 − 2e −t , both the following hold: Proof of Lemma 6. (10) follows by applying Bernstein's inequality from Theorem 2.8.1 in (Vershynin, 2018). The inequalities (11), (12) follow from (10) by applying Hölder's inequality to the first parameter in different ways. From (9), observe that for t ≥ 1, . By applying Hölder's inequality in two different ways we get the high probability bounds (13) and (14).
Proof of Lemma 7. Denote X m+j = −X j for j ∈ [m]. By union bound, for u > 0, Set u = 2ω(log 2m + t) to get the desired bound.

B Proofs of upper bounds B.1 Proof of Theorem 1
We first state a basic inequality.
Lemma 8 (Basic inequality). Let R be as defined in Section 2.1 and let θ be the estimate in (3). Then, Further, this inequality is true if we replace θ with θ t = t θ + (1 − t)θ * for any t ∈ [0, 1].
Proof of Lemma 8. Optimality of θ and the equality R n (θ) − R(θ) = − 1 n T θ gives This is equivalent to the main statement in the lemma. The inequality for θ t follows from the fact that θ → R n (θ) + λ 1 Dθ 1 + λ 2 P N θ 2 is convex.
Finally, when d > 4 then α = 2/d < 1/2 and Next we show that the example signal satisfies the necessary conditions.
Consider the Poisson distribution where the natural parameter vector θ * is constrained. For Then the mean vector is Because the distribution is Poisson, we have b ∞ is constant while ν 2 i = 2β * i (see Table 1). Thus, ν ∞ = √ 2e d/2 which is achieved at i = (0, . . . , 0). The canonical scaling holds for Dθ * 1 n 1−α with k = 1 because there are on the order of N d−1 points at which the Laplacian is non-zero and they are on the order of 1/N .

B.3 Uniform risk bound with null space penalty
Proof of Proposition 1. From the definitions of R, R n , Applying Lemma 9 with J = [k + 1] d , we get , with probability at least 1 − 4nde −t , for t ≥ 1. Here κ = (k + 1) d and we used the fact that m < dn. By definition of Θ, θ should satisfy Dθ 1 ≤ c n n 1−α and P N θ n ≤ a n . Therefore, | T θ| ≤ A n a n √ n + B n c n n 1−α From the assumptions ν ∞ , b ∞ ≤ c, we can write A ≤ 2tµc √ κ. From Lemma 11, for p ≥ 1, L p ,p ≤ c 1 n (pα−1) + (log n) 1{pα=1} . This yields the following bound on B n : Therefore, with probability at least 1 − 4nde −t , 1 n | T θ| ≤ c 2 tc a n n − 1 2 + c n γn −α n (α− 1 2 ) + = c 2 tc a n n − 1 2 + c n γn − min{α, 1 2 } for a constant c 2 depending only on k, d. This is sufficient to show the desired bound.

B.4 Proof of Theorem 2
Proof of Theorem 2. Writing the KKT conditions, θ and β are solutions to (2) and (4) iff where S(u) is the set of subgradients of x → x 1 . S(u) depends only sgn(u). As ψ is a strictly increasing function, for any a, b ∈ R, sgn(ψ (a) − ψ (b)) = sgn(a − b). Therefore sgn(Dψ ( θ)) = sgn(D θ), and hence the subgradients S(Dψ ( θ)) = S(D θ). Plugging this in (15), we see that the KKT conditions for the least squares problem are satisfied by ψ ( θ) and therefore it is a solution to the least squares problem (4). The solution to the least squares optimization problem (4) is unique because the objective is strictly convex. Therefore, by definition of β, β = ψ ( θ).

B.5 Proof of Theorem 3
Proof of Theorem 3. The proof follows the strategy in Theorem 6 in Wang et al. (2016). Abbreviate δ = β − β * . From the optimality in the definition of β, Rearranging and substituting y = β * + , Bound the empirical process term on the right hand side using Lemma 9. By Lemma 9, for t ≥ 1 and J ⊂ [N ] d , the following holds with probability at least 1 − 2(m + |J|)e −t : . Applying Young's inequality on the first term and setting λ ≥ B n , 1 2n We used triangle inequality on the penalty terms to get the second line. Canceling terms, This bound holds with probability at least 1 − 2(m + |J|)e −t ≥ 1 − 4nde −t , and so the proof is complete.
B.6 Proofs of Corollaries to Theorem 3 where t = log n, Compared to the bound in Theorem 3, additional log n factors are incurred when translating from the high-probability statement to O P notation. B n can be bound more explicitly by writing down bounds for L J,1 , L J,2 using Lemma 11. For r ∈ [1, N √ d], we can write and where γ p = log 1/p (n) if pα = 1 and 1 otherwise.
If α > 1/2 and σ 2 /σ ∞ √ n/ log n, then Simultaneously, if α > 1/2, In some situations we can get improved results using (23), particularly in situations when σ σ ∞ . This can happen for the Poisson family when the signal β * is dominated by a few components.
Proof of Corollary 4.1. Throughout let t = log n. Start from the bound (16): In the case α ≤ 1/2, set J = [k + 1] d and recall the bound (20) for B n . This gives the desired result in this case. In the other case of α > 1 2 , we prove the bounds (22) and (23) now. Set J = {i : (i − k − 2) + 2 < r} for an r that we choose later.
Bound (22). Recall from (17) that where we get the inequality by taking only the first term of the inner minimum. Plugin the bounds for L terms from (18), (19) to write Plug this back in (16) to get For α = 1, when possible we will choose r ∈ [1, N √ d] such that which is equivalent to Selecting this r when possible gives the bound in (22) and the assumption √ nσ∞ tσ 2 1 ensures that we are not choosing an impossibly small r. When α = 1, we can retrace the argument with the additional γ 1 factor in (24) to get the bound.
Bound (23). When α ≤ 1, set J = [k +1] d to get the stated bound. Now consider α > 1. Simplify (17) by taking only the second term of the minimum, plug the bound for B n in (16) to get When possible we will choose r ∈ [1, N √ d] to balance the two terms above, that is, This choice of r gives the desired bound. Our assumption that n σ 2 t 2 1 makes sure that this choice of r is not impossibly small. This completes the proof.
Proof of Corollary 3.3. This is a direct result of Corollary 4.1, simplifying the cases.

B.7 Error rates assuming that the estimate is bounded
Consider the penalized maximum likelihood estimator (MLE) The minimum may not be achieved at an interior point of the domain. In that case, we set θ to a limit point of a sequence on which the objective converges to the infimum. If we assume that θ in (25) is constrained in such a way that ψ ( θ) is bounded away from 0, then the error bounding analysis essentially reduces to that in the Gaussian family case. Consider the constrained estimator where Θ(K) = {θ ∈ R : ψ (θ) ≥ 1 K } for some K > 0. Assume that Θ(K) is a convex set for any K > 0. This can be verified for Poisson, exponential and logistic families. Suppose is the best approximation of θ * within Θ(K) n . Also define β = ∇ψ( θ). Then the constrained estimator in (26) satisfies the following error bound.
Proposition 4. Let y i = β * i + i where i is zero mean sub-exponential with parameters (ν 2 i , b i ) for i ∈ [n]. Let L J,p be as defined in (5) for J ⊂ [N ] d , p ≥ 1. Abbreviate A n = µ |J| n ν 2 ∨ b ∞ log n, B n = (min { ν ∞ L J,2 , ν 2 L J,1 } ∨ b ∞ L J,1 ) log n. Then the estimator (26) with λ = Bn n , satisfies The proof is below. We choose J ⊂ [N ] d to minimize the bound. If we set K = 1/v min where v min = min i∈[n] ψ (θ * i ), then θ = θ * , β = β * and the above bound reads Proof of Proposition 4. Similar to the argument in Theorem 3, from the optimality of θ, we have the basic inequality, To lower bound the left hand side, we see that In the above display, the first inequality holds because both θ, θ ∈ Θ(K) n and Θ(K) n is convex.
for some u i between θ i and θ i . As Θ(K) is convex and u i lies between θ i and θ i , we should have u i ∈ Θ(K) and so ψ (u i ) should be at least 1/K.) The second inequality follows from the fact that 2ab ≥ −ca 2 − 1 c b 2 , for any a, b, c ∈ R with c > 0. Applying this to half of the left hand side of (27), Rearranging, By Lemma 9, for t ≥ 1 and J ⊂ [N ] d , the following holds with probability at least 1−2(m+|J|)e −t , . The sum of the first two terms on the right hand side can be bound by completing squares: Plug this into the bound in the previous display to get The argument from here is similar to that in the proof of Theorem 3.

B.8 Empirical process bound
corresponding to its th smallest eigenvalue. For J ∈ [N ] d , let V J denote a |J| × n matrix formed by picking the columns of V corresponding to J. Let P J = V J V T J be the projection matrix onto those columns.
. Let J ⊂ [N ] d and L be as defined in (5). Let m be the number of rows in D. For any J ⊂ [N d ] containing [k + 1] d , and t ≥ 1, with probability at least 1 − 2(m + |J|)e −t , the following holds uniformly for all θ ∈ R n : where we applied Hölder's inequality on each of the two terms separately. We give high probability bounds for P J 2 and (D † ) T (I − P J ) ∞ separately. A union bound will yield the stated result.
should hold with probability at least 1 − 2e −t for any t ≥ 1. From the incoherence property By union bound over j ∈ J, for any t ≥ 1, should hold with probability at least 1 − 2|J|e −t .
Bounding (D † ) T (I − P J ) ∞ . Rewrite this term as where g j = (I − P J )D † e j for j ∈ [m] and where m is the number of rows in D. From Lemma 6, one can deduce that max holds with probability at least 1 − 2me −t for t ≥ 1. Observe that b g j ∞ ≤ b ∞ g j ∞ and ν g j 2 ≤ min { ν ∞ g j 2 , ν 2 g j ∞ } .
Therefore, substituting the bounds on g j 2 , g j ∞ from Lemma 10, we get with probability at least 1 − 2me −t .
Lemma 10. Define g j = (I − P J )D † e j for j ∈ [m] and where m is the number of rows in D. Then for all j ∈ [m], Proof of Lemma 10. Let Σ ∈ R m×n denote the diagonal matrix such that Σ i,i = ξ i for i ∈ J and 0 otherwise. LetΣ = Σ − Σ, which is also diagonal m × n. Then Therefore, we can write The sole inequality in the above display follows from the incoherence property of U . This shows the upper bound on the 2 norms of g j , j ∈ [m]. For the ∞ -norm bound, we write, using Hölder's inequality. Because every entry of V is at most µ/ √ n, we have From the incoherence property of U ,

B.9 Eigenvalue bounds
Lemma 11. Let {ξ 2 i : i = (i 1 , . . . , i d ) ∈ [N ] d } be the eigenvalues of D T D where D = D (k+1) n,d and let p ≥ 1, α = (k + 1)/d. Then for large enough n, where c > 0 is a constant depending only on k, d. In the case pα > 1, for any Proof of Lemma 11. This is a generalization of Lemma 6 in Sadhanala et al. (2021), which states the bound for only p = 2. In their proof, if we change the power applied to the singular values in the summation to a general p ≥ 1 we get (a) the bound in the second display and (b) a bound slightly weaker than the first display: where we used Lemma 12 for the second inequality. In the case pα ≤ 1, this and (28) are sufficient to prove the lemma.
Lemma 12. For k ≥ 1, N > 2k + 2, the smallest eigenvalue of D N,1 L (k−1)/2 where L is the graph Laplacian of a chain of length N . Note that, for odd k, G T G = L k . The set of nonzero eigenvalues of GG T and G T G should be the same. We know that λ 1 (L) = 0, λ 2 (L) > 0 and so λ 1 (G T G) = 0, λ 2 (G T G) > 0. GG T has full rank. Therefore, Plugging in λ 2 (L) = 4 sin 2 π /2N and using the inequality sin x ≥ x /2 for x ∈ [0, π /2], we have λ 1 (GG T ) ≥ c/N 2k . As λ 1 (D (k) Case: k is even. Apply Lemma 13 to get the bound in this case.
Lemma 13. let λ i (A) denote the ith smallest eigenvalue of A. For k ≥ 1, and N > 2k + 2, Proof. Let L m denote the Laplacian of cycle graph with m vertices. It's smallest nonzero eigenvalue is 4 sin 2 π /m. Its eigenvectors are given (v ) j = e 2πi j/m . Let u ∈ R N be the eigenvector of (D (2k) N,1 corresponding to its (2k + 1)th eigenvalue. By Lemma 14, there exists a v ∈ R 2N −2 satisfying the following properties: With such a v, The equality holds by definition of u. The three inequalities follow in order from the three properties satisfied by v above. This is sufficient to complete the proof because we know that λ 2 (L) = 4 sin 2 π 2N −2 .
∆ and ∆ −1 : Define the following truncated discrete difference operator, ∆u = (0, (D N,1 u) 1 , D N,1 u) 2 , . . . , D N,1 u) N −2 , 0) for u ∈ U so that ∆ : U → U. We can write Then we can construct the inverse as the following truncated discrete integral using the following: Let u ∈ U, and define the cumulative sum operator, and note that z 1 = z N = 0. Then we have that ∆z = u for u ∈ U. To see this let i = 2, . . . , N − 1, Constructing v: Construct u ∈ R N such that As v = ext( u − p) and u − p ∈ U, we get v, 1 = 0 from the definition of ext. Again due to the definition of ext, v 2 2 = 2 u − p 2 2 . Write u − p = u + ( u − u − p) and note that u ⊥ N (D Therefore v satisfies all the three properties stated in the lemma. Lemma 15. Let U = {u ∈ R N : u 1 = u N = 0}. Let ext(u) denote the periodic extension of u ∈ R N , defined by v ∈ R 2N −2 where v 1:N = u, v N +i = −u N −i for i = 1, . . . , N − 2. Let L, ∆ be as defined in Lemma 14 and (29) respectively. Then (L k ext(u)) 1:N = ∆ k u for u ∈ U.
With this choice of c 0 , and the lower bound in (36) we can apply Theorem 2.5 in Tsybakov (2009) to get the bound in (33).
Lemma 18. Suppose n ≥ 6. Suppose q = 1 or q is even with q ≤ n/3. Then for r > 0, the minimax risk of B(r, q) defined in (30) satisfies n · R M B(r, q) ≥ 1 12 q min r 2 , σ 2 3 log en 8q ∨ σ 2 36 log 2 en 8q Proof of Lemma 18. We will show a slightly stronger bound: n · R M B(r, q) ≥ 1 12 q r ∧ σg −1 1 6 log en 8q 2 where g(x) = e −x + x − 1 for x ≥ 0. From this and Lemma 17, we get the bound in Lemma 18. The proof is adapted from that of Theorem 5 in Birge and Massart (2001) for Gaussian error model. We use Fano's lemma from information theory.
Lemma 21 (Generalized Stein Lemma; Eldar, 2009). Assume θ(y) is weakly differentiable in y with essentially bounded weak partial derivatives on R n . Let Y be distributed according to a natural exponential family and assume that the base measure h is weakly differentiable. Then, Note that ∇h(Y ) here means the vector [d/dy h(y)| y i ] and h(Y ) means the vector [h(y i )].
Therefore we define the Generalized SURE (Eldar, 2009) along the lines of the multivariate Gaussian case.
We have Now, the first term is a function of the data only, and to the last term, we simply apply Lemma 21. For the second term, by applying Lemma 21 twice along with the quotient rule.
However, we would prefer to estimate the Kullback-Leibler Divergence between the density under θ = θ(y) and that under θ = θ * . For exponential families, and, an application of Lemma 21 provides an unbiased estimator of this quantity. The result is given in Lemma 3 in the main body.
Finally, we conclude this section with the proof of Theorem 4.