Concentration inequalities for non-causal random fields

Concentration inequalities are widely used for analyzing machine learning algorithms. However, the current concentration inequalities cannot be applied to some non-causal processes which appear for instance in Natural Language Processing (NLP). This is mainly due to the non-causal nature of such involved data, in the sense that each data point depends on other neighboring data points. In this paper, we establish a framework for modeling non-causal random fields and prove a Hoeffding-type concentration inequality. The proof of this result is based on a local approximation of the non-causal random field by a function of a finite number of i.i.d. random variables. MSC2020 subject classifications: Primary 60G60, 60G48, 60E15; secondary 68Q32, 62M45.


Introduction
Concentration inequalities are widely used in statistical learning. For example, model selection techniques rely heavily on concentration inequalities [28]. They have also been used for high dimensional procedures [1,7] or for studying various machine learning framework, such as time series forecasting [25], online machine learning [33] or classification problems [19].
Many concentration inequalities have been proposed under different scenarios and assumptions. The simpler case corresponds to the independent and identically distributed hypothesis. Interested readers may consult [9] for an overview on concentration inequalities in this case.
However, in many machine learning applications, the i.i.d. assumption does not hold. It is the case for instance for time series which exhibit inherent temporal dependence.
As simple model of dependent times series are Markov chains. A Markov chain is a sequence of random variables (X t ) verifying: , for a function F and i.i.d. innovations (ε t ) t∈Z .
However, Markov chains are not sufficient to model any type of dependence. Some data may exhibit non-causal dependence. In the uni-dimensional case, it means that one data point does not merely depend on past data points, but also future data.
For example, this situation occurs for textual data. Textual data are a typical example of data generated by a non-causal process, because dependence on words goes forward, but also backward. A key task involving textual data is the completion problem. It consists in filling blanks in a text using surrounding words. This task is achieved by creating a language model, i.e., a probability distribution of words for a given context (past and future words).
In practice, non-causal models are already used to learn language models. Notable examples of such models are bidirectional neural networks [34]. They have recently received a lot of attention for their performances in Natural Language Processing. In particular, the BERT model [21] has become a staple for a very large range of NLP tasks, such as translation, part-of-speech tagging, sentiment analysis. However, despite their success in practical applications, it lacks a theoretical framework to analyze such non-causal models.
If the dimension of the lattice of random variables increases, we obtained a random field. The natural extension of Markov chains to random fields leads to causal random fields: here the dependence is propagated along with preferential directions (see [17] for an example of application). Non-causal random fields appear more naturally in some applications. For instance, it is natural to model the generation of pictures by a non-causal random field defined over a two-dimensional lattice. In this case, the completion problem consists in filling missing pixels using neighboring pixels [4]. It had to use information from each direction (up, down, left, right) to be able to complete the missing pixel. However, contrary to the case of textual data, such pixel completion techniques are not state-of-the-art for image processing. Another application of importance is the case of completing geographical data sets which make sense for an ecology setting (see http://doukhan.u-cergy.fr/EcoDep.html), for which applications are of fundamental importance.

Our contribution
In this article, we extend the Markovian framework presented in Eq. (1.1) to handle non-causal data. In the 1-dimensional case, we assume that the model is a solution of the following equations X t = F (X t−s , . . . , X t−1 , X t+1 , . . . , X t+s , ε t ), with i.i.d. (ε t ) t∈Z .
We straightforwardly generalize this approach to random fields indexed by Z κ ,i.e., the multi-dimensional case. In this case, data are a solution of the following equation X t = F ((X t+s ) s∈B , ε t ), for some neighborhood B and i.i.d. (ε t ) t∈Z . This kind of random field was introduced by [18]. It provides conditions for the existence and uniqueness of the solution of such an equation.
We aim at establishing concentration inequalities for these processes using realistic hypotheses on the data. In particular, we want our hypotheses to be reasonable for text data. These hypotheses are non-causal versions of the hypotheses of [2] or [16].
Our main results are Hoeffding-type inequalities. The main hypothesis for these inequalities is relaxed versions of the contraction assumption used in [2,14,16].
Our technique of proof relies on a convenient local approximation of a noncausal random field by a function of a finite number of i.i.d. random variables. We also give some examples of such random fields and compare our framework with other classical dependent frameworks. Finally, we present a simple application of our results to model selection, inspired by [27].

Outline of the paper
In Section 2, we introduce non-causal random fields proposed in [18]. We also present our hypotheses in this section. We take some examples of models satisfying our framework and define the statistic S I of interest for which we aim at proving a concentration inequality. Main results may be found in Section 3 as well as some immediate applications to machine learning and comparisons with other concentration inequalities. The rest of the paper is dedicated to the proof of main results.
In the section 4, we introduce an approximationS I of S I which only depends on a finite number of independent variables. We also establish a result evaluating the quality of this approximation.
Then, in section 5 we prove the main concentration inequality using the McDiarmid's inequality proposed by [13]

Model
In this section, we will present a model inspired by [18] in order to consider non-causal random field. Then, we introduce our hypotheses and the targeted statistic S I .

Some definitions and notations
From now, all random variables are defined on a probability space (Ω, F, P).
Dimension of the random field Let κ ∈ N denote the dimension of the random field of interest.
This dimension should not be confused the classical dimension that appears in high-dimensional statistics, i.e., the number of parameters of the model. Here, the number of parameters increases exponentially with the dimension κ.
Probabilistic setting Let X be a Banach space endowed with a norm · . We define the m−norm X m of a random variable X as We also use the uniform norm X ∞ = inf{C, X ≤ C a.s. } = lim m→∞ X m .
Additionally, we introduce the notation B + s which denote the set B shifted by s, i.e., B + s = V(δ, s)\{s}.
Innovations ε ε ε = (ε t ) t∈Z κ . Let (ε t ) t∈Z κ be a random field indexed by Z κ of independent and identically distributed random variables ε t on a Banach space E. To shorten the notation, we denote by ε ε ε = (ε t ) t∈Z κ the whole random field.
Additionally, we define μ ε the probability distribution of one random variable ε t and μ = t∈Z κ μ ε , the product of these distributions on E Z κ . Then, μ is the distribution of ε ε ε = (ε t ) t∈Z κ .

Local non-causal relations
In this paper, we will study the κ-dimensional non-causal random field (X t ) t∈Z κ . This random field presents a local dependence, for each t ∈ Z κ , X t depends on its neighborhood indexed by V(δ, t) and on the innovation ε t .
Formally, we assume it exists a function F : X B × E → X such that (X t ) t∈Z κ is a stationary solution of the following equation (2.1) The reference [18] ensures the existence and uniqueness of such solutions giving contraction hypothesis on F . Here, we just suppose to have one (strongly) stationary random field (X t ) t∈Z κ verifying this equation and we do not assume the uniqueness.
Thereafter, we denote by μ X the stationary distribution of one marginal random variable X t . We assume that laws μ X and μ ε are stable by F. It means that, if (X t ) t∈B is drawn with marginal distribution μ X and ε is drawn with distribution μ ε , then the random variable F ((X s ) s∈B , ε) follows the distribution μ X .
For s ∈ B, X t+s depends on X t , but X t also depends on X t+s . Therefore, it is no longer possible to describe (X t ) t∈Z κ as a result of a martingale process. This is why, we call (X t ) t∈Z κ a non-causal random field.

Remark 2.1.
It is important to note that the evolution of the random field is defined by a local equation. This is similar to settings of auto-regressive times series in the uni-dimensional case. This is why our framework can be seen as a generalization of stationary Markov chains. Indeed when δ = κ = 1 and B = {−1}, (X t ) t∈Z is an homogeneous and stationary Markov chain.

Remark 2.2.
We have to assume that the set B of index is finite. Indeed the quantity n B = card(B) + 1 play an important role in our results. Consequently, models with infinite memory (such as AR(∞)) don't fit into the framework defined by Equation 2.1
Let's consider a X -valued Markov chain (X t ) t∈Z given by The classical contraction hypothesis requires that it exists ρ < 1 such that for all y, y ∈ X , and for all ε ∈ E It is relatively easy to extend this condition to a non-causal random field defined by Equation (2.1). It would be the following condition.
Definition 2.1. Absolute Contraction It exists (λ t ) t∈B , such that ρ := t∈B λ t < 1, and for any X -valued tuples Y = (y t ) t∈B and Y = (y t ) t∈B indexed by B and for all ε ∈ E.
This condition is similar to the condition proposed by [18]. It is strong condition, and it is not satisfied by many usual models.
We therefore want to relax this condition. To that end, we consider that the contraction is only verified for a m order moment. This leads to the following hypothesis.
s∈B λ s < 1. We denote thereafter ρ = s∈B λ s . We emphasize that (H m 1 ) depends on m because the contraction concerns only the m−moment and does not require the function F to be contracting (Equation (2.3)).This hypothesis is hard to verify on the new model because ε t and (Y t ) are in general dependent on one another. To prove this condition, we generally need to verify a stronger contraction hypothesis. However, it should be possible to test this hypothesis, whereas it is impossible for the absolute contraction condition because we do not have access to the function F . However, the weak contraction hypothesis leads to weaker results than absolute contraction. Therefore we also use the following compromise: When m goes to ∞, (H m 1 ) becomes Definition 2.3. Uniform contraction hypothesis (H ∞ 1 ). There exists (λ s ) s∈B ∈ [0, 1] B such that, for all X -valued tuples of random variables (Y t ) t∈Z κ , (Y t ) t∈Z κ with marginal distribution μ X and a random fields ε ε ε = (ε t ) t∈Z κ with product distribution μ.
and if there are no null set under the measure μ X , then Let's assume that diam (F ) < diam (X ). In this case, it is tempting to think that hypothesis (H ∞ 1 ) is satisfied. However, it may exist couplings of (Y t ) t∈Z κ , is not necessarily satisfied when the diameter of F is smaller than the diameter of X .
We have the following relation between those hypotheses and the absolute contraction.
Lemma 2.1. If F verifies the absolute contraction and if the first condition of the weak (respectively uniform) contraction condition holds, then it verifies the weak (resp. uniform) contraction condition.
The first condition of weak and uniform contraction condition are immediately verified as soon as X is bounded. However, this is not a necessary condition, in particular for (H m 1 ). In particular, its first condition is verified as soon as μ X is short tail.
The relation between (H m 1 ) and (H ∞ 1 ) is uneasy to establish. If (H ∞ 1 ) is verified, then, for every coupling, it exists a rank m 0 such that the condition 2 of the hypothesis (H m 1 ) is verified for every m > m 0 . However, it is not clear if such a rank m 0 exists for every coupling.

Coupling hypothesis
We introduce a coupling hypothesis similar to the condition used in [16]. It controls the moment of the difference between two independents variables following the same distribution μ X .
This hypothesis is immediately verified as soon as the diameter of X is finite. Moreover, if (H m 1 ) is verified, so is (H m 2 ). Nevertheless, we point out that the quantity V m plays a important role in our concentration inequality result and may be significantly smaller than the diameter of X . Consequently, even when the diameter is finite, it may be advantageous to use hypothesis (H m 1 ) and (H m 2 ).

Examples
We provide below some examples of non-causal random fields.
Non-causal linear fields Under our framework, the simplest possible noncausal random fields is the bidirectional linear model. In this case κ = 1, and we have α −1 , α 1 such that Where the ε t are a Gaussian white noise of variance σ 2 . In this setting, absolute contraction is satisfied if α −1 + α 1 < 1..
Bidirectional linear field can be generalize to linear random fields In any case, absolute contraction is satisfied as soon as s∈B α s < 1.
Finite LARCH random fields Finite LARCH(n) random fields [35] are defined by They can be generalized to non-causal LARCH(n) fields defined by In this case, a sufficient condition to fulfill (H m 1 ) is: ε t ∞ s∈B α s < 1.
Finite ARCH random fields ARCH models are widely used in econometric and can be easily extended to the non-causal case. Here, we propose a Bi-ARCH(1,1) model defined by In this case, (H m 1 ) is satisfied providing is satisfied if ε is bounded and the process stationary. Those processes can be extend to the multidimensional case, such processes verify Bidirectional RNN Bidirectional recurrent neural networks (BRNN) have been used in Natural Language Processing. They have many applications from text translation to part of speech tagging and speech recognition. Here, we present the formal version of a single-layer bidirectional neural network with a white noise ε t .
Where A is a p × 2k matrix, f is an activation function and X t is the 2k vector We suppose that the activation function is 1-Lipschitz. This is the case for most activation function (sigmoid, RELU, softmax). There is also an operator norm op,m associated with the norm m . With this condition, the contraction condition (H m 1 ) is verified as soon as If the white noise is bounded, the condition (H ∞ 2 ) is verified. Instead, if it is subgaussian, we only have (H m 2 ).

Function of interest Φ and the statistic S I
Throughout this article, we focus on a function Φ : XB → R defined on a small neighborhoodB Then, for a given subset I of indices, we define the statistic S I We first recall that the process is strongly stationary, thus E[Φ((X s+t ) t∈B )] = E[S I ] is well defined. Our goal is to control the difference between S I and E[S I ], which is the deviation of S I .We introduce below some hypotheses concerning this function Φ.

Lipschitz separability hypothesis
This hypothesis is close to the condition proposed by [16] in the causal dependent case. Moreover, recent works (see [36]) suggest that such hypothesis is suitable to deal with deep learning models and provide algorithm to estimate the Lipschitz constant.
To simplify, we assume throughout the article that L = 1 but setting L = 1 would only add a multiplicative factor in our results.
This hypothesis is immediately verified when X is bounded. The case of unbounded X will be discuss in section 3.4.

Concentration inequality within i.i.d. assumption
Numerous concentration inequalities have been established to control the deviation of a random variables. Hoeffding's and McDiarmid's inequalities, proposed respectively in [20] and [30], are among the most widely used in machine learning. Their classical versions are based on the i.i.d. assumption. Below, we recall the Hoeffding's inequality.

Theorem 3.1 (Hoeffding's inequality)
. Let X 1 , . . . , X n be independent real random variables and S I = n i=1 X i . We suppose that there exist two tuples (a 1 , . . . , a n ) and (b 1 , . . . , b n ) such that with probability 1, In the field of statistical learning, Hoeffding's and McDiarmid's have lots of applications ( [9,28] give examples).
In this article, we aim to prove a Hoeffding-type inequality in the non-causal setting defined in Section 2.

Remark 3.1. Other types of concentration inequalities (Bernstein, von Bahr-Esseen) could have been proved in the same context using the same approach.
Indeed, it should also be possible, as in the i.i.d. case, to derive additional inequalities from the exponential inequality given in Lemma 4.8. Nevertheless, we focus on Hoeffding's inequality because of its simplicity and its use in many applications.

Concentration inequalities and expected deviation bounds for non-causal random fields
We state below simplified versions of our results. Full theorems may be found in Section 5.2.
where Υ is the function defined in Lemma 4.6 and can be bounded independently of nB, m or n.
Constant A does not depend on n B , nB or n and its explicit value it may be found in Theorem 5.1.The dominant term in the denominator is Moreover, this term is strongly affected by the dimension of the random field κ. However, in practice, this constraint is not a problem, because, in most practical cases, κ ∈ 1, 2.
Other important factors are parametric dimensions of our model, represented by n B and nB. It is logical that the quality of the inequality decreases when the number of variables to control increases. Our inequality is very sensitive to these factors and is therefore not a suitable result for high dimension estimation,i.e., when n is smaller than n B and nB. We recall that for a simple Markov chain both of these terms would be equal to 2.
Finally, the condition ε ≥ 2nBV ∞ is not restrictive for applications as we will show in subsection 3.4.
We now give a simplified version of our concentration inequality for the weaker assumptions (H m 1 ) and (H m 2 ).
These constants are Lemma 4.6 and can be bounded independently of nB, m or n.
Similar remarks can be made about the dimension κ and on parameters n B , nB as about the previous theorem.
The denominator in the exponential is O nn 4 m (ln(n)) κ . Compared to the i.i.d case, there is an additional term in n 4 m (ln(n)) κ that we intend to explain. The term n 4 m comes from the (H m 1 ) which is less strong than the classical contraction hypothesis. As in the strong contraction case, the term (ln(n)) κ comes from the dependence, and is obtained by similar article for unidirectional dependence. [16].
(H m 1 ) and (H m 2 ) are rather weaker assumptions than (H ∞ 1 ) and (H ∞ 2 ) and thus lead to deteriorated concentration inequalities. In fact, in the exponential term, the denominator is asymptotically dominated by O n 1+ 4 m (ln(n)) κ instead of O (n(ln(n)) κ ) under assumptions (H ∞ 1 ) and (H ∞ 2 ) and O (n) under i.i.d. assumption.
We point out that this theorem only gives interesting results when m > 4 (the reason for this becomes clearer in Theorem 3.4).
The two extra additive terms decrease faster than the main term and are not dominant for applications as we will show in subsection 3.4.
These two theorems lead to the next corollaries for the expected deviation.

Remark 3.2. To show how constants occurring in Corollaries 3.2 and 3.3 can be explicitly computed, we provide below a simple example.
We assume that it exists α −1 and α +1 such that This framework corresponds to the most simple bi-linear random field. Additionally, we introduce a function φ and a statistic S I such that In this simple case, We recall from Theorem 4.
Therefore, we get the following bound for Note that S I is defined as a sum, and not an average of random variables. Therefore the expected deviation E[|S I − E [S I ] |] does not go to 0.

Comparison with other results
In this section, we compare our hypotheses and results with other frameworks and concentration inequalities in the literature.
Independent and identically distributed case The first concentration inequalities were proved for i.i.d. random variables. We have already presented the classical Hoeffding's inequality. McDiarmid's is similar, but somewhat more general, since it does not refer to a sum of random variables n i=1 φ(X i ), but to a general random function of random variables φ(X 1 , . . . , X n ).
The requirements for Hoeffding's inequality (see Equation (3.1)) or its equivalent for McDiarmid's inequality, are uniform in the sense of [24] (uniform bounded difference assumption). However, some attempts [13,24] have been made to relax this condition. Instead of assuming such conditions with probability 1, they may only be required with probability 1 − ρ (with ρ small). However, relaxing these conditions degrades the quality of the bound.
This can be compared to the difference between the conditions (H ∞ 1 ) and (H m 1 ). Indeed, the moment condition implies that the application F is a contracting mapping with high probability.

Markov chain
The causal dependence can be easily described using a Markov chain, where the values X t verify an equation of type X t = F (X t−1 , ε t ) for an (i.i.d) innovation ε t . We have borrowed a lot from this approach. Our noncausal master Equation (2.1) has the same form as a classical Markov chain causal equation.
In order to establish concentration inequalities on Markov chains, further assumptions about the behavior of the function F are required. For example, classical hypotheses include conditions on the mixing time and spectral gap [32]. However, these conditions are difficult to translate in a non-causal and multidimensional case. Therefore, we use an approach similar to [2,16] and introduce a contraction hypothesis. However, as far as we know, our contraction hypotheses (H ∞ 1 ) and (H m 1 ) are weaker than the absolute contraction hypothesis used in previous works (e.g. [16]). That is why the convergence rate obtained by [16] is better than our own. They obtain a convergence rate in O( √ n) instead of the O( ln(n)n) of our Corollary 3.1.

Simultaneous auto regressive scheme
The simultaneous autoregressive scheme models the evolution of a lattice through time. At each time step, values on the lattice are updated using the previous values. Causal and non-causal versions of such models were studied in [11]. Our hypothesis (H m 1 ) is close to the Stochastic Lipchitz Continuity hypothesis from this article. Our work differs in that we are interested in non-asymptotic results.
Weak dependence for causal time series Another possibility is to use weak dependence (see [15]) to model dependence in time series. It can be applied to more general processes because it accounts for long-range dependence. However, weak dependence conditions are non-local, and therefore more difficult to verify. In the causal case, they lead to a similar rate of convergence as our results under the uniform contraction hypothesis. For example, they are used by [3] for model selection to obtain a convergence rate asymptotically dominated by O( ln 5 2 (n)n).
Weak dependence for non-causal time series For non-causal time series, many results have been obtained using the notion of weak dependence [15]. This notion replaces our contraction condition (H ∞ 1 ) or (H m 1 ). However, it is difficult to compare our setting with these results because their concentration inequalities are not Hoeffding or Mc-Diarmid type inequalities. In particular, they introduce an autocovariance term that does not appear in our concentration inequalities.
Markovian random field Few results are similar to ours for general random fields. The usual framework for treating non-causal random fields is the Markovian random field framework, where the random variables are distributed on a graph, instead of a lattice. To our knowledge, there are no similar concentration results that apply to general Markovian random fields.
An interesting sub-case of Markovian random fields is the Ising model, which is commonly used in physics and imagery [29]. Ising models can be viewed as a special case of our model, where X = {−1, 1} and B = {−1, 1} κ . The transition is governed by the following equation where ε t follow an uniform distribution in (0, 1). Some concentration results have been established with coupling conditions [10], it leads to polynomial (rather exponential) concentration inequalities. However, it is hard to compare them with our results, because, even if (H ∞ 1 ) is not satisfied for the Ising model, (H m 1 ) seems reasonable but hard to verify.

Application to learning theory
In this subsection, we provide an application of our results to learning theory. We adapt the approach of [5], [26] and [27] with our framework, to get an oracle bound for the model selection problem.

Application to the completion problem
A classical problem in learning theory is the completion problem (both in regression or classification). The objective is to predict X t using its neighbours on a κ dimensional lattice (X t ) t∈B . The prediction is given by a modelf This completion problem may seem trivial, but it is actually the backbone of many NLP tasks. Indeed, predicting words is often used as a primary task to train the encoder in an encoder-decoder scheme. It creates a language model, i.e., a distribution of probability of a word in a text. This language model is then used for more complex task (translation, text segmentation, question answering etc.).
We also introduce a cost function c : X 2 → R. c( x s , x s ) quantifies the error made when the considered algorithm predict x s instead of x s . In this case, the function of interest Φ introduced earlier corresponds to the cost off on a sample And On Figure 1, the red neighborhood V(δ, t) is used to compute x t =f ((x t+s ) s∈B ). However, to compute this in practice, we must know the values of all the x t in the neighborhood V(δ, t). It implies that the training set (the blue set) is only a subset of the set of all known values (the grey one) In the completion problem, we want to control the gap between the theoretical risk of an estimatorf , denoted by R(f ), and its empirical risk, Proof. It is a direct application of Theorems 3.2 and 3.3 with S I = s∈I c(f ((X s+t ) t∈B\{0} ), X s ). Moreover, the constants involved are the same as in Theorem 3.2 and 3.3.
If we assume that E[exp(|R(f )−R emp (f )|)] exists, we get the following useful result. (3.6) The proof can be found in the appendix A.

Model selection
In this subsection, we present a model selection result for a finite set of models.
It is an extension of well known results ( [5]) within our framework. It should also be possible to obtain results for an infinite set of learning function using the Vapnik-Chervonenkis dimension [8].
We consider a finite set F of modelsf and assume that we have an algorithm that can find the model with the lowest empirical risk. Our goal is to bound the difference between the theoretical risk of the selected model and the theoretical risk of the better model in the set.
We assume here that all modelsf are independent of the dataset used to compute the empirical risk. This hypothesis has been relaxed in previous work [31]. However, in order to provide a simple example, we retain this hypothesis.
Thus, for all ε, Then, Theorem 3.4 yields result for both hypotheses.
Bound for the expectation can also be obtained. We focus here on the case where hypotheses (H ∞ 1 ), (H ∞ 2 ) and (H 3 ) hold.
We can also provide an asymptotic equivalent, Proof. We showed in previous corollary that Moreover, for all s > 0, If (H ∞ 1 ), (H ∞ 2 ) and (H 3 ) hold, applying Equation (3.5) from Corollary 3.3. We get Then choosing s = 2n nB V∞ 2 ln(N ) 1+AnB n 3 B κ! 2 ln(n) κ n , we get   H(nB, n B , κ, ρ, N, n Where H and E are constants defined in Corollary 3.2.
Remark 3.7. In [27], a bound is provided for the expected maximal deviation in the i.i.d case,

Approximation
In Sections 4 and 5, we will prove the results of section 3. To this end, we introduce a useful approximation of non-causal random fields. [18] has already shown that a non-causal solution (X t ) of (2.1) can be expressed as a function of an infinite number of i.i.d. random variables. We present an approximation of (X t ) by a function of a finite number of i.i.d. random variables.

Notations
Let's first recall and introduce some notations to handle random fields.

Exact reconstruction
Theorem 1 from [18] ensures, under suitable conditions on F, the existence and uniqueness of a function H such that, for each t ∈ Z κ We make two comments on this result.
• This theorem provides an expression of each X t according to an infinite number of i.i.d. random variables (the whole random field ε ε ε). However many concentration inequalities involve only a finite number of random variable. • This theorem relies on an absolute contraction hypothesis on F (similar to Equation 2.3) which is a stronger assumption than (H ∞ 1 ) and (H m 1 ). For those two reasons, we cannot use this exact reconstruction of the solution X t of Equation (2.1).

Intuition
The idea is to approximate each X t by another random variableX t which, similarly to H(ε ε ε), depends on the innovation ε ε ε. However, unlike H(ε ε ε), we will only use a finite number of random variable ε t . The ones that are located in a finite neighborhood surrounding X t .

Definition
Let's recall some notations and formally define the functionH [d] .
WhereX is an independant random variable draw sampled with distribution μ X .
We can reformulate Equation (4.3) using the notationX In this way,X [d] t is an approximation of X t involving random variables ε t which belong to the finite neighborhood V(dδ, t). Outside, of this neighborhood, we complete the approximation with a random variableX draw form the law μ X and independent from (X t ) t∈Z κ and ε ε ε. Remark 4.1. We emphasize that if the function H from Theorem 1 of [18] exists and is unique, then lim d→∞H [d] (X, (ε t ) t∈V(dδ,t) ) = H(ε ε ε). Nevertheless this limit might not exist and not be unique, but even in this case, for all finite d, the approximationH [d] (X, (ε t ) t∈V(dδ,t) ) is always defined.

Approximation error
The approximationX [d] t is useful only if we are able to control the approximation error. That is the goal of the two following Lemma.
The proof can be found in the appendix B. The lemma is the key to control the quality of our approximation.
Earlier, we define the statistic S I (see Equation (2.5)), which depends on random variables (X t ) t∈I . We now introduce an approximation of the statistic S I that we callS [d] I relying on approximationsX Similarly, we defineS t+s ) s∈B ). (4.6) Using Lemma 4.1, we are able to control the difference between S I andS I . This is the purpose of the following corollary.
Then, using Lemma 4.1

Concentration inequality forS [d] I
In this subsection, we will establish a concentration inequality forS and This theorem is a direct application of a McDiarmid-type inequality. This type of inequality holds if the random variable we want to control is a function of a finite number of independent random variables, which is the case here. Indeed,S

[d]
I can be expressed as a function of a finite number of random variables X [d] t , and each of these variablesX is itself a function of a finite number of innovations ε t and the independent random variableX.
We need to present two things necessary to apply a McDiarmid inequality: • The necessary counting element to count the number of independent random variables that appears inS I . • The precise version of the McDiarmid inequality we are going to use.

Counting random variables
The rest of this section will require to count the number of random variables that occur inS [d] I . There are two types of random variables we want to count. The first is the number of approximate random variablesX [d] t , the second is the number of innovations ε t . For this purpose, we introduce the following cardinal number: • n d = Card (V(dδ, t)) the number of innovations ε t of ε ε ε used in the approx-imationX t (see Equation 4.4).
There are some relations between these numbers.

Lemma 4.2.
• n ≤ N 1 ≤ nnB and Higher bounds for N 1 and N 2 are reached when the above unions involve pairwise disjoint sets. Nevertheless, it is not often the case in practice. For example, in machine learning settings, the training and validation sets are usually connected spaces. Therefore, adding more hypotheses about the topology of I should improve these bounds.

An extention of McDiarmid's inequality
In order to obtain a concentration inequality forS ) are verified without absolute contraction, the uniform difference-bound hypothesis (as defined in [24]) is not satisfied, thus classical McDiarmid's inequality [30] does not hold. Therefore, we need an extended version of McDiarmid's inequality that holds even if the bounded difference hypothesis is verified only with high probability. There are several results of this type ( [13,22,24,37]). Here, we have chosen to use the extended McDiarmid's inequality from [13].
We are under the "A-difference bounded" assumption which corresponds to Assumption 1.2. from [13] and the extension of McDiarmid's inequality (Theorem 2.1. from [13]). [1,N ] if and only if.

Difference bound forS
[d] I To check these assumptions (strong difference bound or A-difference bound), we need to bound the difference between the statisticS ).

Two types of random variables are involved inS
[d] I . • The marginal variables ε s (for s ∈ t∈I V(dδ, t)).
We introduce the following notations.
• For all t in Z κ ,X (dδ,t) ) whereX is drawn from the law μ X and independent from (X t ) t∈Z κ , ε ε ε andX.
• For all t in Z κ and for all i in V(dδ, t),X where ε i is drawn from the law μ ε and independent from (X t ) t∈Z κ , ε ε ε and X. .
We assume (H m 1 ) and (H m 2 ). • The demonstration of the first point is the same as Lemma 4.1.
Then with the same argument as Lemma 4.1, we can show that X [d] t −X We can now bound the difference betweenS And, The proof of this lemma can be found in the appendix C.
In previous Lemma, the quantity d c=1 c κ−1 ρ c occurs, we show in the next Lemma that we can bound this quantity independently of d.
The function Υ is defined by .
The proof of this lemma can be found in the appendix D.

Concentration inequalities for S I
Now we are finally able to prove the result of section 3 and prove concentration inequalities for S I . To do this, we use the concentration inequalities from the previous section (withS [d] I ) and Lemma 4.1.
Proof. If (H ∞ 1 ) and (H 3 ) are verified Using Corollary 4.1, we have almost surely: If Then

Concentration inequalities with optimized parameters
Equations (5.1) and (5.2) in Lemma 5.1 were established for each d ∈ N and each positive value of t 1 , t 2 . Therefore, we can choose an appropriate value for each of these parameters to improve our bounds. Moreover, the quantity N 2 is not easy to interpret and even less to estimate. Therefore, in the following theorems, we fix parameters d, t 1 , t 2 and replace N 2 by an upper bound corresponding to the worst case.
Theorem 5.1 (Improved concentration inequality for S I , uniform contraction case).
Proof. We use the previous Lemma 5.1 and setd = ln(n) ln(ρ −1 ) and d = d . Therefore Then, according to Lemma 5.1, it holds .
And L(n) = nB Vm The proof can be found in the Appendix E.