Minimax Analysis for Inverse Risk in Nonparametric Planer Invertible Regression

We study a minimax risk of estimating inverse functions on a plane, while keeping an estimator is also invertible. Learning invertibility from data and exploiting an invertible estimator are used in many domains, such as statistics, econometrics, and machine learning. Although the consistency and universality of invertible estimators have been well investigated, analysis of the efficiency of these methods is still under development. In this study, we study a minimax risk for estimating invertible bi-Lipschitz functions on a square in a $2$-dimensional plane. We first introduce two types of $L^2$-risks to evaluate an estimator which preserves invertibility. Then, we derive lower and upper rates for minimax values for the risks associated with inverse functions. For the derivation, we exploit a representation of invertible functions using level-sets. Specifically, to obtain the upper rate, we develop an estimator asymptotically almost everywhere invertible, whose risk attains the derived minimax lower rate up to logarithmic factors. The derived minimax rate corresponds to that of the non-invertible bi-Lipschitz function, which shows that the invertibility does not reduce the complexity of the estimation problem in terms of the rate. % the minimax rate, similar to other shape constraints.


Background
Learning invertible structures from data is a problem encountered in several fields, from more classical to modern ones, where an invertible function is a typical shape-constraint of functions.A traditional and well-known application is the nonparametric calibration problem: in a nonparametric regression problem with an unknown invertible function, one estimates an input covariate corresponding to an observed response variable.This problem has been studied by Knafl et al. (1984), Osborne (1991), Chambers et al. (1993), Gruet (1996), Tang et al. (2011) and Tang et al. (2015), and applied in the fields of biology and medicine (Tang et al., 2011(Tang et al., , 2015)).A different application in econometrics is the nonparametric instrumental variable, developed by Newey and Powell (2003) and Horowitz (2011).This is an ill-posed problem with conditional expectations.For instance, Krief (2017) studies the estimation by direct usage of inverse functions.Another application that has been developed rapidly in recent years is a framework for normalizing flow used for generative models in machine learning, developed by Rezende and Mohamed (2015) and Dinh et al. (2017).A related problem is the analysis of latent independent components using nonlinear invertible maps (Dinh et al., 2014;Hyvarinen and Morioka, 2016).Under this problem, an observed data distribution is regarded as a transformation of a latent variable by an unknown invertible function, and this function is estimated by an invertible estimator to reconstruct the latent variable (for a review, see Kobyzev et al. (2020)).Several methods have been developed for estimating invertible functions, for example, Dinh et al. (2014), Papamakarios et al. (2017), Kingma et al. (2016), Huang et al. (2018), De Cao et al. (2020) and Ho et al. (2019).
In the univariate case (d = 1), an error of the invertible estimators has been actively analyzed.In this case, the estimation of invertible functions is related to estimating strictly monotone functions, and there are many related studies in the field of isotonic regression (for a general introduction, see Groeneboom and Jongbloed (2014)).Tang et al. (2011Tang et al. ( , 2015) ) and Gruet (1996) study an estimation for an input point x = f −1 (t ) ∈ [−1, 1] corresponding to an observed output t ∈ R with an invertible function f .Specifically, Tang et al. (2011) shows that a pointwise estimator x, which is based on the estimation of monotone functions, achieves a parametric convergence rate | x − x| = O P (n −1/2 ), where n is the number of observations.They also establish an asymptotic distribution of the estimator x.Krief (2017) develops an estimator f for an unknown invertible function f * , which is written as a conditional expectation with an r -times continuously differentiable distribution function, and study its convergence in terms of a sup-norm ) ).Because this rate is slower than the minimax optimal rate on (even a non-invertible) r -differentiable functions, it is suggested that this rate does not achieve optimality.
For the multivariate (d ≥ 2) case, there are few studies on the rate of errors, because a multivariate invertible function may not be represented by a simple monotone function as the univariate case.Several studies for normalizing flows show the universality of each developed flow model (e.g., Huang et al. (2018); Jaini et al. (2019); Teshima et al. (2020)).However, these studies do not discuss efficiency, and only a few have investigated a volume of approximation errors of simple flow models (Kong and Chaudhuri, 2020).
The minimax rate of risk is a specific measure describing an effect of shape constraints such as invertibility, and one primary interest is whether shape constraints change the minimax rate.It is studied that some shape constraints change the minimax rate to the parametric rate O(n −1/2 ), such as unimodal (Bellec, 2018), convex (Guntuboyina and Sen, 2015), or log-concave (Kim et al., 2018), whereas the ordinary rate without shape constraints is O(n −r /(2r +d ) ) with an input dimension d and smoothness r of a target function.Furthermore, even in the invertible setting, Tang et al. (2015) achieved the parametric rate for the pointwise estimator.In contrast, the monotonicity constraint does not change the rate, that is, Low and Lang (2002) shows that the nonparametric rate appears in the estimation of monotone functions.Based on these contrastive facts, whether the invertible constraint improves L 2 -risk is an open question to clarify the efficiency of invertible function estimation.

Problem Setting
We consider a nonparametric planer regression problem with an invertible bi-Lipschitz function, and study an invertible estimator for the problem.We set the input dimension as d = 2 and define I := [−1, 1].We consider a set of invertible and bi-Lipschitz functions as where !∃ denotes unique existence, and a function f is called bi-Lipschitz if L −1 ∥x − x ′ ∥ 2 ≤ ∥ f (x) − f (x ′ )∥ 2 ≤ L∥x − x ′ ∥ 2 holds for some L ≥ 1 for any x, x ′ ∈ I 2 .The bi-Lipschitz property is reasonable in dealing with invertible functions, because f ∈ F L INV holds if and only if f −1 ∈ F L INV holds (see Lemma 16).Note that invertible and continuous function is called homeomorphism.
Assume we have observations D n := {(X i , Y i )} n i =1 ⊂ I 2 × R 2 that independently and identically follow the regression model for i = 1, ..., n: ∼ N 2 (0, σ 2 I 2 ) (1) with a true function f * ∈ F L INV and σ 2 > 0. Let P X be a marginal measure of X i , and we assume that P X has a density function which is positive and bounded on I 2 .

Analysis Framework with Inverse Risk
The goal is to investigate the difficulty in estimating invertible functions by invertible estimators.To this end, we define two risks; (i) an inverse risk to evaluate both an estimation error and invertibility of estimators, and (ii) an L 2 -risk for an inverse.Preliminary, for any y ∈ I 2 , f ‡ n (y) denotes x ∈ I 2 if it satisfies f n (x) = y uniquely, and some constant vector c ∈ R 2 \ I 2 otherwise.Namely, f ‡ n represents a quasi-inverse of the function f n (that can be defined to not entirely-invertible functions).
(i) Inverse risk: as the first risk, we develop an inverse L 2 -risk as where E n denotes the expectation with respect to the observations D n , (ii) L 2 -risk for inverse: As the second risk, more simply, we define an L 2 -risk for an inverse of f * .It is defined as the following form: This risk is not only designed simply to evaluate the estimation error of the inverse function f −1 * , but also considers whether the estimator f n is invertible, since it utilizes the modified inverse f ‡ n .Then, we study the minimax inverse risk and the minimax L 2 -risk for inverses of the regression problem, that is, we consider the following value inf and that with R ‡ INV ( f n , f * ).Here, the infimum with respect to f n is taken over all measurable estimators, depending on D n .Note that this minimax inverse risk is related to an ordinary minimax risk without the invertibility of estimators, that is, inf

Approach and Results
Our analysis depends on the representation of invertible functions by level-sets.For an invertible function f = ( f 1 , f 2 ) ∈ F L INV , we represent its inverse as where L f j (y j ) := {x ∈ I 2 | f j (x) = y j } is a level-set for y j ∈ I and j = 1, 2. In this form, we can characterize invertibility of f by assuring the uniqueness of the intersection in (3).This result allows the analysis of the smoothness and composition of an invertible estimator.
Our first main result is a lower bound of the minimax inverse risk and the minimax L 2 -risk for inverses based on the developed representation.Specifically, we show that with d = 2 and any ψ ∈ Ψ: where ≳ denotes an asymptotic inequality up to constants, and inf f n takes infimum over all the possible estimators depending on D n .This rate corresponds to a minimax rate of estimating (not necessarily invertible) bi-Lipschitz functions.This result gives a negative answer to the question of whether invertibility improves the minimax optimal rate to the parametric rate.That is, the family of functions restricted to be invertible is still sufficiently complicated, and no rate improvement occurs for L 2 -risk when estimating it.
Our second main result is an upper bound of the minimax risks.To derive the bound, we develop a novel estimator for f * , and derive an upper bound on the inverse risk that corresponds to the lower bound.This estimator employs an arbitrary estimator of f * minimax optimal in the sense of the standard L 2 risk, and amends it to be asymptotically almost everywhere invertible, so as to inherit the rate of convergence.As a result, for d = 2 and ψ(z) = z 4 , we obtain inf where ≍ denotes the asymptotic equality up to the constants and logarithmic factors in n.While the above result considers the 4th power penalty ψ(z) = z 4 due to the pathological example shown in Supplement D.5, the pathological example does not appear if the Lipschitz constant of f , f −1 is less than L = 2 1/4 ≈ 1.19: for another penalty ψ(z) = z 2 , we also prove that inf Similar to the above discussion, these results state that the learning invertibility problem has the same minimax rate for estimating bi-Lipschitz functions.

Organization
The remainder paper is organized as follows.In Section 2, we characterize invertible functions by their level-sets.In Section 3, we provide a minimax lower bound for inverse risk.We develop an invertible estimator, and prove that an upper bound of the risk by the estimator attains the lower bound up to logarithmic factors in Section 4. Supporting Lemmas, propositions and proofs of Theorems are listed in Appendix.

Level-Set Representation on Invertible Function
We consider a representation of invertible functions using the notion of level-sets, which will be used in our main results.That is, we describe an inverse of functions by an intersection of level-sets of coordinates of the functions.This approach is different from the commonly used representation of invertible functions by monotonicity (Krief, 2017), local approximation (Tang et al., 2011(Tang et al., , 2015)), or Hessian normalization (Rezende and Mohamed, 2015;Dinh et al., 2017).
We consider a vector-valued function f : I 2 → I 2 with its coordinate-wise representation f (x) = ( f 1 (x), f 2 (x)) for f j : I 2 → I .For j = 1, 2, we define a level-set of f j for y j ∈ I as The notion of level-sets represents a slice of functions, whose shape depends on the nature of these functions.Then, we define the level-set representation of f (x).
Definition 1 (Level-set representation).For a function f = ( f 1 , f 2 ) : I 2 → I 2 and y ∈ I 2 , the level-set representation is defined as This term is defined with an output-wise level-set of the function f .The existence and nature of the intersection of f † (y) depends on the nature of f .Then, the property of f † (y) explains the invertibility of f .
Proposition 2 (Level-set representation for an invertible function).f : I 2 → I 2 is invertible if and only if f † (y) exists and uniquely determined for all y ∈ I 2 .Furthermore, if f is invertible, we have From this result, if f is invertible, there exists a corresponding level-set representation.Additionally, the level-set has tractable geometric properties, which are useful for future analyses.We discuss the properties of level-sets in the next section.We illustrate level-sets L f 1 , L f 2 in Figure 1.The orange and blue lines represent L f 1 (y 1 ) and L f 2 (y 2 ), respectively; x = f −1 (y) coincides with the intersection L f 1 (y 1 ) ∩ L f 2 (y 2 ) as described in eq.(4).INV .These provide a levelset representation f † of f , and the uniqueness of the intersection (black dot) of each level-set ensures invertibility, yielding f −1 (y) = f † (y).

Property of Level-Set by Invertible Function
We consider an invertible function f ∈ F L INV , where level-sets L f j (y j ) have some geometric properties that are critical for the analyses on minimax inverse risk in Sections 3 and 4. All results in this section are rigorously proven in Appendix A.
A level-set has a parameterization with a parameter α ∈ I : Lemma 3.For f ∈ F L INV , the following holds for each y ∈ I : This parameterization guarantees the smoothness of level-sets, together with the Lipschitz property of f .This property prohibits a "sharp fluctuation" in level-set L f j , as shown in Figure 2. Furthermore, level-set L f j (y) is continuously shifted with respect to y j ∈ I ; more specifically, there exists C ∈ (0, ∞) such that holds for all y, y ′ ∈ I (see Lemma 17 in Appendix A).The level-sets at y = ±1 are also properly included in the boundary of domain I 2 : L f j (±1) ⊂ ∂I 2 (see Lemma 18 in Appendix A).
Whereas the above representation is for identifying the inverse function f −1 , the level-set representation for the inverse function recovers the original function f itself: Lemma 3 which proves (5) As f (x 1 , I ) and f (I , x 2 ) are (1-dimensional) curve, they can be regarded as a kind of (skewed) "grid" of the square I 2 , identifying the unique point y = f (x) by their intersection.We employ this gridlike level-set representation for constructing an invertible estimator in Section 4.3.

Lower Bound Analysis
We develop a lower bound for the minimax risk.The direction of the proof is to utilize the L 2 -risk R( f n , f * ) and to develop a certain subset of invertible bi-Lipschitz functions Then, we derive a lower bound on the right-hand side by two techniques: (i) the level-set representation developed in Section 2, and (ii) the information-theoretic approach for minimax risk (e.g., Section 2 in Tsybakov ( 2008)).

Minimax Lower Bound of the Inverse Risk
We derive the minimax lower bound for the inverse risk: applying the information-theoretic approach to the subset shown in Section 3.2 yields the following theorem.
Theorem 4. Let ψ ∈ Ψ.For d = 2, there exists C * > 0 such that we have inf See Section 3.2 for the proof outline, and Appendix B for details.This lower bound on the rate indicates that imposing invertibility on the true function does not improve estimation efficiency in the minimax sense.This is because the lower rate n −2/(2+d ) is identical to the rate for estimating (non-invertible) Lipschitz functions (see Tsybakov (2008)).Although set F L INV is smaller than a set of Lipschitz functions, we find that the estimation difficulty is equivalent in this sense.
We also derive a lower bound for an inverse risk based on the above results.By the relation ( 6), the following result holds without proof: This result implies that the efficiency of estimators preserving invertibility, such as normalizing flow, coincides with that of the estimation without invertibility in this sense.
Moreover, we also develop a lower bound on the L 2 -risk for the inverse functions: we obtain the following theorem: Theorem 6.For d = 2, there exists C * > 0 such that we have inf This result is simply obtained by leveraging the bi-Lipschitz property of f * and the result of Theorem 4. Given that this rate corresponds to the minimax rate of estimation error for Lipschitz continuous functions, this result also shows that the invertible property does not improve the rate as in the previous example.

Proof
and grid points t j := −1 + . ., m), we define the bi-Lipschitz function as parameterized by a binary matrix θ = (θ j 1 , j 2 ) ∈ Θ ⊗2 m (Θ m := {0, 1} m ).Using the function χ θ , we define a function class: for k ∈ {1, 2}.See Figure 3a for an illustration of the function ξ θ ∈ Ξ 2 k .Using the function set Ξ 2 k defined in (7), we define the function class as We state the invertibility of f ∈ F ({Ξ 2 k } k ) by the level-set representation in Proposition 2. That is, using the fact that a function ) is also piecewise linear with small slopes.Then, we can prove the uniqueness of the level-set representation f † (x), which indicates the invertibility of f (see Figure 3b).We summarize the result as follows.

Upper Bound Analysis
This section derives an upper bound of the minimax inverse risk, by developing an estimator f n which is almost everywhere invertible in the asymptotic sense.We first present the upper bound, and subsequently, describe the developed estimator. (a) Their slopes are restricted so that the intersection is unique; hence, invertibility is guaranteed.

Minimax Upper Bound of the Inverse Risk
Using our developed estimator, we obtain the following upper bound on an inverse risk: and Assumption 1 hold.Then, for any β > 0, there exists holds for any sufficiently large n.
See Appendix D for the proof.This result is consistent with the lower bound of the inverse minimax in Theorem 4 up to logarithmic factors.We immediately obtain the following result: Corollary 9. Let ψ(z) = z 4 .Consider the setting in Theorem 8.Then, for any β > 0, there exists holds for any sufficiently large n.
With this result, we achieve a tight evaluation of the minimax inverse risk in case d = 2.This result implies that the difficulty of estimating invertible functions is similar to the case without invertibility, and that there are estimators that achieve the same rate up to logarithmic factors.
The penalty function ψ(z) = z 4 can be replaced to ψ(z) = z 2 , by considering a function class Proposition 10.Let ψ(z) = z 2 .Consider the setting in Theorem 8.Then, for any β > 0, there exists holds for any sufficiently large n.
Next, we mention the immediate consequence of the above results.Considering the inequality holds for any sufficiently large n.
This constraint by L = 2 1/4 ≈ 1.19 is essential and difficult to improve to larger constants.This is necessary so that the quadrilateral, which is a transformed small square in the domain I 2 by f * , does not become pathological with twists.As shown in Remark 29 in Apppendix D.4: even in the case L = 2 ≈ 1.41, there can be a pathological example that prohibits proving the minimax optimality with ψ(z) = z 2 .

Idea and Preparation for Invertible Estimator
We describe the developed invertible estimator, that attains the above upper-bound.The estimator is made by partitioning the domain I 2 and the range I 2 respectively, and combining local bijective maps between pieces of the partitions.To develop the partitions and bijective maps, we develop (i) a coherent rotation for f * and (ii) two types of partitions of I 2 by squares and quadrilaterals.In this section, we introduce these techniques in preparation.
We refer to ρ as a coherent rotation.We provide a specific form of ρ in Appendix C.2, and the proof of Lemma 12 is shown in Appendix C.3.

Two Partitions of I 2
We develop two types of partitions of I 2 , in order to construct local bijective maps between pieces of the partitions, then combine them to develop an invertible function.
The first partition is defined by grids in I 2 .We consider a set of grids I 2 := {0, ±1/t , ±2/t , . . ., ±(t − 1)/t , ±1} (t ∈ N), then consider a square by the grids For each □, we choose four points ν(□) := {x ′ , x ′′ , x ′′′ , x ′′′′ } ⊂ I 2 such that they are vertices of □, and starting from the x ′ closest to (1, 1), we set the other vertices by a clockwise-path x ′ → x ′′ → x ′′′ → x ′′′′ along with a boundary of □.A set of □ forms a straightforward partition of I 2 .Remark 13 (Twist of quadrilaterals ♢).If ♢ is twisted as Figure 6 (left), the partition is not welldefined.However, when the grids for □ is sufficiently fine, i.e. t is sufficiently large, the twisted quadrilaterals vanish in the sense of the Lebesgue measure (see Figure 6 (right)).Since we will consider t → ∞ as n increases when developing an estimator, an effect of the twisted quadrilaterals are asymptotically ignored in the result of estimation.Hence, we assume that there is no twist to simplify the discussion.We provide details of the twist in Appendix D.5.
Using the partitions, we can develop an invertible approximator for g * .For each □ and its corresponding ♢, we can easily find a local bijective map g □ : □ → ♢ (its explicit construction will Figure 6: The twisted quadrilateral in I 2 (the green region in the left) disappears as the partition by squares become finer (the yellow quadrilateral in the right).The yellow and blue curves are levelsets by g * .As t increases, the twists vanish or become negligibly small.be provided in Section D.3).Then, we combine them and define an invertible function g † * : to g * as t increases to infinity.In the following section, we develop an invertible estimator through estimation of ρ and g † * .

Invertible Estimator
We develop an invertible estimator f n by the following two steps: (i) we develop estimators ρ n for ρ and g † n for g † * , by using a pilot estimator f (1) n (e.g., kernel smoother) which is not necessarily invertible but consistent, and (ii) we define the developed estimator as In preparation, we first introduce the following assumption on the pilot estimator: Assumption 1.There exists an estimator f (1) holds for sufficiently large n, with some α > 0 and a sequence δ n ↘ 0 as n → ∞.
Several estimators are proved to satisfy this assumption, for example, using a kernel method (Tsybakov ( 2008)), a nearest neighbour method (Devroye (1978); Devroye et al. (1994)) and a Gaussian process method (Yoo and Ghosal (2016); Yang et al. (2017)) with various α.In some cases, it is necessary to restrict their ranges to I 2 by clipping.Note that this assumption does not guarantee invertibility of f (1) n as follows: Proposition 14.There exists an estimator f (1) INV and any n ∈ N.
Herein, we develop the invertible estimator f n by leveraging the (not necessarily invertible) pilot estimator f (1) n as follows: (i-a) Estimator for ρ: We develop the invertible estimator ρ n for ρ, such that ρ n ( f (i-b) Estimator for g * : We define an estimator g n (x) := P ρ n ( f (1) n (x)) for g * , where P constrains g n (x) to an edge of the range I 2 , when x is an endpoint of the domain I 2 : P replaces y 1 in . This operator P is necessary for making g n to have a range I 2 .Note that g n is not always invertible.
(ii) Invertible estimator for f * : We define the estimator for f * as Since ρ n and g † n are invertible, the invertibility of f n is assured.

Numerical Demonstration of the Developed Estimator:
We experimentally demonstrate the developed estimator f n .We set a true function where the functions ω, ϑ, v are defined in Appendix C.1.We generated n = 10 4 covariates x i i.i.d.

∼
U (I 2 ) and outcomes n .We note that we use bi-linear interpolation for calculating g † n , which coincides with the triangle interpolation (9) in this setting.
R source codes to reproduce the experimental results are provided in https://github.com/oknakfm/NPIR.

Conclusions and Future Research Directions
We studied the nonparametric planer invertible regression, which estimates invertible and bi-Lipschitz function f * ∈ F L INV between a closed square [−1, 1] 2 .For d = 2, we defined inverse risk to evaluate the invertible estimators f n : the minimax rate is lower bounded by n −2/(2+d ) .We developed an invertible estimator, which attains the lower bound up to logarithmic factors.This result implies that the estimation of invertible functions is as difficult as the estimation of non-invertible functions in the minimax sense.For this evaluation, we employed output-wise level-sets L f j (y) := {x ∈ I 2 | f j (x) = y} of the invertible function f = ( f 1 , f 2 ), as their intersection L f 1 (y 1 )∩L f 2 (y 2 ) identifies the inverse f −1 (y).We identified some important properties of the levelset L f j .This study is the first step towards understanding the multidimensional invertible function estimation problem.
However, there remain unsolved problems.For example, (i) We developed an invertible estimator only for a restricted case, d = 2.A natural direction would be to extend our estimator and the minimax upper bound of the inverse risk to the n,1 , (c,h) estimator g n,1 transformed by a coherent rotation, (d,i) invertible estimator g n,1 using biniliear interpolation, and (e,j) invertible estimator f n,1 .The upper row is σ 2 = 10 −3 and the lower row is σ 2 = 10 −1 .general d ≥ 3.However, theoretical extension to general d ≥ 3 seems not straightforward by the following two reasons: (i) coherent rotation, which is used to align the endpoints in our estimator, cannot be defined even for d = 3 and (ii) Donaldson and Sullivan (1989) proved that bi-Lipschitz homeomorphisms cannot be approximated by even piecewise Affine functions for d = 4.Some additional assumptions seem needed.Another ongoing work of ours studies the case d ∈ N, by additionally imposing C 2 smoothness on f * to eliminate the pathological cases.
(ii) The discussions in this paper mostly rely on (a) the existence of the boundary and (b) the simple connectivity of set [−1, 1] 2 .It would be worthwhile to generalize our discussion to different types of domains, such as the open multidimensional unit cube (−1, 1) 2 (e.g., Kawamura (1979) and Pourciau (1988) for a characterization of nonsmooth invertible mappings between R 2 , where, R 2 and (−1, 1) 2 are homeomorphic) and some sets with different torus (see, Hatcher (2002) for the gentle introduction to torus, and Rezende et al. (2020) for normalizing flow on tri and sphere surface).
(iii) It is an important attempt to relax the bi-Lipschitz continuity setting.In particular, omitting the restriction of lower-Lipschitz property is important.If we omit the restriction, we can handle a wider class of functions such as polynomials.
(iv) Whereas the minimax rate is obtained for a supervised regression problem, one of the main applications of the multidimensional invertible function estimation is density estimation, which implicitly trains the invertible function in an unsupervised manner.An interesting direction would be to extend the minimax rate to an unsupervised setting.
Lemma 15 immediately proves the following Lemma 16.
Lemma 16.Let f : I 2 → I 2 be an invertible function.Both f , f −1 are Lipschitz if and only if f is bi-Lipschitz, i.e., there exists Proof of Lemma 17. Considering the representation in Lemma 3, the Lipschitz property of f −1 proved in Lemma 15 leads to for some C ∈ (0, ∞), and d Haus.(L f 2 (y), L f 2 (y ′ )) ≤ C |y − y ′ | is proved in the same way.
Lemma 18.Let X , Y ⊂ R 2 be non-empty closed topological spaces and let f : X → Y be a homeomorphism, i.e., invertible and continuous function.Then, f (∂X ) = ∂Y .

B Proofs for Lower Bound Analysis
(ii) ξ θ is surjective from I to I .
Proof of Lemma 19.We prove (i) and (ii) as follows.
(i) We prove that a continuous function χ θ (x k ) = χ θ (x k , x ℓ ) is piecewise linear whose slopes are greater than −1, as it immediately proves the strict monotonicity of Let j ∈ [m] and let j * ℓ ∈ [m] be a minimum index satisfying |x ℓ − t j * ℓ | ≤ 1/m, for l ∈ {1, 2} \ {k}.Recall the function Φ(x) defined in Section 3.2.Consider a function and split I into .
By the definition of the function Φ, we have is piecewise linear with slopes m/M , 0, −m/M .Recalling that M > 2m and θ j 1 , j 2 ∈ {0, 1}, the slope of χ θ is greater than −1 (and is less than 1).The assertion (i) is proved.

B.2 Proof of Proposition 7
As the bi-Lipschitz property is straightforwardly proved, we describe the invertibility of f shown in ( 4) is a unique point for every y = (y 1 , y 2 ) ∈ I 2 .Therefore, in this proof, it is sufficient to prove the uniqueness of f † .
We first examine the level-set L f 1 (y 1 ).Using a function ι(x where the last equality follows from χ θ (x) As slopes of two level-sets L f 1 (y 1 ), L f 2 (y 2 ) take values within (−1, 1) (along with axes 2, 1, respectively), they belong to each region divided by the two dot lines, meaning that the intersection L f 1 (y 1 ) ∩ L f 2 (y 2 ) is unique, i.e., the function f = ( f 1 , f 2 ) is invertible.
(a) Level-set L f 1 (y 1 ), which is piecewise linear and its maximum slope is 2m/M (along with the axis 2).
(b) Intersection of the level-sets L f 1 (y 1 ), L f 2 (y 2 ) is a unique point.

B.3 Proof of Theorem 4.
This proof consists of the following two steps: (step 1) we define an induced set , and (step 2) we apply Lemma 21 to this (sufficiently complex) function set Step 1: Define F ({ Ξ 2 k } k ), a sufficiently complex subset of F L INV .We define an induced subset of the function class m , Lemma 20 proves the existence of T ⊂ Θ ⊗2 m such that |T | ≥ m/8 and min θ̸ =θ ′ ∈T H (θ, θ ′ ) ≥ 2 m/8 .We define an induced function set where the definition of F (•) and Lemma 7 prove indicating the invertibility of the functions k , there exists c 1 ∈ (0, 1) such that we obtain Hence, we obtain α in Lemma 21 (used in the next step) bounded below by m −1 .
Step 2: Apply information-theoretic approach.Finally, we develop a set of invertible functions.In this step, P j ∈ P denotes a joint distribution of D n associated to the probabilistic model ( 1), equipped with the corresponding f * = f j .
Considering the inclusion relation By this form, it is sufficient to study a minimax rate with F ({ Ξ 2 k } k ).We apply the discussion for minimax analysis introduced in Tsybakov ( 2008), which is displayed as Lemma 21.We check the conditions of Lemma 21 one by one.First, we check that with some constant c 2 > 0. Here, the setting of the non-zero bounded density of P X assures the existence of c 2 .Third, we apply an equation (2.36) in Tsybakov (2008) which yields which does not diverge when we set m = n 1/(2+d ) .Hence, we set m = n 1/(2+d ) ; there exists C * > 0 such that, with a probability larger than 1/2, we obtain inf Finally, we apply the discussion of the minimax probability, which is described in (2.5) of Tsybakov (2008).Since Markov's inequality gives that the risk R( , we obtain the minimax lower bound of R( f n , f * ) in the statement.

B.4 Proof of Theorem 6
As f * is bi-Lipschitz, for any point y ∈ Ω( f n ) := {y ∈ I 2 : f n is invertible at y}, we have an inequality where x = f ‡ n (y).Hence, we have with some constant c ∈ (0, 1).By Daneri and Pratelli (2014), it can be shown that the Lebesgue measure of the non-invertible region L (I 2 \ Ω( f n )) converges to 0. Therefore, the assertion is proved by following the proof of Theorem 4.

C.1 Additional Symbol and Notation
We define several functions and vectors with fixed f ∈ F L INV .Recall that D is a unit ball in R 2 .We develop a correspondence between the unit ball D and the square I 2 as the domain, by using some invertible maps.We define a map ω : Its inverse is explicitly written as For each of the vertices, its corresponding point on D 2 is defined as We also consider polar coordinates of elements in the unit ball D. For a radius r ∈ [0, 1] and an angle θ ∈ [0, 2π), v (r, θ) := (r sin θ, r cos θ) ∈ D is a transform from polar coordinate to the ordinary system.For convenience, we define {0} and e = (0, 1).

C.2 Coherent rotation ρ
For a function INV , we define a coherent rotation ρ with functions and vectors defined in Section C.1: where R : D 2 → D 2 will be defined in the latter half of this section.
In preparation, we consider angles that correspond to the vertices of I 2 as where θ † is a fixed angle θ .
τ is a piecewise linear function which connects the angles {θ j } 4 j =1 defined above (shown in Figure 11).
Using the notions, we define the function R : R has a role for rotating the points on the unit ball to make the points {θ j } 4 j =1 equally spaced on the boundary of D. Figure 12 shows the illustration of ρ including the role of R.
Lemma 22 below proves the invertibility and the bi-Lipschitz property of the function ρ.By the fact, we can define g Proof of Lemma 22. ω is invertible from the definition, and ω, ω −1 are Lipschitz as their directional derivatives are bounded on the compact set I 2 .As R is invertible from the definition, it suffices to show the bi-Lipschitz property of R.
We first consider a case |τ( θ . ω converts the point in the square I 2 to the unit ball D. R rotates the points in D to arrange {θ j } 4 j =1 equally spaced on the circle.
(ii) We apply Lemma 24 and obtain The inequality in the second line follows from the property of polar coordinates: , 2π).By the result of (i), we have Convergence of R −1 n is proved in the same way.
(iii) We apply Lemma 24 and obtain Then, by the result of (ii), we prove P( The result on ρ −1 n is proved in the same way. (iv) We apply Lemma 24 and obtain By the result of (iii) and Assumption 1, we obtain the statement of (iv).
See Figure 14 for illustration of the function f † n .Then, we have ∥ f , and their intersection is obtained as is greater than 1, the intersection is not a unique point, indicating that the function f (1) n is not injective at x ∈ I 2 .

D.3 Proof of Theorem 8
We review some notations.As described in Introduction, an inverse function f ‡ for a function f : for some constant vector c ∉ I 2 .We also define two sets that are used to measure a property of invertibility of functions.For a set Ω ⊂ I 2 , L ( Ω) denotes the Lebesgue measure of Ω.
We develop an upper-bound of the inverse risk with the Lipschitz coefficient L f * of f * (and f −1 * ) and some constant C 1 ,C 2 ,C 3 ∈ (0, ∞): The inequality (⋆) follows from L (Ω( f n )) ≤ L (I 2 ) = 4 and the inequality Therefore, we herein evaluate L ( Ω( f n )) and f n − f * L ∞ in the following Propositions 27 and 28: applying Lemma 24 with these Propositions to (19) proves: By taking the expectation E n with the decreasing δ n ≲ n −2/(2+d ) (log n) 2α+2β , the statement is proved.

Proposition 27. Suppose Assumption 1 holds. There exists
Proof of Proposition 27.Let g † * be a function for triangle interpolation defined in (18).Then, Proposition 4.1 in Daneri and Pratelli (2014) evaluates the Lebesgue measure of the squares, that cannot be linearly interpolated (so twisted): there exists C 1 ∈ (0, ∞) such that (20) is obtained by specifying that r = γ n and ε is proportional to γ n in Proposition 4.1 in Daneri and Pratelli (2014).
Here, we show Ω ) Since the vertices of the squares converge in probability with the convergence rate for some C 2 > 0 and δ n ↘ 0. Therefore, applying Lemma 24 leads to the assertion.
Proof of Proposition 28.We prove (i) and (ii) by the uniform convergence of g n , which is already proved in Proposition 26 (iv).
Firstly, we consider the case that ♢ is not twisted: with the triangle △(x) whose vertices are x ′ , x ′′ , s, we have Secondly, if ♢ is twisted, Overall, we have obtained and applying Lemma 24 with Proposition 26 (iv) and the definition of t n as (17) proves the assertion.
(ii) We apply Lemma 24 and obtain with the above (i) and Proposition 26 (iii) leads to (vi).

D.4 Proof of Proposition 10
This proposition is obtained by slightly modifying the proof of Theorem 8 (shown in Appendix D.3).Specifically, we replace the penalty function ψ(z) = z 4 in the inequality (19) with ψ(z) = z 2 and obtain an inequality with a decreasing sequence δ n ↘ 0 and C > 0, and it completes the proof.

Twisted Not twisted
Figure 16: The left is □ with its sides of length a and diagonals of length 2a.The right is ♢ ♯ , obtained by transforming □ with f * , showing both twisted and not twisted By this fact, it is sufficient to show that ♢ ♯ is not twisted.Let a > 0 be the length of a side of the square □, then the length of a diagonal of □ is 2a.Let b, c, d , e > 0 be length of lines obtained by transforming the sides of □ by f * , and r, s > 0 be length of lines obtained by transforming the diagonals of □ as shown in Figure 16.The bi-Lipschitz property of f * with the Lipschitz constant L = 2 1/4 yields 2 −1/4 a ≤ min{b, c, d , e} ≤ max{b, c, d , e} ≤ 2 1/4 a and 2 1/4 a = 2 −1/4 ( 2a) ≤ min{r, s}.
By these facts, we obtain max{b, c, d , e} ≤ min{r, s}.
Then, ♢ ♯ cannot be twisted, since the length of the diagonals of ♢ ♯ is no less than those of the sides of ♢ ♯ .Therefore, the pathological example of twists (shown in Appendix D.5) does not appear, hence ( 22) is proved.
Remark 29.While the above proof considers the bi-Lipschitz function with L ≤ 2 1/4 ≈ 1.19, we here consider the case L = 2 1/2 ≈ 1.41.Even in this case (that seems theoretically tractable), twist may appear as shown in Figure 17, and the above proof does not hold.

D.5 A Pathological Example of Twists
We defined an interpolation over the quadrilaterals as shown in Figure 5.However, the quadrilateral connecting the four points u ′ = g n (x ′ ), u ′′ = g n (x ′′ ), u ′′′ = g n (x ′′′ ) and u ′′′′ = g n (x ′′′′ ) can be twisted as shown in Figure 6 (left): this twist interrupts the estimator f n from being bijective the whereby f n is not entirely invertible over I 2 .These twists can be eliminated by increasing the number of splits t n in most cases (see Figure 6 (right) and Proposition 27).
Here, a natural question arises: can we further prove that the developed estimator is entirely invertible on I 2 ?For most suitable f * , yes, our developed estimator is (asymptotically) entirely invertible as all the twists vanish as t n increases.Unfortunately, however, there exists a pathological example that such twist does not disappear even if t n increases.See Figure 18 for such an pathological example.In this example, small twists (which can be ignored in the Lebesgue measure) appear indefinitely, and it prohibits our simple estimator from being entirely invertible for general f * ∈ F L INV .
for vectorvalued functions, and ψ ∈ Ψ := ψ : R ≥0 → R ≥0 is continuous, increasing, and ψ(0) = 0(2) denotes a non-negative penalty function.In our upper-bound analysis, we consider ψ(z) = z 4 and ψ(z) = z 2 .By virtue of the penalty term ψ(•), R INV ( f n , f * ) → 0 indicates both that f n is almost everywhere invertible and that f n and f ‡ n are consistent estimators.Using this risk, we can discuss constructing invertible estimators in the context of nonparametric regression.

Figure 1 :
Figure 1: Level-sets L f 1 (y 1 ) (orange) and L f 2 (y 2 ) (purple) in I 2 for f ∈ F L INV .These provide a levelset representation f † of f , and the uniqueness of the intersection (black dot) of each level-set ensures invertibility, yielding f −1 (y) = f † (y).

Figure 2 :
Figure 2: Level-sets in I 2 .[Left] L f j (y) without the Lipschitz continuity of f j .[Right] L f j (y) with the Lipschitz continuity of h j .If f j is Lipschitz continuous, the (excessively) sharp fluctuation along with one direction, shown in the left panel, does not appear.This property is clarified by parameterization (Lemma 3).
Outline: Construction of Subset of F L INV Applying an information-theoretic approach to the subset F L INV constructed below proves Theorem 4. The important technical point is to use the level-set representation developed in Section 2 to guarantee the invertibility of functions in F L INV .We first define a set of functions Ξ 2 k for k ∈ {1, 2} as follows.Let m ∈ N and let M > 2m.Using a hyperpyramid-type basis function Φ : R 2 → [0, 1] Proposition 10 leads to the following upper-bound without proof: Proposition 11.Consider the setting in Theorem 8.Then, for any β > 0, there exists C

quadrilateralFigure 4 Figure 5 :
Figure 4: (Left) Level-sets of g * = ρ • f * , whose endpoints are aligned with a square I 2 by the coherent rotation.(Right) Partition of I 2 into quadrilaterals.Since the endpoint level-set of g * is aligned to the endpoint of I 2 , the partition is well-defined.

Figure 5 (
Figure5(right) illustrates the quadrilaterals.A set of ♢ works as a partition of I 2 , if the quadrilaterals are not twisted (see Remark 13).Also, ♢ plays a role of approximation of g * (□) ⊂ I 2 .

Figure 12 :
Figure 12: The effect of the coherent rotation ρ := ω −1 • R • ω : I 2 → I 2 .ω converts the point in the square I 2 to the unit ball D. R rotates the points in D to arrange {θ j } 4 j =1 equally spaced on the circle.
ω are Lipschitz.Lemma 15 proves the bi-Lipschitz property of ρ, which indicates the assertion ρ ∈ F L INV .