A (tight) upper bound for the length of confidence intervals with conditional coverage

We show that two popular selective inference procedures, namely data carving (Fithian et al., 2017) and selection with a randomized response (Tian et al., 2018b), when combined with the polyhedral method (Lee et al., 2016), result in confidence intervals whose length is bounded. This contrasts results for confidence intervals based on the polyhedral method alone, whose expected length is typically infinite (Kivaranovic and Leeb, 2020). Moreover, we show that these two procedures always dominate corresponding sample-splitting methods in terms of interval length.


Introduction
Post-model-selection inference, i.e., parametric inference when the fitted model is chosen in a data-driven fashion, is non-trivial.Obviously, such a model is random and may be misspecified.It is well-known that the 'naive' approach, where the model-selection step is ignored in the sense that the chosen model is treated as a-priori given and as correct in subsequent analyses, can result in invalid inference procedures; cf.Leeb and Pötscher (2005).The polyhedral method of Lee et al. (2016) is a recently proposed technique that allows one to construct valid inference procedures, like tests or confidence intervals, after model selection, for a parameter of interest that depends on the selected model.The polyhedral method and its variants can be used with a variety of methods, including the Lasso or the sequential testing method considered later in this paper.Kivaranovic and Leeb (2021) showed that the expected length of confidence intervals based on the polyhedral method of Lee et al. (2016) is typically infinite.The polyhedral method can be modified by combining it with data carving (Fithian et al., 2017) or with selection on a randomized response (Tian and Taylor, 2018).These references found, in simulations, that this combination results in significantly shorter intervals than those based on the polyhedral method alone.In this paper, we give a formal analysis of this phenomenon.We show that the polyhedral method, when combined with the proposals of Fithian et al. (2017) or of Tian and Taylor (2018), delivers intervals whose length is always bounded.Our upper bound is easy to compute, easy to interpret and also applies in situations where variances are estimated.In the interesting case where the polyhedral method alone gives intervals with infinite expected length, our bound is also sharp.Moreover, we show that the intervals of Fithian et al. (2017) and Tian and Taylor (2018) are always shorter than the intervals obtained by corresponding sample splitting methods.
see Lee et al. (2016, Theorem 5.2).This interval, while having the right coverage probability, can be rather large in practice: Because the truncation set T is bounded on one side, the results of Kivaranovic and Leeb (2021) entail that Now suppose that the conditioning event is randomized as in Example 2 of Tian and Taylor (2018).More precisely, suppose that the event {Z ∈ T } is replaced by {Z + R ∈ T }, where R ∼ N (0, τ 2 ) is independent of Z and where τ is known.While this may not be practical when dealing with publication bias, this is exactly what happens in data carving and selection with a randomized response; see Section 4. Now the underlying distribution becomes and standard methods can again be used to construct an equal-tailed confidence set [ Ľ(Z), Ǔ (Z)] that satisfies The main finding of this paper is that this and related randomization methods have a dramatic impact on confidence interval length.Without randomization, the expected length of the interval [ L(Y ), Û (Y )] is infinite.With randomization, the length of the interval [ Ľ(Y ), Ǔ (Y )] is always bounded: Ǔ (Z) − Ľ(Z) < 2 × 1.96 × 1 + 1/τ 2 ; see Theorem 3.1.The upper bound is just the length of the usual (unconditional) 95%confidence interval Z ± 1.96 multiplied by 1 + 1/τ 2 .Now consider a more elaborate scenario, model selection with the Lasso.Consider a response vector Y and a matrix X of explanatory variables.In particular, assume that Y ∼ N (θ, σ 2 I n ) with n ∈ N, θ ∈ R n and σ 2 ∈ (0, ∞), and assume that X ∈ R n×d (d ∈ N) is a fixed matrix whose columns are in general position in the sense of Tibshirani (2013).The Lasso estimator, denoted by β(y), is the minimizer of the least squares problem with an additional penalty on the absolute size of the regression coefficients (Frank and Friedman, 1993;Tibshirani, 1996): where y ∈ R n and where λ ∈ (0, ∞) is a given tuning parameter.Because of our assumptions on X, β(y) is well-defined; cf.Lemma 3 in Tibshirani (2013).1), where the truncation set is now an interval T m,s (w) that depends on the polyhedron (i.e., on m and s) and on w.In particular, where L(. . . ) denotes the indicated (conditional) distributions, where Z ∼ N (η ′ m θ, σ 2 ∥η m ∥ 2 ), and where T m,s (w) is as above.Note that this distribution is of the same functional form as (1).Using this distribution, Lee et al. (2016) obtain a confidence interval for η ′ m θ with pre-specified coverage probability conditional on m(Y ) = m, ŝ(Y ) = s, (I n − P ηm )Y = w and hence also conditional on the (larger) event { m(Y ) = m, ŝ(Y ) = s}.The construction is the same as that used to obtain the interval [ L(Z), Û (Z)] in the file-drawer problem considered earlier.And, again similar to the file-drawer problem, the conditional expected length of this interval is infinite because the truncation set T m,s (w) is always bounded either from above or below (Kivaranovic and Leeb, 2021, Proposition 2).
Next, suppose that the model selection step is randomized as proposed by Tian and Taylor (2018).Take ω ∼ N (0, τ 2 I n ) independent of Y .We will select a model based on the randomized data Y + ω, while the original data Y will be used for subsequent inference.For model selection, we first compute the lasso-estimator β(Y + ω) from the randomized data.The non-zero coefficients of this estimator and their signs again give a model m(Y + ω) and a sign-vector ŝ(Y + ω) (provided that m(Y + ω) ̸ = 0).For m and s as before, we will use the polyhedral method to obtain a confidence interval for η ′ m θ with pre-specified coverage probability conditional on the event s} is again a polyhedron, but now in (Y + ω)-space.Arguing as in the preceding paragraph, we see that where Z and T m,s (w) are as before and where Based on this distribution, Tian and Taylor (2018) obtain a confidence interval for η ′ m θ with pre-specified coverage probability conditional on m(Y + ω) = m, ŝ(Y + ω) = s, (I n − P ηm )(Y + ω) = w and hence also conditional on the (larger Because the distribution in the preceding display is of the same functional form as (2), the confidence interval of Tian and Taylor (2018) has properties similar to those of the interval [ Ľ(Z), Ǔ (Z)] that we constructed in the randomized file-drawer problem.In particular, its length is always bounded.We will return to the Lasso later in Subsection 5.2 to cover some more technical details; in particular, we will explicitly compute the upper bound on confidence interval length and also consider the case where the conditioning is on the selected model only (and not on the signs).

Main technical result
Here, we study confidence intervals based on an observation from the truncated Gaussian distribution where Z and R are independent, Z ∼ N (µ, σ 2 ) with µ ∈ R and σ 2 ∈ (0, ∞), R ∼ N (0, τ 2 ) with τ 2 ∈ (0, ∞), and where the truncation set T is of the form Let Φ(t) be the cumulative distribution function (c.d.f.) of the standard normal distribution and denote by F τ 2 µ,σ 2 (z) the conditional c.d.f. of the random variable in (3).For sake of readability, we do not show the dependence of this c.d.f. on T in the notation.For α ∈ (0, 1) and z ∈ R, let µ α (z) satisfy The quantity µ α (z) is well-defined and strictly increasing as a function of α for fixed z ∈ R (cf.Lemma A.6).For all α 1 , α 2 ∈ (0, 1) such that α 1 < α 2 , we have by a textbook result for confidence bounds (e.g., Chapter 3.5 in Lehmann and Romano, 2006).A common choice is to set α 1 = α/2 and α 2 = 1 − α/2 such that, conditional on ] is an equal-tailed confidence interval for µ at level 1 − α.Another option is to choose α 1 and α 2 such that, conditional on ] is an unbiased confidence interval at level 1 − α (cf.Chapter 5.5 in Lehmann and Romano, 2006).
The upper bound in Theorem 3.1 is easy to compute, does not depend on the truncation set T , and increases as the amount of randomization τ 2 decreases.As τ 2 goes to zero, the upper bound diverges to infinity, in accordance with Kivaranovic and Leeb (2021).On the other hand, as τ 2 goes to ∞, the upper bound converges to σ Φ −1 (α 2 ) − Φ −1 (α 1 ) .Also note that the upper bound is sharp if T is bounded from above or from below, i.e., in the case where confidence intervals based on Z|Z ∈ T have infinite expected length.Finally, as detailed in Remark 3.2 below, the upper bound can also be used in the unknown-variance case, i.e., if σ 2 or τ 2 or both are replaced by estimators.
In Figure 1, we plot the length of [µ α 1 (z), µ α 2 (z)] as a function of z for several truncation sets T .In the left panel the truncation set is of the form (−a, a) (bounded) and in the right panel the truncation set is of the form (−∞, −a) ∪ (a, ∞) (unbounded with a gap in the middle).The top dashed line denotes the upper bound σ , the length of the confidence interval with unconditional coverage.In the left panel, we see that µ α 2 (z) − µ α 1 (z) approximates the upper bound as |z| diverges.The smaller a, i.e., the smaller the truncation set T , the faster the convergence.Also in this case, where the truncation set is a bounded interval, the left panel indicates that the length is bounded from below by σ Φ −1 (α 2 ) − Φ −1 (α 1 ) .In the right panel, we see that our upper bound is not sharp when the truncation set is unbounded on both sides.It appears that the length converges to σ Φ −1 (α 2 ) − Φ −1 (α 1 ) as z diverges.However, as the gap of the truncation set becomes larger (i.e., as a grows), we see that the length approximates the upper bound for values around a and −a.
is not an lower bound in this case, as we can see that the length is considerably smaller for values in the gap (−a, a).It seems that the length converges to zero for values of z around 0 as a diverges.In Figure 2, we plot Monte-Carlo approximations of the conditional expected length of [µ α 1 (Z), µ α 2 (Z)] given Z + R ∈ T as a function of µ, for the same scenarii as considered in Figure 1.For the Monte-Carlo simulations, we draw 2000 independent samples from the distribution in (3), compute the confidence interval for each and estimate the conditional expected length by the sample mean of the lengths.In the left panel, we observe that the conditional expected length is minimized at µ = 0 and converges to the upper bound as µ diverges.We also note that the smaller the truncation set, the larger the conditional expected length.This means that the conditional expected length is close to the upper bound if the probability of the conditioning event is small.In the right panel, we again observe that the conditional expected length is minimized at µ = 0.However, it seems to converge not the upper bound but to σ Φ −1 (α 2 ) − Φ −1 (α 1 ) as µ diverges.Surprisingly, the conditional expected length at µ = 0 decreases as a increases and becomes significantly smaller than σ Φ −1 (α 2 ) − Φ −1 (α 1 ) .In particular, and in contrast to the left panel, we here observe that, for µ close to zero, the conditional expected length decreases as the probability of the conditioning event decreases; for µ not close to zero, the situation is again as in the left panel.
Remark 3.2 (The unknown-variance case).In the discussion so far, the variances σ 2 and τ 2 were assumed to be known.Assume now that one or both of these variances are unknown, and that variance estimators σ2 and τ 2 are available that take values in (0, ∞); if one of the variances is known, set the corresponding estimator equal to its value.We note that the truncation set T as in (4) may depend on the (estimated) variances, and we stress this dependence here by denoting it by T .A natural way to obtain a confidence interval in the unknown variance case is to proceed as before, using the variance estimators as plugins.In particular, following the construction leading up to Theorem 3.1, with σ 2 , τ 2 and T replaced by σ2 , τ 2 and T , respectively, we obtain a confidence interval for µ that we denote by [μ α 1 (Z), μα 2 (Z)].For this interval, a relation like (6) typically does not hold, and its conditional coverage probability depends on the estimators σ2 and τ 2 ; this topic is further discussed in Section 8.1 of Lee et al. (2016).However, Theorem 3.1 can still be used to obtain an upper bound on the length of this interval: Using the theorem with σ 2 , τ 2 and T replaced by σ2 , τ 2 and T , respectively, we see that μα 2 (Z) − μα 1 (Z) is smaller than σ(Φ −1 (α 2 ) − Φ −1 (α 1 )) 1 + σ2 /τ 2 .This is because the theorem only requires that (5) holds with α replaced by α i , i = 1, 2.

Application to selective inference
Throughout this section, we consider the generic sample mean setting that lies at the heart of the polyhedral method and of many procedures derived from it.To use these methods with specific model selectors, one essentially has to reduce the specific situation at hand to the generic setting considered here.This is demonstrated by the examples in Section 5; further examples can be obtained from the papers on selective inference mentioned in Section 1.For the sake of exposition, we focus here on the known-variance case.As outlined in Remark 3.2, our results can also be applied in situations where variances are estimated, mutatis mutandis.
Let n ∈ N and let Z 1 , . . ., Z n be i.i.d.normal random variables with mean µ ∈ R and variance σ 2 ∈ (0, ∞).The outcome of a model-selection procedure can often be characterized through an event of the form Zn ∈ T , where Zn denotes the sample mean and T is as in (4).The polyhedral method provides a confidence interval for µ with pre-specified coverage probability conditional on the event Zn ∈ T , based on the conditional distribution of Zn | Zn ∈ T .The (conditional) expected length of this interval is infinite if and only if T is bounded from above or from below; cf.Kivaranovic and Leeb (2021).

Data carving
Data carving (Fithian et al., 2017) means that only a subset of the data is used for model selection while the entire dataset is used for inference based on the selected model.Let δ ∈ (0, 1) be such that δn is a positive integer.If only the first δn observations are used for selection, the outcome of a model-selection procedure can often be characterized through an event of the form Zδn ∈ T ; here Zδn is the sample mean of the first δn observations and the truncation set T is as in ( 4).(Of course, the truncation sets used by the plain polyhedral method and by the polyhedral method with data carving might differ.)Inference for µ is now based on the conditional distribution Zn | Zδn ∈ T. (7) In the preceding display, the conditioning variable Zδn can be written as Zδn = Zn + R for R = Zδn − Zn .Using elementary properties of the normal distribution, it is easy to see that Zn and R are independent.We thus obtain the following: Proposition 4.1.The conditional c.d.f. of the random variable in (7) is equal to Let μα (z) satisfy F τ 2 μα(z),σ 2 (z) = 1 − q, where σ2 and τ 2 are as in the proposition.Then [μ α 1 ( Zn ), μα 2 ( Zn )] is a confidence interval for µ with conditional coverage probability We see that the length of [μ α 1 ( Zn ), μα 2 ( Zn )] shrinks at the same √ n-rate as in the unconditional case.The price of conditioning is at most the factor 1/ √ 1 − δ.Note that, for δ = 1, data carving reduces to the polyhedral method.
A corresponding sample-splitting method is the following: Again, the model is selected based on the first δn observations, resulting in the same event Zδn ∈ T .For subsequent inference, however, only the last (1 − δ)n observations are used.Because these are independent of Zδn , one obtains the standard confidence interval for µ based on the last (1−δ)n observations, whose length is σ(Φ −1 (α 2 ) − Φ −1 (α 1 ))/ (1 − δ)n.By the inequality in the preceding display, this sample-splitting interval is always strictly larger than the interval obtained with data carving.

Selection with a randomized response
This method of Tian and Taylor (2018) performs model-selection with a randomized version of the data (i.e., after adding noise), while inference based on the selected model is performed with the original data.Let ω ∼ N (0, τ 2 I n ) be a noise vector independent of Z 1 , . . ., Z n and write ωn for the mean of its components.If, in the model-selection step, the randomized data Z 1 + ω 1 , . . .Z n + ω n are used instead of the original data, then the outcome of the modelselection step can often be characterized through an event of the form Zn + ωn ∈ T where T again is a truncation set as in Section 3 (possibly different from the truncation sets used by the plain polyhedral method or by the polyhedral method with data carving).Here, inference for µ is based on the conditional distribution which is easy to compute.
Proposition 4.2.The conditional c.d.f. of the random variable in (8) is equal to Let μα (z) satisfy F τ 2 μα(z),σ 2 (z) = 1 − q, where σ2 and τ 2 are as in the proposition, so that [μ α 1 ( Zn ), μα 2 ( Zn )] is a confidence interval for µ with conditional coverage probability Similarly to data carving, the length of the interval shrinks at the same √ n-rate as in the unconditional case, where the price of conditioning is controlled by the factor 1 + σ 2 /τ 2 , and the method reduces to the polyhedral method if τ = 0.
To obtain a sample-splitting method that is comparable to selection with a randomized response, we proceed as follows: We use the first nσ 2 /(σ 2 + τ 2 ) observations for model selection and the remaining m = nτ 2 /(σ 2 + τ 2 ) observations for inference (assuming, for simplicity, that these numbers are positive integers).Write Zm for the mean of the last m observations.Because the first n − m observations are independent of Zm , we thus obtain the standard confidence interval for µ based on Zm with length σ In terms of length, this interval is always dominated by [μ α 1 ( Zn ) − μα 2 ( Zn )] considered above.
For data carving, choosing a corresponding sample splitting method was obvious; cf.Subsection 4.1.This is not the case in the setting considered here.Nevertheless, the considerations in the preceding paragraph show that selection with a randomized response dominates any sample splitting scheme that uses at most m = nτ 2 /(σ 2 + τ 2 ) observation for inference and the rest for selection.
Remark 4.3.Throughout this section, we have considered Gaussian data.In non-Gaussian settings, our results can be applied asymptotically, provided that, in ( 7) or ( 8), (i) the estimator used in the inference step, i.e., Zn , as well as the random variables in the conditioning event are asymptotically jointly normal and (ii) the probability of the conditioning event does not vanish.

Sequential testing with data carving
Consider a situation where an experiment Z is to be repeated independently n times in order to determine the mean µ of Z.However, the whole process is to be stopped at an earlier stage if results do not look promising, e.g., in a simple clinical trial.In particular, the process is to be stopped if the mean Zδn of the first δn repetitions fails to exceed a certain threshold c (assuming that δ is a pre-determined fraction so that δn is an integer less than n).In this situation, a confidence interval for µ is desired conditional on the event that Zδn > c, that is, in the event that the process was not stopped early.If Z is assumed to be Gaussian with mean µ and variance σ 2 , this situation can be handled using the results of Section 4.1: Set T = (c, ∞).Based on the conditional distribution of Zn given Zδn ∈ T , one obtains an equal-tailed confidence interval for µ with coverage probability 1 − α whose length is less than In particular, the early stopping rule, i.e., conditioning on Zδn ∈ T , results in intervals that are longer than the standard interval (that is constructed without early stopping, i.e., without conditioning) by a factor of less than 1/ √ 1 − δ.In this example, we did not consider controlling for explanatory variables for simplicity; allowing for additional explanatory variables to influence the experimental outcome is more complex and will be studied elsewhere.We also note that, in the setting of the present example, data carving controls a conditional probability that is not commonly considered in group sequential testing.For the substantial body of literature in that area, we refer to Jennison and Turnbull (2000).

Lasso selection with a randomized response
To complete the discussion of the randomized Lasso from Section 2 we will use the assumptions and the notation maintained there.Considered a model m and a sign-vector s so that , and where T = T m,s (w) is an interval.Choose α i , i = 1, 2, satisfying 0 < α 1 < α 2 < 1 and choose μα i (z) so that F τ 2 μα i (z),σ 2 (z) = 1 − α i , i = 1, 2, where the c.d.f. is computed with the truncation set T m,s (w) replacing T .Clearly, μα i (z) depends on z, on m and s, and on w (through the truncation set).Set Ľm,s,w (z) = μα 1 (z) and Ǔm,s,w (z) = μα 2 (z).Then, by construction, and, by Theorem 3.1, Also, the relations in the preceding two displays continue to hold if the conditioning on ( by Theorem 3.1.As before, the relations in the preceding two displays continue to hold if the conditioning on (I n − P ηm )(Y + ω) = w is dropped in the first display and if w is replaced by W = (I n − P ηm )(Y + ω) in both.

Discussion
We have shown that the length of certain confidence intervals with conditional coverage guarantee can be drastically shortened by adding some noise to the data throughout the modelselection step.Examples include data carving and selection on a randomized response, both combined with the polyhedral method.Our findings clearly support the observations of Fithian et al. (2017) and Tian and Taylor (2018) that sacrificing some power in the modelselection step results in an significant increase in power in subsequent inferences.Selection and inference on the same data is not favorable in the case where the events describing the outcome of the selection step correspond to bounded regions in sample space (in our case, the truncation set T ), because then the resulting confidence set has infinite expected length; cf.Kivaranovic and Leeb (2021).There are, however, situations where this case can not occur: For example, Heller et al. ( 2019) study a situation where first a global hypothesis is tested against a two-sided alternative and subsequent tests are only performed if the global hypothesis is rejected.There, bounded selection regions do not arise and excessively long intervals are not an issue.Hence, we recommend to be cautious about the selection procedure one chooses.In some situations, adding noise in the selection step (e.g., through data carving or randomized selection) may be beneficial; in other situations, it may not be necessary.
Appendix A Proof of Theorem 3.1 We first provide some intuition behind the theorem.Second, we state Proposition A.1 and A.2 which are the two core results which the proof of Theorem 3.1 relies on.Finally, we prove Theorem 3.1 with the help of these two propositions.In Section A.1 and A.2 we then prove Proposition A.1 and A.2, respectively.The first of these two propositions is considerably more difficult to prove.In Section A.3 we collect several auxiliary results which are required for the proofs of the main results.Throughout this section, fix σ 2 and τ 2 in (0, ∞), and simplify notation by setting , where F τ 2 µ,σ 2 (z) and f τ 2 µ,σ 2 (z) denote the conditional c.d.f. and the conditional probability density function (p.d.f.), respectively, of the random variable in (3).Set ρ 2 = τ 2 /(σ 2 + τ 2 ), and recall that µ α (z) is defined by ( 5).Denote by ϕ(t) and Φ(t) the p.d.f. and c.d.f. of the standard normal distribution with the usual convention that Observe that the random vector (Z, Z + R) ′ has a two-dimensional normal distribution with mean (µ, µ) ′ , variance (σ 2 , σ 2 + τ 2 ) ′ and covariance σ 2 .It is elementary to verify that, for any v ∈ R, Let G µ (z, v) denote the c.d.f. of this normal distribution, i.e., By definition of F µ (z), we have where V µ is a random variable that is truncated normally distributed with mean µ, variance σ 2 + τ 2 and truncation set T .Assume, for this paragraph, that T is the singleton set T = {v} for some fixed v ∈ R (singleton truncation sets are excluded by our definition of T in (4)).Then the c.d.f.s G µ (z, v) and F µ (z) coincide, and it is elementary to verify that the length of [µ α 1 (z), µ α 2 (z)] is equal to (σ/ρ)(Φ −1 (α 2 )−Φ −1 (α 1 )), which is exactly the upper bound in Theorem 3.1.The theorem thus implies that confidence intervals only become shorter if one conditions on a set T with positive Lebesgue measure instead of a singleton.On the other hand, if T is equal to R, it is clear that the length of [µ α 1 (z), µ α 2 (z)] is equal to σ(Φ −1 (α 2 ) − Φ −1 (α 1 )).Surprisingly, this latter quantity is not necessarily a lower bound for the length of [µ α 1 (z), µ α 2 (z)] if T is a proper subset of R; cf. the r.h.s. of Figure 1, or in Figure 1 of Kivaranovic and Leeb (2021) in the case where τ is equal to 0.
This proposition entails that G µα(z) (z, b k ) converges to F µα(z) (z) as z → ∞ if the truncation set T is bounded from above, and the same is true with a 1 replacing b k and as z → −∞ if T is bounded from below.We continue now with the proof of Theorem 3.1.
).Because Proposition A.1 holds for any z and µ, it follows that for any c ∈ (0, ∞), we have We plug c = σ(Φ −1 (α 2 )−Φ −1 (α 1 ))/ρ into the inequality, apply the strictly increasing function Φ to both sides and use the symmetry Because F µ (z) is strictly decreasing in µ by Lemma A.6 and µ α 2 (z) satisfies the equation F µα 2 (z) (z) = 1 − α 2 , the previous inequality implies that Subtracting µ α 1 (z) on both sides gives the inequality of the theorem.It remains to show that this upper bound is tight if the truncation set T is bounded.We only consider the case sup T = b k < ∞ here, because the case inf T = a 1 > −∞ can be treated by similar arguments, mutatis mutandis.In view of the definition of G µ (z, v) in (10), Proposition A.2 and the symmetry Subtracting the second limit from the first and multiplying by σ/ρ gives Hence the upper bound is tight.

A.1 Proof of Proposition A.1
The proof of the Proposition is split up into a sequence of lemmas that are directly proven here.
In view of ( 10)-( 11), Leibniz's rule implies that the p.d.f.f µ (z) is equal to Observe now that it is sufficient to show that the function ϕ(Φ −1 (α)) is strictly concave because, by Jensen's inequality, it follows then that f µ (z) is bounded from above by completing the proof.
Elementary calculus shows that .
Note that the inequality (13) of this lemma resembles inequality (12) of Proposition A.1.While inequality ( 13) is surprisingly easy to prove, inequality (12) is more difficult.Equation (11) provides intuition why this is the case: The distribution of the random variable V µ does not depend on z but it depends on µ.Hence to prove inequality (12), we cannot exchange integral and differential and we cannot apply Jensen's inequality as we did in the proof of Lemma A.3.We did not find a direct proof of inequality (12).However, in the following we show that inequality (13) in fact implies (12).
To show this implication, we need a more explicit representation of f µ (z).Elementary calculus and properties of the conditional normal distribution imply that the conditional p.d.f.f µ (z) can also be written as Lemma A.4.Let G µ (z, v) be defined as in (10).For all z ∈ R and all µ ∈ R, we have where Proof.The chain rule implies that The inverse function theorem implies that the first derivative on the r.h.s. is equal to 1/ϕ(Φ −1 (F µ (z))).Therefore, it remains to show that ∂F µ (z)/∂µ is equal to the numerator on the r.h.s. of the equation of the lemma.Leibniz's rule implies that ∂F µ (z)/∂µ = z −∞ ∂f µ (u)/∂µ du.Therefore, we compute ∂f µ (z)/∂µ first.Lemma A.7 implies that We use the expression on the r.h.s. to compute x −∞ ∂f µ (u)/∂µdu.Lemma A.8 implies that the integral of the first summand is equal to Because, in the second-to-last display, the second summand on the r.h.s.depends on x only through f µ (z), it is easy to see the integral of this function is equal to The sum of the last two expressions is equal to the numerator on the r.h.s of the equation of the lemma, which completes the proof.This lemma implies that proving Proposition A.1 is equivalent to showing that B µ (z) < 0 for Observe that This equation chain implies that B µ (z) converges to 0 as |z| → ∞.Holding µ, σ 2 and τ 2 fixed, this is the same as saying that B µ (z) converges to 0 as F µ (z) goes to 0 or 1.Let the function F −1 µ (α) be defined by the equation Clearly, F −1 µ (α) is well-defined for all α ∈ (0, 1) and we have that F −1 µ (F µ (z)) = z for all z ∈ R. To prove Proposition A.1, it is now sufficient to show that B µ (F −1 µ (α)) is strictly convex as a function of α for any fixed µ, σ 2 and τ 2 (in view of the second-to-last display).
Proof.We start by computing the first derivative of B µ (F −1 µ (α)) with respect to α.The chain rule implies that The inverse function theorem implies that the second derivative on the r.h.s. is equal to 1/f µ (F −1 µ (α)).In view of the definitions of B µ (z) and F −1 µ (α) in ( 15) and ( 16), we see that to compute the first derivative on the r.h.s.we need to compute the derivatives of α, ϕ(Φ −1 (α)), f µ (F −1 µ (α)) and G µ (F −1 µ (α), v) with respect to F −1 µ (α).Because ∂F −1 µ (α)/∂α is equal to 1/f µ (F −1 µ (α)), it follows that the derivative of α with respect to F −1 µ (α) is equal to f µ (F −1 µ (α)).The chain rule implies that the derivative of ϕ(Φ −1 (α)) with respect to where l(z, v) is defined in Lemma A.9.And finally, Lemma A.10 implies that the derivative of G µ (F −1 µ (α), v) with respect to F −1 µ (α) is equal to where h(v) is defined in Lemma A.4 and l(z,v) is defined in Lemma A.9.The previous four derivatives and the definition of B µ (z) in ( 15) entail, after straight-forward simplifications, that and therefore Now it is easy to see that .
The claim of the lemma follows by evaluating the second derivative at α = F µ (z).
To prove Proposition A.1, it remains to show that But this is the same as showing The l.h.s. is equal to ∂Φ −1 (F µ (z))/∂z and in Lemma A.3 we have shown that this inequality is true.This completes the proof of Proposition A.1.

A.2 Proof of Proposition A.2
Let ϵ > 0. We will only consider the case where sup T = b k < ∞; the other case follows by similar arguments, mutatis mutandis.Recall the definition of F µ (z) in (11).By the law of total probability, we can write F µ (z) as Note that both conditional expectations are bounded by 1.Because the random variable V µ (defined under equation ( 11)) is a truncated normal with mean µ, variance σ 2 + τ 2 and truncation set T , it follows that V µ converges in probability to b k as µ goes to ∞.Now as z goes to ∞, it follows that µ α (z) goes to ∞ (cf. the discussion after the proof of Lemma A.6).This implies that lim z→∞ Because ϵ was arbitrary and G µ (z, v) is simply a normal c.d.f., which is uniformly continuous, the claim of the proposition follows.

A.3 Auxiliary results
Lemma A.6.For every z ∈ R, F µ (z) is continuous and strictly decreasing in µ and satisfies Proof.Continuity is obvious.For monotonicity, it is sufficient to show that f µ (z) has monotone likelihood ratio because Lee et al. (2016) already showed that monotone likelihood ratio implies monotonicity.This means for µ 1 < µ 2 , we need to show that f µ 2 (z)/f µ 1 (z) is strictly increasing in z.In view of the definition of f µ (z) in ( 14), it is easy to see that f µ 2 (z)/f µ 1 (z) can be written as c exp((µ 2 − µ 1 )x/σ), where c is a positive constant that does not depend on z.Because µ 1 < µ 2 and the exponential function is strictly increasing, it follows that f µ (z) has monotone likelihood ratio.Finally, we show that lim µ→∞ F µ (z) = 0.The other part of this equation follows by similar arguments.Let M < b k .Recall the definition of F µ (z) in (11).By the law of total probability, we can write F µ (z) as Note that both conditional expectations are bounded by 1.Because the random variable V µ (defined under equation ( 11)) is a truncated normal with mean µ, variance σ 2 + τ 2 and truncation set T , it follows that lim µ→∞ P( Because G µ (z, v) is strictly decreasing in v, it follows that the conditional expectation on the r.h.s. is bounded from above by G µ (z, M ).But this means that lim µ→∞ F µ (z) is bounded by lim µ→∞ G µ (z, M ).Because latter limit is equal to 0, the same follows for the former limit.
Lemma A.7.Let the function h(v) be defined as in Lemma A.4.For all x ∈ R and all µ ∈ R, we have Proof.In view of the definition of f µ (z) in ( 14), the chain rule and product rule imply that the derivative of f µ (z) with respect to µ is equal to In view of the definition of f µ (z), it is easy to see that the first summand is equal to and, in view of the definitions of f µ (z) and h(v), that the second summand is equal to Lemma A.8. Let G µ (z, v) be defined as in (10) and h(v) as in Lemma A.4.For all x ∈ R and all µ ∈ R, we have Proof.By definition of f µ (z) in ( 14), the integral can be written as Note that all integrands in the numerator are of the same form.They only differ in the constants a 1 , b 1 , . . ., a k , b k .This means, we can apply Equation 10,011.1 of Owen (1980) to each integral.This equation implies that the numerator is equal to (Also note that this equation can easily be verified by differentiation of the antiderivative.) In view of the definitions of f µ (z), h(v) and G µ (z, v), we can see that the claim of the lemma is true.
Lemma A.9.For all z ∈ R and all µ ∈ R, we have where Proof.In view of the definition of f µ (z) in ( 14), the chain rule and product rule imply that the derivative of f µ (z) with respect to z is equal to It is easy to see that the first summand is equal to and, in view of the definitions of f µ (z) and l(z, v), that the second summand is equal to Lemma A.10.Let h(v) be defined as in Lemma A.4 and l(z,v) as in Lemma A.9.For all z ∈ R and all µ ∈ R, we have Proof.By definition of G µ (z, v) in (10), we have We claim that To see this note that we can write the l.h.s. as c 1 exp(−d 1 /2) and r.h.s. as c 2 exp(−d 2 /2), where Hence the claimed equation is true.Observe that the r.h.s. of that equation can be written as .
In view of the definitions of f µ (z), l(z, v) and h(v), it is easy to see that the previous expression is equal to f µ (z)l(z, v)/h(v).Hence the derivative of G µ (z, v) with respect to z is of the claimed form.
Since individual components of β(Y ) are zero with positive probability, the non-zero coefficients of β(Y ) can be viewed as the model m(Y ) selected by the Lasso.More formally, for each y ∈ R n , let m(y) ⊆ {1, . . ., d} and ŝ(y) ∈ {−1, 1} | m(y)| denote the set of indices and the vector of signs, respectively, of the non-zero components of β(y) (in case m(y) = ∅, ŝ(y) is left undefined).Conditional on events like { m(Y ) = m} or { m(Y ) = m, ŝ(Y ) = s}, the polyhedral method provides confidence intervals for linear contrasts of the form η ′ m θ with pre-specified coverage probability for a given d-vector η m ̸ = 0. 1 These intervals are based on the (conditional) distribution of η ′ m Y , which is the obvious (unconditionally unbiased) estimator for η ′ m θ.Consider 1 The quantity of interest η ′ m θ may depend on the selected model and is often of the form η

a
model m ̸ = ∅ and a sign-vector s ∈ {−1, 1} |m| , so that P( m(Y ) = m, ŝ(Y ) = s) > 0. Lee et al. (2016) show that the event { m(Y ) = m, ŝ(Y ) = s} is a polyhedron in Y -space.Hence, the distribution of Y | m(Y ) = m, ŝ(Y ) = s is a multivariate Gaussian restricted to said polyhedron.Now decompose Y into the sum of two independent components were one depends only on η ′ m Y : Y = P ηm Y + (I n − P ηm )Y with P ηm denoting the orthogonal projection on the span of η m .Conditional on m(Y ) = m, ŝ(Y ) = s and (I n − P ηm )Y = w, we see that Y equals P ηm Y + w and lies on the affine line {αη m + w, α ∈ R} intersected with the polyhedron corresponding to { m(Y ) = m, ŝ(Y ) = s}.In particular, the conditional distribution of η ′ m Y conditional on m(Y ) = m, ŝ(Y ) = s and (I n − P ηm )Y = w is a a truncated normal, similar to (

Figure 1 :
Figure 1: The length [µ α 1 (z), µ α 2 (z)] is plotted as a function of z.In the left panel the truncation sets are of the form (−a, a) and in the right panel truncation sets are of the form (−∞, −a) ∪ (a, ∞).The different values for a are shown below the plot.The remaining parameters are α 1 = 1 − α 2 = 0.025 and σ 2 = τ 2 = 1.

Figure 2 :
Figure 2: The conditional expected length of [µ α 1 (Z), µ α 2 (Z)] is plotted as a function of µ.In the left panel the truncation sets are of the form (−a, a) and in the right panel the truncation sets are of the form (−∞, −a) ∪ (a, ∞).The different values for a are shown below the plot.The remaining parameters are α 1 = 1 − α 2 = 0.025 and σ 2 = τ 2 = 1.
dropped in the first display and if w is replaced by W = (I n −P ηm )(Y +ω) in both.Now consider a model m with P( m(Y )) > 0. If m = ∅, then the event { m(Y ) = m} is again a polyhedron in Y -space; cf. Lee et al. (2016).If m ̸ = ∅, then the event { m(Y ) = m} can be decomposed into the (disjoint) union of events of the form { m(Y ) = m, ŝ(Y ) = s}, each with positive probability; therefore, the event { m(Y ) = m} is the (disjoint) union of finitely many polyhedra in Y -space.This entails that the conditional distribution of η ′ m Y given m(Y +ω) = m and (I n − P ηm )(Y + ω) = w is again of the form (3) with Z and R as in the preceding paragraph, where T = T m (w) is now the union of finitely many intervals.Now, proceeding as in the preceding paragraph, we obtain a confidence interval [ Ľm,w (η ′ m Y ), Ǔm,w (η ′ m Y )] that satisfies µ (u)du = − f µ (z) − k i=1 h(b i )G µ (z, b i ) − h(a i )G µ (z, a i ).
G µα(z) (z, b k ).On the other hand, observe that the conditional expectation on the r.h.s. of the preceding display is bounded from above by G µα(z) (z, b k − ϵ).This means that lim z→∞ F µα(z) (z) is bounded from above by lim z→∞ This means that F µ (z) is bounded from below by G µ (z, b k ).But this also means that lim z→∞ F µα(z) (z) is bounded from below by lim z→∞