Discussion on the paper: Hypotheses testing by convex optimization by Goldenshluger, Juditsky and Nemirovski

We briefly discuss some interesting questions related to the paper"Hypotheses testing by convex optimization"by Goldenshluger, Juditsky and Nemirovski.

This is an exciting piece of work.I agree with the authors that developing computationally tractable methods for hypotheses testing is an important problem in statistics that have received little attention to date.In this discussion, I would like to put the emphasis on three points presented in the paper under discussion that are of particular interest.

Connection with the statistical learning theory
The idea of convexification of the loss function in order to construct computationally tractable procedures has been widely used in statistical learning theory [Zhang, 2004].In this part of the discussion, I would like to share some thoughts about the similarities of the two approaches.
To this end, let me briefly recall the principle of loss convexification in the problem of binary classification.One observes n iid pairs {(X i , Y i )} i=1,...,n drawn from an unknown distribution P on the product space X ×Y with Y = {−1, +1} and the goal is to design a prediction rule g : X → Y with the smallest possible misclassification error rate The convexification is achieved in two steps.First, the classification risk is replaced by the φ-risk where φ : R → R is a convex function often referred to as the convex surrogate loss.Second, the set of "pure" classification rules g : X → Y is extended to "generalized" rules h : X → R with the convention that the predictions furnished by h and sgn(h) are the same.The φ-risk is accordingly extended to all generalized prediction rules: As a consequence of this construction, if H is a convex subset of the set of all measurable functions from X to R, then the computation of the empirical risk minimizer (ERM) amounts to solving a convex program.The most common choices for the function φ are the hinge loss φ(u) = (1 + u) + , the exponential loss φ(u) = e u and the logistic loss φ(u) = log(1 + e u ).
Let me turn now to the problem of testing two hypotheses Θ 0 and Θ 1 based on n observations X 1 , . . ., X n independently drawn from a distribution P θ * on X .Let Θ = Θ 0 ∪ Θ 1 and s : Θ → {±1} be the function that equals −1 on the set Θ 0 and +1 on the set Θ 1 .The usual loss of a pure test T : X n → {±1} associated with a sample Comparing ( 4) with ( 1), one can see some clear similarities between the problems of finding binary predictors g minimizing the misclassification error rate and that of finding testing procedures T minimizing the worst case error rate ǫ(T ).In both problems the decision rules form a nonconvex set and the performance measure is defined as the expected loss for a nonconvex loss function (the Heaviside step function).However, there is an important difference consisting in the fact thatcontrary to (1)-the expectation at the right-hand side of (4) does not admit an empirical counterpart that is easily computable from the sample.Therefore, even if one applies the aforementioned two steps of convexification, this does not readily yield a test procedure computable by solving a convex program (in the spirit of (3)).
Elaborating on these ideas, one can define the following convexified strategy for testing the hypothesis Θ 0 against Θ 1 .Given a convex subset H of the set of measurable functions from X n to R and a convex loss φ : R → R, define In this "saddle-point" formulation, the outer minimisation problem has the attractive property of being convex: it has a convex feasible set and a convex cost function.Unfortunately, in general, the inner maximization problem is not concave and there is no particular reason to expect that it can be efficiently solved for any given h when the dimensionality of θ is large.To circumvent this drawback, the authors had the ingenious idea to combine the following three facts: • the saddle point of G(h, θ) coincides with the saddle point of log G(h, θ), • when the model {P θ : θ ∈ Θ} belongs to an exponential family, it is natural to choose H as the linear span of the sufficient statistics: H 0 = Span(S j : j = 1, . . ., m), • for some statistical models1 belonging to an exponential family, for every h ∈ H 0 , the mapping θ This leads to the test procedure The final step of construction aims at convexifying the feasible set of the inner maximization problem.In the case when Θ = Θ 0 ∪ Θ 1 with convex sets Θ 0 and Θ 1 , this aim is achieved by replacing sup θ∈Θ log G exp (h, θ) by the expression sup (θ, θ)∈Θ0×Θ1 log G exp (h, θ) + log G exp (h, θ), which does not impact the error of testing too much in view of the inequalities An important remark to be made here is that-in the case of exponential loss φ-taking the logarithm of G φ does not break the convexity with respect to h.So, in this notation, the test proposed and studied by the authors is I believe that these explanations shed some additional light on the construction proposed in Theorem 2.1 of the paper under discussion.This also raises several questions that might be interesting to investigate in the future.In particular, a compelling question is to characterize the set of surrogate loss functions φ that lead to computationally tractable testing procedures and for which the testing error rate remains small.Another question is the possibility to deal with test (6) directly, without using the final step of convexification.At a heuristic level, the risk of h exp n,H0 should be smaller than that of hexp n,H0 .Therefore, the advantage of the latter would be only computational tractability.I wonder if it is possible to efficiently compute the test h exp n,H0 , despite the lack of convexconcavity of the cost function, exploiting the facts that (a) for every h, the sup of log G exp (h, θ) over Θ can be efficiently computed, and (b) for every θ, the minimum of log G exp (h, θ) over H 0 can be efficiently computed as well.

Reduction to testing simple hypotheses
The definition of the test given by the authors in Theorem 2.1, see also Eq. ( 7) above, is well suited for the computational purposes but, in my opinion, has the inconvenience of hiding the main reason why the proposed test is a natural one to use in the setting under consideration.In fact, the proposed test can be alternatively defined as follows: in order to distinguish between two (convex) hypotheses Θ 0 and Θ 1 based on a sample X ∼ P θ * , 1. Determine the two closest points θ 0 ∈ Θ 0 and θ 1 ∈ Θ 1 in terms of the Hellinger distance between the corresponding distributions (in other terms, find the two representers P θ0 and P θ1 in the families {P θ : θ ∈ Θ 0 } and {P θ : θ ∈ Θ 1 } that are the hardest to distinguish).This step is completely data independent.2. Apply the standard likelihood-ratio test to the problem of choosing among two simple hypotheses The equivalence of these two definitions follows from the proof of Theorem 2.1, see Eq. ( 52).In Section 2.3.2, this interpretation is presented for the discrete observation scheme.At a conceptual level, it is important to underline that the same interpretation holds true in the general case as well.However, from a practical point of view, the definition given in the paper is more convenient than the foregoing one since the first step of the latter, generally, is not computationally tractable.

Testing error for inexact solutions
As it is judiciously noted by the authors, in many practical situations, the exact computation of the saddle point in (7) can not be performed.Then, one relies on an approximation of the saddle point and it is a central task to assess how this approximation error impacts the error of testing.I find it relevant to measure the error of approximation in terms of the magnitude of violation of first-order optimality conditions (see, for instance, Eq. ( 8) of the paper under discussion).
In such a context, the authors establish upper bounds on the error of the test based on an approximate solution to the saddle point problem.For example, in the case of the Gaussian observation scheme explored in Section 2.3.1, it is shown that the worst-case error rate of the test based on the exact solution is where Φ is the cumulative distribution function of the standard normal distribution and (θ 0 , θ 1 ) is the second argument of the solution to the saddle point problem.On the other hand, when an inexact solution ( θ0 , θ1 ) is used, with an approximation error bounded by δ > 0, the worst-case error rate satisfies (see Eq. ( 9)): In my opinion, it is worth complementing this upper bound by another one that involves only the exact solution (θ 0 , θ 1 ) and, therefore, makes it easier to compare the two errors ε * and ε.In the case of Gaussian observation scheme, this can be easily done.In fact, one can deduce from the first-order exact and approximate optimality conditions that Since the Gaussian cdf is increasing, we infer from this inequality that An even more elegant bound can be obtained if the normalized approximate optimality condition is used: ∀(θ, θ) ∈ Θ 0 × Θ 1 , it holds ( θ1 − θ0 )Σ −1 (θ − θ0 ) + ( θ0 − θ1 )Σ −1 ( θ − θ1 ) ≤ δ Σ −1/2 ( θ0 − θ1 ) 2 2 .In this case, inequalities (9) take the form and we get This inequality allows for an easy comparison of ε and ǫ * in the case of Gaussian observations.In the case of other observation schemes, deriving this type of upper bounds seems to be more challenging and constitutes an interesting avenue of future research.