Sign-constrained least squares estimation for high-dimensional regression

Many regularization schemes for high-dimensional regression have been put forward. Most require the choice of a tuning parameter, using model selection criteria or cross-validation schemes. We show that a simple non-negative or sign-constrained least squares is a very simple and effective regularization technique for a certain class of high-dimensional regression problems. The sign constraint has to be derived via prior knowledge or an initial estimator but no further tuning or cross-validation is necessary. The success depends on conditions that are easy to check in practice. A sufficient condition for our results is that most variables with the same sign constraint are positively correlated. For a sparse optimal predictor, a non-asymptotic bound on the L1-error of the regression coefficients is then proven. Without using any further regularization, the regression vector can be estimated consistently as long as \log(p) s/n ->0 for n ->\infty, where s is the sparsity of the optimal regression vector, p the number of variables and n sample size. Network tomography is shown to be an application where the necessary conditions for success of non-negative least squares are naturally fulfilled and empirical results confirm the effectiveness of the sign constraint for sparse recovery.


Introduction
High-dimensional regression problems are characterized by a large number of predictor variables in relation to sample size.Regularization (in a broad sense) is of critical importance for high-dimensional problems and much attention has been paid to various schemes and their properties in recent years, including the Ridge estimator (Hoerl and Kennard, 1970), non-negative Garrotte (Breiman, 1995), the Lasso (Tibshirani, 1996) and various variations of the latter, including the group Lasso (Yuan and Lin, 2006) and adaptive Lasso (Zou, 2006).Datasets with very low signal-to-noise ratio offer similar challenges to high-dimensional problems even if the notional sample size is quite high.
Sign-constraints on the regression coefficients are a simpler regularization and have been first advocated by I.J.Good, as covered in the book Lawson and Hanson (1995).There is a wide range of problems where the sign of the regression coefficients can either be estimated by an initial estimator or where it is known a priori, such as in image processing and spectral analysis (Waterman, 1977;Bellavia et al., 2006;Donoho et al., 1992;Chen and Plemmons, 2009).Sign-constraints have also been implemented for matrix factorizations, specifically the non-negative Matrix factorization (Lee et al., 1999;Lee and Seung, 2001;Ding et al., 2010) and non-negative least squares regression can be a useful tool for this factorization (Kim and Park, 2007).We study the performance of non-negative least squares type problems under a so-called Positive Eigenvalue Condition, which can be checked for any given dataset by solving a quadratic programming problem.A sufficient condition uses only the minimum of all entries in the design matrix.It is shown that non-negative (or, in general, sign-constrained) least squares is a surprisingly effective regularization technique for highdimensional regression problems under these conditions.If the Positive Eigenvalue Condition is not fulfilled, the sign constraint is still a good ingredient in a regularization framework.The non-negative Garrote (Breiman, 1995) is, for example, making use of a sign-constraint, where the signs are derived from an initial estimator as is the positive Lasso (Efron et al., 2004).
The data are assumed to be given by a n × 1-vector of real-valued observations Y and a n × p-dimensional matrix X, where column k of X contains all n samples of the k-th predictor variable for k = 1, . . ., p.The non-negative least squares (NNLS) regression estimator is defined as We will work with a positivity constraint without limitation of generality since variables that are constrained to be negative can be replaced by their negative counterpart and the problem can thus always be framed as a non-negative least squares optimisation.Problem (1) is a convex optimization problem and can be solved with general quadratic programming problem solvers, including active set (Lawson and Hanson, 1995), iterative (Kim et al., 2006) and interior-point approaches (Bellavia et al., 2006).A tailor-made fast approximate algorithm based on random projections has recently been proposed in Boutsidis and Drineas (2009).The recent manuscript Slawski et al. (2011) contains independent work on the behaviour of NNLS in high-dimensions.Using the same Positive Eigenvalue Condition (which is called self-regularizing design condition), a bound on the prediction error of NNLS and a sparse recovery property after hard thresholding are shown in Slawski et al. (2011).Our main focus is on sparse recovery in the 1 -sense.The bounds on prediction error are also of different nature since the assumptions are different.We make use of the so-called compatibility condition which is appears in most sparse recovery results in the 1 -norm penalized estimation literature (Van De Geer and Bühlmann, 2009) and derive, with the help of this condition, tight non-asymptotic bounds on the prediction error.
Note that the non-negative least squares estimator (1) does not require the choice of a tuning parameter beyond choosing the sign of the coefficients.Imposing a sign-constraint might seem like a very weak regularization but it will be shown that the estimator is remarkably different from the un-regularized least squares estimator.It can cope with high-dimensional problems, where the number of predictor variables vastly exceeds sample size.It will be shown to be a consistent estimator as long as the underlying optimal prediction is sufficiently sparse (ie using only a small subset of all predictor variables) and the so-called Positive Eigenvalue Condition is fulfilled.
The manuscript is organized as follows.The notation and the main two assumptions, the compatibility and Positive Eigenvalue Condition, are introduced in Section 2. Our main result, a 1 -bound on the difference between the NNLS estimator and the optimal regression coefficients, is shown in Section 3, along with a bound on the prediction error.

Notation and Assumptions
We assume that the n samples Y ∈ R n are drawn from Xβ * + ε for some p-dimensional vector β * with min k β * k ≥ 0 and ε ∼ N (0, σ 2 ) for some σ > 0. Let S be the set of non-zero entries of the optimal solutions, S : {k : β * k = 0} and N = S c be the complement of S. We could also let β * be the best approximation to the data-generating model under positivity constraints but will refrain from doing so for notational simplicity.We assume that the columns of X are standardized to 2 -norm of n.Despite not necessarily assuming that the columns are mean-centered, we call Σ = n −1 X T X the covariance matrix throughout.
We make two major assumptions for the main result, one about sparse eigenvalues and another about the positive eigenvalue between predictor variables.

Compatibility Condition
There has been much recent work on the properties of the Lasso (Tibshirani, 1996).Many similar conditions for success of the Lasso penalization schemes have been derived (for example Zhang and Huang, 2008;Meinshausen and Yu, 2009;Wainwright, 2009;Bunea et al., 2007Bunea et al., , 2006;;Van De Geer, 2008;Bickel et al., 2009).A good overview of all conditions and their relations is given in Van De Geer and Bühlmann (2009).The weakest condition is based on the notion of (L, S) restricted 1 -eigenvalues.
The (L, S) restricted 1 -eigenvalue of matrix A is defined as: A lower bound on this restricted eigenvalue is necessary for success of the Lasso, either in a prediction loss or coefficient recovery sense and was called the compatibility condition in Van De Geer and Bühlmann (2009).It was shown to be weaker than all similar conditions such as the Restricted Isometry Property (Candes and Tao, 2007).
We make the following assumption.
Remark 1.The assumption is formulated for the empirical covariance matrix Σ but can also easily be reformulated on the population covariance matrix Σ for random design.Assume that the maximal difference between the population and empirical covariance matrix is bounded by δ > 0, that is Σ − Σ ∞ ≤ δ.This assumption is fulfilled with high probability for many data sets with larger sample size.If the predictors have for example a multivariate normal distribution (which will not be assumed elsewhere), then the condition is fulfilled with probability 1−2 exp(−t) for δ ≥ √ u+u with u = (4t+8 log(p))/n, see (10.1) in Van De Geer and Bühlmann (2009) √ δs in Corrolary 10.1 in Van De Geer and Bühlmann (2009).The Compatibility Condition could thus be imposed on the population covariance matrix instead of the empirical covariance matrix.

Positive eigenvalue condition
The following Positive Correlation Condition is the main assumption necessary to show success of nonnegative least squares.
The positively constrained minimal 1 -eigenvalue of matrix A is defined as A lower bound on this restricted eigenvalue will be a sufficient condition for sparse recovery success of NNLS.

Assumption 2 (Positive Eigenvalue Condition
).There exists some ν > 0 such that φ 2 pos ( Σ) ≥ ν.A lower bound on this eigenvalue seems to be a much stricter condition than the Compatibility Condition.However, the latter allows for positive and negative regression coefficients, while the Positive Eigenvalue Condition is restricted to positive coefficients.There are thus some immediate examples where it is fulfilled, which we discuss below.
Example I: strictly positive covariance matrix.The Positive Correlation Condition is fulfilled if min i,j Σ ≥ ν > 0, that is all entries in the covariance matrix are strictly positive.Again, this condition could also be formulated for the population covariance matrix, using a bound on Σ − Σ ∞ .
We also remark on the case of general sign-constraints (some variables constrained to be positive, some negative).The condition applies then to the dataset where all variables with a negativity constraint have been replaced with their negative counterparts.The constraint on the original covariance matrix is thus that it forms two blocks.The variables in the first block are the variables with a positivity constraint and the second block is formed by all variables with a negativity constraint.Correlations are required to be positive within a block and negative between blocks.
A generalization of Example I is the following.
Example II: only few negative entries.Let A := {i : Σij < 0 for some 1 ≤ j ≤ p} be the minimal set such that Σij < 0 implies {i, j} ⊆ A for all 1 ≤ i, j ≤ p.The Positive Eigenvalue Condition is fulfilled if both of the conditions below are fulfilled for some ν > 0.
1.All entries of the covariance matrix are strictly positive on A c , that is Σij ≥ 2ν if {i, j} ⊆ A c for all 1 ≤ i, j ≤ n.
2. A restricted eigenvalue condition holds on the set A, ie min If the set A is very small, in particular much smaller than n, the latter restricted 1 -eigenvalue condition is in general not very restrictive.The important criterion is thus whether the set A is small compared to the sample size.
Example III: block matrix.For a p × p-matrix A and a set K ⊆ {1, . . ., p}, let A KK be the |K| × |K|submatrix formed by all elements in set K. Suppose 1. Entries of the covariance matrix can be negative but fulfil Σij ≥ −ρ/p 2 for all 1 ≤ i, j ≤ n and some ρ > 0.
A more specific example is thus: all entries in Σ are larger than −ρ/p 2 for some ρ > 0 and Σij ≥ (ν + ρ)B if both i, j are within the same block.
The Positive Eigenvalue Condition is fulfilled with parameter ν > 0.
The positive aspect of the condition is that it is very easy to check in practice whether it applies (at least approximately) and whether one would thus expect the bounds shown below to apply to a given dataset.

Main Results
It will be shown that non-negative least squares leads to a good recovery of the optimal sparse regression vector for high-dimensional data.We study the 1 -error in the regression vector, which also yields a bound on the 2 -error and prediction loss.
Theorem 1. Assume that the Positive Eigenvalue Condition holds with ν > 0. Choose any 0 < η < 1/3.Assume that the compatibility condition holds with φ > 0 for L = 4ν −1 .Setting and assuming min k∈S β k > K p,η σ/ √ nφ, it then holds with probability at least 1 − η that A proof is in the appendix.The result might be surprising since it implies that non-negative least squares is succeeding in recovering the regression coefficients in an 1 -sense if log(p)s/ √ n → 0 for n → ∞, a scaling that requires for general design a lot more regularization in the form of Lasso penalties (or similar).
The result does not imply exact sign recovery in the sense that the non-zero coefficients equal exactly the set S (and indeed this will in general not be the case), but it implies that the S largest coefficients correspond to the variables in the set S. q q q q q q q q q q Leaf 1 q Leaf 2 q Leaf 3 Node 1 Node 2 Node 3 q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Figure 1: Left: A network with three internal nodes and three leaf nodes.The (unobservable) losses at the internal nodes are (10,10,0), meaning that the first two nodes lead to a loss rate of 10 and the third node is not leading to any losses.The observations of the loss rates at the leaf nodes are then (8,3,9).Using the observations at the leaf nodes and knowledge of the topology, NNLS can correctly identify the two first nodes as responsible for the losses.Right: A network with 78 internal nodes and 22 leaf nodes.Two of the internal nodes have a positive loss (marked with a dot) and the observations at the leaf nodes are again sufficient to pinpoint the (unknown) location of the two nodes using NNLS estimation.
Corollary 1.Under the same conditions as Theorem 1 and the stronger assumption that the minimum over all non-zero coefficients is bounded from below by min k∈S β k ≥ 2K p,η σ (5/ν + 4/ √ φ)s/ √ n, it holds with probability at least 1 − η that the indices of the s largest absolute coefficients in β are identical to the set S.
This follows immediately from Theorem 1 since the 1 -bound on the difference between β and β * implies the same bound in the supremum-norm.
The bound in Theorem 1 also implies a bound on the prediction error.
Theorem 2. Under the same conditions as Theorem 1, with probability at least 1 − η for any 0 < η < 1/3, A proof is given in the appendix.The mean squared error, introduced by using NNLS instead of the oracle estimator is thus proportional to log(p) 2 s/n.The result implies asymptotically vanishing prediction error if s log(p) 2 /n → 0 for n → n.

Numerical Results
The results above imply that NNLS can be very effective if (a) the sign of regression coefficients is known or can easily be estimated and (b) the Positive Eigenvalue Condition holds.Network tomography is a good example (Castro et al., 2004).For others, including image analysis and applications in signal processing, see Slawski et al. (2011).There are different aspects of network tomography, including origin-destimation matrix estimation and link-level network tomography; see Castro et al. (2004) for a good overview of the statistical aspects and Xi et al. (2006) and Lawrence et al. (2006) for a discussion of active tomography in the context of link-level analysis.We will focus on one aspect of the link-level network tomography.The network consists of nodes arranged in a directed acyclic graph (or sometimes as a special case a tree) and measurements can be taken at the leaf nodes.These measurements are used to infer the state of all the nodes in the network.In a communication network, the measurements can be the delay or loss rate of packages, in a transport network (such as water distributions networks) it can be the shortfall of the flow rate compared to the expected rate.Since the network topology is assumed to be known, the measurements consist typically of noise plus a linear combination of the internal and unobservable states of the nodes in the network.If a node in the network has a loss (be it in the form of delaying packages or loss of water flow), it will have a linear effect on all leaf nodes that are descendants of the node in the directed acyclic graph.
Figure 1 shows a toy example.Imagining a flow passing through the tree from the internal nodes to the leaf nodes, the entry X i,j is the proportion of flow in node j that reaches leaf node i if flow is divided equally among all outgoing edges in each node of the tree.Three internal nodes have loss rates (β 1 , β 2 , β 3 ) = (10, 10, 0).The loss rates Y = (Y 1 , Y 2 , Y 3 ) at the three leaf nodes are then given by Y = Xβ + ε for some i.i.d.noise ε and A positivity constraint on the coefficient vectors is clearly appropriate since there will in general not be a negative loss at internal nodes (for example no unexected gain of water in a distribution network).In the noiseless case, the NNLS solution recovers exactly the internal states (10, 10, 0) and thus identifies correctly the first two nodes as responsible for the loss of the flow rate in all three leaf nodes.In this simple example, the number of leaf nodes is equal to the number of internal nodes and ordinary least squares would also work in the noiseless case.Least squares clearly ceases to be useful once the number of internal nodes exceeds the number of leaf nodes.Note that, contrary to the previous literature (for example Castro et al. (2004); Lawrence et al. (2006)) we do not attempt to fit a stochastic model to the observations.We are merely trying to directly estimate the current internal state β of the nodes in the network as accurately as possible.
The theory suggests that a non-negativity constraint can already be very powerful under certain constraints on the design matrix.The main condition is the Positive Eigenvalue Condition.In our simple network tomography example, it is obvious that all entries in X are positive and the same is hence true for Σ = n −1 X T X. Entries in X correspond to the amount of loss (delay of packages or reduction in flow rate) in a leaf node caused by a specific loss at an internal node and is non-zero if and only if there is a connection between the internal and the leaf node.Suppose that all non-zero entries in X have entries at least as large as δ for some δ > 0. Suppose further that we can group all internal nodes into B blocks such that the internal nodes within a block share at least one leaf node to which they all connect.The Positive Eigenvalue Condition is then fulfilled with value δ 2 /B; see Example III in the discussion of the condition.
The theory seems to show that under these conditions the NNLS-regularization is effective.To test this, we examine the effect of placing an additional 1 -constraint on the coefficient by computing Let β be again the NNLS-solution defined in (1).It is obvious that β λ ≡ β for all λ ≥ λ max for λ max := β 1 .We generate networks of similar type as the ones shown in Figure 1.The number N of total nodes is chosen for each of 1000 simulations uniformly out the set {25, 50, 100, 200, 400}.Nodes are distributed uniformly on the area [−1, 1] 2 and numbered in order of their Euclidean distance from the origin.Starting with the first node k = 1 closest to the origin, edges are drawn between it and its K nearest neighbours with a larger ordering number (where K is drawn uniformly from the set {5, 10, 20}).When drawing edges at node k = 1, . . ., N − 1, they are deleted with probability ν (where ν is drawn uniformly from the set {.2, .4,.6,.8,1}) or when the edge would cross a previously drawn edge.Imagining again a flow passing through the tree from the internal nodes to the leaf nodes, the entry X i,j is the proportion of flow in node The average number of correctly identified internal nodes with a positive loss under 1000 different scenarios with an additional 1 -constraint as in (3).The NNLS solution corresponds to λ = λ max and is seen to be in general superior to the solutions under additional shrinkage.j that reaches leaf node i if flow is divided equally among all outgoing edges in each node of the tree.For each of the 1000 simulations, we draw a single graph from the parameters as described above and also draw the noise variance uniformly from the set {0, 0.125, 0.25, 0.5, 1, 2, 4} and a number s of non-zero entries in β (corresponding to nodes with a delay or loss), where s is drawn uniformly from the set {2, 5, 10}.The s non-zero entries from β are generated independently as the absolute value of a standard-normal random variable.For each such setting, we simulate 50 times the vector Y and reconstruct with βλ as in (3) for an evenly spaced grid of 20 points between λ = 0 and λ = λ max , the NNLS solution.Nodes are put in decreasing order of the reconstructed value βλ .We record the first entry in the re-ordered vector βλ that corresponds to a false positive (a zero entry in the equally re-ordered vector β) and call the number of true positives the number of values of βλ with larger value than the first false positive.
Figure 2 shows the average number of true positives as a function of λ.Each line corresponds to the average value over all 50 simulations in a given scenario.For nearly all scenarios there is no benefit in placing an additional 1 -penalty on the coefficients.The NNLS solution is thus a very good and simple estimator in these settings, as expected from theory.Additional regularization by an 1 -penalty does not seem to improve results.

Discussion
We have shown that non-negative (or sign-constrained) least squares can be an effective regularization technique for sparse high-dimensional data under two conditions: (a) the data fulfil the so-called Positive Eigenvalue Condition, which is easy to check for a given dataset, and (b) the sign of the coefficients is known or can easily be estimated.If the conditions hold, NNLS can recover the correct sparsity pattern in the absence of any further shrinkage, as long as log(p)s/n → 0 for n → ∞, where p is the number of variables, s the number of non-zero variables in the optimal regression vector and n is sample size.We have shown network tomography as an example where the sign of regression coefficients is known a priori and the design condition is fulfilled automatically, at least approximately.In other examples the sign can be estimated by an initial estimator.An attractive feature of NNLS is that it does not require any tuning parameter beyond the choice of the signs of the individual regression coefficients.Despite its simplicity, it can remarkably accurate for high-dimensional regression.
6 Appendix: Proofs 6.1 Proof of Theorem 1 and the results follow hence from Lemma 1.

Proof of Theorem 2
Define the oracle non-negative least squares solution as and let δβ = β − βoracle .Let M be the set M := {k : δβ k < 0}.Using Equation ( 9) in the proof of Lemma 1, it follows that, with probability at least 1 and, using δβ M c 1 ≤ δβ 1 and the bound in (7) for the latter quantity, it holds with probability at least 1 Using again C 2 = K 2 p,η = 2 log( √ 2p √ πη ), the claim follows.

Lemmata
and it thus remains to be shown that, if (6) is fulfilled, also Figure2: The average number of correctly identified internal nodes with a positive loss under 1000 different scenarios with an additional 1 -constraint as in (3).The NNLS solution corresponds to λ = λ max and is seen to be in general superior to the solutions under additional shrinkage.