High-dimensional variable selection via low-dimensional adaptive learning

A stochastic search method, the so-called Adaptive Subspace (AdaSub) method, is proposed for variable selection in high-dimensional linear regression models. The method aims at finding the best model with respect to a certain model selection criterion and is based on the idea of adaptively solving low-dimensional sub-problems in order to provide a solution to the original high-dimensional problem. Any of the usual $\ell_0$-type model selection criteria can be used, such as Akaike's Information Criterion (AIC), the Bayesian Information Criterion (BIC) or the Extended BIC (EBIC), with the last being particularly suitable for high-dimensional cases. The limiting properties of the new algorithm are analysed and it is shown that, under certain conditions, AdaSub converges to the best model according to the considered criterion. In a simulation study, the performance of AdaSub is investigated in comparison to alternative methods. The effectiveness of the proposed method is illustrated via various simulated datasets and a high-dimensional real data example.


Introduction
Rapid developments during the last decades in fields such as information technology or genetics have led to an increased collection of huge amounts of data. Nowadays one often faces the challenging scenario, where the number of possible explanatory variables p is large while the sample size n can be relatively small. In this high-dimensional setting with p possibly much larger than n (abbreviated by p n), statistical modelling and inference is possible under the assumption that the true underlying model is sparse. Hence, we are particularly interested in variable selection, that is we want to identify a sparse, well-fitted model with only a few of the many candidate explanatory variables.
Although the proposed Adaptive Subspace method can be applied in a more general setup, in this paper we focus on variable selection in linear regression models with a * e-mails: staerk@imbie.uni-bonn.de, maria.kateri@rwth-aachen.de, ntzoufras@aueb.gr response Y and explanatory variables X 1 , . . . , X p , i.e.
where i are i.i.d. random errors, i ∼ N (0, σ 2 ), with variance σ 2 > 0, µ ∈ R is the intercept and β = (β 1 , . . . , β p ) T ∈ R p is the vector of regression coefficients. The matrix X = (X i,j ) ∈ R n×p is the design or data matrix with its i-th row X i, * corresponding to the i-th observation and its j-th column X * ,j to the values of the j-th explanatory variable.
Let {X j : j ∈ P} be the set of all possible explanatory variables, where P = {1, . . . , p} is the corresponding set of indices. Then, for S ⊆ P, let X S ∈ R n×|S| denote the design matrix restricted to the columns with indices in S and let β S ∈ R |S| denote the coefficient vector restricted to indices in S. Furthermore let S 0 = {j ∈ P : β j = 0} be the set of indices corresponding to the true underlying model, the so-called true active set.
As already mentioned, a usual theoretical assumption in the high-dimensional regime is the sparsity of the true model. Thus, for the linear model (1), the cardinality of S 0 is assumed to be small, that is s 0 = |S 0 | p. The aim is to identify the active set S 0 , so a variable selection method tries to "estimate" S 0 by some subsetŜ ⊆ {1, . . . , p}.
It is desirable that a selection procedure has the following frequentist properties: The probability P (Ŝ = S 0 ) of selecting the correct model should be as large as possible and the procedure should be variable selection consistent in the sense that P (Ŝ = S 0 ) → 1 in an asymptotic setting where n → ∞ and (possibly) p → ∞ with some specified rate.
Although the assumption that the "truth" is linear and sparse cannot be expected to hold in practice, it is desirable to identify the "best" linear, sparse approximation to the "truth" in order to find an interpretable model that avoids overfitting (see e.g. van de Geer et al.

2011).
Many different methods have been proposed to solve the variable selection problem in a high-dimensional situation, including the Lasso (Tibshirani 1996) and its variants (see Tibshirani 2011, for an overview), the SCAD (Fan and Li 2001) or Stability Selection (Meinshausen and Bühlmann 2010). Here we propose an alternative approach, the Adaptive Subspace (AdaSub) method, which tackles the original high-dimensional selection problem by appropriately splitting it into many low-dimensional sub-problems, based on a certain form of adaptive learning.
In Section 2 a selective overview of existing high-dimensional variable selection methods is given along with a motivation for the proposed new approach. The AdaSub algorithm is presented in Section 3. Its limiting properties are analysed in Section 4 where it is shown that, under the ordered importance property (OIP), AdaSub converges to the best model according to the adopted criterion (Theorem 1). It is further argued that, even when OIP is not satisfied, AdaSub provides a stable thresholded model. The performance of AdaSub is investigated through low-and high-dimensional examples in Section 5, demonstrating that AdaSub can outperform other well-established methods in certain situations with small sample sizes or highly correlated covariates. In Section 6, the effectiveness of AdaSub is further illustrated via a very high-dimensional real data example with p = 22, 575 explanatory variables. Finally, the results along with directions for future work are discussed in Section 7.

Background and motivation
Many different methods have been proposed to solve the variable selection problem in a linear model. Classical selection criteria include the Akaike Information Criterion AIC (Akaike 1974) aiming for optimal predictions and the Bayesian Information Criterion BIC (Schwarz 1978) aiming at identifying the "true" generating model. The BIC can be obtained as an approximation to a fully Bayesian analysis with a uniform prior on the model space. Chen and Chen (2008) argue that this model prior underlying BIC is not suitable for a highdimensional framework where the truth is assumed to be sparse. Therefore they propose a modified version of the BIC, called the Extended Bayesian Information Criterion (EBIC), with an adjusted underlying prior on the model space: For a fixed additional parameter γ ∈ [0, 1] and a subset S ⊆ P let the prior of the corresponding model be π(S) ∝ p |S| −γ .
If γ = 1, the model prior is π(S) = 1 p+1 p |S| −1 and it gives equal probability to each model size, and to each model of the same size. The choice γ = 1 also corresponds to a default beta-binomial model prior providing automatic multiplicity correction (see Scott and Berger, 2010). For γ = 0, the original BIC is obtained.
Similarly to the derivation of the BIC, for a subset S ⊆ P, the EBIC with parameter γ ∈ [0, 1] is asymptotically obtained as EBIC γ (S) = −2 log fβ S ,μ,σ 2 (Y |X S ) + log(n) + 2γ log(p) |S|, where fβ S ,μ,σ 2 (Y |X S ) denotes the maximized normal likelihood under model (1) with restricted design matrix X S (Chen and Chen 2012). According to EBIC, the active set S 0 is estimated byŜ = arg min S EBIC γ (S). It has been shown by Chen and Chen (2008) that, under a mild asymptotic identifiability condition, the EBIC is variable selection consistent for a linear model if p = O(n k ) for some k > 0 and γ > 1 − 1 2k , where the size of the true active set s 0 = |S 0 | is assumed to be fixed. The result has been extended by Foygel and Drton (2010) and Luo and Chen (2013) to the setting of a diverging number of relevant explanatory variables.
The challenging problem with 0 -type selection criteria like EBIC is that the resulting combinatorial optimization problems are very difficult to solve in the presence of many possible explanatory variables p, since there are 2 p possible models for which the criterion has to be evaluated. In fact, best subset selection with an 0 -penalty is in general NP-hard (see e.g. Huo and Ni 2007). Different alternatives have been proposed to circumvent the costly full enumeration approach. Clever branch-and-bound strategies (see e.g. Furnival and Wilson 1974;Narendra and Fukunaga 1977) reduce the number of model evaluations and in practice allow an exact solution up to p ≈ 40. Very recently, a mixed integer optimization approach has been proposed by Bertsimas et al. (2016) which practically solves problems with n ≈ 1000 and p ≈ 100 exactly and finds approximate solutions for n ≈ 100 and p ≈ 1000. Methods like classical forward-stepwise selection, genetic algorithms (see e.g. Yang and Honavar 1998) as well as the the more recently proposed "shotgun stochastic search" algorithm of Hans et al. (2007) and the stochastic regrouping algorithm of Cai et al. (2009) try to trace good models in a heuristic way, but there is no guarantee that one obtains the optimal solution according to the selected criterion.
In the 90's the focus shifted from solving discrete optimization problems to solving continuous, convex relaxations of the original problem. Tibshirani (1996) proposes the celebrated Lasso, which solves a convex optimization problem with an 1 -penalty on the regression coefficients and then selects those variables whose corresponding regression coefficients are non-zero in the optimal solution. Many modifications of the Lasso have been proposed such as the Elastic Net (Zou and Hastie 2005) or the Group Lasso (Yuan and Lin 2006) and efficient algorithms for solving the corresponding optimization problems have been developed (see e.g. Efron et al. 2004;Friedman et al. 2007). A drawback of 1 -regularization methods like the Lasso is that, in order to be variable selection consistent, they typically require quite strong conditions on the design matrix X. For the Lasso in linear regression models, it has been shown that the design matrix X has to satisfy the restrictive "Irrepresentable Condition" to obtain variable selection consistency (Meinshausen and Bühlmann 2006;Zhao and Yu 2006). Alternative methods like SCAD (Fan and Li 2001) -yielding a non-convex optimization problem -or the Adaptive Lasso (Zou 2006) provide consistent variable selection under weaker conditions. A general problem with procedures based on either 0 -or 1 -type criteria is that their optimal solution is not very stable with respect to small changes in the sample. In particular, it has been noted that the discrete nature of the 0 -penalty can lead to "overfitting" of the criterion, if the optimization is carried out among all possible 2 p models (see e.g. Breiman 1996;Loughrey and Cunningham 2005). Another problem of 1 -type criteria is that they do not provide any information about the uncertainty concerning the best model, per se. Meinshausen and Bühlmann (2010) propose a procedure called Stability Selection which aims at addressing these issues. It is based on the idea of applying a given variable selection method (e.g. the Lasso) multiple times (say L times) on subsamples of the data. At the end, one selects those explanatory variables whose relative selection frequencies exceed some threshold (which is chosen in a way to control the false discovery rate). The subsampling scheme is to draw subsets I l , l ∈ {1, . . . , L}, of size n 2 without replacement from {1, . . . , n} and then repeatedly consider the model (1) with observations i ∈ I l only. Even though Stability Selection has nice theoretical properties and also seems to be used more and more in practice, one might observe that in a high-dimensional situation with p n, Stability Selection in combination with Lasso successively applies a possibly inconsistent selection procedure on even more severe high-dimensional problems with p ≫ n 2 . The main idea of the proposed AdaSub method is to successively apply a consistent selection procedure ( 0 -type criteria like EBIC) on data with the original sample size n and only a few q covariates (where q min(n, p)). So the concept behind AdaSub can be summarized as: "Solve a high-dimensional problem by solving many low-dimensional sub-problems." Two issues naturally arise in this regime: Which low-dimensional problems should be solved? And how can the information from the solved low-dimensional problems be combined in order to solve the original problem? AdaSub links the answers to those questions using a certain form of adaptive learning: In each iteration of the algorithm, the solutions from the already solved low-dimensional problems are used to propose (or more precisely "sample" in a stochastic way) a new low-dimensional problem of potentially higher relevance. The construction is based on the principle that a significant explanatory variable for the full model space should also be identified as significant in "many" of the considered low-dimensional problems it is involved in.
The idea of applying variable selection methods subsequently to different model subspaces appears also in other methods like the Random Subspace Method (Ho 1998;Lai et al. 2006), Tournament Screening ), the stochastic regrouping algorithm (Cai et al. 2009), the Bayesian split-and-merge (SAM) approach (Song and Liang 2015), extensions of Stability Selection (Beinrucker et al. 2016) and DECOrrelated feature space partitioning (Wang et al. 2016). Relevant are also the PC-simple algorithm  and Tilting (Cho and Fryzlewicz 2012), which are discussed later in Sections 4 and 5. A characteristic feature of the proposed AdaSub method is that it makes explicit and effective use of the information learned from the subspaces already considered by using a certain form of adaptive stochastic learning. In particular, the inclusion probabilities of the individual variables to be selected in the subspaces are adjusted after each iteration of AdaSub, based on their currently estimated "importance". Therefore, the sizes of the sampled subspaces in AdaSub are not fixed in advance but are automatically adapted during the algorithm. In addition, the solution of the sub-problems in AdaSub does not necessarily rely on relaxations of the original 0 -type problem (such as the Lasso with an 1 -penalty) or on heuristic methods (such as stepwise selection methods). These features distinguish AdaSub from other subspace methods that have been previously considered in the literature.

Notation and assumptions
We first introduce some general notation in a setting with a criterion-based variable selection procedure. For the full set of explanatory variables {X j : j ∈ P} we identify a subset S ⊆ P with the linear model (1) where the sum on the right hand side is restricted to the indices j ∈ S; i.e. in matrix notation the model induced by S is given by where Y = (Y 1 , . . . , Y n ) T , µ = (µ, . . . , µ) T ∈ R n and = ( 1 , . . . , n ) T with error variance σ 2 > 0. We consider the model space M = {S ⊆ P : |S| < n−2} . Here we exclude subsets S ⊆ P with |S| ≥ n−2 to avoid obvious overfitting and non-identifiability of the regression coefficients. Given that we have observed some data D = (X, Y ), let C D : M → R be a certain model selection criterion. In the following we will write C ≡ C D for brevity, but one should always recall that the function C depends on the observed data D. We aim at identifying the best model, which is assumed to be, without loss of generality, the one that maximizes the given criterion C. Examples for C include posterior model probabilities (within the Bayesian setup) or the negative AIC, BIC or EBIC (within the 0 -penalized criteria framework). We define where P(P) = {V ⊆ P} denotes the power set of P = {1, . . . , p}. So for a given V ⊆ P, f C (V ) is the best model according to criterion C among all models included in V . In the following we will assume for simplicity that any two different models have different well-defined function which maps any V ⊆ P to a single model f C (V ) ∈ M. In the 0penalized likelihood framework this assumption is almost surely satisfied if the values of the explanatory variables are generated from an absolutely continuous distribution with respect to the Lebesgue measure (Nikolova 2013). Let with s * = |S * | denote the best model according to criterion C which is unique under the made assumptions. Hereafter, S * will be referred to as the C-optimal model. Finally, in the following let N denote the set of natural numbers and N 0 = N ∪ {0} the set of non-negative integers. For a set Ω and a subset A ⊆ Ω the indicator function of A is denoted by 1 A , i.e. we have 1 A (ω) = 1 if ω ∈ A, and 1 A (ω) = 0 if ω ∈ Ω \ A.

The algorithm
We will now describe the generic AdaSub method, given as Algorithm 1.

> ρ}
A first version of the algorithm has been presented at the 31st International Workshop on Statistical Modelling (Staerk et al. 2016). Suppose that we have observed some data D = (X, Y ) and we want to identify the C-optimal model. As described in Section 2, the basic idea of AdaSub is to solve many low-dimensional problems (i.e. compute f C (V ) for many subspaces V ⊆ P with |V | relatively small) in order to obtain a solution for the given high-dimensional problem (i.e. identify S * = f C (P)). AdaSub is a stochastic algorithm which in each iteration t, for t = 1, . . . , T , samples a subset V (t) ⊆ P of the set of all possible explanatory variables P = {1, . . . , p} and then computes S (t) = f C (V (t) ). The probability that j ∈ P is included in V (t) at iteration t is given by r j are automatically adapted after each iteration t in the following way: where q ∈ (0, p) and K > 0 are tuning parameters of the algorithm.
, so the selection probability of variable X j decreases in the next iteration. If j ∈ V (t) and also j ∈ S (t) , then r so the selection probability increases. If j / ∈ V (t) , then obviously j / ∈ S (t) , so the selection probability does not change in the next iteration. Note that r (t) j depends on the whole history (from iteration 1 up to iteration t) of the number of times variable X j has been considered in the search (j ∈ V (i) ) and the number of times it has been included in the best subset (j ∈ S (i) ). Clearly we have 0 < r (t) j < 1 for all t = 1, . . . , T and j ∈ P. So at each iteration t each variable X j has positive probability r (t) j of being considered in the model search (j ∈ V (t) ) and also has positive probability 1 − r (t) j of not being considered (j / ∈ V (t) ).
As the final subset selected by AdaSub one can either (i) choose the "best" sampled modelŜ b for which C(Ŝ b ) = max{C(S (1) ), . . . , C(S (T ) )}, or (ii) consider the thresholded modelŜ ρ = {j ∈ P : r (T ) j > ρ} with some threshold ρ ∈ (0, 1). WhileŜ b is obviously more likely to coincide with the C-optimal model S * , it can be beneficial in terms of variable selection stability to consider the thresholded modelŜ ρ instead (with ρ relatively large).
A detailed relevant discussion follows in Section 4.
Note that we implicitly assume that it is computationally feasible to compute S (t) = f C (V (t) ) in each iteration t. In fact, if the underlying "truth" is sparse and the criterion used enforces sparsity, |V (t) | is expected to be relatively small. Otherwise one might use heuristic algorithms in place of a full enumeration. Alternatively, if |V (t) | is bigger than In the case of variable selection in linear regression with C(S) = −EBIC(S) using the fast branch-and-bound algorithm (Lumley and Miller 2017) one might set U C ≤ 40. However, in the following we will assume that the original version of AdaSub (Algorithm 1) is used.
The AdaSub method requires that we initialize three parameters: q, K and T . Here q ∈ (0, p) is the initial expected search size, which should be relatively small (e.g. q = 10).
The initial expected search size q reflects our prior belief about the sparsity of the problem, i.e. q should be a first rough "estimate" of the size of the C-optimal model. We have so the expected search size in the first iteration is indeed q. In the following iterations t, t ∈ {2, . . . , T }, the expected search size is automatically adapted depending on the sizes of the previously selected models S (i) , i < t; see Section A2 of the supplement for an illustrative example. The parameter K > 0 controls the learning rate of the algorithm. The larger K is chosen, the faster the selection probabilities r (t) j of the variables X j are adapted. Based on our experience with numerous simulated and real data examples, we recommend the choices K = n and q ∈ [5,15]. A more detailed discussion of the tuning parameters is given in Section 5.3, where we investigate the performance of AdaSub with respect to the choices of q and K in a simulation study. The number of iterations T ∈ N can be specified in advance. Alternatively one might impose an automatic stopping criterion for the algorithm, but we strongly advise to inspect the output of AdaSub by appropriate diagnostic plots and assess the convergence of the algorithm interactively; see Section A2 of the supplement for suggested diagnostic plots.

Limiting properties of AdaSub
In this section we summarize theoretical results concerning the limiting properties of Ada-Sub while a detailed exposition and proofs of the results can be found in the supplement to this paper. In particular, we address the question under which conditions it can be guaranteed that AdaSub "converges correctly" against the C-optimal model S * = f C (P).
Definition 4.1. For a given selection problem with model selection criterion C, the Ada-Sub algorithm is said to converge to the C-optimal model S * if and only if for all j ∈ P we have for the selection probability of explanatory variable X j that By definition, AdaSub converges to the C-optimal model S * if the selection probabilities r (t) j converge almost surely against one (zero) for explanatory variables included (not included) in S * . The C-optimal convergence of AdaSub implies that, for any fixed threshold ρ ∈ (0, 1), the thresholded modelŜ ρ = {j ∈ P : r (T ) j > ρ} will coincide with the C-optimal model S * if the number of iterations T of AdaSub is large enough. Note that even when AdaSub does not converge to the C-optimal model in the sense of Definition 4.1, it is still possible that the C-optimal model is identified by AdaSub, by considering the "best" modelŜ b found by AdaSub after a finite number of iterations.
We now introduce the so called ordered importance property (OIP) of a given variable selection problem with criterion C, which turns out to be a sufficient condition for the C-optimal convergence of AdaSub.
Then the selection criterion C is said to fulfil the ordered importance property (OIP) for the sample D, if there exists a permutation (k 1 , . . . , k s * ) of (j 1 , . . . , j s * ) such that for each Theorem 1. Suppose that the ordered importance property (OIP) is satisfied. Then Ada-Sub converges to the C-optimal model.
We briefly describe the main idea behind OIP and the proof of Theorem 1: OIP assumes that there exists an k 1 ∈ S * (the "most important" variable X k 1 ) such that it is always selected to be in the best subset f C (V ) for all sets V ⊆ P with k 1 ∈ V . By Theorem A.2 of the supplement we conclude that r (t) k 1 → 1 (almost surely). Furthermore, by OIP there exists an k 2 ∈ S * (the "second most important" variable X k 2 ) such that it is always selected to be in the best subset f C (V ) for all sets V ⊆ P with k 1 , k 2 ∈ V . In other words, variable X k 2 is always selected to be in the best subset as long as variable X k 1 is also considered.
By Theorem A.2 we similarly conclude that r (t) k 2 → 1 (a.s.). We continue in the same way and obtain that r (t) k i → 1 (a.s.) for each i = 1, . . . , s * − 1. Now by the definition of the map f C and the C-optimal model S * it holds f C (V ) = S * for all V ⊆ P with S * ⊆ V .
Thus with Theorem A.2 we conclude that r (t) k s * → 1 (a.s.) and that r (t) j → 0 (a.s.) for each j ∈ P \ S * . In the supplement of this paper we prove the C-optimal convergence of AdaSub under a slightly different (weaker) sufficient condition OIP' (see Definition A.1 and Theorem A.3). For ease of presentation here we focused on the more intuitive version of OIP in Definition 4.2. Theorem A.3 of the supplement implies Theorem 1 above.
Note that OIP requires only the existence of such a permutation of the variables with indices in S * and not its identification or uniqueness. So in order to guarantee that OIP holds, we do not need to know any concrete permutation, but only that such a permutation exists. On the other hand, this condition cannot be easily checked, since we do not know the set S * , which AdaSub actually tries to identify. Despite this, note that if we observe that the AdaSub algorithm does not converge to the C-optimal model, i.e. if there exists j ∈ P with r (t) j → r * j , r * j ∈ (0, 1) with positive probability, then we can conclude that OIP is not satisfied. In that situation we actually might not wish to select S * = f C (P), since then there is no "stable learning path" in the sense of OIP. Instead, we propose to consider the thresholded modelŜ ρ for some large threshold value (e.g. ρ = 0.9).
Indeed, Corollary A.2 of the supplement implies that in a situation where OIP does not hold, the thresholded modelŜ ρ will (for fixed ρ ∈ (0, 1) and T large enough) contain at least those variables in S * that are included in a maximal "learning path" in the sense of OIP. AlthoughŜ ρ might also contain additional variables which are possibly not in S * , simulation studies (Section 5) show that in most of the cases when OIP is not satisfied the thresholded modelŜ ρ provides a sparser and more stable model (with less false positives) than the "best" modelŜ b found by AdaSub; see also the examples discussed in Sections A2 and A3 of the supplement. Note that in practice the threshold ρ ∈ (0, 1) should not be chosen too close to one, since otherwise the selection probabilities r (T ) j of "important" variables may not have exceeded that threshold after a finite number of iterations T ∈ N.
We observe that the choice ρ = 0.9 works empirically well in combination with a sufficiently large number of iterations T (see Sections 5 and 6).
The idea behind the ordered importance property (OIP) is connected to the concept of partial faithfulness (PF) underlying the PC-simple algorithm for variable selection of Bühlmann et al. (2010). In a random design setting, let ρ(Y, X j | X S ) denote the partial correlation between the response Y and variable X j given the set of variables X S := {X k : k ∈ S} for some subset S ⊆ P. Bühlmann et al. (2010) show that if the covariance denotes a density on a subset of R |S 0 | of an absolutely continuous distribution with respect to the Lebesgue measure, then the PF property holds almost surely with respect to the distribution generating the non-zero regression coefficients, which implies that for each This means that any truly important variable X j (i.e. β j = 0) remains "important" when conditioning on any subset S ⊆ P \ {j} (i.e. the corresponding partial correlation is non-zero). Therefore, if PF holds, one would hope that the criterion C, which aims at identifying S 0 , does also satisfy the following analogous property (for each j ∈ P): Note that OIP is significantly weaker than the assumption given in (10) in the sense that Similarly, an OIP on the population level (which is a weaker condition than the PF property) assumes that, if j = k i ∈ S 0 , then it holds ρ(Y, X j | X S ) = 0 for all S ⊆ P \ {j} with {k 1 , . . . , k i−1 } ⊆ S. One cannot generally expect that the PF property (9) on the population level implies the analogous property (10) or the weaker OIP in the given finite sample situation. But if OIP does not hold, then this indicates that the best model S * according to the criterion C is not "stable" in the sense of (10) and that there does not even exist a "learning" path (k 1 , . . . , k s * ), such that variable X k i is selected to be important in each "relevant experiment" in which Finally, we would like to emphasize that we have focused on the algorithmic convergence of AdaSub against the best model S * according to a given criterion C (as the number of iterations T diverges). Based on the presented analysis, depending on the properties of the employed selection criterion C, one may derive specific statistical consistency results for recovering the true underlying model S 0 = {j ∈ P : β j = 0} (as the sample size n and the number of variables p diverge with a certain rate). We briefly indicate how such a consistency result can be obtained in case the employed selection criterion C is the (negative) BIC.
For this, note that optimizing a given selection criterion C inside subspaces V ⊆ P with S 0 ⊆ V corresponds to variable selection in the situation of misspecified models. It has been shown that the BIC is a quasi-consistent criterion in such situations under mild regularity conditions for the classical asymptotic setting where the number of variables p is fixed and the sample size n diverges, i.e. with probability tending to one, the BIC selects the model that minimizes the Kullback-Leibler divergence to the true model (see Nishii 1988;Lv and Liu 2014;Song and Liang 2015). By using such a result for each variable selection sub-problem f C (V ) = arg max S⊆V, S∈M C(S) for all possible subspaces V ⊆ P, one can deduce that AdaSub in combination with the BIC yields a variable selection consistent procedure for the classical asymptotic setting, provided that the OIP condition on the population level (or alternatively the more stringent PF condition (9)) is satisfied; this implies that, with probability tending to one, the thresholded modelŜ ρ of AdaSub equals the true model S 0 when the sample size n and the number of iterations T go to infinity for fixed p. The detailed investigation of the variable selection consistency of AdaSub, including high-dimensional asymptotic settings where the number of variables p diverges with the sample size n, is an interesting topic for future work.

Simulation study
We have investigated the performance of AdaSub in extensive simulation studies and here we present some representative results. The discussion is divided into three parts: First, we examine relatively low-dimensional simulation examples where it is feasible to identify the best model according to an 0 -type criterion C, so that it can be compared to the output of AdaSub. In the second part, we apply AdaSub on high-dimensional simulation examples and compare its performance with different well-known methods. Finally, we investigate the algorithmic stability of AdaSub and the effects of the choice of its tuning parameters.
The following simulation setup is used: For a given sample size n ∈ N and a number of explanatory variables p ∈ N we simulate the design matrix Here, we consider a Toeplitz-correlation structure, i.e. for some c ∈ (−1, 1) let Σ k,l = c |k−l| for all k = l. Results for further correlation structures are presented in Section A3 of the supplement.
In particular, we examine the case of independent covariates (c = 0) and the case of highly correlated covariates (c = 0.9). For each dataset, we select s 0 ∈ {0, . . . , 10} and S 0 ⊂ P of size |S 0 | = s 0 randomly; then for each j ∈ S 0 we independently simulate We apply AdaSub in combination with the (negative) EBIC γ as a selection criterion for different regularization constants γ ∈ [0, 1] (recall that γ = 0 corresponds to the usual BIC). In AdaSub we use the "leaps and bounds" algorithm implemented in the R-package leaps (Lumley and Miller 2017) to compute at iteration t the best model S (t) according to EBIC γ contained in V (t) .

Low-dimensional setting
It is illuminating to analyse the performance of AdaSub in a situation where we actually can compute the best model according to the criterion used (here BIC). We are thus able to answer the question whether AdaSub really recovers the BIC-optimal model. In order to compute the BIC-optimal model in reasonable computational time using the "leaps and bounds" algorithm we set p = 30. For a given correlation structure, the sample size n is increased from 40 to 200 in steps of size 20 and for each value of n we simulate 100 different datasets according to the simulation setup described above. In AdaSub we set q = 5, K = n and T = 2000. Figure 1 summarizes the results of the low-dimensional simulation study in the case of independent explanatory variables. The BIC-optimal model S * tends to select many false positives for small sample sizes and to overfit the data. On the other hand,Ŝ 0.9 andŜ b from AdaSub yield sparser models and often reduce the number of falsely selected variables in a situation where the BIC is too liberal. This comes at the price of a slightly increased number of false negatives (for small n), but the overall effect of selecting a sparser model with AdaSub is beneficial for the given situation yielding higher relative frequencies of selecting the true model S 0 , smaller Mean Squared Errors (MSE) and smaller Root Mean Squared Prediction Errors (RMSE). Although the "best" sampled modelŜ b from AdaSub identifies the BIC-optimal model more often than the thresholded modelŜ 0.9 from AdaSub, the choice ofŜ 0.9 is beneficial for the given situation. When the sample size increases, the BIC-optimal model becomes more "stable" and the relative frequencies that the models selected by AdaSub agree with the BIC-optimal models tend to one. We note that the tendency of AdaSub to suggest sparser models in unstable situations is also observed in further simulations with different correlation structures of X (see Section A3 of the supplement).

High-dimensional setting
We now turn to a high-dimensional scenario, in which both the sample size n and the number of explanatory variables p tend to infinity with a certain rate. In particular, we consider the setting p = 10 × n where n increases from 40 to 200 in steps of size 20 (and thus p increases from 400 to 2000). For each pair (n, p) we simulate 100 datasets according to the simulation setup described above. We compare the "best" modelŜ b from AdaSub and the thresholded modelŜ ρ with ρ = 0.9 from AdaSub with different well-known methods for high-dimensional variable selection: We consider the Lasso, Forward Stepwise Regression, the SCAD, the Adaptive Lasso, Stability Selection with Lasso and Tilting.
For the computation of the Lasso and the Adaptive Lasso we use the R-package glmnet (Friedman et al. 2010), for Stability Selection the R-package stabs (Hofner and Hothorn 2017), for the SCAD the R-package ncvreg (Breheny and Huang 2011) and for Tilting the R-package tilting (Cho and Fryzlewicz 2016). In AdaSub we choose the EBIC γ with parameter γ = 0.6 or γ = 1 as the criterion C; additionally we set q = 10, K = n and T = 5000. Note that p = O(n k ) with k = 1, so that we have γ > 1 − 1 2k and thus EBIC γ is a variable selection consistent criterion for the given asymptotic setting for both choices of γ ∈ {0.6, 1}.
For comparison reasons we also choose the regularization parameter of the Lasso, the SCAD and Forward Stepwise Regression according to EBIC γ (with γ = 0.6 or γ = 1).
Instead of the usual Lasso and SCAD estimators we use versions of the Lasso-OLS-hybrid (see also Efron et al. 2004;Belloni and Chernozhukov 2013) where we compute the EBIC γvalues of all models along the Lasso-path (and the SCAD-path, respectively) using the ordinary least-squared (OLS) estimators and finally select the model (with corresponding OLS estimator) yielding the lowest EBIC γ -value. The additional tuning parameter of the SCAD penalty is set to the default value of 3.7 (as recommended in Fan and Li 2001).
For the Adaptive Lasso we derive the initial estimator with the usual Lasso where the regularization parameter is chosen using 10-fold cross-validation and compute in the second step an additional Lasso path where the regularization parameter is chosen according to EBIC γ . We make use of the complementary pairs version of Stability Selection yielding improved error bounds (Shah and Samworth 2013). The parameters for Stability Selection are chosen such that the expected number of type I errors is bounded by 1 (using the perfamily error rate bound), while using the threshold 0.6 and considering 100 subsamples.
The final estimator for Stability Selection is the OLS estimator for the model identified by Stability Selection.
Relevant is also the adaptive variable selection approach of Cho and Fryzlewicz (2012) via Tilting. Note that this approach is conceptually different from AdaSub in the sense that it builds a sequence of nested subsets S (1) ⊂ S (2) ⊂ . . . ⊂ S (m) by gradually adding explanatory variables based on "tilted" correlations and then selectingŜ = arg min S (i) EBIC γ (S (i) ).
For the Tilting procedure we consider the version TCS2 based on rescaling rule 2 (see Cho and Fryzlewicz 2012) and we always use the EBIC γ with γ = 1 for final model selection, since we observe that the choice γ = 0.6 yields unreasonably large numbers of false positives. Due to the increasing computational demand of Tilting for larger values of p, the maximum number of selected variables is set to 10 and results are only reported for p ≤ 1200 (i.e. n ≤ 120). Our simulations confirm the observation in Cho and Fryzlewicz (2012) that Tilting tends to outperform the PC-simple algorithm, thus we do not report the detailed results for the PC-simple algorithm here.  tends to include more false positives than the thresholded modelŜ 0.9 , while the number of mean false negatives inŜ b is only slightly reduced for small sample sizes. Thus, in this situation with a quite liberal choice of the selection criterion EBIC 0.6 , considering the thresholded model is beneficial and yields more "stable" variable selection than the "best" model according to the criterion identified by AdaSub. On the other hand, for γ = 1, the EBIC γ criterion enforces more sparsity and the performance of the thresholded and "best" model from AdaSub is very similar, with slight advantages of the "best" model yielding on average less false negatives. For γ = 0.6, the SCAD selects too many false positives if the sample size is small. On the other hand, Stability Selection with the Lasso tends to reduce the number of mean false positives in comparison to a single run of the Lasso (for γ = 0.6), but at the prize of a larger number of mean false negatives, leading to an undesirable estimative and predictive performance. Furthermore, when the aim is the identification of the true underlying model, Stability Selection is uniformly outperformed by the AdaSub models when considering EBIC 1 as the selection criterion in AdaSub. As might have been expected in a situation with independent explanatory variables, the performance of Forward Stepwise Selection is quite similar to the "best" model identified by AdaSub. In the considered setting it is generally observed that the AdaSub models, Forward Stepwise Selection and the Adaptive Lasso in combination with EBIC 1 tend to yield the best results with respect to variable selection, while the AdaSub models with EBIC 0.6 and Tilting with EBIC 1 tend to perform best with respect to estimation and prediction. Figure 3 summarizes the results of the high-dimensional simulation study for a Toeplitzcorrelation structure with large correlation c = 0.9. In this setting the thresholded model from AdaSub again tends to select significantly less false positives than the "best" model from AdaSub (particularly for γ = 0.6), but at the prize of missing some truly important variables (particularly for γ = 1). It is generally observed that the AdaSub models for EBIC 1 tend to yield the best variable selection results, while the "best" model selected by AdaSub for EBIC 0.6 tends to show the best predictive performance. Note that using a more liberal selection criterion is beneficial for prediction in the given situation with large correlations among the explanatory variables. The Adaptive Lasso performs generally well, but the AdaSub models with EBIC 1 show a significantly better variable selection performance. Similarly as in the independence case, although Stability Selection reduces the number of false positives in comparison to the usual Lasso, it is generally outperformed by the AdaSub models. In contrast to the independence scenario, Forward Stepwise Selection does not perform similarly to AdaSub, but tends to include more false positives on average.
Tilting seems not to be competitive for the situation of highly correlated covariates.
The summary of the results of additional simulations can be found in Section A3 of the supplement for this paper. All in all the performance of AdaSub is very competitive to state-of-the-art methods like the SCAD or the Adaptive Lasso and can lead to improved results in situations with small sample sizes or highly correlated covariates. Additionally, AdaSub tends to outperform Stability Selection with the Lasso in all of the situations con-  sidered. We note that the practical computational time needed for a decent convergence behaviour of AdaSub is generally larger in comparison to the considered competitors except for the Tilting method. However, the computational times for AdaSub (on an Intel(R) Core(TM) i7-7700K, 4.2 GHz processor) are not prohibitively large with on average less than 30 seconds in all considered settings for up to p = 2000 variables and we are convinced that the extra computational time spent for AdaSub can pay off in many practical situations, as illustrated in this simulation study.

Sensitivity analysis
In order to illustrate the effects of the tuning parameters q (the initial expected search size) and K (the learning rate) on the performance of AdaSub, we specifically reconsider the high-dimensional simulation setting of Section 5.2 with n = 100 (p = 1000) and n = 200 (p = 2000) for the Toeplitz correlation structure with high correlation c = 0.9 and the (negative) EBIC 0.6 as the selection criterion. For both values of n, 100 datasets are simulated as before and for each dataset AdaSub is applied ten times with T = 5000 iterations and specific choices of q and K: For the first five runs of AdaSub K = n is fixed while q ∈ {1, 2, 5, 10, 15} is varied; for the remaining five runs q = 10 is fixed while K ∈ {1, 100, 200, 1000, 2000} is varied.
In this sensitivity analysis we investigate the efficiency in terms of computational time and the effectiveness with respect to optimizing the given criterion EBIC 0.6 for the ten considered choices of q and K in AdaSub. In order to evaluate the optimization effectiveness, we proceed as follows: LetŜ   of iterations needed to identify the "best" model (left) and of the computational times (right). In this context, the "best" model refers to the model with the smallest EBIC value among all ten runs of AdaSub for that dataset. The number of times the "best" model has not been identified is also reported (denoted by f for "failures"; in such cases 5000 is depicted as the required number of iterations).
to sample more promising low-dimensional sub-problems, resulting in a slow convergence of the algorithm. If instead K is large (e.g. K = 2000), the algorithm might focus too quickly on specific classes of sub-problems and thus often a larger number of iterations is needed to identify the "best" model. If for example an important variable X j is not selected when it is first considered in the model search (i.e. j ∈ V (t) but j / ∈ S (t) ), then r (t) j = q p+2000 is close to zero for K = 2000, so variable X j will probably not be considered in the model search for a long time. It can be argued that a sensible choice of K depends on the sample size n of the considered dataset, since larger sample sizes come with less uncertainties regarding the "best" model and a faster convergence of the algorithm might be achieved with larger values of K. We recommend to choose the learning rate K = n; this choice of K is also supported by the results in Figure 5 regarding the required number of iterations to identity the "best" models. We refer to Staerk (2018, Sections 3.4, 3.5) for additional discussions regarding the choice of K and q.
Since AdaSub is a stochastic algorithm, it is desirable that the selected models by AdaSub do not largely vary if one repeatedly runs the algorithm for the same dataset and the same selection criterion, but with possibly different choices of the tuning parameters of AdaSub. In order to investigate the algorithmic stability of AdaSub we consider the same setting as in the high-dimensional simulation study of Section 5.2 and rerun the AdaSub algorithm ten times with T = 5000 iterations for a particular dataset with random choices of K and q from a sensible range. Here, we simulate 20 different datasets for each value of n ∈ {40, 60, . . . , 200} (with p = 10n) for both the independence and Toeplitz correlation structure and consider again the (negative) EBIC γ with γ ∈ {0.6, 1} as the selection criterion, yielding in total 2 × 2 × 10 × 20 × 9 = 7200 different runs of AdaSub. For each application of AdaSub, the initial expected search size q is randomly generated from the uniform distribution U(5, 15) and the learning rate K is randomly generated from the uniform distribution U(n/2, 2n).
In Figure 6 it can be seen that the average relative frequencies of model agreement for both the thresholded and the "best" model are reasonably large across different runs of AdaSub for the same datasets (with random choices of q and K). Furthermore, the variances of the sizes of the AdaSub models are small, indicating that the selected models are quite similar even if they differ between certain runs of AdaSub. Note that the algorithmic stability of AdaSub further improves with increasing samples size n, i.e. the relative frequencies of agreement tend to one and the variances of model sizes tend to zero. and Toeplitz (c = 0.9) correlation structures: Mean relative frequency of model agreement and mean variance of model sizes across the ten runs of AdaSub (averaged over 20 simulated datasets for each sample size) for the thresholded modelŜ 0.9 (AdaSubThres) and the "best" modelŜ b (AdaSubBest) for multiple runs of AdaSub with EBIC γ for γ ∈ {0.6, 1}.

Real data example
In this section we consider the application of AdaSub on (ultra)-high-dimensional real data.
For comparison reasons we examine a polymerase chain reaction (PCR) dataset which has already been analysed in Song and Liang (2015). They demonstrate that their Bayesian split-and-merge approach (SAM) performs favourably in comparison to hybrid methods like (I)SIS-lasso and (I)SIS-SCAD, so we do not include the results of these methods here.
(I)SIS-lasso and (I)SIS-SCAD are acronyms for the combination of a screening step with (Iterated) Sure Independence Screening (Fan and Lv 2008) and then a selection step of the final model with lasso and SCAD, respectively. A special intention of this section is to show that it is computationally feasible to apply the AdaSub method even in the situation of ultra-high-dimensional data with ten thousands of explanatory variables and that an additional screening step is not necessarily needed.
We consider the preprocessed PCR data from Song and Liang (2015), available in JRSS(B) Datasets Vol. 77(5), which consists of n = 60 samples (mice) with p = 22, 575 explanatory variables (expression levels of genes). Phosphoenolpyruvat-carboxykinase (physiological phenotype) is chosen as the response variable. For details concerning this data example we refer to Lan et al. (2006) and Song and Liang (2015). We first apply the AdaSub algorithm with q = 5, K = n and T = 500, 000 and choose the (negative) EBIC 0.6 as the selection criterion (computational time approximately 20 minutes). The evolution of the values EBIC 0.6 (S (t) ) along the iterations (t) is given in Figure   7a. The criterion EBIC 0.6 seems to be too liberal for the given situation resulting in high uncertainty concerning the EBIC 0.6 -optimal model and (possibly) failure of the OIP condition. The thresholded modelŜ 0.9 selected by AdaSub consists of five variables (genes), while the "best" modelŜ b consists of ten variables (genes); see Table 1 for a summary of the results.
In order to compare the predictive performances of the selected models we compute the mean and median leave-one-out-cross-validation squared errors (CV-errors) for each fixed model as described in Song and Liang (2015). Note that the CV-errors of the final models (with variables selected based on the full dataset) generally tend to underestimate the true generalization errors on independent test data (compare Ambroise and McLachlan 2002) and only serve for a comparison of models with the same number of selected variables.
It can be seen that the CV-errors of the thresholded modelŜ 0.9 with five genes and the CV-errors of the "best" modelŜ b with ten genes are of the same order or even lower than the errors of the best SAM model with five and ten explanatory variables, respectively (compare Figure 5 in Song and Liang 2015). In order to compare the final model from SAM to a model with six genes selected by AdaSub we proceed in the following way: Let g : P → P be a permutation such that r . Assuming no "ties", for k ∈ P we defineŜ k := {j ∈ P : g −1 (j) ≥ k} to be the thresholded model from AdaSub with exactly |Ŝ k | = k variables. In Table 1 it can be seen that even though the thresholded modelŜ 6 from AdaSub with six genes is totally different from the model selected by SAM, it has similar predictive performance. Table 1: Results for PCR data in terms of selected genes and mean/median CV-errors for the final model selected by SAM as well as the "best" models (Ŝ b ) and thresholded models (Ŝ 0.9 ,Ŝ 6 ) from AdaSub for EBIC 0.6 and EBIC 1 . We now apply AdaSub with q = 5, K = n and T = 50, 000 and choose the (negative) EBIC 1 as a selection criterion that enforces more sparsity (comp. time approximately 1 minute and 30 seconds). The evolution of the values EBIC 1 (S (t) ) along the iterations (t) is given in Figure 7b. NowŜ 0.9 andŜ b coincide, consisting both of only one gene (1438937_x_at). Note that for the criterion EBIC 0.6 the gene 1438937_x_at is also included in the thresholded modelŜ 0.9 , in the "best" modelŜ b and in the thresholded modelŜ 1 with exactly one gene selected by AdaSub, whereas it is not included in the final model selected by SAM.

Discussion
AdaSub has been introduced in order to solve the natural 0 -regularized optimization problem for high-dimensional variable selection. If the ordered importance property (OIP) is satisfied, then AdaSub converges against the optimal solution of the generally NP-hard 0 -regularized optimization problem. Furthermore, AdaSub provides a stable thresholded model even when OIP is not guaranteed to hold. It has been demonstrated through simulated and real data examples that the performance of AdaSub is very competitive for highdimensional variable selection in comparison to state-of-the-art methods like the Adaptive Lasso, the SCAD, Tilting or the Bayesian split-and-merge approach (SAM). It is notable that AdaSub outperforms Stability Selection with the Lasso in many situations, which underpins the argument that usual subsampling in combination with an 1 -type method might not be optimal in a high-dimensional situation. On the contrary, the application of adaptive "subsampling" in the space of explanatory variables can efficiently reduce the intractable 0 -type high-dimensional problem to solvable low-dimensional sub-problems even in very high-dimensional situations with ten thousands of possible explanatory variables.
In this paper we have focused on variable selection in linear regression models, but the proposed AdaSub method is more general and can for example be applied to any variable selection problem in the framework of generalized linear models (GLMs). The practical problem is then that -to the best of our knowledge -there is no efficient algorithm like "leaps and bounds" which could be used for solving the low-dimensional sub-problems for a GLM within reasonable computational time. In particular, a full enumeration is costly since the ML-estimators for the single models are not given in closed form, in general. A possible solution would be to use heuristic algorithms in place of a full enumeration in order to derive approximate solutions for the sub-problems. It is then desirable to extent the convergence properties of AdaSub also to those situations.
Furthermore, even though we have focused on the EBIC as the selection criterion, the AdaSub method is very general and can be combined with any other selection criterion. It is also possible to use other variable selection methods such as 1 -type methods (like the Lasso) for "solving" the sampled sub-problems in AdaSub. However, the theoretical results concerning the limiting properties of AdaSub are based on the assumption of optimizing a discrete function on the model space, so the presented limiting properties are not directly applicable for such alternative methods. The investigation of the performance of AdaSub for different choices of the selection procedure is an interesting topic for future research.
Another line of our current research concerns the further exploration of the sufficient condition for the C-optimal convergence of AdaSub and particularly attempts to relax OIP by weaker sufficient conditions. We want to emphasize that in this work we have

Supplement to "High-dimensional variable selection via low-dimensional adaptive learning"
In Section A1 of this supplement we provide the theoretical details concerning the limiting properties of AdaSub, which have been omitted in the paper. In Section A2 we illustrate the application of the AdaSub algorithm on a high-dimensional simulated data example and discuss typical "diagnostic plots" for the convergence of the algorithm. In Section A3 we present further results of the simulation study given in Section 5 of the paper.

A1 Theoretical details
In this section we theoretically investigate the limiting properties of AdaSub (see Algorithm 1) by analysing the evolution (along the iterations t ∈ N) of the selection probabilities In order to describe the information available after iteration t of the AdaSub algorithm, we define a filtration (F (t) ) t∈N 0 on the underlying probability space Ω of the process: Let be the σ-algebra generated by W (1) 1 , . . . , Z p . Then by the construction of AdaSub we have for t ∈ N 0 and j ∈ P: In addition, for t ∈ N 0 and j ∈ P we define where for events A, B ∈ F (t+1) the conditional probabilities under F (t) are defined by In the following, we will make repeated use of the following generalization of Borel-Cantelli's lemma and the strong law of large numbers, which is due to Dubins and Freedman (1965).
Theorem A.1 (Dubins and Freedman, 1965). Let (F n ) n∈N 0 be a filtration and A n ∈ F n for n ∈ N. For i ∈ N define q i := P (A i | F i−1 ), then: A first simple but important observation is that, with probability 1, each variable X j with j ∈ P is considered infinitely many times in the model search of AdaSub.
Lemma A.1. Let j ∈ P. Then it holds Proof. Let (F (t) ) t∈N 0 be the filtration given by equation (A.2). Fix j ∈ P and for t ∈ N let A (t) The following theorem shows that the convergence of p (t) j as t → ∞ determines the convergence of r (t) j . This result will be the key ingredient needed for the proof of the C-optimal convergence of AdaSub (Theorem 1). → p * j as t → ∞.
Proof. Fix j ∈ P and suppose that p (t) j a.s.
→ p * j as t → ∞. We apply Theorem A.1 again, but using a different filtration (G (t) ) t∈N 0 , where and and Ω := ω ∈ Ω : By Lemma A.1 we have P (Ω ) = 1. Let and where equality in (a1) holds since for each ω ∈ Ω 1 there exists an increasing sequence (l ω i ) i∈N with l ω i ∈ N and Z (l ω i ) j (ω) = 1 for all i ∈ N. So on Ω 1 we have for t large enough (to avoid division by 0) which holds for those increasing sequences (l ω i ) i∈N that additionally fulfil Z Here we applied Cauchy's limit theorem using the fact that p j = ∞ we can use the same argument as above and obtain On Noting that P (Ω 1 ∪ Ω 2 ) = 1 by assumption and combining the arguments on Ω 1 and Ω 2 , we conclude that on Ω we have and Remark A.1. Note that S * = f C (V ) for all V ⊆ P with S * ⊆ V . Therefore (8) always holds for i = s * since k s * ∈ S * . Furthermore, we have Remark A.2. Note that the ordered importance property (OIP') of Definition A.1 is a weaker condition than the ordered importance property (OIP) of Definition 4.2 in the main paper (i.e. OIP implies OIP'). Indeed, equation (8) in the paper implies equation (A.5) since the required condition is only imposed on a generally smaller set of subsets V .
The next theorem shows that OIP' (and thus also OIP) is really a sufficient condition for the C-optimal convergence of AdaSub against S * .
Theorem A.3. Suppose that the ordered importance property (OIP') is satisfied. Then AdaSub converges to the C-optimal model in the sense of Definition 4.1.
→ 1 and therefore Thus we conclude with the law of total probability that By Theorem A.2 we also obtain r for all t ∈ N 0 . Note that again by the independence of the Bernoulli trials in AdaSub we have Thus we similarly conclude with the law of total probability that By Theorem A.2 we also obtain r Note that again by the independence of the Bernoulli trials in AdaSub we have Thus we similarly conclude with the law of total probability that By Theorem A.2 we also obtain r (t) k 2 a.s.
Proceeding by induction we similarly conclude that for each i = 2, . . . , s * − 1 we have → 0 as t → ∞ for all j ∈ N i \ N i−1 ; and for each i = 3, . . . , s * − 1 we have r Note that k s * ∈ S * = f C (V ) for all V ⊆ P with {k 1 , . . . , k s * } ⊆ V and that N s * = P \ S * (see Remark A.2). Therefore, by using the same arguments, we also obtain r → 0 as t → ∞ for all j ∈ N s * = P \ S * . This completes the proof.
Corollary A.1. If |S * | ≤ 1, then OIP is satisfied and therefore AdaSub converges to the C-optimal model. Corollary A.2. Let S * = {j 1 , . . . , j s * } and let D = {l 1 , . . . , l d } ⊆ S * be of maximal cardinality |D| = d such that there exists a permutation (k 1 , . . . , k d ) of (l 1 , . . . , l d ) such that for all i = 1, . . . , d we have where the sets N 0 , . . . , N d are defined as in Definition 4.2. In particular we have Then for all j ∈ D we have r Proof. The proof is along the lines of the proof of Theorem A.3, using the (partial) permutation (k 1 , . . . , k d ) of variables in D ⊆ S * instead of the (full) permutation (k 1 , . . . , k s * ) of all variables in S * .

A2 Illustrative example of AdaSub
In order to illustrate the performance of AdaSub in a high-dimensional set-up, we consider a simulated example with p = 1000 and n = 60. We generate one particular dataset where Σ kl = 0 for k = l and Σ kk = 1. Furthermore, let β 0 = (0.4, 0.8, 1.2, 1.6, 2.0, 0, . . . , 0) T ∈ R p be the true vector of regression coefficients with active set S 0 = {1, . . . , 5}. The response ∼ N (X i, * β 0 , 1), i = 1, . . . , n. We adopt the (negative) extended BIC (EBIC γ ) as the criterion C and consider the tuning parameter choices γ = 0.6 and γ = 1 in EBIC γ . For both cases, we apply AdaSub with T = 10, 000 iterations on the same dataset simulated as above and choose q = 10 and K = n as the tuning parameters of AdaSub. We present some typical "diagnostic plots" for the described simulated data example, which are generally very helpful for examining the convergence of the AdaSub algorithm. and γ = 1 (recall that S (t) = f C (V (t) ) denotes the "best" submodel contained in V (t) ), while the red lines indicate the values of EBIC γ for the thresholded modelŜ 0.9 . For γ = 0.6 it is obvious that the algorithm does not converge against the "best" sampled modelŜ b = arg min{EBIC 0.6 (S (1) ), . . . , EBIC 0.6 (S (T ) )} and thus OIP' does not hold here.
The "best" model identified by AdaSub is given byŜ b = {2, 3, 4, 5, 519, 731, 950}, while the thresholded modelŜ 0.9 = {2, 3, 4, 5, 950} with threshold ρ = 0.9 does not include the "noise variables" X 519 and X 731 and is therefore closer to the true underlying model. This is an example, where the thresholded model from AdaSub reduces the number of false positives in a situation where the criterion used is too liberal (compare Corollary A.2). On the other hand, for γ = 1, the algorithm appears to have converged against the EBIC 0.6 -optimal model; the "best" sampled modelŜ b and the thresholded modelŜ 0.9 agree:Ŝ b =Ŝ 0.9 = {2, 3, 4, 5}. This indicates, that the model identified by AdaSub is "stable" in the sense of OIP.
1 tends to zero and hence the "signal variable" X 1 is not selected in both cases (note that β 1 = 0.4 is quite small). Additionally, the evolution of the selection probabilities r (t) j for j ∈ {519, 731, 950} are shown. While for γ = 1 these selection probabilities all tend to zero as desired, the behaviour is different for γ = 0.6: r (950) j tends to one; r (519) j and r (731) j seem to converge to values close but not exactly zero. This reflects a situation, where OIP does not hold and variables X 519 and X 731 are not "stable" in the sense of OIP. AdaSub algorithm automatically adjusts the expected search sizes which, after some time, start to decrease with the number of iterations. For γ = 0.6, the search sizes are a bit larger, since the criterion EBIC 0.6 enforces less sparsity than EBIC 1 . The computation times for T = 10, 000 iterations of AdaSub were approximately 15.1 seconds for γ = 0.6 and 13.5 seconds for γ = 1.

A3 Additional results of simulation study
We present further results of the simulation study given in Section 5 of the paper. The lowand high-dimensional simulation set-ups are as described in Section 5. In particular, the design matrix X = (X i,j ) ∈ R n×p is simulated via X i, * ∼ N p (0, Σ). Here, we consider the following correlation structures between the explanatory variables induced by the matrix Σ ∈ R p×p : (a) Toeplitz-Correlation Structure: For some c ∈ (−1, 1) let Σ k,l = c |k−l| for all k = l.
(c) Block-Correlation Structure: For some c ∈ (0, 1) and a fixed number of blocks b ∈ N let Σ k,l = c for all k = l with (k − l) mod b = 0, and let Σ k,l = 0 otherwise. Comparison of thresholded modelŜ 0.9 (AdaSubThres) and "best" model S b (AdaSubBest) from AdaSub with BIC-optimal model S * (Best Subset BIC) in terms of mean number of false positives/ false negatives, relative frequency of selecting the true model S 0 , relative frequency of agreement between AdaSub models and S * , Mean Squared Error (MSE) and Root Mean Squared Prediction Error (RMSE) on independent test set with sample size 100. The relative frequency of agreement between the models selected by AdaSub and the BICoptimal model increases towards one when the sample size increases, but the "convergence" is markedly slower than in the independent case (see Figure 1). This shows that the models from AdaSub may yield different (and in the given setting preferable) results in comparison to the BIC-optimal model even if the sample size is moderately large.
Next, we consider an equal-correlation structure (correlation c = 0.7) and a blockcorrelation structure (b = 10 blocks and c = 0.5 as the correlation within blocks). Results for low-dimensional setting (p = 30) with (a) equal-correlation structure and (b) block-correlation structure: Comparison of thresholded modelŜ 0.9 (AdaSub-Thres) and "best" modelŜ b (AdaSubBest) from AdaSub with BIC-optimal model S * (Best Subset BIC) in terms of mean number of false positives/ false negatives, relative frequency of selecting the true model S 0 , relative frequency of agreement between AdaSub models and S * , Mean Squared Error (MSE) and Root Mean Squared Prediction Error (RMSE) on independent test set with sample size 100.   further demonstrate, that the performance of AdaSub is very competitive in comparison to the other methods considered.