Optimal selection of sample-size dependent common subsets of covariates for multi-task regression prediction

An analyst is given a training set consisting of regression datasets $D_j$ of different sizes, which are distributed according to some $G_j$, $j=1,\ldots,\cal J$, where the distributions $G_j$ are assumed to form a random sample generated by some common source. In particular, the $D_j$'s have a common set of covariates and they are all labeled. The training set is used by the analyst for selection of subsets of covariates denoted by ${P}^*(n)$, whose role is described next. The multi-task problem we consider is as follows: given a number of random labeled datasets (which may be in the training set or not) $D_{J_k}$ of size $n_k$, $k=1,\ldots,K$, estimate separately for each dataset the regression coefficients on the subset of covariates ${P}^*(n_k)$ and then predict future dependent variables given their covariates. Naturally, a large sample size $n_k$ of $D_{J_k}$ allows a larger subset of covariates, and the dependence of the size of the selected covariate subsets on $n_k$ is needed in order to achieve good prediction and avoid overfitting. Subset selection is notoriously difficult and computationally demanding, and requires large samples; using all the regression datasets in the training set together amounts to borrowing strength toward better selection under suitable assumptions. Furthermore, using common subsets for all regressions having a given sample size standardizes and simplifies the data collection and avoids having to select and use a different subset for each prediction task. Our approach is efficient when the relevant covariates for prediction are common to the different regressions, while the models' coefficients may vary between different regressions.


A general description of the problem
This paper concerns data consisting of a class of regression datasets, and a multi-task of predictions in different regressions.The emphasis is on selection of common subsets of covariates for prediction in the different regressions, which depend on the regression datasets' sample sizes.As documented in the classical model selection literature, the size or dimension of models for prediction should depend on the sample size, for example through a penalty function that depends on the dimension of the model and the sample size.References and further details will be provided following a description of our motivating problem.
In order to improve service, a hospital wants to develop a tool for predicting the actual duration of planned visits of any particular patient to any doctor in the hospital.Given a sample of size n of different patients' visits to any particular doctor (n can vary between doctors) with covariates such as the past durations of the patients' visits, the nature of the visits, the time scheduled etc., and a response variable, which is the actual duration, the goal is to predict the duration of the next visit of a given patient to the particular doctor.Our objective is to select an optimal subset of covariates, denoted by P * (n), to be used for the prediction of a future visit's duration.We shall provide a procedure that selects the optimal set with high probability.The number of covariates in the set P * (n) depends naturally on n, with a large n allowing more variables in the regression, taking account of the need to find the right balance between efficiency of models, and the pitfall of overfitting.For any given n, we want to select a standard set of covariates to be used for any doctor in the hospital for whom we have a regression dataset of size n.However, we allow different regression coefficients for different doctors since different doctors may be influenced differently by the patient's background.An intercept for each doctor represents her or his general tendency for longer or shorter visits.
Standardization is desirable for more than one reason, and will be discussed in more detail later.First, it obviously simplifies the data collection and maintenance.Second, performing separate model selection for each doctor may be computationally demanding.Third, model selection is notorious for being difficult and to require much data.The idea we present is to perform model selection on the basis of a sample of doctors as described below, and thus borrow strength from different datasets and obtain better subset selection.
In order to perform the subset selection we assume we have a training sample of J doctors each consisting of a dataset containing the covariates and the response (actual duration) of N j visits, j = 1, . . ., J .Under certain assumptions, we use these data to select subsets of covariates for different value of n.We then use the selected subset for prediction for any doctor (who in general may not be in the training sample) on the basis of a sample of visits (of some size n) as described above.When predicting for a doctor in the sample, say doctor j, it is natural to take n = N j .The subset selection procedure and its properties are the focus of this paper.
One may suggest to concatenate the whole training sample and perform a single regression with the same coefficients for all doctors, but allowing a different intercept for each doctor.In certain cases this may result in good subset selection.However, suppose, for example, that for about half of the doctors the covariate "duration of previous visit" has a positive coefficient in the regression and for the other half it is negative.It is easy to conceive of a justification for each possibility.In this case, this covariate may not enter the model if the regression is computed by concatenating the data into a single model.However, allowing different regression coefficients for different doctors, the variable may enter the model and contribute to the prediction with a different coefficient for different doctors.Thus, allowing the regression coefficients to vary between individual regressions adds flexibility to the model, and in particular it improves the prediction in our dataset (verified by cross validation, see Section 6.4).Of course an informal screening of variables is often done, either at the stage of collecting the data, or before conducting formal variable selection and analysis.In particular, researchers may decide to avoid certain variables or interaction terms in order to keep the selection process feasible.
The classical theory of model selection in regression deals with the selection of a subset of covariates (or features) that are useful for prediction based on a single regression dataset.Numerous model selection methods have been suggested; AIC (Akaike [1]), Mallows C p (Mallows [15]), and FIC (Claeskens and Hjort [8]) are prominent examples.These methods apply to a single regression dataset of a given size, for which a model is to be selected and then used for prediction.For a well-known Bayesian approach to model selection, see Schwarz [23].A large body of literature emerged following these articles.In the setup of a single dataset, serious issues of optimality arise; see, e.g., Yang [29].
Breiman's celebrated paper [4] starts with a similar training set of regression datasets with similar assumptions, however, both the subset of covariates and the regression coefficients used for prediction are common to all regressions.A very closely related setup appears in Obozinski, Taskar and Jordan [16] which "addresses the problem of recovering a common set of covariates that are relevant simultaneously to several classification problems."The paper focuses on classification or discrimination problems, but regression is also mentioned.References cited in this paper, which deal with the same problem, are referred to as "transfer learning" or "multi-task learning" in the machine learning literature.They demonstrate that learning multiple related tasks from data simultaneously can be advantageous in terms of predictive performance relative to learning these tasks independently.In Obozinski, Taskar and Jordan [16] the goal is to decide which variables are "relevant to the overall class of prediction problems without making a commitment to a specific value of a parameter," that is, allowing different parameters for the different prediction tasks, and to "borrow strength across multiple estimation problems in order to support a decision that a covariate is to be selected."A large number of papers and review articles on multi-task learning have appeared, mostly in the past decade.For a recent survey containing numerous applications and references, see, for example, Zhang [31].In the latter paper transfer learning refers (in our setup) to predicting for a single target file that may not be in the training sample, while multi-task learning refers to predicting for every dataset in the training sample.Here we consider both possibilities.
Our paper differs from Obozinski, Taskar and Jordan [16] and more generally from the multi-task literature in several ways: our emphasis is on regression, our focus is not on algorithms but rather on asymptotic consistency and optimality type results; however, the main difference is that in the spirit of model selection, the selected common covariate subsets, depend on the sample sizes of the different regression prediction tasks, thus avoiding under and overfitting.The issue of different sample sizes n appears in Zhang [31] in a context where face databases have different image sizes; however the proposed solution is to project these databases to a common subspace, which results in loss of information, rather than taking the task size into account as we propose.

A formal setup
Our setup is formalized as follows.We assume we have a training sample T of regression datasets all having the same set of covariates.Thus T = {D j : j = 1, . . ., J } with D j = {(X ij , Y ij )}, i = 1, . . ., N j , j = 1, . . ., J , where X ij ∈ R d is a column vector of d random covariate values of the ith subject in the jth dataset, and Y ij ∈ R is a response variable.For each j, the N j vectors (X ij , Y ij ) are iid from some distribution G j ∈ G, where G is a set (population) of distributions of size |G| = K.We assume that J ≤ K ≤ ∞, and that {G j } J j=1 is a random sample from G. Now consider a new regression dataset of some size n, D J = {(X iJ , Y iJ )} n i=1 distributed according to G J , a random element of G, which may but need not be in the training set T .If D J is in T then it is natural to assume that n = N J .
We consider the following task: for (X, Y ) ∼ G J independent of the above datasets, we want to predict Y from a given X using the sample D J .It is natural to be interested in the multi-task of prediction for many random D J 's; however, it suffices to study the prediction error for one such D J .Since G J is random, we clearly need to consider different possible values of n, and random covariates.Our treatment of random covariates is based on a generalization of Mallows C p to random covariates that was inspired by notes generously given to us by Larry Brown (see [5]).
As usual, the prediction model involves two components, the subset of variables to be used, and their regression coefficients.The regression coefficients will be estimated by standard least squares based on the sample D J , and thus will vary between D J 's.For the subset selection, a task that is known to require large samples, we shall pool the whole training set.Such pooling can be efficient if the set of distributions G, which may be finite or generated by a probability model (superpopulation model), is sufficiently homogeneous (to be discussed in Section 3) in a way that justifies a common model selection.Besides some technical conditions for such homogeneity, a user would have to apply common sense to decide if one can borrow strength and learn the subset selection from the pooled sample T rather than from the individual dataset D J .As mentioned before, numerous examples appear in Zhang [31] and the references therein.Our goal is to select for each possible value of n, a subset of covariates based on the pooled training sample T and use it for prediction, using least squares estimates, computed for each regression dataset D J separately.Thus, we select subsets for prediction that are common to all regressions having the same sample size, but we allow different parameters for the regressions.
Given a distribution ) be the conditional expectation under G j .We do not assume a linear model or any particular model for m j when we analyze our procedures, but for the sake of prediction we shall approximate m j (X ij ) by a linear function X ij β j , where β j is the vector of projection coefficients under G j .We shall require minimal assumptions on G j such as moment conditions, to be specified later.
When m j (X ij ) is not linear then X ij is not ancillary, and its marginal distribution matters; see, e.g, Buja et al. [6].In this case, conditioning on X or considering it as nonrandom leads to loss of information.For a recent discussion on fixed versus random X in the context of model selection see Rosset and Tibshirani [20].When m j (X ij ) = X ij β j , allowing linear models with different coefficients in different regressions is called heterogeneous regression ANCOVA; see, e.g., Rutherford [21], Chapter 8, and the references therein.Related models appear under titles such as repeated measure regression (see, e.g., Vonesh and Chinchilli [26]), often with mixed effects.
Given datasets {X ij , Y ij } from G j , consider the subset of covariates P of size p ≤ d.We may sometimes refer to P as a model.Let X (P ) ij denote the subvector of X ij consisting of the covariates in P .Let β (P ) j denote the linear projection coefficient vector and let β (P ) j,n be its least squares estimator based on n observations, where we assume that n > p.In Section 2.4 we discuss the case of discrete covariates in which exact (or perfect) multicollinearity may occur with a positive probability, and the least squares estimators are not unique.
For now we focus on the case that J = K, that is we observe all datasets in G. (In Section 3.3 we consider prediction of an out of (the training) sample dataset, in the spirit of transfer learning.)Consider prediction for a regression dataset of some size n, often referred to as the task size, which will be taken to equal N j for the task of predicting for the dataset D j in the multi-task problem of prediction for datasets in the training sample.The linear prediction of a response Y , based on n observations from a random G J ∈ G, and when the subset P is used is given by (X (P ) ) β (P ) J,n .In order to select a subset for regression tasks of size n we make the counterfactual assumption that all datasets in the training sample are of size n.Then the corresponding expected prediction error or risk is given by where (X, Y ) ∼ G j independently of β (P ) j,n , and the expectation on the right-hand side of (1.1) applies to both (X, Y ) and β (P ) j,n .This expression has the alternative interpretation where instead of predicting for a random D J we predict for all D j , j = 1, . . ., J , assuming that there is a common sample size n, and now R(n, P ) represents the average prediction error.With either interpretation, our goal is to estimate R(n, P ) and related quantities, in order to select (with high probability) an optimal common subsets P * (n) for prediction for any dataset in the training set (taking n = N j ) and also for out of sample datasets when we later consider the case that J < K, on the basis of n observations, where the subset selection is based on the pooled training set T .
It is natural to choose common subsets for prediction if the different regression datasets arise from a common source; besides efficiency in subset selection due to pooling, common subsets lead to computational efficiency.However, we assert that in a variety of situations (but obviously not always) it is advantageous to choose standard common sets of covariates to be used for prediction even if the regression datasets do not arise from a homogeneous source.In this case we are trying to select compromise subsets that can be used for the different regressions (and may not be optimal for some or any of them).For example, a large health organization with K clinics often recommends a common standard set of tests for the purpose of certain diagnoses, thereby simplifying the instructions to participating clinics and doctors.In our notation, the set of tests is based on a sample of size J , which is in general ≤ K.The regression coefficients used for prediction based on this common set of tests may differ between communities or doctors, who may attach different weights to different tests.Concerning economics models, consider the OECD, where J = K = 37 (as of 2021) since all countries are sampled and economic prediction are made in all of them.The OECD attempts to standardize sets of common economic indicators to be used for economic predictions (e.g., forecast of GDP growth) for its member countries, which are to be estimated by their bureaus of statistics by the same methodology.In general, it makes sense to assume that in different countries, economic variables may have different weights in economic predictions.For example, oil prices must weigh differently for economic predictions between oil importing and exporting countries.

GENO, a measure of usefulness
In order to compare the quality of different models, we introduce a new measure, GENO, which is inspired by the measure ENO (equivalent number of observations) of Erev, Roth, Slonim, and Barron [10].To describe ENO in the context of experimental economics, consider an experiment where a game is played by a sample of subjects in order to study the average behavior of players, and predict future play.ENO is based on a comparison between the empirical statistics of past actions of the players, and a given model for predicting players' actions.The more subjects who have already played the game, the better the estimate that past play will give of the mean behavior of the subject population on this game.ENO measures the usefulness of the prediction of a particular model by asking how many prior observations of subjects playing the game, say m, would be needed to make the empirical statistics as accurate as the prediction by the model.ENO of a model is this number m.
While ENO compares a given model to the relevant empirical model, GENO generalizes ENO to comparing any two data-based models.Thus, let now R(n, P ) denote the prediction error of some model P in a very general setup.For our present purposes, one can have regression models in mind, with R(n, P ) defined above; however, the definition of GENO below is more general.Given two models P and Q , define GENO(n; P , Q ) to be the value of m satisfying R(m, Q ) = R(n, P ).In words, GENO(n; P , Q ) is the number of observations required in order for a model based on the covariates in Q to predict equally well as a model based on the covariates in P , when the parameters of the latter model are estimated on the basis of n observations.In Section 4 we shall use an approximation to R(n, P ) to formally define and estimate GENO.Such a measure allows us to decide between a set of covariates that may be good for prediction but costly to obtain, and another set of more accessible covariates that we may consider using, even if their predictive value is lower and therefore may require more observations.See the recent paper -Andrade et al. [2] and the references therein for a formal Bayesian approach to minimizing cost of classification in the presence of costly covariates.A comparison in terms of the sample size required by one model (for prediction, testing or estimation) to be as good as another with a given sample size is closely akin to the notion of Pitman efficiency; see, e.g., Zacks [30].Our approach to quantifying the value of a model is close in spirit, but not in detail, to the work of Lindsay and Liu [14] who define a "model credibility index" as the sample size N * , where data from the model and from the true generating process are indistinguishable in the sense that for a given goodness of fit test of the model with N * observations, the probability of rejection under the model is, say, 50%.
A different approach to measuring the usefulness of a model is by Akaike weights, which are defined by the likelihood function of each model evaluated at the MLE, standardized by their sum; see Anderson and Burnham [3] (Page 75), where these weights are referred to informally as "the weight of evidence in favor of model."The AIC weights are sometimes [e.g., 27] interpreted as probabilities of a model to be the best in terms of the AIC criterion.With a uniform prior on the set of models this interpretation could be meaningful if we believe that one of the models is true.Otherwise, the weights are still informative, but their interpretation is less clear.GENO, on the other hand, is measured in units of number of observations, which are easy to grasp.Another advantage of GENO is that it accounts for the number of observations in the data to which the model is applied.This makes sense as the usefulness of a model for a given dataset is also a function of the size of the data.
In Section 2 we restate the problem and provide some basic results and notation for a single regression dataset, as a preliminary to the main part, Section 3, where we consider the multi-task problem of model selection for several regression datasets.In Section 4 we discuss the GENO measure of the relative quality of models.In Section 5 we demonstrate the results by simulations, and in Section 6 we discuss an application to a medical management problem of predicting service times, that is, visit durations of patients in hospital.Section 7 is an appendix containing the proofs.Appendix B summarizes the notation used in the paper.

Prediction error with random covariates: A single dataset
We start with |G| = J = 1, that is, with selection of a model for prediction given a training set consisting of a single regression dataset.This case is treated in the standard model selection literature.Although our real interest is in results for large J , we consider J = 1 as a starting point which simplifies the notation while allowing us to present some of the ideas used in the general case.For now our training set T consists of a single dataset D 1 = {(X i , Y i )} N 1 of N := N 1 iid pairs from some distribution G := G 1 .The distinction between N and n may seem artificial in this case, but we shall make it and consider prediction based on any sample size n for later purposes.We use D 1 for selecting a subset of covariates for linear prediction of a future Y from X distributed by G, with parameters that will be estimated using a dataset We derive some results that will be needed for the general case J > 1, to be discussed in Section 3. Subsets of the covariates are denoted by letters like P , Q , etc., and their sizes by p and q, etc.We refer to the associated linear model as model P .For now we fix P and suppress it in most of our notation and instead of X (P ) we write X and assume it is in R p .The same holds for other vectors and matrices.Later we shall assume that X ∈ R d , and consider different subsets of covariates.

Preliminaries
1 of iid pairs from some distribution G, where X i is a column vector in R p , i = 1, . . ., n.Let (X, Y ) without indexes denote one such "generic" observation, distributed independently of the dataset D as any (X i , Y i ) according to G. The first entry of each X i may be 1, so that the models may include an intercept term.
Set Q := E(XX ) and let Y n ∈ R n denote the n-column vector of the Y i 's, and set m(X) := E(Y |X) for some function m.Assuming that both X and Y have finite second moments and that Q is invertible, the best linear approximation of m(X) is X β, where The same projection coefficient vector β also satisfies β = arg min b E(Y −X b) 2 ; hence X β is the best linear predictor of Y .Our assumptions imply that the minimizer β is unique.Set e i := Y i − X i β, with β defined in (2.1).By (2.25) in Hansen [12], where most of our notation and the standard results we use can be found, we have E(Xe) = 0, where again X and e are "generic" X i and e i .Define X n to be the n × p matrix whose n rows are the row vectors X i .In this common notation the standard linear model will be written as X n β, whereas each of its rows as X i β, and X n X n = n i=1 X i X i .Under standard assumptions, the least squares estimator is 2) The assumption that (X n X n ) −1 exists (with probability 1) holds if we assume that X has a continuous distribution.For the existence of certain moments required later we shall assume that the distribution of X is a mixture of normals.
See Hansen [12], pp.102-3, for a discussion of the existence of (X n X n ) −1 and its moments.Without assuming continuity, the assumption that Q is invertible implies that (X n X n ) −1 exists with probability converging to 1 as n → ∞; however, for discrete distributions this probability is smaller than one, and thus β n may not exist, and has no finite moments, a "conundrum" in the words of Hansen.
In Section 2.4 we extend our discussion to discrete covariates by conditioning on the existence of a bounded inverse, and showing that under simple conditions this amounts to neglecting a set having an exponentially small probability, thus providing some solution to the above conundrum.We now assume that (X n X n ) −1 exists and has sufficiently many moments so that expressions like (2.4) below are finite.If X and Y have finite fourth moments, then by Theorem 7.3 in Hansen [12] √ n( where W := E(XX e 2 ), a p × p matrix assumed to be positive definite.For a single distribution G and a dataset D as above, the prediction error incurred by a model P based on all p covariates with linear regression coefficients computed from a sample of size n is Later we assume that X ∈ R d and set X (P ) ∈ R p to be the vector consisting of the covariates of X in the subset of covariates P of size p.When we consider several models, we set, for example, X n to be the n×p matrix whose n rows are the row vectors n Y n , W (P ) := E(X (P ) X (P ) e 2 ), and likewise for Q, etc.We then have (2.5)

Equally good sequences of models
When selecting the best model for a given n, that is, the subset of covariates that minimizes R(n, P ), we should take into account that different samples yield different estimators β n , leading to different prediction errors; thus, there is no gain in optimizing more precisely than the difference between such errors.Consider the prediction error conditioned on the estimated regression coefficients R n, P ; , expanding the latter term, and taking conditional The first term in (2.6) is a constant and the second equals ( ; see (2.3).This means that R n, P ; β n varies between different β n by a quantity of order O p (1/n).Hence, if two sequences of models P (n) and Q (n) satisfy we consider them to be equally good.If P (n) is best in the sense of minimizing R(n, P (n)) and Q (n) is equally good, we say that Q (n) is adequate, and rather than choose "best models" we settle for adequate models.See, e.g., Nevo and Ritov [17] for a related approach.

Versions of Mallows C p for random covariates
Given a dataset D 1 = {(X i , Y i )} of size N (which constitutes the training set when J = 1), we first estimate the prediction error (2.4) incurred if prediction is to be based on n observations.We shall consider two types of asymptotics: one when n is considered to be large, and the other when n is fixed, and N is large.For now J = 1; asymptotics in J will be considered later.
We use the following notation: set Note that U N is not a statistic and that E(U N U N ) = W since E(Xe) = 0 implies that the expectations of mixed terms vanish.In all the vectors and matrices above and below the index P was suppressed unless otherwise indicated.
Akin to (2.4), we define the approximate prediction error to be where tr denotes trace, and equation (2.9) of Theorem 2.1 below shows that it is an approximation to the quantity R(n, P ) of (2.4).
2 and the trace is an approximation of the difference with precision of order O(1/n 3/2 ); see (2.9).Next we define the statistic C (P ) (n, N ) as an estimator of AR(n, P ) by as shown in (7.6) and (7.7).The fact that tr( V N ) is a biased estimator of tr(V) entails a bias of order 1/n for the estimator C (P ) (n, N ) as an estimator of AR(n, P ).We shall study the latter estimator, and when we use it, we shall apply a standard jackknife correction for its bias; see Efron [9], Equation (2.8).We denote the bias-corrected C (P ) (n, N ) by C (P ) (n, N ).It suffices to bias-correct only tr( V N ) in (2.8) as explained in the fourth paragraph after Theorem 2.1.The superscript P in the statistic C (P ) refers to the set of covariates in P and for now we have X (P ) i = X i ∈ R p and the subset P is fixed and suppressed.The statistic C (P ) (n, N ) is a counterpart of Mallows C p , but here we consider random covariates.Furthermore, we distinguish between the number N of observations used for the choice of the model and the sample size n of observations used for estimating the model's parameters.The classic Mallows C p concerns nonrandom covariates, where n = N , and the true model is assumed to be linear.To see the relation to Mallows C p , assuming a homoskedastic linear model, we have that e i = Y i − X i β is uncorrelated with the covariates, with variance σ 2 , and If we use σ 2 p as an approximation of tr( V N ) (and therefore we only have to estimate σ 2 rather than a trace), then C (P ) in the case N = n coincides with Mallows C p .
The following theorem provides the rate of approximation of AR(n, P ) to R(n, P ), and then analyzes C (P ) as an estimator of AR(n, P ); some of its conditions and implications are discussed below.All proofs are in the Appendix.Our proof shows that Assumption (i) below can be replaced by the assumption that X and Y have 24 finite moments, and a careful inspection of the proof shows that this number can be somewhat reduced.
Theorem 2.1.Assume that (i) The coordinates of X and Y have finite moments of all orders.
(ii) The entries of (X n X n /n) −1 have third moments that are bounded uniformly in n.Then and where Furthermore, ) for some asymptotic variance τ 2 as N → ∞, and n is fixed.
Since there is only a finite number of models, the above terms O, O p , and o p do not depend on the subset of covariates P .For example, we could replace (2.9) by |R(n, P ) − AR(n, P )| ≤ B/n 3/2 for all n and P , where B is a constant.Moreover, the term o p (1/N ) in (2.10) does not depend on n.
Condition (i) of Theorem of 2.1 is standard, and Lemma 2.2 below shows that Condition (ii) is satisfied if X is distributed as a mixture of normals; see Sampson [22].Such mixtures form a dense family of distributions with respect to weak convergence in the space of distribution on R p .As the distribution of X is never known exactly, it makes sense to assume, as an approximation, that the data satisfy such a condition.The case where X has discrete components is discussed in Section 2.4.
We shall later compare models consisting of different subsets of covariates.Equation (2.9) suggests that choosing a model by minimizing a good estimate of AR(n, P ) with respect to P can lead to a model for which R(n, P ) is within o(1/n) of the best model, and thus P is an adequate model in the sense of Section 2.2.This is stated formally in Proposition 2.4.
In view of (2.10) we use C (P ) (n, N ) of (2.8) as an estimator of the approximate prediction error AR(n, P ) and hence of the prediction error R(n, P ).This is formalized in Propositions 2.5 and 2.6 below.We now briefly discuss Equations (2.10) and (2.11).First consider the bias of C (P ) (n, N ) as an estimator of AR(n, P ).It is easy to see that , and after dividing the latter term by n as in (2.10), it is of a smaller order than the term tr( V N ) 1  n + 1 N appearing in C (P ) (n, N ).This shows that the latter term contributes to reducing the bias of C (P ) (n, N ) as an estimator of AR(n, P ).
Our main interest is in the case of J > 1 regressions, and in choosing a model that minimizes an average of J values of AR.Averaging (nearly) unbiased estimates can result in consistency in J , which explains why we care about correcting the bias of C (P ) (n, N ).In this case, a further bias correction using the jackknife is useful (see Section 5.2).The above discussion implies that it suffices to bias-correct the estimator tr( V N ), which is what we do when using the jackknife.
Choosing a good model can be reduced to choosing between two models, say, P and Q at a time, by approximating the difference AR(P ) − AR(Q ) using The leading terms in the latter expression will be the difference between the relevant values of E N for the two models, and it is easy to see that the leading term of this difference is the difference between the values of 1 N ||Y N − X N β|| 2 for the corresponding models, which is of order O p (1/ √ N ) by the central limit theorem.However, when two models having very similar prediction values are compared by differencing their corresponding values of C (P ) (n, N ), their leading terms will approximately cancel, and in this case the second term on the right-hand side of (2.8) plays a role.This holds also for Mallows C p and the AIC, [1], and will be exploited formally in the Propositions 2.5 and 2.6 below.
The following lemma shows that Condition (ii) of Theorem 2.1 holds when X is distributed as a mixture of normals.
Lemma 2.2.Let the distribution of the covariate vectors (excluding the first coordinate in the case that it is a constant 1) be normal, or a finite mixture of normals, or an infinite mixture of normals with covariance matrices in a set Ξ, and inf Σ∈Ξ λ min (Σ) > 0, where λ min denotes the smallest eigenvalue.Then, for n > p + 5, Condition (ii) of Theorem 2.1 is satisfied.
More generally, the rth moments of the entries of (X n X n /n) −1 are bounded under the conditions of Lemma 2.2 provided that n > p + 2r − 1 (see von Rosen [19], Theorem 4.1).Note that the condition on λ min guarantees that X is bounded away from exact multicollinearity.

Discrete covariates
When X contains discrete covariates, the probability that the matrix (X n X n /n) −1 does not exist is positive, and expressions like β n of (2.2) and hence R(n, p) of (2.4) may not exist.When the components of X are bounded, we provide the following limiting approach.Set where λ min is the smallest eigenvalue, and R(n, We have Theorem 2.3.Suppose that Y has all moments, the components of X are bounded, and Q is invertible; then for some a ∈ (0, 1), Moreover, all quantities appearing in Theorem 2.1 are well defined on H N , and can be defined in an arbitrary way outside of H N , and the results (2.10)-(2.13)hold.
Thus, apart from the complement H c n , which has exponentially small probability, the approximation rate of AR(n, P ) to the prediction error is the same as in (2.9) and the rest of Theorem 2.1 still holds.The result follows from Theorem 2.1 and Lemma 2.3 given in the Appendix.

Approximations and consistency
The focus of this section is on choosing a subset of covariates for prediction of future responses on the basis of a single dataset of size N .The linear model parameters are estimated from a sample of size n, with the understanding that different n's may (and should) lead to different choices of subsets; more specifically, a larger n naturally gives rise to a larger set of covariates.Asymptotic results in n are not of major interest in this context; however, they may contribute some understanding when n is not small.Such results are discussed in this section.
In Proposition 2.5 we show that under the conditions of Theorem 2.1, choosing a subset of covariates in the set arg min P C (P ) (n, N ) guarantees that for increasing n and N we choose the best linear model with probability converging to 1, that is, the model minimizing R(n, P ) = E Y − X (P ) β (P ) n 2 , with notation defined after (2.4).In Proposition 2.6 we show that for fixed n, using C (P ) (n, N ), we choose an adequate model in the sense defined in Section 2.2, with probability converging to 1 as N → ∞.
Below arg min P is taken over all subsets of covariates.For a given n, define the following sets: The following proposition shows that the first two sets defined above by arg min converge to the third, which is a singleton.Note that P * is the best linear model in the sense of being the most parsimonious model minimizing the expected square of the projection error Y − X (P ) β (P ) .We deal with the convergence of π * (n, N ) in Proposition 2.5.
Proposition 2.4.Suppose that the conditions of Theorem 2.1 hold.Then (i) Any two sequences in π * (n) and P * (n) are equally good, that is, any sequence of models in π * (n) is adequate in the sense of Section 2.2.
(ii) The set P * is a singleton, and the sets π * (n) and P * (n) converge to the singleton P * as n → ∞.
The proof shows that essentially M is a singleton; that is, besides P * , M may only contain models having the same covariates and regression coefficients as those of P * , and further covariates whose coefficients vanish.Note that since the number of models is finite, it follows that P * (n) = π * (n) = P * for large enough n; that is, the same model P * minimizes both R(n, P ) and AR(n, P ).The model P * is the minimal best linear predictive model that one would ideally use if the projection coefficients β (P ) were known.
The next proposition shows that minimizing the statistic C (P ) (n, N ) leads to correct selection asymptotically, that is, to selecting the model that minimizes the prediction error R(n, P ) with probability converging to 1.
The proof is given in the Appendix, where we also show that the condition n/N → 0 is necessary.The case n = N (with nonrandom covariates) corresponds to the standard Mallows C p , which is inconsistent; more specifically, it is well known that for n = N , the choice π * (n, N ) may lead to models Q that strictly contain P * ; see, e.g., Nishii [18].The equality π * (n, N ) = P * (n), which holds for large enough n and N with high probability, implies that π * (n, N ) is a singleton (by Proposition 2.4 (ii)), and that selecting a model according to the statistic π * (n, N ) yields a model that minimizes the prediction error.Furthermore, the choice of a model by π * (n, N ) leads asymptotically to the choice of P * , the smallest model in terms of the number of covariates in M, that is, the most parsimonious model P that minimizes E Y −X (P ) β (P ) 2 .This property is often referred to as consistency; see, e.g., Shao [24].
In the case of fixed n, Equation (2.13) readily implies that Therefore, as N goes to infinity, the left-hand side converges to zero (at a rate of 1/ √ N ), implying Proposition 2.6.Under the condition of Theorem 2.1, we have for any fixed n, P π * (n, N ) ⊆ π * (n) In words, Proposition 2.6 says that a model that minimizes C (P ) will minimize AR(n, P ) with high probability for fixed n and a suitably large N .Proposition 2.4 (i) asserts that minimizing AR(n, P ) by π * (n) is close to minimizing R(n, P ) by P * (n), which is our goal.

Several datasets
Our main focus is on the case where several regression datasets are observed.We first discuss the case where we observe datasets from all the regressions of interest, and then, in Section 3.3, we consider a hierarchical situation where the data consist of a random sample of regression datasets from a structured collection of regression models.

Model selection observing all regressions
We consider a population of distributions G = {G j : j = 1, . . ., K} with J = K < ∞, that is, the training set comprises of all regression datasets in the population.Thus, we observe data For a given n, the goal is to select a common set of covariates P to be used for prediction of the response Y from X = X (P ) (the subvector with coordinates in P ) for each individual distribution G j from the population, or equivalently for a random G J , see below (3.1),where the coefficients β (P ) j,n , which are allowed to vary with j, are estimated with a sample of size n.The relevant prediction error for this task is (3.1) below.When predicting for individual j, it may be natural to set n = N j .However, other values of n may be of interest in studying the contribution of covariates as a function of the sample size.Later (in Section 3.3), we use the J datasets as a training set for choosing a model to predict for any out-of-sample G J on the basis of n future observations, where n is not determined in advance since J is not in the training set.In this case we use the chosen subset of covariates, and estimate its parameters on the basis of a dataset of size n from G J .The value of n may vary, being the size of the dataset G J .
Let X := X (P ) ∈ R p , where for now P and its size p are suppressed in the notation.For each j and generic observation (X, Y ) from the distribution G j , we define , where Q j := E Gj (XX ).Assuming finite fourth moments, we have for a sample size n → ∞, for each j, as in X j,N and Y j,N are the jth versions of X N , and Y N , and W j := E Gj (XX e 2 ), a p × p matrix, assumed to be positive definite.We further use the notation V j for the jth version of V, that is, when expectations are taken with respect to G j , and similar notation when N = N j observations are used for the estimators Q j,N , V j,N , and W j,N instead of Q N , V N , and W N .
We consider prediction for a random individual regression dataset of size n from the population G, based on a model, that is, a subset of covariates P .As above we suppress P and write X and β rather than X (P ) and β P , etc.The relevant prediction error (see (1.1) and around for a discussion) is where (X, Y ) ∼ G j independently of β j,n , and the expectation on the righthand side of (3.1) is also applied to β j,n .The risk R(n, P ) can be interpreted as an expectation over G J for a uniform choice of a single J ∈ G or equivalently, as the risk per task average for the multi-task of predicting for all G j ∈ G if all datasets sizes (or task size) were n.In R(n, P ) above and similar expressions below, we suppress the number of datasets J .In the case that any of the distributions G j involves discrete covariates, we replace E Gj (Y − X β j,n ) 2 by a conditional expectation as in Section 2.4, where the conditioning is on a set whose complement is exponentially small.In the definition given in Equation (1.1), (3.1), and others below we use boldface letters when J > 1.Next define Using (2.9) we have where N = (N 1 , . . ., N J ).We define the jackknife bias-corrected C (P ) by where C (P ) j (n, N j ) is the bias-corrected C (P ) j (n, N j ); see Efron [9], Equation (2.8), for a precise definition of the jackknife correction we use.
Theorem 3.1 below parallels Theorem 2.1 concerning the error of C (P ) (n, N) as an estimator of AR(n, P ).Theorem 3.1.Suppose that the conditions of Theorem 2.1 are satisfied when (X, Y ) ∼ G j for each j = 1, . . ., J .Then, where E j,Nj is the jth version of E N defined in (2.11), and the o p terms do not depend on n.Moreover, assume that lim N→∞ N 1 /N j := a j exists for all j, where 0 < a j < ∞; then as N → ∞, where τ 2 J = 1 J 2 J j=1 a j τ 2 j and τ 2 j is the asymptotic variance under G j as in Theorem 2.1, Equation (2.13).
Notice that if τ 2 j and a j are bounded (in j), then the asymptotic variance of √ N 1 AR(n, P ) − C (P ) (n, N) decreases like 1/J , which means that the error is decreasing in J .Theorem 3.1 and (3.3) imply properties of C (P ) (n, N) as an estimator of R(n, P ) as discussed next.

Consistency
Analogously to the definitions in Section 2.5, where now the optimal sets of the multi-task problem are denoted using boldface, define The next result is similar to Proposition 2.4, with essentially by the same proof.
The notions equally good and adequate are the same as that of Section 2.2.
Proposition 3.2.Suppose that the conditions of the first part of Theorem 3.1 hold.Then (i) Any two sequences in π * (n) and P P P * (n) are equally good, that is, any sequence of models in π * (n) is adequate.
(ii) The set P P P * is a singleton and the sets P P P * (n) and π * (n) converge to the singleton P P P * as n → ∞.
The following proposition generalizes Propositions 2.6 and 2.5 to J > 1.Here we consider a uniform bound (3.7).Technically, the constant C provides a measure of the notion of "sufficiently homogeneous" of Section 1.2 when referring to the set of distributions G; informally we mean that the regression datasets have enough in common to justify common subsets for prediction.Proposition 3.3.
1. Assume that the conditions of the first part of Theorem 3.1 hold.Then for fixed n we have 2. Let n/N j be bounded for all j = 1, . . ., J , and let C be a constant satisfying for all j and P n/N j , λ max (W where K C depends only on C. The existence of C follows from the assumption on n/N j since only a finite number of bounded terms appear in (3.7) besides n/N j .
Part 1 of the above proposition extends Propositions 2.6.Part 2 extends Proposition 2.5; however, a stronger condition was needed before, namely that n/N → 0, to obtain consistency.Here, we obtain approximate consistency for large J , assuming only that n/N j is bounded, along with the other terms in (3.7).This is useful since in our application it is natural to consider the possibility that n = N j .
The rate K C /J in the theorem was achieved by using Chebyshev's inequality.Since under our assumptions all moments are bounded, a similar argument using a bound on 2m moments leads in the same way to the rate K C,m /J m , where K C,m depends also on m, and with further effort, a large deviation rate (in J ) can be achieved.

A population of distributions
We now consider the situation where we have a sample of J regression datasets from a given, finite or infinite, population of such datasets, and we are interested in predictions for a random (possibly out-of-sample) further regression or several regressions from the same population.In terms of the application considered in this paper, this situation corresponds to the case that we have a training sample of J doctors out of many more, and our goal is to select a subset of covariates to be used to predict service durations for a random doctor from the population who may not be in the training sample.
Formally, let (Θ, T , P) be a probability space and let {G θ : θ ∈ Θ} be a family of distributions; see, e.g., C ¸inlar [7], Chapter VI for a formulation of random measures.Let {θ 1 , . . ., θ J } be a sample where θ j ∼ P, and as in Section 3.1 we observe a training set consisting of datasets D j = {(X ij , Y ij ) ∼ G j , i = 1, . . ., N j }, j = 1, . . ., J , where G j stands for G θj .Given θ ∈ Θ, we consider D = {(X i , Y i ) ∼ G θ , i = 1, . . ., n}.For any function f for which the conditional expectation E G θ f (D) of f (D) given G θ is well defined, we assume that so is , where the outer expectation is over θ ∼ P.
We now fix a set of covariates P , which is suppressed in most of the notation as before.For any G θ we define β θ,n to be the least squares estimator for the given dataset D. If G θ is sampled randomly from P then the population prediction error is defined as where the expectation , and As before, R pop (n, P ) is approximated by where the latter integrand defines AR θ (n, P ) as in (2.7).The quantity AR θ (n, P ), whose estimation was already discussed, is now a random variable, since it depends on G θ with θ ∼ P; its expectation, given by (3.9), is the basis of our estimation of R pop (n, P ) of (3.8).Lemma 3.4 below generalizes (2.9).
Lemma 3.4.Suppose that the conditions of Theorem 2.1 hold uniformly in θ ∈ Θ; that is, for each k, the kth moment with respect to G θ of each entry of X and Y is bounded uniformly in θ, and the entries of (X n X n /n) −1 have third moments with respect to G θ that are bounded uniformly in n and θ.Then The lemma clearly holds if Θ is finite, and in general it follows readily by the uniform boundedness of moments in θ and the proof of (2.9) given in the Appendix.Recall Lemma 2.2, where we showed that the moment conditions of Theorem 2.1 hold when X is a mixture of normals and inf Σ∈Ξ λ min (Σ) > 0. For the bound on moments as assumed in Lemma 3.4 to hold uniformly, it suffices that inf Σ λ min (Σ) > 0, where now the infimum is over all covariance matrices of all the mixing normal distributions involved in all the distributions G θ for all θ ∈ Θ.This technical assumption means that the covariates that are taken into account for the model selection are "bounded away" from multicollinearity.For discrete variables we redefine the prediction error by conditioning as in Section 2.4.Theorem 2.3 extends easily when we assume that all covariates are uniformly bounded in θ, and that λ min (Q θ ) > c for some c > 0, for all θ.
Lemma 3.4 suggests that a consistent estimator of AR pop (n, P ) will lead to selection of an adequate model in the sense of Section 2.2, that is, a model that is as good as the model that minimizes R pop (n, P ).
Recall the definition of AR(n, P ) in (3.2); now this quantity is considered random as it is a function of the sampled distributions G 1 , . . ., G J .In order to generalize the consistency results of Theorem 3.1 to this case, we need to bound AR pop (n, P ) − AR(n, P ) as in the lemma below.π * (n, N) and π * (n) are defined as in (3.6), however the fact that now the G j 's are random adds randomness to π * (n, N), and makes π * (n) a random variable.
Proposition 3.6 parallels Proposition 3.3; it shows consistency properties of π * (n, N), as defined in (3.6) using (3.4).Below, the probability P is obtained by first conditioning on θ 1 , . . ., θ J , and then unconditioning by taking expectation over θ 1 , . . ., θ J with respect to the product measure P J .Proposition 3.6.Assume that the conditions of Lemma 3.4 hold, and in addition that n/N j , λ max (W θ ) ≤ C for all θ and P .1.When n is fixed, where K C depends only on C.
The proof of Proposition 3.6 also shows that P P P * pop is a singleton and P P P * pop (n) and π * pop (n) converge to it when n → ∞.

Definition of GENO
Given a model (i.e., a set of covariates) P with coefficients estimated by a sample of n observations, we can say that it is equivalent to another model Q with m observations if their expected prediction errors satisfy R(m, Q ) = R(n, P ).Using the approximation AR(n, P ) to R(n, P ) given in (2.7), (3.2), and (3.9) for each of the scenarios we consider, we define GENO by If AR(m, Q ) > (<)AR(n, P ) for all m, we set GENO(n, P , Q ) = ∞(0), indicating that model P with n observations is better than model Q with any number of observations (model Q with any number of observations is better than P with n).A direct calculation shows that for J = 1 we have For J > 1 with AR defined in (3.2) we have .
For the case of (3.9), j is replaced by θ, and the averages by integrals P(dθ).GENO(n; P , Q ) = m means that model P with n observations is equivalent in terms of expected prediction error to model Q with m observations.Note that the larger GENO(n; P , Q ) is, the better model P (with n observations) is relative to model Q .For each model P and sample size n, we define where the minimum is over all subsets of covariates R .It follows that the inequality GENO(n, P ) ≤ n holds always, where equality means that P is the best model for n observations, as no other model can achieve the same prediction error with fewer observations.On the other hand, GENO(n, P ) = m < n means that there is a model that achieves, with m < n observations, the same prediction error as P with n observations.Thus, small values of GENO(n, P ) suggest considering another model.By the monotonicity of AR(n, P ) in n, if the inequality GENO(n; P , R ) ≥ GENO(n; Q , R ) holds for some model R , then it holds all R .This readily implies

Estimation of GENO
In the case J = 1, (2.13) shows the consistency of C (P ) (n, N ) as an estimator of AR(n, P ) for fixed n as N → ∞.In view of (4.1) we define a consistent estimator of GENO(n; p, q) by GENO(n; p, q) := m : To avoid cumbersome notation we suppress N in GENO.Using (2.8) we obtain, as before, setting it to be ∞ if the expression in curly brackets is negative or zero.
In the case of J > 1 datasets of Section 3.1, or in the population case of Section 3.3, the above expression (4.4) remains unchanged except that now C (P ) (n, N ) is replaced by C (P ) (n, N) defined in (3.4), and ).We can also define the estimator of (4.4) in terms of the jackknife bias-corrected C C C (P ) (n, N) of (3.5).This is done in estimating GENO in Section 6.3.The results below hold in the same way for all these cases.Similarly to (4.2), we define the statistic GENO(n, P ) := min R GENO(n; P , R ), which is an estimate the minimal number of observations required by the best competing model to achieve the same prediction error as model P with sample size n.
As in (4.3), we have The next proposition follows from (2.13) by applying the δ-method to the inverse function in (4.4).In particular, it shows the consistency of GENO(n; P , Q ) for fixed n as N → ∞.Proposition 4.1.Under the conditions of Theorem 3.1 (which include the case J = 1), we have for any fixed n The variance η 2 is not given explicitly since it is too complicated to be useful, and it can be be computed by the bootstrap.See Theorem 3.1 and the ensuing comment, which show that (under certain conditions) the variance decreases at a rate of 1/J .
A similar problem is to estimate for a given model P and a certain prescribed prediction error E the sample size n that satisfies AR(n, P ) = E.When J = 1, using (2.7) this quantity is given by tr(V (P ) ) E − E Y − X (P ) β (P ) 2 and can be estimated by tr( V N ) N )} is an unbiased estimator of E Y − X (P ) β (P ) 2 ); the extensions to the cases J > 1 and to the population setup are straightforward.

Simulations
In this section we evaluate by simulations the prediction error R(n, P ), its approximation AR(n, P ), and its estimation using C (P ) .We start with a single dataset (J = 1) and then we consider the case of several datasets.This simple example demonstrates the well-known difficulty involved in model selection for a single given dataset with methods such as Mallows C p , AIC, BIC, as well as our version C (P ) .In Section 5.2 we compare the case of model selection for one dataset to that of choosing a common model for successful prediction on the average when we have data from several datasets, that is, a multi-task.Section 5.3 compares the prediction error when model selection is done according to C C C (P ) to the prediction error under other methods.

A single dataset
Suppose that the distribution of (X, Y ) for X ∈ R 5 is given by with X 1 , . . ., X 5 , ε ∼ iid N (0, 1).Setting all models to include the intercept, there are 2 5 possible submodels; for simplicity, we focus for now on two models consisting of the subsets of covariates P 1 = {1, X 1 }, P 2 = {1, X 1 , . . ., X 5 }; more explicitly, we have model These two models are wrong (as linear conditional expectation function models, see Hansen [12] Section 2.15) since the residual e includes the nonlinear term X 2 1 − 1.By the orthogonality of the variables in (5.1), the projection parameters β k are equal to b k for these models; see (2.1).This is used in computing the first part of AR(n, P ) for = 1, 2, and since in this case Q = I, it is also easy to compute tr(V) for each model.We obtain 2)) (dashed line), = 1, 2, as functions of n for the above parameters.We evaluated R(n, P ), where = 1, 2, by a simulation based on 10 3 repetitions and using the decomposition (see (7.1) and recall that the first expectation can be computed explicitly and the second is evaluated using simulations.For small n, R(n, P 2 ) differs from AR(n, P 2 ), and the approximation improves as n increases.For n smaller than about 50, model P 1 has a smaller prediction error; for large n model P 2 is better.This holds approximately for both R and AR.This makes sense as models with fewer parameters have a smaller prediction error for small n.The rest of the models are not optimal for any n (this observation is not shown in Figure 1).Consider GENO as defined in (4.1).Careful inspection of Figure 1 shows, for example, that GENO(49; P 1 , P 2 ) = 49, which means that in order to achieve the same prediction error as model P 1 with n = 49 observations (the value of n where the dashed black line and red the line intersect), model P 2 requires the same number of observations.Also, GENO(60; p 2 , p 1 ) = 95, and therefore, to achieve the same prediction error as model P 2 with n = 60, model P 1 would require 95 observations (the value of n where the dashed black line has the same level as the dashed red line at 60).Since the decrease of AR(n, P 1 ) (the black line) in n is slow, a small increase in n, will result in a much larger value of GENO(n; P 2 , P 1 ); for example, GENO(65; P 2 , P 1 ) = 142.As mentioned before, GENO allows the statistician to compare the cost of additional observations to the cost of measuring additional variables, which may be expensive, or harmful, such as in the case of an invasive medical procedure or imaging that involves radiation.
By (4.5), the numbers of observations for models P 1 and P 2 to obtain a prediction error of 59 are about 29 and 39, respectively; i.e., model P 1 can achieve this prediction error with a sample size that is smaller by 10 observations.On the other hand, for a prediction error of 56, model P 1 requires 118 observations, while model P 2 needs only 63 observations.We now discuss estimation of the prediction error using C (P ) (n, N ) based on a single dataset of size N = 100.Figure 3 depicts simulation estimates of the probability of selecting models P 1 and P 2 as a function of n, using C (P ) (n, N) and the jackknifed C (P ) (n, N), where all 2 5 possible sub-models P are considered; for clarity we present the curves of P 1 and P 2 only.For each n and for each simulated dataset, C (P ) (n, N) and C (P ) (n, N) are calculated for all P .The empirical averages over 100 simulations of selecting models P 1 and P 2 out of the 2 5 sub-models for each n are plotted in Figure 3.The bias correction increases the probability of selecting model P 1 for small n.This improves the selection for small or moderate values of n.For the problem of selecting a common model for J datasets, the bias correction becomes more significant, as demonstrated next.

Multiple datasets
We now consider the case of J > 1 datasets.Suppose that G θ is given by model , where b k is given in (5.4), (Z 1 , . . ., Z 5 ) ∼ N (0, 0.2 2 ), W k is ±1 with equal probability, and all the above random variables are independent (thus determining the distribution P of Section 3.3), and then fixed throughout this section.The expected b 2 θ,k is approximately equal to b 2 k in (5.4), but about half of the b θ,k 's are positive and half are negative.The number of regression datasets is J = 100, and N j = 20, 100, and 200 for 1 ≤ j ≤ 33, 34 ≤ j ≤ 66, 67 ≤ j ≤ 100, respectively.
In the case of observing all regressions (see Section 3.1, Equation (1.1)), we wish to estimate R(n, P ), whereas in the case of observing a sample of regressions from the distribution P (see Section 3.3, Equation (3.8)), the relevant quantity is R pop (n, P ).Computing the latter quantity is difficult, and instead we use the approximation R(n, P ), which is justified by the law of large numbers and the central limit theorem (see Lemma 3.5).Thus we now focus on estimating R(n, P ) and selecting according to its estimate.The plot of R(n, P ) for J =100 and P = P 1 , P 2 is similar to Figure 1 and therefore is not presented here.
Figure 4 parallels Figure 2, where now in the case of J datasets, C (P ) (n, N) and C C C (P ) (n, N) replace C (P ) (n, N ), and C (P ) (n, N ), respectively; see (3.4) and (3.5); the number of simulations to evaluate R pop (n, P ) and to produce Figures 4 and 5 is 100.We see that the jackknife bias correction works well.Here the variances of the estimates are much smaller, indicating that several datasets can lead to better estimates and model selection procedures, as predicted by theory.The y-axis scale varies between Figures 4 and 2, in a way that undermines their difference.
Figure 5 plots the selection probabilities as a function of n (out of all 2 5 submodels).Unlike the case J = 1 (see Figure 3), model P 1 (respectively, model P 2 ) is selected with high probability for small n (respectively, large n).Recall that it is optimal to select model P 1 (respectively, P 2 ) when n ≤ 50, (respectively, n ≥ 50).Selecting according to C (P ) (n, N) leads to favoring P 2 (or other models) when n is greater than approximately 25 (instead of 50) and C C C (P ) (n, N) corrects this bias.Thus, the probability of correct model selection is much higher when using the J = 100 datasets (see Figure 1).Clearly, the probability of making a correct selection depends on the number of datasets J , the similarity among the J models, the noise level in the models, and the sample size n.
(a) Boxplots of

Comparisons to other approaches
We considered the possibility of concatenating the whole training sample and performing a single regression with an intercept for each j.In this simulation, since about half of the b θ,k 's are positive and half are negative, the resulting regression model leads to a higher prediction error than the one of C C C (P ) (n, N).The latter has estimated prediction error of 56.1 (SE=0.03)(see Table 1 below), while for the ordinary least squares applied to the concatenated dataset the corresponding number is 63.6 (SE=0.1),computed by averaging the prediction error over 1000 independent datasets with the same distribution.For ridge and lasso estimators applied to the concatenated dataset (calculated using the glmnet package, where the tuning parameter was computed using cross-validation), the prediction error was slightly higher: 64.1 (SE=0.1)and 63.9 (SE=0.1)for ridge and lasso respectively.Another approach is to consider a separate model selection algorithm for each of the J datasets.We considered three selection criteria: C (P ) (n, N j ) with n = N j as in (2.8) (applied to each dataset separately), Mallows' C p and BIC.The means of the resulting prediction errors are given in the Table 1 below as well as that of C C C (P ) (n, N ) (where the same model is selected for all j's with the same sample size N j ).The datasets are divided into three categories according to their sample sizes and the mean is reported for each category separately.Recall that N j = 20, 100, and 200 for 1 ≤ j ≤ 33, 34 ≤ j ≤ 66, and 67 ≤ j ≤ 100, respectively.The prediction error R j (N j , P ) was evaluated as in Figure 1.Table 1 shows that C C C (P ) (n, N ) leads to smaller prediction errors and the improvements is higher for smaller sample-sizes, where borrowing power from other datasets is more important.

Table 1
The means of the prediction errors R j (N j , P ), where P is selected by different methods.
The standard errors are about 0.03.

Prediction of durations of medical examinations
In this section we analyze a dataset of outpatients' hospital visits.Different models are considered in order to predict the actual appointments' durations as opposed to the planned durations.

Description of the data
The dataset analyzed is taken from the SEE Lab at the Technion.It consists of information on 140,924 hospital visits that took place in a certain US hospital for about two years between 2013 and 2015.For each visit, both the planned time and the actual time are reported.The goal was to provide a more accurate prediction of the actual duration than the planned one.In this dataset there is information on 44,516 patients and 258 doctors, out of whom 34 doctors had fewer than 50 visits.We shall focus on the rest, which corresponds to 99.5% of all visits.The regression coefficients will differ between doctors, and the goal is to select one common subset of covariates (for each n) for all doctors for prediction of visit durations.The distribution of the planned duration is given in Table 2 and Figure 6 plots the estimated density (a normal kernel estimate using the R command "density") of the actual durations for the time slots of 15, 30, and 60 minutes.Actual durations are obtained by a real-time location system (RTLS).The means are 16.7, 21.3, and 41.2, respectively.The vertical dashed lines are at 15, 30, and 60 minutes.

A regression model
The original dataset contains a large number of covariates, of which many did not seem to have any predictive power relative to visit durations.For simplicity of presentation, we focus on a small number of covariates that seem most relevant.We aim to predict actual duration, using the following covariates: • duration planned = the planned duration of the visit in minutes.
• duration planned 2 = the planned duration in minutes of the visit, squared.
• last = the planned minus the actual duration of the previous visit of the same patient (taken to be 0 for the first visit of the patient).• hour end = whether the exam is planned to end on the hour.It turns out that these kinds of visits tend to be somewhat longer.• type = there are two types of examinations: consultation/examination only, or the above plus treatment.In either case, only the first part counts as duration.
Standard statistical inference of the linear regression model of the whole dataset (ignoring the doctors' index) reveals that all of the above covariates besides "type" are statistically significant; however, the standard error of the residuals is 15.33, and R 2 = 0.227, suggesting that the prediction error is quite large.

C C C (P ) and model selection
In our notation, each doctor is indexed by j, and N j is the number of visits to doctor j in the dataset; N j varies between 50 and 2135.We demonstrate our approach by focusing on four candidate models that have the smallest (or nearly smallest) C C C (P ) from all submodels of the five covariates (all models included the intercept term) for relevant sample sizes n.These models are P 1 -the model with the covariates: duration planned, duration planned 2; P 2 -the model with the same covariates as in P 1 and additionally, the variable "last"; P 3 -the model with the same covariates of P 2 and additionally, the variable "type"; and P 4the full model.For certain submodels estimation is possible only for a subset of the doctors since X (P ) j,Nj X (P ) j,Nj is not always invertible.Therefore J varies between the models.For the models P 1 and P 2 , invertibility held for 96 doctors and for the models P 3 and P 4 , the corresponding number is 95, and so for these models J = 96 or J = 95.In this case, C C C (P ) (n, N) is based only on this subset.
Figure 7 plots C C C (P ) (n, N) for P = P where = 1, . . ., 4 and n is between 50 and 500.For n smaller than approximately 80, model P 1 is the best among the candidate models; for n between 80 and 450, P 2 has a smaller C C C (P ) , and for larger n, P 3 is the best, but P 2 is very close.In terms of GENO, we have, for example, that for n = 50, GENO(n, P 1 , Q ) for Q = P 2 , P 3 , and P 4 equals 54, 63, 73, respectively.The latter number means that model P 4 (the full model) would require 73 observations to achieve the same prediction error as model P 1 with n = 50 observations.Also, GENO(200, P 2 , P 1 ) = 370; if one considers using only the planned duration (P 1 ) or using model P 2 , that is, adding the variable "last" with the information on the last visit, which may not be available for some patients, then the estimated prediction error by the model P 2 with n = 200 observations can be achieved without knowing "last" by P 1 , with n = 370.It is then left to the user to decide whether to invest in measuring "last" or in using a larger sample, if such a sample is available.
Table 3 reports C C C (P ) (n, N) for different sample sizes n.Standard deviations estimated by the bootstrap, and cross-validation estimates of R(n, P ), are also provided.The latter estimates are computed only for j's where N j > n.For each such j, the data were split at random into a training set with n observations, and a testing set of size N j − n.The estimates β(P) j,n are based on the training set and the prediction error R j (n, P ) is evaluated using the testing set.This procedure was repeated 1,000 times and the average prediction error is reported.The crossvalidation estimates are mostly within one standard error of the C C C (P ) values, and the two approaches lead to selection of the same models.
Table 4 reports the values of C C C (P ) (n, N) − C C C (Q ) (n, N) together with a bootstrap estimate of the standard deviation for different values of n and various pairs of models.Also the differences of the corresponding cross-validation estimates are given.The standard deviations of Table 4 are much smaller than those of Table 3.This is consistent with our theoretical results that comparison of two similar models leads to a small estimation error (see the discussion after Theorem 2.1).The table shows which pairs P , Q differ significantly, and for which values of n.
for n = 50, 150, 500.Bootstrap standard deviations (SD) and cross-validation (CV) estimates are also provided.Boldface numbers indicate differences that are significantly (more than two SD's) non-zero.

Comparisons to other approaches
As in Section 5.3 we compare our method to other approaches.One possibility is to concatenate the whole training sample and add a categorical variable for the doctors.The (10-fold) cross-validation estimate of the prediction error of the OLS is 204.5.The corresponding numbers for the ridge and lasso estimates (applied to the concatenated data) are similar: 205.0 and 204.8.The estimates of prediction errors of our method are smaller: they vary between 190 and 176 for 50 ≤ n ≤ 500 (See Figure 7 and Table 3).
A different approach is to preform a separate model selection for each of the J datasets.As in Section 5.3, three selection criteria are considered, C (P ) (N j , N j ) (the bias-corrected C (P ) (n, N ) with n = N j ), Mallows' C p and BIC. Figure 8 plots the cross-validation estimates of the prediction errors of the selected models by the three criteria as a function of the sample size n = N j .A normalkernel smoothing is drawn to illustrate the average prediction error as a function of the sample size.Also plotted is the prediction error of the common model selection P 2 (which is close to optimal for sample sizes between 50 and 500) as estimated by C C C (P2) .Table 3 shows that the latter estimate is rather close to its cross validation estimate.Figure 8 shows that: a. the differences between the three selection criteria are small; b. a common model selection by C C C (P ) is better on average than a separate model selection; c. the latter statement is especially true for small sample sizes where borrowing strength is more important.

Appendix A: Proofs
Recall that we use P to denote a subset of the covariates, to which we sometimes refer as a model, and denote its size with the corresponding letter p.We suppress it in most of our notation below and instead of X (P ) we write X and assume it is in R p .Proof of Theorem 2.1.We first prove (2.9).For β n computed from a sample D = {(X i , Y i ) : i = 1, . . ., n}, and a pair of new observations from the same distribution (X, Y ), independent of D, we have where the last term in the first line of (7.1) vanishes since E[(Y − X β)X] = E[eX] = 0 and Y and X are independent of β n .This argument holds also if we condition on H n (see (2.14), and Theorem 2.3).By (7.1) we have that Using independence of X and β n again we have For the proof of Theorem 2.3 the expectations should be conditioned on the set H n , whose probability is large, and the conditioning does not affect the rates we obtain.By Equation (7.3) of Hansen [12], Since E{tr(U n U n Q −1 )} = tr(V) we rewrite the right-hand side of (7.2) as In order to prove (2.9) we now show that the latter expectation is of order O(1/n 1/2 ).To this end, notice that We now deal with the first term on the right-hand side above, the other term being similar, and simpler.We have and therefore (recall (7.3)) we consider the expectation of This matrix is a product of random matrices of the form ABCD where A The trace is a sum of products of entries from all the matrices appearing in the product.Different choices of powers can be made, but we use Hölder's inequality in the form E|abcd| ≤ (Ea 12 ) 1/12 (E|b| 3 ) 1/3 (Ec 4 ) 1/4 (E|d| 3 ) 1/3 for simplicity.Here a is an entry from A, b an entry from from B, etc., and the triangle inequality can then be used to bound the sum comprising the trace.
For each element j, k of Q − Q n we have we see that the number of nonvanishing terms when the expectation is taken It follows that tr( V N ) − tr(V) = O p (1/ √ N ) and it is asymptotically normal.Similar to previous arguments, the random variables in (7.9) and the first part of E N are jointly asymptotically normal and a version of Slutsky's theorem that allows us to ignore the term O p (1/N ) together with (2.10) and (2.11), implies the last statement of Theorem 2.1 about the asymptotic normality of √ N C (P ) (n, N ) − AR(n, P ) .Proof of Lemma 2.2.First assume that the first coordinate of the covariate vectors is 1 (corresponding to an intercept coefficient).Let X denote the n × (p − 1) matrix defined as X n but without the first column of 1's.We suppress n here and in the following notation.Let X := X 1/n, where 1 is an ncolumn vector of 1's, that is, X is the (p − 1)-column of covariates means, and let S := X X/n − X X .Then by Horn and Johnson [13], page 25, Equation (0.8.5.6) Let X denote a (p − 1)-vector of the covariates without the first 1, and assume first that X ∼ N (µ, Σ).Then nS ∼ W ishart p−1 (Σ, n − 1), and X and S are independent.The third moments of S −1 are uniformly (in n) bounded by C max{1, 1/λ min (Σ) 3 } for some C > 0, provided n − p − 5 > 0 by Theorem 4.1 of von Rosen [19].By (7.10) the third moments of (X X/n) −1 are also bounded.If X is distributed according to a mixture of normals, the assumption in the lemma that all these normal distributions have λ min (Σ) > c > 0 implies the uniform boundedness for the mixture.
If the first coordinate is not 1, we append 1 to the covariates and now the matrix of interest in the above notation (with p replacing p − 1) is X X/n, which is now p × p, and we wish to show that its inverse has bounded third moments.The eigenvalues of the latter matrix are larger (not strictly) than those of S = X X/n − X X .The latter relation is revered for the inverses.Now use the inequality that for a positive definite matrix A we have |a ij | ≤ tr(A)/2 to conclude that the entries of ( X X/n) −1 are bounded by tr(S −1 ).By the first part of the theorem (with p replacing p − 1) we know that S −1 has finite third moments and therefore also its trace (by Minkowski inequality), and the result follows.
Lemma 2.3.For X n ∈ H n (see (2.14)) the matrix (X n X n /n) −1 exists and all its entries are bounded uniformly in n.Moreover, if the components of X are bounded, then for some a < 1 we have, P (H n ) > 1 − a nλmin(Q) , which converge to 1 at an exponential rate in n.
Proof.When X n ∈ H n , then λ min (X n X n /n) ≥ 1 2 λ min (Q) > 0, and therefore (X n X n /n) −1 exists.Since the entries of a positive semi-definite matrix are bounded by the maximal eigenvalue, all entries of (X n X n /n) −1 are bounded in this case by 2/λ min (Q).For the moreover part, notice that when the elements of X are bounded then so is λ max (X n X n /n).By Tropp [25], Theorem 1.1, P (H c n ) ≤ a nλmin(Q) for some a < 1.
and similarly for tr(V (Q ) ).We have e (P ) = e (Q ) for any P , Q ∈ M (with probability 1) since otherwise, by the argument in (7.11), P ∪ Q would contradict the assumption that both P and Q are in M as above.We conclude that, tr(V (P ) ) = tr E X (P ) X (P ) {e (P ) } 2 = k∈P E X (P ) k e (P ) 2 = tr(V (Q ) ). (7.12) The strict inequality follows from the fact that W is positive definite, and thus X (P ) X (P ) {e (P ) } 2 = AX (P ) X (P ) {e (P ) } 2 A are matrices with positive definite expectations, and therefore positive diagonal elements.Summing up, the above discussion shows that P * is a singleton, and that tr(V (P * ) ) is minimal among the models in M.This implies that π * (n) → P * as n → ∞.By (2.9), for every model p, R(n, P ) − AR(n, P ) = o(1/n), and therefore P * (n) = π * (n) for large enough n.Hence, also P * (n) → P * as n → ∞.Proof of Proposition 2.5.It suffices to prove that P ( π * (n, N ) = P * ) → 1 when n, N → ∞ and n/N → 0, since by Proposition 2.4, P * (n) = P * for large enough n.Equivalently, we claim that for every P = P * we have P ( π * (n, N ) = P ) → 0, and since there is a finite number of models, the result follows.The latter claim is proved separately for P / ∈ M and then for P ∈ M, (conditions where B is a positive constant implying that P ( π * (n, N ) = P ) → 0 when both n, N go to infinity and n/N → 0.

Fig 1 :
Fig 1: Simulation estimates of R(n, P 1 ) and R(n, P 2 ) (solid line) as well as the approximations AR (dashed line) given in (5.2).
Figure 2 plots R(n, P 1 ) − R(n, P 2 ) (solid line), AR(n, P 1 )−AR(n, P 2 ) (dashed line), and boxplots of the estimators C (P1) (n, N )− C (P2) (n, N ) on the left-hand side, and C (P1) (n, N ) − C (p2) (n, N ), the jackknife bias-corrected version, on the right-hand side, based on 10 3 simulations for each n = 20, 40, . . ., 200.Their means are given by circles.We see that the jackknife corrects the bias of C (P1) (n, N ) − C (P2) (n, N ) as an estimator of AR(n, P 1 ) − AR(n, P 2 ); see the discussion following (2.8).Recall that the bias itself and the correction decrease in n.The mean of the difference C (P1) (n, N ) − C (P2) (n, N ) and C (P1) (n, N ) − C (P2) (n, N ) equals 0 at about n = 40 and n = 50, respectively; thus the jackknife leads to correct selection on average since it is optimal to select model P 1 for about n ≤ 50.

Fig 4 :Fig 5 :
Fig 4: Same plots as in Figure 2 when there are J = 100 samples.

Fig 6 :
Fig 6: Estimated density of the actual duration for the time slots of 15, 30, and 60 minutes.

Fig 8 :
Fig 8: Prediction errors (estimated by cross-validation) of the model selection methods C (p) (N j , N j ), Mallows' Cp and BIC applied to each dataset separately compared to the common model P 2 .The dashed lines are Gaussian-kernel smoothing.The green line is C C C (P 2 ) (n, N) as a function of n ∈ [50, 500].

n 2 .
) N = O p (1/N ).Recalling that AR(n, P ) − AR(n, P * ) = B/n, (2.10) implies that C (P ) (n, N ) − C (P * ) (n, N ) = B/n + O p (1/N ) − tr(V (P ) ) − tr( V (P ) N ) − tr(V (P * ) ) + tr( V (P * ) N ) n , which implies (7.14) by (2.12) (b).Proof of Theorem 3.1.The first part follows from (2.10) of Theorem 2.1.That the o p terms do not depend on n can be seen by inspecting the proof of (2.10) of Theorem 2.1.The moreover part follows from the asymptotic normality of each j; see (2.13).Proof of Proposition 3.3.Part 1 follows from the first part of Theorem 3.1.The proof of Part 2 differs from that of Proposition 2.5 only in taking averages over J in similar expressions.The only real difference is in case (b) of the proof of Proposition 2.5, with P and P P P * both in M. The proof is achieved by showing that there exists B > 0 such that lim sup n/Nj ≤C,n,N→∞ P n{C (P ) (n, N) − C (P P P * ) (n, N)} < B/2 < K/J .(7.15)We haven{C (P ) (n, N) − C (P P P * ) (n, N)) − tr(V (P P P * ) j) + tr( V (P P P * ) j,Nj ), and E j = no p (1/N j ), arising from the last term in (2.10).The proof of (7.15) is accomplished by showing that 1 J J j=1 B j ≥ B, to be defined below, and that the other three sums are small.and therefore,P {P ∈ π * (n)} ∩ {P P P * pop / ∈ π * (n)} ≤ K C /J .Next consider the case P / ∈ M pop .By definition, there exists ε > 0 such that for anyP / ∈ M pop E G θ (Y − X (P ) β (P ) θ ) 2 P(dθ) − E G θ (Y − X (P P P * pop ) β (P P P * pop ) θ ) 2 P(dθ) > ε.It is easy to see that for n 2 large enough this implies E AR(n, P P P * pop ) − AR(n, P ) < −ε/2 ∀n ≥ By an argument as aboveP {P ∈ π * (n)} ∩ {P P P * pop / ∈ π * (n)} ≤ K C /J .Sincethe number of models is finite, (7.20) follows.Now, for fixed n that satisfies n < n 0 := max{n 1 , n 2 } again a similar argument shows that for any P ∈ π * (n) ,P P / ∈ π * pop (n) ≤ K C (n) J ,where K C (n) may depend on n (and on C).Since there are only finite such n's the result of Part 2 follows.Proof of Proposition 3.6.The first part of Proposition 3.6 follows from Part 1 of Proposition 3.3, which shows that π * (n, N) ⊆ π * (n) with probability converging to 1, and Part 2 of Lemma 7.1, which shows that π * (n) ⊆ π * pop (n) with high probability.The second part of Proposition 3.6 follows from a combination of several statements: π * (n, N) = P P P * (n) with high probability (Proposition 3.3, Part 2); P P P * (n) = π * (n) for large n (Proposition 3.2); π * (n) ⊆ π * pop (n) with high probability (Lemma 7.1 Part 2); and for large n, π * pop (n) is a singleton, and π * pop (n) = P P P * pop (n) (Lemma 7.1, Part 1).

Table 2
The distribution of the planned duration.

Table 4
The values of C C C