Registered users receive a variety of benefits including the ability to customize email alerts, create favorite journals list, and save searches.
Please note that a Project Euclid web account does not automatically grant access to full-text content. An institutional or society member subscription is required to view non-Open Access content.
Contact email@example.com with any questions.
The workshop 'High-dimensional data: p >> n in mathematical statistics and bio-medical applications' was held at the Lorentz Center in Leiden from 9 to 20 September 2002. This special issue of Bernoulli contains a selection of papers presented at that workshop.
The introduction of high-throughput micro-array technology to measure gene-expression levels and the publication of the pioneering paper by Golub et al. (1999) has brought to life a whole new branch of data analysis under the name of micro-array analysis. Some aspects of micro-array data are quite new and typical of the data-extraction technique, but the issue of using high-dimensional data as explanatory variables in classification or prediction models has been recognized as a scientific problem in its own right in chemometrics, machine learning and mathematical statistics. The aim of the workshop was to bring together researchers from the more theoretical side (mathematical statistics, chemometrics, machine learning) and the applied side (biostatistics) to have a cross-disciplinary discussion on the analysis of high-dimensional data and to be more than just another workshop on micro-arrays. The first lesson learned is that quite different languages are spoken in the different fields and that the communication between hardcore mathematical statistics and practical data analysis on micro-arrays is far from easy. Further meetings of this sort will be beneficial because they improve interdisciplinary communication.
This special issue contains papers on different issues of micro-array data analysis and papers on statistical models for high-dimensional data. There are different statistical challenges in micro-array data analysis. A major problem with the micro-array technology (and similar high-throughput techniques) is that the outcomes obtained within one experiment (array) are of a relative nature. The outcomes of one single array can be normalized by comparing them with the (geometric) mean or the median of all values. In the common two-colour (red-green) experiment this problem is partly solved by measuring two samples on the same array in different colours. Relative measures can directly be obtained by comparing red with green. Even then, there appears to be a need for normalization because the relation between red and green can be distorted. Developing proper normalization methods is an important statistical challenge in micro-array data analysis. The paper by Lee and Whitmore gives a nice insight in the normalization debate. It is interesting to observe that normalization is made possible by the abundance of data. Having tens of thousands of gene expressions measured on the same array makes it possible to use the variation over the genes within one array to construct a reasonable normalization. Also, the high dimension is in this case a blessing, not a curse.
The next step in gene-expression data analysis is to determine which genes are differentially expressed, which means that they show differences between subgroups of individuals. This can be a case of supervised learning if the individuals are characterized as normal/abnormal or of unsupervised learning if there is no further information on the individuals available. The paper of Garrett and Parmigiani discusses an interesting mix of unsupervised and supervised learning. Using latent class modelling, they manage to reduce the gene-expression information to a trichotomous outcome 'underexpressed', 'normal' or 'overexpressed'. This reduction is helpful to reduce the noise and to select the genes that could be of interest for further data analysis. In this search for differentially expressed genes, the large number of genes is again more of a blessing than a curse. Similarities between genes can be used in a multi-level (or empirical Bayes) setting to find the cut-off values for being under- or overexpressed per gene. The large number of genes only becomes cumbersome if one wants to test each gene for differential expression between normals and abnormals or any similar grouping of individuals. Controlling the studywise error rate by Bonferroni or more sophisticated corrections can be detrimental, but after switching to false discovery rates as introduced by Benjamini and Hochberg (1995), the large number of genes can be helpful to establish the prevalence of truly expressed genes.
The curse of dimensionality p >> n comes into play when micro-array data are used for diagnosis/classification or prediction. It is this application of gene-expression data in the paper by Golub et al. (1999) that excited a lot of interest in micro-array data among machine learners and statisticians. The remaining papers in this issue all address preventing overfitting in classification/regression models on a high-dimensional predictor.
Early papers on classification using micro-array data exploited rather simple classification rules that appeared hard to beat by more sophisticated classification rules. The paper by Bickel and Levina, inspired by analysing high-dimensional texture data, discusses and explains why the so-called naive Bayes classifier, which ignores the dependencies between the predictors, behaves so well. To put some structure in the high-dimensional explanatory variable, they view the sequence of predictors as a stochastic process and assume stationarity of the covariance function. It is not quite clear how this carries over to the unstructured micro-array data.
From the theoretical point of view, it is interesting to understand why simple rules are hard to beat, but from a more practical point of view it is disappointing that the wealth of data cannot be more efficiently analysed. The lesson is that we need more biological understanding of the relations between genes if we want to get more out of gene-expression data. If p >> n, it is impossible to discover the relevant relations from the data and use these in an efficient way for classification or prediction.
The paper by Greenshtein and Ritov approaches a problem very similar to the one in Bickel and Levina's paper, but from a different angle, with the emphasis on linear prediction. They offer a theoretical framework for the popular lasso of Tibshirani (1996), which is closely related to soft-thresholding (Donoho 1995). The lasso restricts the '1-norm when fitting a linear regression model using least squares, or adds an '1-penalty to the sum of squares. The finding of Greenshtein and Ritov is that persistent procedures (as good as the best procedure under the same restrictions) can be obtained under quite liberal conditions on the restriction. They conclude that there is 'asymptotically no harm' in introducing many more explanatory variables than observations as far as prediction is concerned. It is implicit in their paper that finding the best predictor is different from estimating the vector of regression coefficients. The latter is hopeless if p >> n. The message for the practitioner should be that the lasso (and also some other penalized methods leading to sparse representations) can be safely used in combination with proper cross-validation for the purpose of prediction, but that one should avoid any (biological) interpretation of the set of explanatory variables that are thus selected and their regression coefficients. The link with Bickel and Levina might be that penalization by the l1-norm of the regression vector has the effect of undoing multi-collinearity and acting as if the predictors were independent.
The paper by Keles, van der Laan and Dudoit is in the same spirit of finding the best predictor, but in the setting of right-censored survival data. The first problem they deal with is the estimation of prediction error in censored data. They show that the problem of censoring can be handled by IPCW, that is, weighting by the inverse probability of censoring. Secondly, they use cross-validation to estimate the prediction error. They do not use the terminology of Greenshtein and Ritov, but their main result basically states that their procedure is 'persistent'. They show that, asymptotically, the rule that minimizes the empirical cross-validated prediction error behaves as well as the rule that minimizes the expected cross-validation error (the benchmark in their terminology). They do not explicitly address the issue of high-dimensional data. The class of prediction rules is left open and the practitioner has to make sure he/she uses a class of predictors that is rich enough to give cross-validation a chance.
The papers by Birgé and by Kerkyacharian and Picard discuss the problem of estimating an unknown regression function f (X) from a sample from (X, Y) with random X. Birgés paper is theoretical in nature. He defines model selection as selecting a small number of basis functions of which the unknown f is supposed to be a linear combination. Results about optimal selection in L2-norm are available for designed experiments in which X can be chosen by the observer. Life is more complicated when X is random. Birgé argues that for random pairs (X, Y) the Hellinger distance is more natural and the usual rates can be obtained for this distance, but might not hold for the L2-norm. The paper by Kerkyacharian and Picard is more practical (but also highly technical). Their starting point is the use of shrunken wavelets in the case of a designed experiment with equidistant observations. In the case of random observations they combine shrunken wavelets with warping of the x-axis induced by the distribution function G of X. If G is not known, it can be estimated by the empirical distribution function. They show that under certain regularity conditions, the behaviour of the new basis is quite similar to the behaviour of the regular wavelet basis.
In classification problems arising in genomics research it is common to study populations for which a broad class assignment is known (say, normal versus diseased) and one seeks undiscovered subclasses within one or both of the known classes. Formally, this problem can be thought of as an unsupervised analysis nested within a supervised one. Here we take the view that the nested unsupervised analysis can successfully utilize information from the entire data set for constructing and/or selecting useful predictors. Specifically, we propose a mixture model approach to the nested unsupervised problem, where the supervised information is used to develop latent classes which are in turn used for data mining and robust unsupervised analysis. Our solution is illustrated using data on molecular classification of lung adenocarcinoma.
Let , be independent and identically distributed random vectors, . It is desired to predict Y by , where , under a prediction loss. Suppose that , that is, there are many more explanatory variables than observations. We consider sets Bn restricted by the maximal number of non-zero coefficients of their members, or by their l1 radius. We study the following asymptotic question: how 'large' may the set Bn be, so that it is still possible to select empirically a predictor whose risk under F is close to that of the best predictor in the set? Sharp bounds for orders of magnitudes are given under various assumptions on . Algorithmic complexity of the ensuing procedures is also studied. The main message of this paper and the implications of the orders derived are that under various sparsity assumptions on the optimal predictor there is 'asymptotically no harm' in introducing many more explanatory variables than observations. Furthermore, such practice can be beneficial in comparison with a procedure that screens in advance a small subset of explanatory variables. Another main result is that 'lasso' procedures, that is, optimization under l1 constraints, could be efficient in finding optimal sparse predictors in high dimensions.
We show that the `naive Bayes' classifier which assumes independent covariates greatly outperforms the Fisher linear discriminant rule under broad conditions when the number of variables grows faster than the number of observations, in the classical problem of discriminating between two normal populations. We also introduce a class of rules spanning the range between independence and arbitrary dependence. These rules are shown to achieve Bayes consistency for the Gaussian `coloured noise' model and to adapt to a spectrum of convergence rates, which we conjecture to be minimax.
Over the last two decades, nonparametric and semi-parametric approaches that adapt well-known techniques such as regression methods to the analysis of right censored data, e.g. right censored survival data, have become popular in the statistics literature. However, the problem of choosing the best model (predictor) among a set of proposed models in the right censored data setting has received little attention. We develop a new cross-validation-based model selection method to select among predictors of right censored outcomes such as survival times. The proposed method considers the risk of a given predictor based on the training sample as a parameter of the full data distribution in a right censored data model. Then, the doubly robust locally efficient estimation method or an ad hoc inverse probability of censoring weighting method, as presented by Robins and Rotnitzky and later by van der Laan and Robins, is used to estimate this conditional risk parameter based on the validation sample. We prove that, under general conditions, the proposed cross-validated selector is asymptotically equivalent to an oracle benchmark selector based on the true data generating distribution. The method presented covers model selection with right censored data in prediction (univariate and multivariate) and density/hazard estimation problems.
This paper is concerned with Gaussian regression with random design, where the observations are independent and identically distributed. It is known from work by Le Cam that the rate of convergence of optimal estimators is closely connected to the metric structure of the parameter space with respect to the Hellinger distance. In particular, this metric structure essentially determines the risk when the loss function is a power of the Hellinger distance. For random design regression, one typically uses as loss function the squared L2-distance between the estimator and the parameter. If the parameter space is bounded with respect to the L∞-norm, both distances are equivalent. Without this assumption, it may happen that there is a large distortion between the two distances, resulting in some unusual rates of convergence for the squared L2-risk, as noticed by Baraud. We explain this phenomenon and then show that the use of the Hellinger distance instead of the L2-distance allows us to recover the usual rates and to carry out model selection in great generality. An extension to the L2-risk is given under a boundedness assumption similar to that given by Wegkamp and by Yang.
We consider the problem of estimating an unknown function f in a regression setting with random design. Instead of expanding the function on a regular wavelet basis, we expand it on the basis warped with the design. This allows us to employ a very stable and computable thresholding algorithm. We investigate the properties of this new basis. In particular, we prove that if the design has a property of Muckenhoupt type, this new basis behaves quite similarly to a regular wavelet basis. This enables us to prove that the associated thresholding procedure achieves rates of convergence which have been proved to be minimax in the uniform design case.