The Annals of Statistics Articles (Project Euclid)
http://projecteuclid.org/euclid.aos
The latest articles from The Annals of Statistics on Project Euclid, a site for mathematics and statistics resources.en-usCopyright 2010 Cornell University LibraryEuclid-L@cornell.edu (Project Euclid Team)Thu, 05 Aug 2010 15:41 EDTTue, 07 Jun 2011 09:09 EDThttp://projecteuclid.org/collection/euclid/images/logo_linking_100.gifProject Euclid
http://projecteuclid.org/
Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem
http://projecteuclid.org/euclid.aos/1278861454
<strong>James G. Scott</strong>, <strong>James O. Berger</strong><p><strong>Source: </strong>Ann. Statist., Volume 38, Number 5, 2587--2619.</p><p><strong>Abstract:</strong><br/>
This paper studies the multiplicity-correction effect of standard Bayesian variable-selection priors in linear regression. Our first goal is to clarify when, and how, multiplicity correction happens automatically in Bayesian analysis, and to distinguish this correction from the Bayesian Ockham’s-razor effect. Our second goal is to contrast empirical-Bayes and fully Bayesian approaches to variable selection through examples, theoretical results and simulations. Considerable differences between the two approaches are found. In particular, we prove a theorem that characterizes a surprising aymptotic discrepancy between fully Bayes and empirical Bayes. This discrepancy arises from a different source than the failure to account for hyperparameter uncertainty in the empirical-Bayes estimate. Indeed, even at the extreme, when the empirical-Bayes estimate converges asymptotically to the true variable-inclusion probability, the potential for a serious difference remains.
</p>projecteuclid.org/euclid.aos/1278861454_Thu, 05 Aug 2010 15:41 EDTThu, 05 Aug 2010 15:41 EDTNonparametric Bayesian analysis of the compound Poisson prior for support boundary recoveryhttps://projecteuclid.org/euclid.aos/1594972824<strong>Markus Reiß</strong>, <strong>Johannes Schmidt-Hieber</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 3, 1432--1451.</p><p><strong>Abstract:</strong><br/>
Given data from a Poisson point process with intensity $(x,y)\mapston\mathbf{1}(f(x)\leq y)$, frequentist properties for the Bayesian reconstruction of the support boundary function $f$ are derived. We mainly study compound Poisson process priors with fixed intensity proving that the posterior contracts with nearly optimal rate for monotone support boundaries and adapts to Hölder smooth boundaries. We then derive a limiting shape result for a compound Poisson process prior and a function space with increasing parameter dimension. It is shown that the marginal posterior of the mean functional performs an automatic bias correction and contracts with a faster rate than the MLE. In this case, $(1-\alpha )$-credible sets are also asymptotic $(1-\alpha )$-confidence intervals. As a negative result, it is shown that the frequentist coverage of credible sets is lost for linear functions $f$ outside the function class.
</p>projecteuclid.org/euclid.aos/1594972824_20200717040034Fri, 17 Jul 2020 04:00 EDTEntrywise eigenvector analysis of random matrices with low expected rankhttps://projecteuclid.org/euclid.aos/1594972825<strong>Emmanuel Abbe</strong>, <strong>Jianqing Fan</strong>, <strong>Kaizheng Wang</strong>, <strong>Yiqiao Zhong</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 3, 1452--1474.</p><p><strong>Abstract:</strong><br/>
Recovering low-rank structures via eigenvector perturbation analysis is a common problem in statistical machine learning, such as in factor analysis, community detection, ranking, matrix completion, among others. While a large variety of bounds are available for average errors between empirical and population statistics of eigenvectors, few results are tight for entrywise analyses, which are critical for a number of problems such as community detection.
This paper investigates entrywise behaviors of eigenvectors for a large class of random matrices whose expectations are low rank, which helps settle the conjecture in Abbe, Bandeira and Hall (2014) that the spectral algorithm achieves exact recovery in the stochastic block model without any trimming or cleaning steps. The key is a first-order approximation of eigenvectors under the $\ell _{\infty }$ norm: \begin{equation*}u_{k}\approx \frac{Au_{k}^{*}}{\lambda _{k}^{*}},\end{equation*} where $\{u_{k}\}$ and $\{u_{k}^{*}\}$ are eigenvectors of a random matrix $A$ and its expectation $\mathbb{E}A$, respectively. The fact that the approximation is both tight and linear in $A$ facilitates sharp comparisons between $u_{k}$ and $u_{k}^{*}$. In particular, it allows for comparing the signs of $u_{k}$ and $u_{k}^{*}$ even if $\|u_{k}-u_{k}^{*}\|_{\infty }$ is large. The results are further extended to perturbations of eigenspaces, yielding new $\ell _{\infty }$-type bounds for synchronization ($\mathbb{Z}_{2}$-spiked Wigner model) and noisy matrix completion.
</p>projecteuclid.org/euclid.aos/1594972825_20200717040034Fri, 17 Jul 2020 04:00 EDTConcentration of tempered posteriors and of their variational approximationshttps://projecteuclid.org/euclid.aos/1594972826<strong>Pierre Alquier</strong>, <strong>James Ridgway</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 3, 1475--1497.</p><p><strong>Abstract:</strong><br/>
While Bayesian methods are extremely popular in statistics and machine learning, their application to massive data sets is often challenging, when possible at all. The classical MCMC algorithms are prohibitively slow when both the model dimension and the sample size are large. Variational Bayesian methods aim at approximating the posterior by a distribution in a tractable family $\mathcal{F}$. Thus, MCMC are replaced by an optimization algorithm which is orders of magnitude faster. VB methods have been applied in such computationally demanding applications as collaborative filtering, image and video processing or NLP to name a few. However, despite nice results in practice, the theoretical properties of these approximations are not known. We propose a general oracle inequality that relates the quality of the VB approximation to the prior $\pi $ and to the structure of $\mathcal{F}$. We provide a simple condition that allows to derive rates of convergence from this oracle inequality. We apply our theory to various examples. First, we show that for parametric models with log-Lipschitz likelihood, Gaussian VB leads to efficient algorithms and consistent estimators. We then study a high-dimensional example: matrix completion, and a nonparametric example: density estimation.
</p>projecteuclid.org/euclid.aos/1594972826_20200717040034Fri, 17 Jul 2020 04:00 EDTRobust and rate-optimal Gibbs posterior inference on the boundary of a noisy imagehttps://projecteuclid.org/euclid.aos/1594972827<strong>Nicholas Syring</strong>, <strong>Ryan Martin</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 3, 1498--1513.</p><p><strong>Abstract:</strong><br/>
Detection of an image boundary when the pixel intensities are measured with noise is an important problem in image segmentation. From a statistical point of view, a challenge is that likelihood-based methods require modeling the pixel intensities inside and outside the image boundary, even though these distributions are typically not of interest. Since misspecification of the pixel intensity distributions can negatively affect inference on the image boundary, it would be desirable to avoid this modeling step altogether. Toward this, we develop a robust Gibbsian approach that constructs a posterior distribution for the image boundary directly, without modeling the pixel intensities. We prove that the Gibbs posterior concentrates asymptotically at the minimax optimal rate, adaptive to the boundary smoothness. Monte Carlo computation of the Gibbs posterior is straightforward, and simulation results show that the corresponding inference is more accurate than that based on existing Bayesian methodology.
</p>projecteuclid.org/euclid.aos/1594972827_20200717040034Fri, 17 Jul 2020 04:00 EDTThe hardness of conditional independence testing and the generalised covariance measurehttps://projecteuclid.org/euclid.aos/1594972828<strong>Rajen D. Shah</strong>, <strong>Jonas Peters</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 3, 1514--1538.</p><p><strong>Abstract:</strong><br/>
It is a common saying that testing for conditional independence, that is, testing whether whether two random vectors $X$ and $Y$ are independent, given $Z$, is a hard statistical problem if $Z$ is a continuous random variable (or vector). In this paper, we prove that conditional independence is indeed a particularly difficult hypothesis to test for. Valid statistical tests are required to have a size that is smaller than a pre-defined significance level, and different tests usually have power against a different class of alternatives. We prove that a valid test for conditional independence does not have power against any alternative.
Given the nonexistence of a uniformly valid conditional independence test, we argue that tests must be designed so their suitability for a particular problem may be judged easily. To address this need, we propose in the case where $X$ and $Y$ are univariate to nonlinearly regress $X$ on $Z$, and $Y$ on $Z$ and then compute a test statistic based on the sample covariance between the residuals, which we call the generalised covariance measure (GCM). We prove that validity of this form of test relies almost entirely on the weak requirement that the regression procedures are able to estimate the conditional means $X$ given $Z$, and $Y$ given $Z$, at a slow rate. We extend the methodology to handle settings where $X$ and $Y$ may be multivariate or even high dimensional. While our general procedure can be tailored to the setting at hand by combining it with any regression technique, we develop the theoretical guarantees for kernel ridge regression. A simulation study shows that the test based on GCM is competitive with state of the art conditional independence tests. Code is available as the R package $\mathtt{GeneralisedCovarianceMeasure}$ on CRAN.
</p>projecteuclid.org/euclid.aos/1594972828_20200717040034Fri, 17 Jul 2020 04:00 EDTSome theoretical properties of GANShttps://projecteuclid.org/euclid.aos/1594972829<strong>Gérard Biau</strong>, <strong>Benoît Cadre</strong>, <strong>Maxime Sangnier</strong>, <strong>Ugo Tanielian</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 3, 1539--1566.</p><p><strong>Abstract:</strong><br/>
Generative Adversarial Networks (GANs) are a class of generative algorithms that have been shown to produce state-of-the-art samples, especially in the domain of image creation. The fundamental principle of GANs is to approximate the unknown distribution of a given data set by optimizing an objective function through an adversarial game between a family of generators and a family of discriminators. In this paper, we offer a better theoretical understanding of GANs by analyzing some of their mathematical and statistical properties. We study the deep connection between the adversarial principle underlying GANs and the Jensen–Shannon divergence, together with some optimality characteristics of the problem. An analysis of the role of the discriminator family via approximation arguments is also provided. In addition, taking a statistical point of view, we study the large sample properties of the estimated distribution and prove in particular a central limit theorem. Some of our results are illustrated with simulated examples.
</p>projecteuclid.org/euclid.aos/1594972829_20200717040034Fri, 17 Jul 2020 04:00 EDTOn post dimension reduction statistical inferencehttps://projecteuclid.org/euclid.aos/1594972830<strong>Kyongwon Kim</strong>, <strong>Bing Li</strong>, <strong>Zhou Yu</strong>, <strong>Lexin Li</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 3, 1567--1592.</p><p><strong>Abstract:</strong><br/>
The methodologies of sufficient dimension reduction have undergone extensive developments in the past three decades. However, there has been a lack of systematic and rigorous development of post dimension reduction inference, which has seriously hindered its applications. The current common practice is to treat the estimated sufficient predictors as the true predictors and use them as the starting point of the downstream statistical inference. However, this naive inference approach would grossly overestimate the confidence level of an interval, or the power of a test, leading to the distorted results. In this paper, we develop a general and comprehensive framework of post dimension reduction inference, which can accommodate any dimension reduction method and model building method, as long as their corresponding influence functions are available. Within this general framework, we derive the influence functions and present the explicit post reduction formulas for the combinations of numerous dimension reduction and model building methods. We then develop post reduction inference methods for both confidence interval and hypothesis testing. We investigate the finite-sample performance of our procedures by simulations and a real data analysis.
</p>projecteuclid.org/euclid.aos/1594972830_20200717040034Fri, 17 Jul 2020 04:00 EDTStatistical and computational limits for sparse matrix detectionhttps://projecteuclid.org/euclid.aos/1594972831<strong>T. Tony Cai</strong>, <strong>Yihong Wu</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 3, 1593--1614.</p><p><strong>Abstract:</strong><br/>
This paper investigates the fundamental limits for detecting a high-dimensional sparse matrix contaminated by white Gaussian noise from both the statistical and computational perspectives. We consider $p\times p$ matrices whose rows and columns are individually $k$-sparse. We provide a tight characterization of the statistical and computational limits for sparse matrix detection, which precisely describe when achieving optimal detection is easy, hard or impossible, respectively. Although the sparse matrices considered in this paper have no apparent submatrix structure and the corresponding estimation problem has no computational issue at all, the detection problem has a surprising computational barrier when the sparsity level $k$ exceeds the cubic root of the matrix size $p$: attaining the optimal detection boundary is computationally at least as hard as solving the planted clique problem.
The same statistical and computational limits also hold in the sparse covariance matrix model, where each variable is correlated with at most $k$ others. A key step in the construction of the statistically optimal test is a structural property of sparse matrices, which can be of independent interest.
</p>projecteuclid.org/euclid.aos/1594972831_20200717040034Fri, 17 Jul 2020 04:00 EDTSegmentation and estimation of change-point models: False positive control and confidence regionshttps://projecteuclid.org/euclid.aos/1594972832<strong>Xiao Fang</strong>, <strong>Jian Li</strong>, <strong>David Siegmund</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 3, 1615--1647.</p><p><strong>Abstract:</strong><br/>
To segment a sequence of independent random variables at an unknown number of change-points, we introduce new procedures that are based on thresholding the likelihood ratio statistic, and give approximations for the probability of a false positive error when there are no change-points. We also study confidence regions based on the likelihood ratio statistic for the change-points and joint confidence regions for the change-points and the parameter values. Applications to segment array CGH data are discussed.
</p>projecteuclid.org/euclid.aos/1594972832_20200717040034Fri, 17 Jul 2020 04:00 EDTRobust covariance estimation under $L_{4}-L_{2}$ norm equivalencehttps://projecteuclid.org/euclid.aos/1594972833<strong>Shahar Mendelson</strong>, <strong>Nikita Zhivotovskiy</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 3, 1648--1664.</p><p><strong>Abstract:</strong><br/>
Let $X$ be a centered random vector taking values in $\mathbb{R}^{d}$ and let $\Sigma=\mathbb{E}(X\otimes X)$ be its covariance matrix. We show that if $X$ satisfies an $L_{4}-L_{2}$ norm equivalence (sometimes referred to as the bounded kurtosis assumption), there is a covariance estimator $\hat{\Sigma}$ that exhibits almost the same performance one would expect had $X$ been a Gaussian vector. The procedure also improves the current state-of-the-art regarding high probability bounds in the sub-Gaussian case (sharp results were only known in expectation or with constant probability).
In both scenarios the new bounds do not depend explicitly on the dimension $d$, but rather on the effective rank of the covariance matrix $\Sigma$.
</p>projecteuclid.org/euclid.aos/1594972833_20200717040034Fri, 17 Jul 2020 04:00 EDTRobust inference via multiplier bootstraphttps://projecteuclid.org/euclid.aos/1594972834<strong>Xi Chen</strong>, <strong>Wen-Xin Zhou</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 3, 1665--1691.</p><p><strong>Abstract:</strong><br/>
This paper investigates the theoretical underpinnings of two fundamental statistical inference problems, the construction of confidence sets and large-scale simultaneous hypothesis testing, in the presence of heavy-tailed data. With heavy-tailed observation noise, finite sample properties of the least squares-based methods, typified by the sample mean, are suboptimal both theoretically and empirically. In this paper, we demonstrate that the adaptive Huber regression, integrated with the multiplier bootstrap procedure, provides a useful robust alternative to the method of least squares. Our theoretical and empirical results reveal the effectiveness of the proposed method, and highlight the importance of having inference methods that are robust to heavy tailedness.
</p>projecteuclid.org/euclid.aos/1594972834_20200717040034Fri, 17 Jul 2020 04:00 EDTOn the optimal reconstruction of partially observed functional datahttps://projecteuclid.org/euclid.aos/1594972835<strong>Alois Kneip</strong>, <strong>Dominik Liebl</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 3, 1692--1717.</p><p><strong>Abstract:</strong><br/>
We propose a new reconstruction operator that aims to recover the missing parts of a function given the observed parts. This new operator belongs to a new, very large class of functional operators which includes the classical regression operators as a special case. We show the optimality of our reconstruction operator and demonstrate that the usually considered regression operators generally cannot be optimal reconstruction operators. Our estimation theory allows for autocorrelated functional data and considers the practically relevant situation in which each of the $n$ functions is observed at $m_{i}$, $i=1,\dots ,n$, discretization points. We derive rates of consistency for our nonparametric estimation procedures using a double asymptotic. For data situations, as in our real data application where $m_{i}$ is considerably smaller than $n$, we show that our functional principal components based estimator can provide better rates of convergence than conventional nonparametric smoothing methods.
</p>projecteuclid.org/euclid.aos/1594972835_20200717040034Fri, 17 Jul 2020 04:00 EDTLarge sample properties of partitioning-based series estimatorshttps://projecteuclid.org/euclid.aos/1594972836<strong>Matias D. Cattaneo</strong>, <strong>Max H. Farrell</strong>, <strong>Yingjie Feng</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 3, 1718--1741.</p><p><strong>Abstract:</strong><br/>
We present large sample results for partitioning-based least squares nonparametric regression, a popular method for approximating conditional expectation functions in statistics, econometrics and machine learning. First, we obtain a general characterization of their leading asymptotic bias. Second, we establish integrated mean squared error approximations for the point estimator and propose feasible tuning parameter selection. Third, we develop pointwise inference methods based on undersmoothing and robust bias correction. Fourth, employing different coupling approaches, we develop uniform distributional approximations for the undersmoothed and robust bias-corrected $t$-statistic processes and construct valid confidence bands. In the univariate case, our uniform distributional approximations require seemingly minimal rate restrictions and improve on approximation rates known in the literature. Finally, we apply our general results to three partitioning-based estimators: splines, wavelets and piecewise polynomials. The Supplemental Appendix includes several other general and example-specific technical and methodological results. A companion $\mathsf{R}$ package is provided.
</p>projecteuclid.org/euclid.aos/1594972836_20200717040034Fri, 17 Jul 2020 04:00 EDTStatistical inference in two-sample summary-data Mendelian randomization using robust adjusted profile scorehttps://projecteuclid.org/euclid.aos/1594972837<strong>Qingyuan Zhao</strong>, <strong>Jingshu Wang</strong>, <strong>Gibran Hemani</strong>, <strong>Jack Bowden</strong>, <strong>Dylan S. Small</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 3, 1742--1769.</p><p><strong>Abstract:</strong><br/>
Mendelian randomization (MR) is a method of exploiting genetic variation to unbiasedly estimate a causal effect in presence of unmeasured confounding. MR is being widely used in epidemiology and other related areas of population science. In this paper, we study statistical inference in the increasingly popular two-sample summary-data MR design. We show a linear model for the observed associations approximately holds in a wide variety of settings when all the genetic variants satisfy the exclusion restriction assumption, or in genetic terms, when there is no pleiotropy. In this scenario, we derive a maximum profile likelihood estimator with provable consistency and asymptotic normality. However, through analyzing real datasets, we find strong evidence of both systematic and idiosyncratic pleiotropy in MR, echoing the omnigenic model of complex traits that is recently proposed in genetics. We model the systematic pleiotropy by a random effects model, where no genetic variant satisfies the exclusion restriction condition exactly. In this case, we propose a consistent and asymptotically normal estimator by adjusting the profile score. We then tackle the idiosyncratic pleiotropy by robustifying the adjusted profile score. We demonstrate the robustness and efficiency of the proposed methods using several simulated and real datasets.
</p>projecteuclid.org/euclid.aos/1594972837_20200717040034Fri, 17 Jul 2020 04:00 EDTLocal uncertainty sampling for large-scale multiclass logistic regressionhttps://projecteuclid.org/euclid.aos/1594972838<strong>Lei Han</strong>, <strong>Kean Ming Tan</strong>, <strong>Ting Yang</strong>, <strong>Tong Zhang</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 3, 1770--1788.</p><p><strong>Abstract:</strong><br/>
A major challenge for building statistical models in the big data era is that the available data volume far exceeds the computational capability. A common approach for solving this problem is to employ a subsampled dataset that can be handled by available computational resources. We propose a general subsampling scheme for large-scale multiclass logistic regression and examine the variance of the resulting estimator. We show that asymptotically, the proposed method always achieves a smaller variance than that of the uniform random sampling. Moreover, when the classes are conditionally imbalanced, significant improvement over uniform sampling can be achieved. Empirical performance of the proposed method is evaluated and compared to other methods via both simulated and real-world datasets, and these results match and confirm our theoretical analysis.
</p>projecteuclid.org/euclid.aos/1594972838_20200717040034Fri, 17 Jul 2020 04:00 EDTLocal nearest neighbour classification with applications to semi-supervised learninghttps://projecteuclid.org/euclid.aos/1594972839<strong>Timothy I. Cannings</strong>, <strong>Thomas B. Berrett</strong>, <strong>Richard J. Samworth</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 3, 1789--1814.</p><p><strong>Abstract:</strong><br/>
We derive a new asymptotic expansion for the global excess risk of a local-$k$-nearest neighbour classifier, where the choice of $k$ may depend upon the test point. This expansion elucidates conditions under which the dominant contribution to the excess risk comes from the decision boundary of the optimal Bayes classifier, but we also show that if these conditions are not satisfied, then the dominant contribution may arise from the tails of the marginal distribution of the features. Moreover, we prove that, provided the $d$-dimensional marginal distribution of the features has a finite $\rho $th moment for some $\rho >4$ (as well as other regularity conditions), a local choice of $k$ can yield a rate of convergence of the excess risk of $O(n^{-4/(d+4)})$, where $n$ is the sample size, whereas for the standard $k$-nearest neighbour classifier, our theory would require $d\geq 5$ and $\rho >4d/(d-4)$ finite moments to achieve this rate. These results motivate a new $k$-nearest neighbour classifier for semi-supervised learning problems, where the unlabelled data are used to obtain an estimate of the marginal feature density, and fewer neighbours are used for classification when this density estimate is small. Our worst-case rates are complemented by a minimax lower bound, which reveals that the local, semi-supervised $k$-nearest neighbour classifier attains the minimax optimal rate over our classes for the excess risk, up to a subpolynomial factor in $n$. These theoretical improvements over the standard $k$-nearest neighbour classifier are also illustrated through a simulation study.
</p>projecteuclid.org/euclid.aos/1594972839_20200717040034Fri, 17 Jul 2020 04:00 EDTAn adaptable generalization of Hotelling’s $T^{2}$ test in high dimensionhttps://projecteuclid.org/euclid.aos/1594972840<strong>Haoran Li</strong>, <strong>Alexander Aue</strong>, <strong>Debashis Paul</strong>, <strong>Jie Peng</strong>, <strong>Pei Wang</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 3, 1815--1847.</p><p><strong>Abstract:</strong><br/>
We propose a two-sample test for detecting the difference between mean vectors in a high-dimensional regime based on a ridge-regularized Hotelling’s $T^{2}$. To choose the regularization parameter, a method is derived that aims at maximizing power within a class of local alternatives. We also propose a composite test that combines the optimal tests corresponding to a specific collection of local alternatives. Weak convergence of the stochastic process corresponding to the ridge-regularized Hotelling’s $T^{2}$ is established and used to derive the cut-off values of the proposed test. Large sample properties are verified for a class of sub-Gaussian distributions. Through an extensive simulation study, the composite test is shown to compare favorably against a host of existing two-sample test procedures in a wide range of settings. The performance of the proposed test procedures is illustrated through an application to a breast cancer data set where the goal is to detect the pathways with different DNA copy number alterations across breast cancer subtypes.
</p>projecteuclid.org/euclid.aos/1594972840_20200717040034Fri, 17 Jul 2020 04:00 EDTGRID: A variable selection and structure discovery method for high dimensional nonparametric regressionhttps://projecteuclid.org/euclid.aos/1594972841<strong>Francesco Giordano</strong>, <strong>Soumendra Nath Lahiri</strong>, <strong>Maria Lucia Parrella</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 3, 1848--1874.</p><p><strong>Abstract:</strong><br/>
We consider nonparametric regression in high dimensions where only a relatively small subset of a large number of variables are relevant and may have nonlinear effects on the response. We develop methods for variable selection, structure discovery and estimation of the true low-dimensional regression function, allowing any degree of interactions among the relevant variables that need not be specified a-priori . The proposed method, called the GRID, combines empirical likelihood based marginal testing with the local linear estimation machinery in a novel way to select the relevant variables. Further, it provides a simple graphical tool for identifying the low dimensional nonlinear structure of the regression function. Theoretical results establish consistency of variable selection and structure discovery, and also Oracle risk property of the GRID estimator of the regression function, allowing the dimension $d$ of the covariates to grow with the sample size $n$ at the rate $d=O(n^{a})$ for any $a\in(0,\infty)$ and the number of relevant covariates $r$ to grow at a rate $r=O(n^{\gamma})$ for some $\gamma\in(0,1)$ under some regularity conditions that, in particular, require finiteness of certain absolute moments of the error variables depending on $a$. Finite sample properties of the GRID are investigated in a moderately large simulation study.
</p>projecteuclid.org/euclid.aos/1594972841_20200717040034Fri, 17 Jul 2020 04:00 EDTNonparametric regression using deep neural networks with ReLU activation functionhttps://projecteuclid.org/euclid.aos/1597370649<strong>Johannes Schmidt-Hieber</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 4, 1875--1897.</p><p><strong>Abstract:</strong><br/>
Consider the multivariate nonparametric regression model. It is shown that estimators based on sparsely connected deep neural networks with ReLU activation function and properly chosen network architecture achieve the minimax rates of convergence (up to $\log n$-factors) under a general composition assumption on the regression function. The framework includes many well-studied structural constraints such as (generalized) additive models. While there is a lot of flexibility in the network architecture, the tuning parameter is the sparsity of the network. Specifically, we consider large networks with number of potential network parameters exceeding the sample size. The analysis gives some insights into why multilayer feedforward neural networks perform well in practice. Interestingly, for ReLU activation function the depth (number of layers) of the neural network architectures plays an important role, and our theory suggests that for nonparametric regression, scaling the network depth with the sample size is natural. It is also shown that under the composition assumption wavelet estimators can only achieve suboptimal rates.
</p>projecteuclid.org/euclid.aos/1597370649_20200813220442Thu, 13 Aug 2020 22:04 EDTDiscussion of: “Nonparametric regression using deep neural networks with ReLU activation function”https://projecteuclid.org/euclid.aos/1597370650<strong>Behrooz Ghorbani</strong>, <strong>Song Mei</strong>, <strong>Theodor Misiakiewicz</strong>, <strong>Andrea Montanari</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 4, 1898--1901.</p>projecteuclid.org/euclid.aos/1597370650_20200813220442Thu, 13 Aug 2020 22:04 EDTDiscussion of: “Nonparametric regression using deep neural networks with ReLU activation function”https://projecteuclid.org/euclid.aos/1597370651<strong>Gitta Kutyniok</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 4, 1902--1905.</p><p><strong>Abstract:</strong><br/>
I would like to congratulate Johannes Schmidt–Hieber on a very interesting paper in which he considers regression functions belonging to the class of so-called compositional functions and analyzes the ability of estimators based on the multivariate nonparametric regression model of deep neural networks to achieve minimax rates of convergence.
In my discussion, I will first regard such a type of result from the general viewpoint of the theoretical foundations of deep neural networks. This will be followed by a discussion from the viewpoint of expressivity, optimization and generalization. Finally, I will consider some specific aspects of the main result.
</p>projecteuclid.org/euclid.aos/1597370651_20200813220442Thu, 13 Aug 2020 22:04 EDTDiscussion of: “Nonparametric regression using deep neural networks with ReLU activation function”https://projecteuclid.org/euclid.aos/1597370655<strong>Michael Kohler</strong>, <strong>Sophie Langer</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 4, 1906--1910.</p>projecteuclid.org/euclid.aos/1597370655_20200813220442Thu, 13 Aug 2020 22:04 EDTDiscussion of: “Nonparametric regression using deep neural networks with ReLU activation function”https://projecteuclid.org/euclid.aos/1597370656<strong>Ohad Shamir</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 4, 1911--1915.</p>projecteuclid.org/euclid.aos/1597370656_20200813220442Thu, 13 Aug 2020 22:04 EDTRejoinder: “Nonparametric regression using deep neural networks with ReLU activation function”https://projecteuclid.org/euclid.aos/1597370657<strong>Johannes Schmidt-Hieber</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 4, 1916--1921.</p>projecteuclid.org/euclid.aos/1597370657_20200813220442Thu, 13 Aug 2020 22:04 EDTNonclassical Berry–Esseen inequalities and accuracy of the bootstraphttps://projecteuclid.org/euclid.aos/1597370658<strong>Mayya Zhilova</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 4, 1922--1939.</p><p><strong>Abstract:</strong><br/>
We study accuracy of bootstrap procedures for estimation of quantiles of a smooth function of a sum of independent sub-Gaussian random vectors. We establish higher-order approximation bounds with error terms depending on a sample size and a dimension explicitly. These results lead to improvements of accuracy of a weighted bootstrap procedure for general log-likelihood ratio statistics. The key element of our proofs of the bootstrap accuracy is a multivariate higher-order Berry–Esseen inequality. We consider a problem of approximation of distributions of two sums of zero mean independent random vectors, such that summands with the same indices have equal moments up to at least the second order. The derived approximation bound is uniform on the sets of all Euclidean balls. The presented approach extends classical Berry–Esseen type inequalities to higher-order approximation bounds. The theoretical results are illustrated with numerical experiments.
</p>projecteuclid.org/euclid.aos/1597370658_20200813220442Thu, 13 Aug 2020 22:04 EDTOn the validity of the formal Edgeworth expansion for posterior densitieshttps://projecteuclid.org/euclid.aos/1597370659<strong>John E. Kolassa</strong>, <strong>Todd A. Kuffner</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 4, 1940--1958.</p><p><strong>Abstract:</strong><br/>
We consider a fundamental open problem in parametric Bayesian theory, namely the validity of the formal Edgeworth expansion of the posterior density. While the study of valid asymptotic expansions for posterior distributions constitutes a rich literature, the validity of the formal Edgeworth expansion has not been rigorously established. Several authors have claimed connections of various posterior expansions with the classical Edgeworth expansion, or have simply assumed its validity. Our main result settles this open problem. We also prove a lemma concerning the order of posterior cumulants which is of independent interest in Bayesian parametric theory. The most relevant literature is synthesized and compared to the newly-derived Edgeworth expansions. Numerical investigations illustrate that our expansion has the behavior expected of an Edgeworth expansion, and that it has better performance than the other existing expansion which was previously claimed to be of Edgeworth type.
</p>projecteuclid.org/euclid.aos/1597370659_20200813220442Thu, 13 Aug 2020 22:04 EDTModel selection for high-dimensional linear regression with dependent observationshttps://projecteuclid.org/euclid.aos/1597370660<strong>Ching-Kang Ing</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 4, 1959--1980.</p><p><strong>Abstract:</strong><br/>
We investigate the prediction capability of the orthogonal greedy algorithm (OGA) in high-dimensional regression models with dependent observations. The rates of convergence of the prediction error of OGA are obtained under a variety of sparsity conditions. To prevent OGA from overfitting, we introduce a high-dimensional Akaike’s information criterion (HDAIC) to determine the number of OGA iterations. A key contribution of this work is to show that OGA, used in conjunction with HDAIC, can achieve the optimal convergence rate without knowledge of how sparse the underlying high-dimensional model is.
</p>projecteuclid.org/euclid.aos/1597370660_20200813220442Thu, 13 Aug 2020 22:04 EDTOptimal estimation of Gaussian mixtures via denoised method of momentshttps://projecteuclid.org/euclid.aos/1597370661<strong>Yihong Wu</strong>, <strong>Pengkun Yang</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 4, 1981--2007.</p><p><strong>Abstract:</strong><br/>
The method of moments ( Philos. Trans. R. Soc. Lond. Ser. A 185 (1894) 71–110) is one of the most widely used methods in statistics for parameter estimation, by means of solving the system of equations that match the population and estimated moments. However, in practice and especially for the important case of mixture models, one frequently needs to contend with the difficulties of non-existence or nonuniqueness of statistically meaningful solutions, as well as the high computational cost of solving large polynomial systems. Moreover, theoretical analyses of the method of moments are mainly confined to asymptotic normality style of results established under strong assumptions.
This paper considers estimating a $k$-component Gaussian location mixture with a common (possibly unknown) variance parameter. To overcome the aforementioned theoretic and algorithmic hurdles, a crucial step is to denoise the moment estimates by projecting to the truncated moment space (via semidefinite programming) before solving the method of moments equations. Not only does this regularization ensure existence and uniqueness of solutions, it also yields fast solvers by means of Gauss quadrature. Furthermore, by proving new moment comparison theorems in the Wasserstein distance via polynomial interpolation and majorization techniques, we establish the statistical guarantees and adaptive optimality of the proposed procedure, as well as oracle inequality in misspecified models. These results can also be viewed as provable algorithms for generalized method of moments ( Econometrica 50 (1982) 1029–1054) which involves nonconvex optimization and lacks theoretical guarantees.
</p>projecteuclid.org/euclid.aos/1597370661_20200813220442Thu, 13 Aug 2020 22:04 EDTSharp instruments for classifying compliers and generalizing causal effectshttps://projecteuclid.org/euclid.aos/1597370662<strong>Edward H. Kennedy</strong>, <strong>Sivaraman Balakrishnan</strong>, <strong>Max G’Sell</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 4, 2008--2030.</p><p><strong>Abstract:</strong><br/>
It is well known that, without restricting treatment effect heterogeneity, instrumental variable (IV) methods only identify “local” effects among compliers, that is, those subjects who take treatment only when encouraged by the IV. Local effects are controversial since they seem to only apply to an unidentified subgroup; this has led many to denounce these effects as having little policy relevance. However, we show that such pessimism is not always warranted: it can be possible to accurately predict who compliers are, and obtain tight bounds on more generalizable effects in identifiable subgroups. We propose methods for doing so and study estimation error and asymptotic properties, showing that these tasks can sometimes be accomplished even with very weak IVs. We go on to introduce a new measure of IV quality called “sharpness,” which reflects the variation in compliance explained by covariates, and captures how well one can identify compliers and obtain tight bounds on identifiable subgroup effects. We develop an estimator of sharpness and show that it is asymptotically efficient under weak conditions. Finally, we explore finite-sample properties via simulation, and apply the methods to study canvassing effects on voter turnout. We propose that sharpness should be presented alongside strength to assess IV quality.
</p>projecteuclid.org/euclid.aos/1597370662_20200813220442Thu, 13 Aug 2020 22:04 EDTEmpirical risk minimization and complexity of dynamical modelshttps://projecteuclid.org/euclid.aos/1597370663<strong>Kevin McGoff</strong>, <strong>Andrew B. Nobel</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 4, 2031--2054.</p><p><strong>Abstract:</strong><br/>
A dynamical model consists of a continuous self-map $T:\mathcal{X}\to \mathcal{X}$ of a compact state space $\mathcal{X}$ and a continuous observation function $f:\mathcal{X}\to \mathbb{R}$. This paper considers the fitting of a parametrized family of dynamical models to an observed real-valued stochastic process using empirical risk minimization. The limiting behavior of the minimum risk parameters is studied in a general setting. We establish a general convergence theorem for minimum risk estimators and ergodic observations. We then study conditions under which empirical risk minimization can effectively separate signal from noise in an additive observational noise model. The key condition in the latter results is that the family of dynamical models has limited complexity, which is quantified through a notion of entropy for families of infinite sequences that connects covering number based entropies with topological entropy studied in dynamical systems. We establish close connections between entropy and limiting average mean widths for stationary processes, and discuss several examples of dynamical models.
</p>projecteuclid.org/euclid.aos/1597370663_20200813220442Thu, 13 Aug 2020 22:04 EDTAdaptive estimation in structured factor models with applications to overlapping clusteringhttps://projecteuclid.org/euclid.aos/1597370664<strong>Xin Bing</strong>, <strong>Florentina Bunea</strong>, <strong>Yang Ning</strong>, <strong>Marten Wegkamp</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 4, 2055--2081.</p><p><strong>Abstract:</strong><br/>
This work introduces a novel estimation method, called LOVE, of the entries and structure of a loading matrix $A$ in a latent factor model $X=AZ+E$, for an observable random vector $X\in \mathbb{R}^{p}$, with correlated unobservable factors $Z\in \mathbb{R}^{K}$, with $K$ unknown, and uncorrelated noise $E$. Each row of $A$ is scaled, and allowed to be sparse. In order to identify the loading matrix $A$, we require the existence of pure variables, which are components of $X$ that are associated, via $A$, with one and only one latent factor. Despite the fact that the number of factors $K$, the number of the pure variables and their location are all unknown, we only require a mild condition on the covariance matrix of $Z$, and a minimum of only two pure variables per latent factor to show that $A$ is uniquely defined, up to signed permutations. Our proofs for model identifiability are constructive, and lead to our novel estimation method of the number of factors and of the set of pure variables, from a sample of size $n$ of observations on $X$. This is the first step of our LOVE algorithm, which is optimization-free, and has low computational complexity of order $p^{2}$. The second step of LOVE is an easily implementable linear program that estimates $A$. We prove that the resulting estimator is near minimax rate optimal for $A$, with respect to the $\|\ \|_{\infty ,q}$ loss, for $q\geq 1$, up to logarithmic factors in $p$, and that it can be minimax-rate optimal in many cases of interest.
The model structure is motivated by the problem of overlapping variable clustering, ubiquitous in data science. We define the population level clusters as groups of those components of $X$ that are associated, via the matrix $A$, with the same unobservable latent factor, and multifactor association is allowed. Clusters are respectively anchored by the pure variables, and form overlapping subgroups of the $p$-dimensional random vector $X$. The L atent model approach to OVE rlapping clustering is reflected in the name of our algorithm, LOVE.
The third step of LOVE estimates the clusters from the support of the columns of the estimated $A$. We guarantee cluster recovery with zero false positive proportion, and with false negative proportion control. The practical relevance of LOVE is illustrated through the analysis of a RNA-seq data set, devoted to determining the functional annotation of genes with unknown function.
</p>projecteuclid.org/euclid.aos/1597370664_20200813220442Thu, 13 Aug 2020 22:04 EDTPartial identifiability of restricted latent class modelshttps://projecteuclid.org/euclid.aos/1597370665<strong>Yuqi Gu</strong>, <strong>Gongjun Xu</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 4, 2082--2107.</p><p><strong>Abstract:</strong><br/>
Latent class models have wide applications in social and biological sciences. In many applications, prespecified restrictions are imposed on the parameter space of latent class models, through a design matrix, to reflect practitioners’ assumptions about how the observed responses depend on subjects’ latent traits. Though widely used in various fields, such restricted latent class models suffer from nonidentifiability due to their discreteness nature and complex structure of restrictions. This work addresses the fundamental identifiability issue of restricted latent class models by developing a general framework for strict and partial identifiability of the model parameters. Under correct model specification, the developed identifiability conditions only depend on the design matrix and are easily checkable, which provide useful practical guidelines for designing statistically valid diagnostic tests. Furthermore, the new theoretical framework is applied to establish, for the first time, identifiability of several designs from cognitive diagnosis applications.
</p>projecteuclid.org/euclid.aos/1597370665_20200813220442Thu, 13 Aug 2020 22:04 EDTPosterior concentration for Bayesian regression trees and forestshttps://projecteuclid.org/euclid.aos/1597370666<strong>Veronika Ročková</strong>, <strong>Stéphanie van der Pas</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 4, 2108--2131.</p><p><strong>Abstract:</strong><br/>
Since their inception in the 1980s, regression trees have been one of the more widely used nonparametric prediction methods. Tree-structured methods yield a histogram reconstruction of the regression surface, where the bins correspond to terminal nodes of recursive partitioning. Trees are powerful, yet susceptible to overfitting. Strategies against overfitting have traditionally relied on pruning greedily grown trees. The Bayesian framework offers an alternative remedy against overfitting through priors. Roughly speaking, a good prior charges smaller trees where overfitting does not occur. While the consistency of random histograms, trees and their ensembles has been studied quite extensively, the theoretical understanding of the Bayesian counterparts has been missing. In this paper, we take a step toward understanding why/when do Bayesian trees and forests not overfit. To address this question, we study the speed at which the posterior concentrates around the true smooth regression function. We propose a spike-and-tree variant of the popular Bayesian CART prior and establish new theoretical results showing that regression trees (and forests) (a) are capable of recovering smooth regression surfaces (with smoothness not exceeding one), achieving optimal rates up to a log factor, (b) can adapt to the unknown level of smoothness and (c) can perform effective dimension reduction when $p>n$. These results provide a piece of missing theoretical evidence explaining why Bayesian trees (and additive variants thereof?) have worked so well in practice.
</p>projecteuclid.org/euclid.aos/1597370666_20200813220442Thu, 13 Aug 2020 22:04 EDTDouble-slicing assisted sufficient dimension reduction for high-dimensional censored datahttps://projecteuclid.org/euclid.aos/1597370667<strong>Shanshan Ding</strong>, <strong>Wei Qian</strong>, <strong>Lan Wang</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 4, 2132--2154.</p><p><strong>Abstract:</strong><br/>
This paper provides a unified framework and an efficient algorithm for analyzing high-dimensional survival data under weak modeling assumptions. In particular, it imposes neither parametric distributional assumption nor linear regression assumption. It only assumes that the survival time $T$ depends on a high-dimensional covariate vector $\mathbf{X}$ through low-dimensional linear combinations of covariates $\Gamma ^{T}\mathbf{X}$. The censoring time is allowed to be conditionally independent of the survival time given the covariates. This general framework includes many popular parametric and semiparametric survival regression models as special cases. The proposed algorithm produces a number of practically useful outputs with theoretical guarantees, including a consistent estimate of the sufficient dimension reduction subspace of $T\mid \mathbf{X}$, a uniformly consistent Kaplan–Meier-type estimator of the conditional distribution function of $T$ and a consistent estimator of the conditional quantile survival time. Our asymptotic results significantly extend the classical theory of sufficient dimension reduction for censored data (particularly that of Li, Wang and Chen in Ann. Statist. 27 (1999) 1–23) and the celebrated nonparametric Kaplan–Meier estimator to the setting where the number of covariates $p$ diverges exponentially fast with the sample size $n$. We demonstrate the promising performance of the proposed new estimators through simulations and a real data example.
</p>projecteuclid.org/euclid.aos/1597370667_20200813220442Thu, 13 Aug 2020 22:04 EDTAsymptotic frequentist coverage properties of Bayesian credible sets for sieve priorshttps://projecteuclid.org/euclid.aos/1597370668<strong>Judith Rousseau</strong>, <strong>Botond Szabo</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 4, 2155--2179.</p><p><strong>Abstract:</strong><br/>
We investigate the frequentist coverage properties of (certain) Bayesian credible sets in a general, adaptive, nonparametric framework. It is well known that the construction of adaptive and honest confidence sets is not possible in general. To overcome this problem (in context of sieve type of priors), we introduce an extra assumption on the functional parameters, the so-called “general polished tail” condition. We then show that under standard assumptions, both the hierarchical and empirical Bayes methods, result in honest confidence sets for sieve type of priors in general settings and we characterize their size. We apply the derived abstract results to various examples, including the nonparametric regression model, density estimation using exponential families of priors, density estimation using histogram priors and the nonparametric classification model, for which we show that their size is near minimax adaptive with respect to the considered specific pseudometrics.
</p>projecteuclid.org/euclid.aos/1597370668_20200813220442Thu, 13 Aug 2020 22:04 EDTConvergence rates of variational posterior distributionshttps://projecteuclid.org/euclid.aos/1597370669<strong>Fengshuo Zhang</strong>, <strong>Chao Gao</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 4, 2180--2207.</p><p><strong>Abstract:</strong><br/>
We study convergence rates of variational posterior distributions for nonparametric and high-dimensional inference. We formulate general conditions on prior, likelihood and variational class that characterize the convergence rates. Under similar “prior mass and testing” conditions considered in the literature, the rate is found to be the sum of two terms. The first term stands for the convergence rate of the true posterior distribution, and the second term is contributed by the variational approximation error. For a class of priors that admit the structure of a mixture of product measures, we propose a novel prior mass condition, under which the variational approximation error of the mean-field class is dominated by convergence rate of the true posterior. We demonstrate the applicability of our general results for various models, prior distributions and variational classes by deriving convergence rates of the corresponding variational posteriors.
</p>projecteuclid.org/euclid.aos/1597370669_20200813220442Thu, 13 Aug 2020 22:04 EDTTwo-sample hypothesis testing for inhomogeneous random graphshttps://projecteuclid.org/euclid.aos/1597370670<strong>Debarghya Ghoshdastidar</strong>, <strong>Maurilio Gutzeit</strong>, <strong>Alexandra Carpentier</strong>, <strong>Ulrike von Luxburg</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 4, 2208--2229.</p><p><strong>Abstract:</strong><br/>
The study of networks leads to a wide range of high-dimensional inference problems. In many practical applications, one needs to draw inference from one or few large sparse networks. The present paper studies hypothesis testing of graphs in this high-dimensional regime, where the goal is to test between two populations of inhomogeneous random graphs defined on the same set of $n$ vertices. The size of each population $m$ is much smaller than $n$, and can even be a constant as small as 1. The critical question in this context is whether the problem is solvable for small $m$.
We answer this question from a minimax testing perspective. Let $P$, $Q$ be the population adjacencies of two sparse inhomogeneous random graph models, and $d$ be a suitably defined distance function. Given a population of $m$ graphs from each model, we derive minimax separation rates for the problem of testing $P=Q$ against $d(P,Q)>\rho $. We observe that if $m$ is small, then the minimax separation is too large for some popular choices of $d$, including total variation distance between corresponding distributions. This implies that some models that are widely separated in $d$ cannot be distinguished for small $m$, and hence, the testing problem is generally not solvable in these cases.
We also show that if $m>1$, then the minimax separation is relatively small if $d$ is the Frobenius norm or operator norm distance between $P$ and $Q$. For $m=1$, only the latter distance provides small minimax separation. Thus, for these distances, the problem is solvable for small $m$. We also present near-optimal two-sample tests in both cases, where tests are adaptive with respect to sparsity level of the graphs.
</p>projecteuclid.org/euclid.aos/1597370670_20200813220442Thu, 13 Aug 2020 22:04 EDTBeyond HC: More sensitive tests for rare/weak alternativeshttps://projecteuclid.org/euclid.aos/1597370671<strong>Thomas Porter</strong>, <strong>Michael Stewart</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 4, 2230--2252.</p><p><strong>Abstract:</strong><br/>
Higher criticism (HC) is a popular method for large-scale inference problems based on identifying unusually high proportions of small $p$-values. It has been shown to enjoy a lower-order optimality property in a simple normal location mixture model which is shared by the ‘tailor-made’ parametric generalised likelihood ratio test (GLRT) for the same model; however, HC has also been shown to perform well outside this ‘narrow’ model.
We develop a higher-order framework for analysing the power of these and similar procedures, which reveals the perhaps unsurprising fact that the GLRT enjoys an edge in power over HC for the normal location mixture model. We also identify a similar parametric mixture model to which HC is similarly ‘tailor-made’ and show that the situation is (at least partly) reversed there. We also show that in the normal location mixture model a procedure based on the empirical moment-generating function enjoys the same local power properties as the GLRT and may be recommended as an easy to implement (and interpret), complementary procedure to HC. Some other practical advice regarding the implementation of these procedures is provided. Finally, we provide some simulation results to help interpret our theoretical findings.
</p>projecteuclid.org/euclid.aos/1597370671_20200813220442Thu, 13 Aug 2020 22:04 EDTMinimax optimal rates for Mondrian trees and forestshttps://projecteuclid.org/euclid.aos/1597370672<strong>Jaouad Mourtada</strong>, <strong>Stéphane Gaïffas</strong>, <strong>Erwan Scornet</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 4, 2253--2276.</p><p><strong>Abstract:</strong><br/>
Introduced by Breiman ( Mach. Learn. 45 (2001) 5–32), Random Forests are widely used classification and regression algorithms. While being initially designed as batch algorithms, several variants have been proposed to handle online learning. One particular instance of such forests is the Mondrian forest (In Adv. Neural Inf. Process. Syst. (2014) 3140–3148; In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS) (2016)), whose trees are built using the so-called Mondrian process, therefore allowing to easily update their construction in a streaming fashion. In this paper we provide a thorough theoretical study of Mondrian forests in a batch learning setting, based on new results about Mondrian partitions. Our results include consistency and convergence rates for Mondrian trees and forests, that turn out to be minimax optimal on the set of $s$-Hölder function with $s\in (0,1]$ (for trees and forests) and $s\in (1,2]$ (for forests only), assuming a proper tuning of their complexity parameter in both cases. Furthermore, we prove that an adaptive procedure (to the unknown $s\in (0,2]$) can be constructed by combining Mondrian forests with a standard model aggregation algorithm. These results are the first demonstrating that some particular random forests achieve minimax rates in arbitrary dimension . Owing to their remarkably simple distributional properties, which lead to minimax rates, Mondrian trees are a promising basis for more sophisticated yet theoretically sound random forests variants.
</p>projecteuclid.org/euclid.aos/1597370672_20200813220442Thu, 13 Aug 2020 22:04 EDTIdentifiability of nonparametric mixture models and Bayes optimal clusteringhttps://projecteuclid.org/euclid.aos/1597370673<strong>Bryon Aragam</strong>, <strong>Chen Dan</strong>, <strong>Eric P. Xing</strong>, <strong>Pradeep Ravikumar</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 4, 2277--2302.</p><p><strong>Abstract:</strong><br/>
Motivated by problems in data clustering, we establish general conditions under which families of nonparametric mixture models are identifiable by introducing a novel framework involving clustering overfitted parametric (i.e., misspecified) mixture models. These identifiability conditions generalize existing conditions in the literature and are flexible enough to include, for example, mixtures of infinite Gaussian mixtures. In contrast to the recent literature, we allow for general nonparametric mixture components and instead impose regularity assumptions on the underlying mixing measure. As our primary application we apply these results to partition-based clustering, generalizing the notion of a Bayes optimal partition from classical parametric model-based clustering to nonparametric settings. Furthermore, this framework is constructive, so that it yields a practical algorithm for learning identified mixtures, which is illustrated through several examples on real data. The key conceptual device in the analysis is the convex, metric geometry of probability measures on metric spaces and its connection to the Wasserstein convergence of mixing measures. The result is a flexible framework for nonparametric clustering with formal consistency guarantees.
</p>projecteuclid.org/euclid.aos/1597370673_20200813220442Thu, 13 Aug 2020 22:04 EDTA test for separability in covariance operators of random surfaceshttps://projecteuclid.org/euclid.aos/1597370674<strong>Pramita Bagchi</strong>, <strong>Holger Dette</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 4, 2303--2322.</p><p><strong>Abstract:</strong><br/>
The assumption of separability is a simplifying and very popular assumption in the analysis of spatiotemporal or hypersurface data structures. It is often made in situations where the covariance structure cannot be easily estimated, for example, because of a small sample size or because of computational storage problems. In this paper we propose a new and very simple test to validate this assumption. Our approach is based on a measure of separability which is zero in the case of separability and positive otherwise. We derive the asymptotic distribution of a corresponding estimate under the null hypothesis and the alternative and develop an asymptotic and a bootstrap test which are very easy to implement. In particular, our approach does neither require projections on subspaces generated by the eigenfunctions of the covariance operator nor distributional assumptions as recently used by ( Ann. Statist. 45 (2017) 1431–1461) and ( Biometrika 104 425–437) to construct tests for separability. We investigate the finite sample performance by means of a simulation study and also provide a comparison with the currently available methodology. Finally, the new procedure is illustrated analyzing a data example.
</p>projecteuclid.org/euclid.aos/1597370674_20200813220442Thu, 13 Aug 2020 22:04 EDTA general approach for cure models in survival analysishttps://projecteuclid.org/euclid.aos/1597370675<strong>Valentin Patilea</strong>, <strong>Ingrid Van Keilegom</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 4, 2323--2346.</p><p><strong>Abstract:</strong><br/>
In survival analysis it often happens that some subjects under study do not experience the event of interest; they are considered to be “cured.” The population is thus a mixture of two subpopulations, one of cured subjects and one of “susceptible” subjects. We propose a novel approach to estimate a mixture cure model when covariates are present and the lifetime is subject to random right censoring. We work with a parametric model for the cure proportion, while the conditional survival function of the uncured subjects is unspecified. The approach is based on an inversion which allows us to write the survival function as a function of the distribution of the observable variables. This leads to a very general class of models which allows a flexible and rich modeling of the conditional survival function. We show the identifiability of the proposed model as well as the consistency and the asymptotic normality of the model parameters. We also consider in more detail the case where kernel estimators are used for the nonparametric part of the model. The new estimators are compared with the estimators from a Cox mixture cure model via simulations. Finally, we apply the new model on a medical data set.
</p>projecteuclid.org/euclid.aos/1597370675_20200813220442Thu, 13 Aug 2020 22:04 EDTAdaptive distributed methods under communication constraintshttps://projecteuclid.org/euclid.aos/1597370676<strong>Botond Szabó</strong>, <strong>Harry van Zanten</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 4, 2347--2380.</p><p><strong>Abstract:</strong><br/>
We study estimation methods under communication constraints in a distributed version of the nonparametric random design regression model. We derive minimax lower bounds and exhibit methods that attain those bounds. Moreover, we show that adaptive estimation is possible in this setting.
</p>projecteuclid.org/euclid.aos/1597370676_20200813220442Thu, 13 Aug 2020 22:04 EDTBayesian analysis of the covariance matrix of a multivariate normal distribution with a new class of priorshttps://projecteuclid.org/euclid.aos/1597370677<strong>James O. Berger</strong>, <strong>Dongchu Sun</strong>, <strong>Chengyuan Song</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 4, 2381--2403.</p><p><strong>Abstract:</strong><br/>
Bayesian analysis for the covariance matrix of a multivariate normal distribution has received a lot of attention in the last two decades. In this paper, we propose a new class of priors for the covariance matrix, including both inverse Wishart and reference priors as special cases. The main motivation for the new class is to have available priors—both subjective and objective—that do not “force eigenvalues apart,” which is a criticism of inverse Wishart and Jeffreys priors. Extensive comparison of these “shrinkage priors” with inverse Wishart and Jeffreys priors is undertaken, with the new priors seeming to have considerably better performance. A number of curious facts about the new priors are also observed, such as that the posterior distribution will be proper with just three vector observations from the multivariate normal distribution—regardless of the dimension of the covariance matrix—and that useful inference about features of the covariance matrix can be possible. Finally, a new MCMC algorithm is developed for this class of priors and is shown to be computationally effective for matrices of up to 100 dimensions.
</p>projecteuclid.org/euclid.aos/1597370677_20200813220442Thu, 13 Aug 2020 22:04 EDTExtending the validity of frequency domain bootstrap methods to general stationary processeshttps://projecteuclid.org/euclid.aos/1597370678<strong>Marco Meyer</strong>, <strong>Efstathios Paparoditis</strong>, <strong>Jens-Peter Kreiss</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 4, 2404--2427.</p><p><strong>Abstract:</strong><br/>
Existing frequency domain methods for bootstrapping time series have a limited range. Essentially, these procedures cover the case of linear time series with independent innovations, and some even require the time series to be Gaussian. In this paper we propose a new frequency domain bootstrap method—the hybrid periodogram bootstrap (HPB)—which is consistent for a much wider range of stationary, even nonlinear, processes and which can be applied to a large class of periodogram-based statistics. The HPB is designed to combine desirable features of different frequency domain techniques while overcoming their respective limitations. It is capable to imitate the weak dependence structure of the periodogram by invoking the concept of convolved subsampling in a novel way that is tailor-made for periodograms. We show consistency for the HPB procedure for a general class of stationary time series, ranging clearly beyond linear processes, and for spectral means and ratio statistics on which we mainly focus. The finite sample performance of the new bootstrap procedure is illustrated via simulations.
</p>projecteuclid.org/euclid.aos/1597370678_20200813220442Thu, 13 Aug 2020 22:04 EDTMinimax estimation of large precision matrices with bandable Cholesky factorhttps://projecteuclid.org/euclid.aos/1597370679<strong>Yu Liu</strong>, <strong>Zhao Ren</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 4, 2428--2454.</p><p><strong>Abstract:</strong><br/>
The last decade has witnessed significant methodological and theoretical advances in estimating large precision matrices. In particular, there are scientific applications such as longitudinal data, meteorology and spectroscopy in which the ordering of the variables can be interpreted through a bandable structure on the Cholesky factor of the precision matrix. However, the minimax theory has still been largely unknown, as opposed to the well established minimax results over the corresponding bandable covariance matrices. In this paper we focus on two commonly used types of parameter spaces and develop the optimal rates of convergence under both the operator norm and the Frobenius norm. A striking phenomenon is found. Two types of parameter spaces are fundamentally different under the operator norm but enjoy the same rate optimality under the Frobenius norm which is in sharp contrast to the equivalence of corresponding two types of bandable covariance matrices under both norms. This fundamental difference is established by carefully constructing the corresponding minimax lower bounds. Two new estimation procedures are developed. For the operator norm our optimal procedure is based on a novel local cropping estimator, targeting on all principle submatrices of the precision matrix, while for the Frobenius norm our optimal procedure relies on a delicate regression-based thresholding rule. Lepski’s method is considered to achieve optimal adaptation. We further establish rate optimality in the nonparanormal model. Numerical studies are carried out to confirm our theoretical findings.
</p>projecteuclid.org/euclid.aos/1597370679_20200813220442Thu, 13 Aug 2020 22:04 EDTEstimation and inference for precision matrices of nonstationary time serieshttps://projecteuclid.org/euclid.aos/1597370680<strong>Xiucai Ding</strong>, <strong>Zhou Zhou</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 4, 2455--2477.</p><p><strong>Abstract:</strong><br/>
We consider the estimation of and inference on precision matrices of a rich class of univariate locally stationary linear and nonlinear time series, assuming that only one realization of the time series is observed. Using a Cholesky decomposition technique, we show that the precision matrices can be directly estimated via a series of least squares linear regressions with smoothly time-varying coefficients. The method of sieves is utilized for the estimation and is shown to be optimally adaptive in terms of estimation accuracy and efficient in terms of computational complexity. We establish an asymptotic theory for a class of $\mathcal{L}^{2}$ tests based on the nonparametric sieve estimators. The latter are used for testing whether the precision matrices are diagonal or banded. A Gaussian approximation result is established for a wide class of quadratic forms of nonstationary and possibly nonlinear processes of diverging dimensions which is of interest by itself.
</p>projecteuclid.org/euclid.aos/1597370680_20200813220442Thu, 13 Aug 2020 22:04 EDTIsotropic covariance functions on graphs and their edgeshttps://projecteuclid.org/euclid.aos/1597370681<strong>Ethan Anderes</strong>, <strong>Jesper Møller</strong>, <strong>Jakob G. Rasmussen</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 4, 2478--2503.</p><p><strong>Abstract:</strong><br/>
We develop parametric classes of covariance functions on linear networks and their extension to graphs with Euclidean edges, that is, graphs with edges viewed as line segments or more general sets with a coordinate system allowing us to consider points on the graph which are vertices or points on an edge. Our covariance functions are defined on the vertices and edge points of these graphs and are isotropic in the sense that they depend only on the geodesic distance or on a new metric called the resistance metric (which extends the classical resistance metric developed in electrical network theory on the vertices of a graph to the continuum of edge points). We discuss the advantages of using the resistance metric in comparison with the geodesic metric as well as the restrictions these metrics impose on the investigated covariance functions. In particular, many of the commonly used isotropic covariance functions in the spatial statistics literature (the power exponential, Matérn, generalized Cauchy and Dagum classes) are shown to be valid with respect to the resistance metric for any graph with Euclidean edges, whilst they are only valid with respect to the geodesic metric in more special cases.
</p>projecteuclid.org/euclid.aos/1597370681_20200813220442Thu, 13 Aug 2020 22:04 EDTTesting for stationarity of functional time series in the frequency domainhttps://projecteuclid.org/euclid.aos/1600480922<strong>Alexander Aue</strong>, <strong>Anne van Delft</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 5, 2505--2547.</p><p><strong>Abstract:</strong><br/>
Interest in functional time series has spiked in the recent past with papers covering both methodology and applications being published at a much increased pace. This article contributes to the research in this area by proposing a new stationarity test for functional time series based on frequency domain methods. The proposed test statistics is based on joint dimension reduction via functional principal components analysis across the spectral density operators at all Fourier frequencies, explicitly allowing for frequency-dependent levels of truncation to adapt to the dynamics of the underlying functional time series. The properties of the test are derived both under the null hypothesis of stationary functional time series and under the smooth alternative of locally stationary functional time series. The methodology is theoretically justified through asymptotic results. Evidence from simulation studies and an application to annual temperature curves suggests that the test works well in finite samples.
</p>projecteuclid.org/euclid.aos/1600480922_20200918220221Fri, 18 Sep 2020 22:02 EDTOn spike and slab empirical Bayes multiple testinghttps://projecteuclid.org/euclid.aos/1600480923<strong>Ismaël Castillo</strong>, <strong>Étienne Roquain</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 5, 2548--2574.</p><p><strong>Abstract:</strong><br/>
This paper explores a connection between empirical Bayes posterior distributions and false discovery rate (FDR) control. In the Gaussian sequence model this work shows that empirical Bayes-calibrated spike and slab posterior distributions allow a correct FDR control under sparsity. Doing so, it offers a frequentist theoretical validation of empirical Bayes methods in the context of multiple testing. Our theoretical results are illustrated with numerical experiments.
</p>projecteuclid.org/euclid.aos/1600480923_20200918220221Fri, 18 Sep 2020 22:02 EDTTheoretical and computational guarantees of mean field variational inference for community detectionhttps://projecteuclid.org/euclid.aos/1600480924<strong>Anderson Y. Zhang</strong>, <strong>Harrison H. Zhou</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 5, 2575--2598.</p><p><strong>Abstract:</strong><br/>
The mean field variational Bayes method is becoming increasingly popular in statistics and machine learning. Its iterative coordinate ascent variational inference algorithm has been widely applied to large scale Bayesian inference. See Blei et al. (2017) for a recent comprehensive review. Despite the popularity of the mean field method, there exist remarkably little fundamental theoretical justifications. To the best of our knowledge, the iterative algorithm has never been investigated for any high-dimensional and complex model. In this paper, we study the mean field method for community detection under the stochastic block model. For an iterative batch coordinate ascent variational inference algorithm, we show that it has a linear convergence rate and converges to the minimax rate within $\log n$ iterations. This complements the results of Bickel et al. (2013) which studied the global minimum of the mean field variational Bayes and obtained asymptotic normal estimation of global model parameters. In addition, we obtain similar optimality results for Gibbs sampling and an iterative procedure to calculate maximum likelihood estimation, which can be of independent interest.
</p>projecteuclid.org/euclid.aos/1600480924_20200918220221Fri, 18 Sep 2020 22:02 EDTMinimax optimal sequential hypothesis tests for Markov processeshttps://projecteuclid.org/euclid.aos/1600480925<strong>Michael Fauß</strong>, <strong>Abdelhak M. Zoubir</strong>, <strong>H. Vincent Poor</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 5, 2599--2621.</p><p><strong>Abstract:</strong><br/>
Under mild Markov assumptions, sufficient conditions for strict minimax optimality of sequential tests for multiple hypotheses under distributional uncertainty are derived. First, the design of optimal sequential tests for simple hypotheses is revisited, and it is shown that the partial derivatives of the corresponding cost function are closely related to the performance metrics of the underlying sequential test. Second, an implicit characterization of the least favorable distributions for a given testing policy is stated. By combining the results on optimal sequential tests and least favorable distributions, sufficient conditions for a sequential test to be minimax optimal under general distributional uncertainties are obtained. The cost function of the minimax optimal test is further identified as a generalized $f$-dissimilarity and the least favorable distributions as those that are most similar with respect to this dissimilarity. Numerical examples for minimax optimal sequential tests under different uncertainties illustrate the theoretical results.
</p>projecteuclid.org/euclid.aos/1600480925_20200918220221Fri, 18 Sep 2020 22:02 EDTTest of significance for high-dimensional longitudinal datahttps://projecteuclid.org/euclid.aos/1600480926<strong>Ethan X. Fang</strong>, <strong>Yang Ning</strong>, <strong>Runze Li</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 5, 2622--2645.</p><p><strong>Abstract:</strong><br/>
This paper concerns statistical inference for longitudinal data with ultrahigh dimensional covariates. We first study the problem of constructing confidence intervals and hypothesis tests for a low-dimensional parameter of interest. The major challenge is how to construct a powerful test statistic in the presence of high-dimensional nuisance parameters and sophisticated within-subject correlation of longitudinal data. To deal with the challenge, we propose a new quadratic decorrelated inference function approach which simultaneously removes the impact of nuisance parameters and incorporates the correlation to enhance the efficiency of the estimation procedure. When the parameter of interest is of fixed dimension, we prove that the proposed estimator is asymptotically normal and attains the semiparametric information bound, based on which we can construct an optimal Wald test statistic. We further extend this result and establish the limiting distribution of the estimator under the setting with the dimension of the parameter of interest growing with the sample size at a polynomial rate. Finally, we study how to control the false discovery rate (FDR) when a vector of high-dimensional regression parameters is of interest. We prove that applying the Storey ( J. R. Stat. Soc. Ser. B. Stat. Methodol. 64 (2002) 479–498) procedure to the proposed test statistics for each regression parameter controls FDR asymptotically in longitudinal data. We conduct simulation studies to assess the finite sample performance of the proposed procedures. Our simulation results imply that the newly proposed procedure can control both Type I error for testing a low dimensional parameter of interest and the FDR in the multiple testing problem. We also apply the proposed procedure to a real data example.
</p>projecteuclid.org/euclid.aos/1600480926_20200918220221Fri, 18 Sep 2020 22:02 EDTGeometrizing rates of convergence under local differential privacy constraintshttps://projecteuclid.org/euclid.aos/1600480927<strong>Angelika Rohde</strong>, <strong>Lukas Steinberger</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 5, 2646--2670.</p><p><strong>Abstract:</strong><br/>
We study the problem of estimating a functional $\theta ({\mathbb{P}})$ of an unknown probability distribution ${\mathbb{P}}\in {\mathcal{P}}$ in which the original iid sample $X_{1},\dots ,X_{n}$ is kept private even from the statistician via an $\alpha $-local differential privacy constraint. Let $\omega _{\mathrm{TV}}$ denote the modulus of continuity of the functional $\theta $ over ${\mathcal{P}}$ with respect to total variation distance. For a large class of loss functions $l$ and a fixed privacy level $\alpha $, we prove that the privatized minimax risk is equivalent to $l(\omega _{\mathrm{TV}}(n^{-1/2}))$ to within constants, under regularity conditions that are satisfied, in particular, if $\theta $ is linear and ${\mathcal{P}}$ is convex. Our results complement the theory developed by Donoho and Liu (1991) with the nowadays highly relevant case of privatized data. Somewhat surprisingly, the difficulty of the estimation problem in the private case is characterized by $\omega _{\mathrm{TV}}$, whereas, it is characterized by the Hellinger modulus of continuity if the original data $X_{1},\dots ,X_{n}$ are available. We also find that for locally private estimation of linear functionals over a convex model a simple sample mean estimator, based on independently and binary privatized observations, always achieves the minimax rate. We further provide a general recipe for choosing the functional parameter in the optimal binary privatization mechanisms and illustrate the general theory in numerous examples. Our theory allows us to quantify the price to be paid for local differential privacy in a large class of estimation problems. This price appears to be highly problem specific.
</p>projecteuclid.org/euclid.aos/1600480927_20200918220221Fri, 18 Sep 2020 22:02 EDTAdditive regression with Hilbertian responseshttps://projecteuclid.org/euclid.aos/1600480928<strong>Jeong Min Jeon</strong>, <strong>Byeong U. Park</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 5, 2671--2697.</p><p><strong>Abstract:</strong><br/>
This paper develops a foundation of methodology and theory for the estimation of structured nonparametric regression models with Hilbertian responses. Our method and theory are focused on the additive model, while the main ideas may be adapted to other structured models. For this, the notion of Bochner integration is introduced for Banach-space-valued maps as a generalization of Lebesgue integration. Several statistical properties of Bochner integrals, relevant for our method and theory and also of importance in their own right, are presented for the first time. Our theory is complete. The existence of our estimators and the convergence of a practical algorithm that evaluates the estimators are established. These results are nonasymptotic as well as asymptotic. Furthermore, it is proved that the estimators achieve the univariate rates in pointwise, $L^{2}$ and uniform convergence, and that the estimators of the component maps converge jointly in distribution to Gaussian random elements. Our numerical examples include the cases of functional, density-valued and simplex-valued responses, demonstrating the validity of our approach.
</p>projecteuclid.org/euclid.aos/1600480928_20200918220221Fri, 18 Sep 2020 22:02 EDTNonparametric Bayesian estimation for multivariate Hawkes processeshttps://projecteuclid.org/euclid.aos/1600480929<strong>Sophie Donnet</strong>, <strong>Vincent Rivoirard</strong>, <strong>Judith Rousseau</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 5, 2698--2727.</p><p><strong>Abstract:</strong><br/>
This paper studies nonparametric estimation of parameters of multivariate Hawkes processes. We consider the Bayesian setting and derive posterior concentration rates. First, rates are derived for $\mathbb{L}_{1}$-metrics for stochastic intensities of the Hawkes process. We then deduce rates for the $\mathbb{L}_{1}$-norm of interactions functions of the process. Our results are exemplified by using priors based on piecewise constant functions, with regular or random partitions and priors based on mixtures of Betas distributions. We also present a simulation study to illustrate our results and to study empirically the inference on functional connectivity graphs of neurons
</p>projecteuclid.org/euclid.aos/1600480929_20200918220221Fri, 18 Sep 2020 22:02 EDTHypothesis testing for high-dimensional time series via self-normalizationhttps://projecteuclid.org/euclid.aos/1600480930<strong>Runmin Wang</strong>, <strong>Xiaofeng Shao</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 5, 2728--2758.</p><p><strong>Abstract:</strong><br/>
Self-normalization has attracted considerable attention in the recent literature of time series analysis, but its scope of applicability has been limited to low-/fixed-dimensional parameters for low-dimensional time series. In this article, we propose a new formulation of self-normalization for inference about the mean of high-dimensional stationary processes. Our original test statistic is a U-statistic with a trimming parameter to remove the bias caused by weak dependence. Under the framework of nonlinear causal processes, we show the asymptotic normality of our U-statistic with the convergence rate dependent upon the order of the Frobenius norm of the long-run covariance matrix. The self-normalized test statistic is then constructed on the basis of recursive subsampled U-statistics and its limiting null distribution is shown to be a functional of time-changed Brownian motion, which differs from the pivotal limit used in the low-dimensional setting. An interesting phenomenon associated with self-normalization is that it works in the high-dimensional context even if the convergence rate of original test statistic is unknown. We also present applications to testing for bandedness of the covariance matrix and testing for white noise for high-dimensional stationary time series and compare the finite sample performance with existing methods in simulation studies. At the root of our theoretical arguments, we extend the martingale approximation to the high-dimensional setting, which could be of independent theoretical interest.
</p>projecteuclid.org/euclid.aos/1600480930_20200918220221Fri, 18 Sep 2020 22:02 EDTVariational analysis of constrained M-estimatorshttps://projecteuclid.org/euclid.aos/1600480931<strong>Johannes O. Royset</strong>, <strong>Roger J-B Wets</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 5, 2759--2790.</p><p><strong>Abstract:</strong><br/>
We propose a unified framework for establishing existence of nonparametric $M$-estimators, computing the corresponding estimates, and proving their strong consistency when the class of functions is exceptionally rich. In particular, the framework addresses situations where the class of functions is complex involving information and assumptions about shape, pointwise bounds, location of modes, height at modes, location of level-sets, values of moments, size of subgradients, continuity, distance to a “prior” function, multivariate total positivity and any combination of the above. The class might be engineered to perform well in a specific setting even in the presence of little data. The framework views the class of functions as a subset of a particular metric space of upper semicontinuous functions under the Attouch–Wets distance. In addition to allowing a systematic treatment of numerous $M$-estimators, the framework yields consistency of plug-in estimators of modes of densities, maximizers of regression functions, level-sets of classifiers and related quantities, and also enables computation by means of approximating parametric classes. We establish consistency through a one-sided law of large numbers, here extended to sieves, that relaxes assumptions of uniform laws, while ensuring global approximations even under model misspecification.
</p>projecteuclid.org/euclid.aos/1600480931_20200918220221Fri, 18 Sep 2020 22:02 EDTWhich bridge estimator is the best for variable selection?https://projecteuclid.org/euclid.aos/1600480932<strong>Shuaiwen Wang</strong>, <strong>Haolei Weng</strong>, <strong>Arian Maleki</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 5, 2791--2823.</p><p><strong>Abstract:</strong><br/>
We study the problem of variable selection for linear models under the high-dimensional asymptotic setting, where the number of observations $n$ grows at the same rate as the number of predictors $p$. We consider two-stage variable selection techniques (TVS) in which the first stage uses bridge estimators to obtain an estimate of the regression coefficients, and the second stage simply thresholds this estimate to select the “important” predictors. The asymptotic false discovery proportion ($\operatorname{AFDP}$) and true positive proportion (ATPP) of these TVS are evaluated. We prove that for a fixed ATPP, in order to obtain a smaller $\operatorname{AFDP}$, one should pick a bridge estimator with smaller asymptotic mean square error in the first stage of TVS. Based on such principled discovery, we present a sharp comparison of different TVS, via an in-depth investigation of the estimation properties of bridge estimators. Rather than “orderwise” error bounds with loose constants, our analysis focuses on precise error characterization. Various interesting signal-to-noise ratio and sparsity settings are studied. Our results offer new and thorough insights into high-dimensional variable selection. For instance, we prove that a TVS with Ridge in its first stage outperforms TVS with other bridge estimators in large noise settings; two-stage LASSO becomes inferior when the signal is rare and weak. As a by-product, we show that two-stage methods outperform some standard variable selection techniques, such as $\operatorname{LASSO}$ and Sure Independence Screening, under certain conditions.
</p>projecteuclid.org/euclid.aos/1600480932_20200918220221Fri, 18 Sep 2020 22:02 EDTPermutation methods for factor analysis and PCAhttps://projecteuclid.org/euclid.aos/1600480933<strong>Edgar Dobriban</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 5, 2824--2847.</p><p><strong>Abstract:</strong><br/>
Researchers often have datasets measuring features $x_{ij}$ of samples, such as test scores of students. In factor analysis and PCA, these features are thought to be influenced by unobserved factors, such as skills. Can we determine how many components affect the data? This is an important problem, because decisions made here have a large impact on all downstream data analysis. Consequently, many approaches have been developed. Parallel Analysis is a popular permutation method: it randomly scrambles each feature of the data. It selects components if their singular values are larger than those of the permuted data. Despite widespread use, as well as empirical evidence for its accuracy, it currently has no theoretical justification.
In this paper, we show that parallel analysis (or permutation methods) consistently select the large components in certain high-dimensional factor models. However, when the signals are too large, the smaller components are not selected. The intuition is that permutations keep the noise invariant, while “destroying” the low-rank signal. This provides justification for permutation methods. Our work also uncovers drawbacks of permutation methods, and paves the way to improvements.
</p>projecteuclid.org/euclid.aos/1600480933_20200918220221Fri, 18 Sep 2020 22:02 EDTA general framework for Bayes structured linear modelshttps://projecteuclid.org/euclid.aos/1600480934<strong>Chao Gao</strong>, <strong>Aad W. van der Vaart</strong>, <strong>Harrison H. Zhou</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 5, 2848--2878.</p><p><strong>Abstract:</strong><br/>
High dimensional statistics deals with the challenge of extracting structured information from complex model settings. Compared with a large number of frequentist methodologies, there are rather few theoretically optimal Bayes methods for high dimensional models. This paper provides a unified approach to both Bayes high dimensional statistics and Bayes nonparametrics in a general framework of structured linear models. With a proposed two-step prior, we prove a general oracle inequality for posterior contraction under an abstract setting that allows model misspecification. The general result can be used to derive new results on optimal posterior contraction under many complex model settings including recent works for stochastic block model, graphon estimation and dictionary learning. It can also be used to improve upon posterior contraction results in literature including sparse linear regression and nonparametric aggregation. The key of the success lies in the novel two-step prior distribution: one for model structure, that is, model selection, and the other one for model parameters. The prior on the parameters of a model is an elliptical Laplace distribution that is capable of modeling signals with large magnitude, and the prior on the model structure involves a factor that compensates the effect of the normalizing constant of the elliptical Laplace distribution, which is important to attain rate-optimal posterior contraction.
</p>projecteuclid.org/euclid.aos/1600480934_20200918220221Fri, 18 Sep 2020 22:02 EDTAsymptotic distribution and detection thresholds for two-sample tests based on geometric graphshttps://projecteuclid.org/euclid.aos/1600480935<strong>Bhaswar B. Bhattacharya</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 5, 2879--2903.</p><p><strong>Abstract:</strong><br/>
In this paper, we consider the problem of testing the equality of two multivariate distributions based on geometric graphs constructed using the interpoint distances between the observations. These include the tests based on the minimum spanning tree and the $K$-nearest neighbor (NN) graphs, among others. These tests are asymptotically distribution-free, universally consistent and computationally efficient, making them particularly useful in modern applications. However, very little is known about the power properties of these tests. In this paper, using the theory of stabilizing geometric graphs, we derive the asymptotic distribution of these tests under general alternatives, in the Poissonized setting. Using this, the detection threshold and the limiting local power of the test based on the $K$-NN graph are obtained, where interesting exponents depending on dimension emerge. This provides a way to compare and justify the performance of these tests in different examples.
</p>projecteuclid.org/euclid.aos/1600480935_20200918220221Fri, 18 Sep 2020 22:02 EDTControlled sequential Monte Carlohttps://projecteuclid.org/euclid.aos/1600480936<strong>Jeremy Heng</strong>, <strong>Adrian N. Bishop</strong>, <strong>George Deligiannidis</strong>, <strong>Arnaud Doucet</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 5, 2904--2929.</p><p><strong>Abstract:</strong><br/>
Sequential Monte Carlo methods, also known as particle methods, are a popular set of techniques for approximating high-dimensional probability distributions and their normalizing constants. These methods have found numerous applications in statistics and related fields; for example, for inference in nonlinear non-Gaussian state space models, and in complex static models. Like many Monte Carlo sampling schemes, they rely on proposal distributions which crucially impact their performance. We introduce here a class of controlled sequential Monte Carlo algorithms, where the proposal distributions are determined by approximating the solution to an associated optimal control problem using an iterative scheme. This method builds upon a number of existing algorithms in econometrics, physics and statistics for inference in state space models, and generalizes these methods so as to accommodate complex static models. We provide a theoretical analysis concerning the fluctuation and stability of this methodology that also provides insight into the properties of related algorithms. We demonstrate significant gains over state-of-the-art methods at a fixed computational complexity on a variety of applications.
</p>projecteuclid.org/euclid.aos/1600480936_20200918220221Fri, 18 Sep 2020 22:02 EDTA framework for adaptive MCMC targeting multimodal distributionshttps://projecteuclid.org/euclid.aos/1600480937<strong>Emilia Pompe</strong>, <strong>Chris Holmes</strong>, <strong>Krzysztof Łatuszyński</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 5, 2930--2952.</p><p><strong>Abstract:</strong><br/>
We propose a new Monte Carlo method for sampling from multimodal distributions. The idea of this technique is based on splitting the task into two: finding the modes of a target distribution $\pi$ and sampling, given the knowledge of the locations of the modes. The sampling algorithm relies on steps of two types: local ones, preserving the mode; and jumps to regions associated with different modes. Besides, the method learns the optimal parameters of the algorithm, while it runs, without requiring user intervention. Our technique should be considered as a flexible framework, in which the design of moves can follow various strategies known from the broad MCMC literature.
In order to design an adaptive scheme that facilitates both local and jump moves, we introduce an auxiliary variable representing each mode, and we define a new target distribution $\tilde{\pi}$ on an augmented state space $\mathcal{X}\times\mathcal{I}$, where $\mathcal{X}$ is the original state space of $\pi$ and $\mathcal{I}$ is the set of the modes. As the algorithm runs and updates its parameters, the target distribution $\tilde{\pi}$ also keeps being modified. This motivates a new class of algorithms, Auxiliary Variable Adaptive MCMC. We prove general ergodic results for the whole class before specialising to the case of our algorithm.
</p>projecteuclid.org/euclid.aos/1600480937_20200918220221Fri, 18 Sep 2020 22:02 EDTValid post-selection inference in model-free linear regressionhttps://projecteuclid.org/euclid.aos/1600480938<strong>Arun K. Kuchibhotla</strong>, <strong>Lawrence D. Brown</strong>, <strong>Andreas Buja</strong>, <strong>Junhui Cai</strong>, <strong>Edward I. George</strong>, <strong>Linda H. Zhao</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 5, 2953--2981.</p><p><strong>Abstract:</strong><br/>
Modern data-driven approaches to modeling make extensive use of covariate/model selection. Such selection incurs a cost: it invalidates classical statistical inference. A conservative remedy to the problem was proposed by Berk et al. ( Ann. Statist. 41 (2013) 802–837) and further extended by Bachoc, Preinerstorfer and Steinberger (2016). These proposals, labeled “PoSI methods,” provide valid inference after arbitrary model selection. They are computationally NP-hard and have limitations in their theoretical justifications. We therefore propose computationally efficient confidence regions, named “UPoSI’ (“U” is for “uniform” or “universal.”) and prove large-$p$ asymptotics for them. We do this for linear OLS regression allowing misspecification of the normal linear model, for both fixed and random covariates, and for independent as well as some types of dependent data. We start by proving a general equivalence result for the post-selection inference problem and a simultaneous inference problem in a setting that strips inessential features still present in a related result of Berk et al. ( Ann. Statist. 41 (2013) 802–837). We then construct valid PoSI confidence regions that are the first to have vastly improved computational efficiency in that the required computation times grow only quadratically rather than exponentially with the total number $p$ of covariates. These are also the first PoSI confidence regions with guaranteed asymptotic validity when the total number of covariates $p$ diverges (almost exponentially) with the sample size $n$. Under standard tail assumptions, we only require $(\log p)^{7}=o(n)$ and $k=o(\sqrt{n/\log p})$ where $k$ ($\le p$) is the largest number of covariates (model size) considered for selection. We study various properties of these confidence regions, including their Lebesgue measures, and compare them theoretically with those proposed previously.
</p>projecteuclid.org/euclid.aos/1600480938_20200918220221Fri, 18 Sep 2020 22:02 EDTInference for spherical location under high concentrationhttps://projecteuclid.org/euclid.aos/1600480939<strong>Davy Paindaveine</strong>, <strong>Thomas Verdebout</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 5, 2982--2998.</p><p><strong>Abstract:</strong><br/>
Motivated by the fact that circular or spherical data are often much concentrated around a location $\pmb{\theta }$, we consider inference about $\pmb{\theta }$ under high concentration asymptotic scenarios for which the probability of any fixed spherical cap centered at $\pmb{\theta }$ converges to one as the sample size $n$ diverges to infinity. Rather than restricting to Fisher–von Mises–Langevin distributions, we consider a much broader, semiparametric, class of rotationally symmetric distributions indexed by the location parameter $\pmb{\theta }$, a scalar concentration parameter $\kappa $ and a functional nuisance $f$. We determine the class of distributions for which high concentration is obtained as $\kappa $ diverges to infinity. For such distributions, we then consider inference (point estimation, confidence zone estimation, hypothesis testing) on $\pmb{\theta }$ in asymptotic scenarios where $\kappa _{n}$ diverges to infinity at an arbitrary rate with the sample size $n$. Our asymptotic investigation reveals that, interestingly, optimal inference procedures on $\pmb{\theta }$ show consistency rates that depend on $f$. Using asymptotics “à la Le Cam,” we show that the spherical mean is, at any $f$, a parametrically superefficient estimator of ${\pmb{\theta }}$ and that the Watson and Wald tests for $\mathcal{H}_{0}:{\pmb{\theta }}={\pmb{\theta }}_{0}$ enjoy similar, nonstandard, optimality properties. We illustrate our results through simulations and treat a real data example. On a technical point of view, our asymptotic derivations require challenging expansions of rotationally symmetric functionals for large arguments of $f$.
</p>projecteuclid.org/euclid.aos/1600480939_20200918220221Fri, 18 Sep 2020 22:02 EDTSemiparametric Bayesian causal inferencehttps://projecteuclid.org/euclid.aos/1600480940<strong>Kolyan Ray</strong>, <strong>Aad van der Vaart</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 5, 2999--3020.</p><p><strong>Abstract:</strong><br/>
We develop a semiparametric Bayesian approach for estimating the mean response in a missing data model with binary outcomes and a nonparametrically modelled propensity score. Equivalently, we estimate the causal effect of a treatment, correcting nonparametrically for confounding. We show that standard Gaussian process priors satisfy a semiparametric Bernstein–von Mises theorem under smoothness conditions. We further propose a novel propensity score-dependent prior that provides efficient inference under strictly weaker conditions. We also show that it is theoretically preferable to model the covariate distribution with a Dirichlet process or Bayesian bootstrap, rather than modelling its density.
</p>projecteuclid.org/euclid.aos/1600480940_20200918220221Fri, 18 Sep 2020 22:02 EDTRelaxing the assumptions of knockoffs by conditioninghttps://projecteuclid.org/euclid.aos/1600480941<strong>Dongming Huang</strong>, <strong>Lucas Janson</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 5, 3021--3042.</p><p><strong>Abstract:</strong><br/>
The recent paper Candès et al. ( J. R. Stat. Soc. Ser. B. Stat. Methodol. 80 (2018) 551–577) introduced model-X knockoffs, a method for variable selection that provably and nonasymptotically controls the false discovery rate with no restrictions or assumptions on the dimensionality of the data or the conditional distribution of the response given the covariates. The one requirement for the procedure is that the covariate samples are drawn independently and identically from a precisely-known (but arbitrary) distribution. The present paper shows that the exact same guarantees can be made without knowing the covariate distribution fully, but instead knowing it only up to a parametric model with as many as $\Omega (n^{*}p)$ parameters, where $p$ is the dimension and $n^{*}$ is the number of covariate samples (which may exceed the usual sample size $n$ of labeled samples when unlabeled samples are also available). The key is to treat the covariates as if they are drawn conditionally on their observed value for a sufficient statistic of the model. Although this idea is simple, even in Gaussian models conditioning on a sufficient statistic leads to a distribution supported on a set of zero Lebesgue measure, requiring techniques from topological measure theory to establish valid algorithms. We demonstrate how to do this for three models of interest, with simulations showing the new approach remains powerful under the weaker assumptions.
</p>projecteuclid.org/euclid.aos/1600480941_20200918220221Fri, 18 Sep 2020 22:02 EDTAnalytical nonlinear shrinkage of large-dimensional covariance matriceshttps://projecteuclid.org/euclid.aos/1600480942<strong>Olivier Ledoit</strong>, <strong>Michael Wolf</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 5, 3043--3065.</p><p><strong>Abstract:</strong><br/>
This paper establishes the first analytical formula for nonlinear shrinkage estimation of large-dimensional covariance matrices. We achieve this by identifying and mathematically exploiting a deep connection between nonlinear shrinkage and nonparametric estimation of the Hilbert transform of the sample spectral density. Previous nonlinear shrinkage methods were of numerical nature: QuEST requires numerical inversion of a complex equation from random matrix theory whereas NERCOME is based on a sample-splitting scheme. The new analytical method is more elegant and also has more potential to accommodate future variations or extensions. Immediate benefits are (i) that it is typically 1000 times faster with basically the same accuracy as QuEST and (ii) that it accommodates covariance matrices of dimension up to 10,000 and more. The difficult case where the matrix dimension exceeds the sample size is also covered.
</p>projecteuclid.org/euclid.aos/1600480942_20200918220221Fri, 18 Sep 2020 22:02 EDTCoupled conditional backward sampling particle filterhttps://projecteuclid.org/euclid.aos/1600480943<strong>Anthony Lee</strong>, <strong>Sumeetpal S. Singh</strong>, <strong>Matti Vihola</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 5, 3066--3089.</p><p><strong>Abstract:</strong><br/>
The conditional particle filter (CPF) is a promising algorithm for general hidden Markov model smoothing. Empirical evidence suggests that the variant of CPF with backward sampling (CBPF) performs well even with long time series. Previous theoretical results have not been able to demonstrate the improvement brought by backward sampling, whereas we provide rates showing that CBPF can remain effective with a fixed number of particles independent of the time horizon. Our result is based on analysis of a new coupling of two CBPFs, the coupled conditional backward sampling particle filter (CCBPF). We show that CCBPF has good stability properties in the sense that with fixed number of particles, the coupling time in terms of iterations increases only linearly with respect to the time horizon under a general (strong mixing) condition. The CCBPF is useful not only as a theoretical tool, but also as a practical method that allows for unbiased estimation of smoothing expectations, following the recent developments by Jacob, Lindsten and Schön (2020). Unbiased estimation has many advantages, such as enabling the construction of asymptotically exact confidence intervals and straightforward parallelisation.
</p>projecteuclid.org/euclid.aos/1600480943_20200918220221Fri, 18 Sep 2020 22:02 EDTAsymptotic risk and phase transition of $l_{1}$-penalized robust estimatorhttps://projecteuclid.org/euclid.aos/1600480944<strong>Hanwen Huang</strong>. <p><strong>Source: </strong>Annals of Statistics, Volume 48, Number 5, 3090--3111.</p><p><strong>Abstract:</strong><br/>
Mean square error (MSE) of the estimator can be used to evaluate the performance of a regression model. In this paper, we derive the asymptotic MSE of $l_{1}$-penalized robust estimators in the limit of both sample size $n$ and dimension $p$ going to infinity with fixed ratio $n/p\rightarrow \delta $. We focus on the $l_{1}$-penalized least absolute deviation and $l_{1}$-penalized Huber’s regressions. Our analytic study shows the appearance of a sharp phase transition in the two-dimensional sparsity-undersampling phase space. We derive the explicit formula of the phase boundary. Remarkably, the phase boundary is identical to the phase transition curve of LASSO which is also identical to the previously known Donoho–Tanner phase transition for sparse recovery. Our derivation is based on the asymptotic analysis of the generalized approximation passing (GAMP) algorithm. We establish the asymptotic MSE of the $l_{1}$-penalized robust estimator by connecting it to the asymptotic MSE of the corresponding GAMP estimator. Our results provide some theoretical insight into the high-dimensional regression methods. Extensive computational experiments have been conducted to validate the correctness of our analytic results. We obtain fairly good agreement between theoretical prediction and numerical simulations on finite-size systems.
</p>projecteuclid.org/euclid.aos/1600480944_20200918220221Fri, 18 Sep 2020 22:02 EDT