Electronic Journal of Statistics Articles (Project Euclid)
http://projecteuclid.org/euclid.ejs
The latest articles from Electronic Journal of Statistics on Project Euclid, a site for mathematics and statistics resources.en-usCopyright 2010 Cornell University LibraryEuclid-L@cornell.edu (Project Euclid Team)Thu, 05 Aug 2010 15:41 EDTFri, 03 Jun 2011 09:20 EDThttp://projecteuclid.org/collection/euclid/images/logo_linking_100.gifProject Euclid
http://projecteuclid.org/
The bias and skewness of M -estimators in regression
http://projecteuclid.org/euclid.ejs/1262876992
<strong>Christopher Withers</strong>, <strong>Saralees Nadarajah</strong><p><strong>Source: </strong>Electron. J. Statist., Volume 4, 1--14.</p><p><strong>Abstract:</strong><br/>
We consider M estimation of a regression model with a nuisance parameter and a vector of other parameters. The unknown distribution of the residuals is not assumed to be normal or symmetric. Simple and easily estimated formulas are given for the dominant terms of the bias and skewness of the parameter estimates. For the linear model these are proportional to the skewness of the ‘independent’ variables. For a nonlinear model, its linear component plays the role of these independent variables, and a second term must be added proportional to the covariance of its linear and quadratic components. For the least squares estimate with normal errors this term was derived by Box [1]. We also consider the effect of a large number of parameters, and the case of random independent variables.
</p>projecteuclid.org/euclid.ejs/1262876992_Thu, 05 Aug 2010 15:41 EDTThu, 05 Aug 2010 15:41 EDTOn misspecifications in regularity and properties of estimatorshttps://projecteuclid.org/euclid.ejs/1515747853<strong>Oleg V. Chernoyarov</strong>, <strong>Yury A. Kutoyants</strong>, <strong>Andrei P. Trifonov</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 80--106.</p><p><strong>Abstract:</strong><br/>
The problem of parameter estimation by the continuous time observations of a deterministic signal in white Gaussian noise is considered. The asymptotic properties of the maximum likelihood estimator are described in the asymptotic of small noise (large signal-to-noise ratio). We are interested in the situation when there is a misspecification in the regularity conditions. In particular it is supposed that the statistician uses a discontinuous (change-point type) model of signal, when the true signal is continuously differentiable function of the unknown parameter.
</p>projecteuclid.org/euclid.ejs/1515747853_20180605220106Tue, 05 Jun 2018 22:01 EDTLocally stationary functional time serieshttps://projecteuclid.org/euclid.ejs/1516006818<strong>Anne van Delft</strong>, <strong>Michael Eichler</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 107--170.</p><p><strong>Abstract:</strong><br/>
The literature on time series of functional data has focused on processes of which the probabilistic law is either constant over time or constant up to its second-order structure. Especially for long stretches of data it is desirable to be able to weaken this assumption. This paper introduces a framework that will enable meaningful statistical inference of functional data of which the dynamics change over time. We put forward the concept of local stationarity in the functional setting and establish a class of processes that have a functional time-varying spectral representation. Subsequently, we derive conditions that allow for fundamental results from nonstationary multivariate time series to carry over to the function space. In particular, time-varying functional ARMA processes are investigated and shown to be functional locally stationary according to the proposed definition. As a side-result, we establish a Cramér representation for an important class of weakly stationary functional processes. Important in our context is the notion of a time-varying spectral density operator of which the properties are studied and uniqueness is derived. Finally, we provide a consistent nonparametric estimator of this operator and show it is asymptotically Gaussian using a weaker tightness criterion than what is usually deemed necessary.
</p>projecteuclid.org/euclid.ejs/1516006818_20180605220106Tue, 05 Jun 2018 22:01 EDTA nearest neighbor estimate of the residual variancehttps://projecteuclid.org/euclid.ejs/1528250442<strong>Luc Devroye</strong>, <strong>László Györfi</strong>, <strong>Gábor Lugosi</strong>, <strong>Harro Walk</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 1752--1778.</p><p><strong>Abstract:</strong><br/>
We study the problem of estimating the smallest achievable mean-squared error in regression function estimation. The problem is equivalent to estimating the second moment of the regression function of $Y$ on $X\in{\mathbb{R}} ^{d}$. We introduce a nearest-neighbor-based estimate and obtain a normal limit law for the estimate when $X$ has an absolutely continuous distribution, without any condition on the density. We also compute the asymptotic variance explicitly and derive a non-asymptotic bound on the variance that does not depend on the dimension $d$. The asymptotic variance does not depend on the smoothness of the density of $X$ or of the regression function. A non-asymptotic exponential concentration inequality is also proved. We illustrate the use of the new estimate through testing whether a component of the vector $X$ carries information for predicting $Y$.
</p>projecteuclid.org/euclid.ejs/1528250442_20180605220106Tue, 05 Jun 2018 22:01 EDTCluster analysis of longitudinal profiles with subgroupshttps://projecteuclid.org/euclid.ejs/1517367715<strong>Xiaolu Zhu</strong>, <strong>Annie Qu</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 171--193.</p><p><strong>Abstract:</strong><br/>
In this paper, we cluster profiles of longitudinal data using a penalized regression method. Specifically, we allow heterogeneous variation of longitudinal patterns for each subject, and utilize a pairwise-grouping penalization on coefficients of the nonparametric B-spline models to form subgroups. Consequently, we identify clusters based on different patterns of the predicted longitudinal curves. One advantage of the proposed method is that there is no need to pre-specify the number of clusters; instead the number of clusters is selected automatically through a model selection criterion. Our method is also applicable for unbalanced data where different subjects could have measurements at different time points. To implement the proposed method, we develop an alternating direction method of multipliers (ADMM) algorithm which has the desirable convergence property. In theory, we establish the consistency properties for approximated nonparametric function estimation and subgrouping memberships. In addition, we show that our method outperforms the existing competitive approaches in our simulation studies and real data example.
</p>projecteuclid.org/euclid.ejs/1517367715_20180608220215Fri, 08 Jun 2018 22:02 EDTAsymptotic confidence bands in the Spektor-Lord-Willis problem via kernel estimation of intensity derivativehttps://projecteuclid.org/euclid.ejs/1518080460<strong>Bogdan Ćmiel</strong>, <strong>Zbigniew Szkutnik</strong>, <strong>Jakub Wojdyła</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 194--223.</p><p><strong>Abstract:</strong><br/>
The stereological problem of unfolding the distribution of spheres radii from linear sections, known as the Spektor-Lord-Willis problem, is formulated as a Poisson inverse problem and an $L^{2}$-rate-minimax solution is constructed over some restricted Sobolev classes. The solution is a specialized kernel-type estimator with boundary correction. For the first time for this problem, non-parametric, asymptotic confidence bands for the unfolded function are constructed. Automatic bandwidth selection procedures based on empirical risk minimization are proposed. It is shown that a version of the Goldenshluger-Lepski procedure of bandwidth selection ensures adaptivity of the estimators to the unknown smoothness. The performance of the procedures is demonstrated in a Monte Carlo experiment.
</p>projecteuclid.org/euclid.ejs/1518080460_20180608220215Fri, 08 Jun 2018 22:02 EDTSemi-parametric regression estimation of the tail indexhttps://projecteuclid.org/euclid.ejs/1518426109<strong>Mofei Jia</strong>, <strong>Emanuele Taufer</strong>, <strong>Maria Michela Dickson</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 224--248.</p><p><strong>Abstract:</strong><br/>
Consider a distribution $F$ with regularly varying tails of index $-\alpha$. An estimation strategy for $\alpha$, exploiting the relation between the behavior of the tail at infinity and of the characteristic function at the origin, is proposed. A semi-parametric regression model does the job: a nonparametric component controls the bias and a parametric one produces the actual estimate. Implementation of the estimation strategy is quite simple as it can rely on standard software packages for generalized additive models. A generalized cross validation procedure is suggested in order to handle the bias-variance trade-off. Theoretical properties of the proposed method are derived and simulations show the performance of this estimator in a wide range of cases. An application to data sets on city sizes, facing the debated issue of distinguishing Pareto-type tails from Log-normal tails, illustrates how the proposed method works in practice.
</p>projecteuclid.org/euclid.ejs/1518426109_20180608220215Fri, 08 Jun 2018 22:02 EDTEmpirical evolution equationshttps://projecteuclid.org/euclid.ejs/1518426110<strong>Susan Wei</strong>, <strong>Victor M. Panaretos</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 249--276.</p><p><strong>Abstract:</strong><br/>
Evolution equations comprise a broad framework for describing the dynamics of a system in a general state space: when the state space is finite-dimensional, they give rise to systems of ordinary differential equations; for infinite-dimensional state spaces, they give rise to partial differential equations. Several modern statistical and machine learning methods concern the estimation of objects that can be formalized as solutions to evolution equations, in some appropriate state space, even if not stated as such. The corresponding equations, however, are seldom known exactly, and are empirically derived from data, often by means of non-parametric estimation. This induces uncertainties on the equations and their solutions that are challenging to quantify, and moreover the diversity and the specifics of each particular setting may obscure the path for a general approach. In this paper, we address the problem of constructing general yet tractable methods for quantifying such uncertainties, by means of asymptotic theory combined with bootstrap methodology. We demonstrates these procedures in important examples including gradient line estimation, diffusion tensor imaging tractography, and local principal component analysis. The bootstrap perspective is particularly appealing as it circumvents the need to simulate from stochastic (partial) differential equations that depend on (infinite-dimensional) unknowns. We assess the performance of the bootstrap procedure via simulations and find that it demonstrates good finite-sample coverage.
</p>projecteuclid.org/euclid.ejs/1518426110_20180608220215Fri, 08 Jun 2018 22:02 EDTCorrigendum to “Classification with asymmetric label noise: Consistency and maximal denoising”https://projecteuclid.org/euclid.ejs/1528509712<strong>Gilles Blanchard</strong>, <strong>Clayton Scott</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 1779--1781.</p><p><strong>Abstract:</strong><br/>
We point out a flaw in Lemma 15 of [1]. We also indicate how the main results of that section are still valid using a modified argument.
</p>projecteuclid.org/euclid.ejs/1528509712_20180608220215Fri, 08 Jun 2018 22:02 EDTAdaptive estimation in the nonparametric random coefficients binary choice model by needlet thresholdinghttps://projecteuclid.org/euclid.ejs/1518426111<strong>Eric Gautier</strong>, <strong>Erwan Le Pennec</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 277--320.</p><p><strong>Abstract:</strong><br/>
In the random coefficients binary choice model, a binary variable equals 1 iff an index $X^{\top}\beta$ is positive. The vectors $X$ and $\beta$ are independent and belong to the sphere $\mathbb{S}^{d-1}$ in $\mathbb{R}^{d}$. We prove lower bounds on the minimax risk for estimation of the density $f_{\beta}$ over Besov bodies where the loss is a power of the $\mathrm{L}^{p}(\mathbb{S}^{d-1})$ norm for $1\le p\le \infty$. We show that a hard thresholding estimator based on a needlet expansion with data-driven thresholds achieves these lower bounds up to logarithmic factors.
</p>projecteuclid.org/euclid.ejs/1518426111_20180611220607Mon, 11 Jun 2018 22:06 EDTStatistical properties of simple random-effects models for genetic heritabilityhttps://projecteuclid.org/euclid.ejs/1518663656<strong>David Steinsaltz</strong>, <strong>Andrew Dahl</strong>, <strong>Kenneth W. Wachter</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 321--358.</p><p><strong>Abstract:</strong><br/>
Random-effects models are a popular tool for analysing total narrow-sense heritability for quantitative phenotypes, on the basis of large-scale SNP data. Recently, there have been disputes over the validity of conclusions that may be drawn from such analysis. We derive some of the fundamental statistical properties of heritability estimates arising from these models, showing that the bias will generally be small. We show that the score function may be manipulated into a form that facilitates intelligible interpretations of the results. We go on to use this score function to explore the behavior of the model when certain key assumptions of the model are not satisfied — shared environment, measurement error, and genetic effects that are confined to a small subset of sites.
The variance and bias depend crucially on the variance of certain functionals of the singular values of the genotype matrix. A useful baseline is the singular value distribution associated with genotypes that are completely independent — that is, with no linkage and no relatedness — for a given number of individuals and sites. We calculate the corresponding variance and bias for this setting.
</p>projecteuclid.org/euclid.ejs/1518663656_20180611220607Mon, 11 Jun 2018 22:06 EDTKernel estimation of extreme regression risk measureshttps://projecteuclid.org/euclid.ejs/1518663657<strong>Jonathan El Methni</strong>, <strong>Laurent Gardes</strong>, <strong>Stéphane Girard</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 359--398.</p><p><strong>Abstract:</strong><br/>
The Regression Conditional Tail Moment (RCTM) is the risk measure defined as the moment of order $b\geq0$ of a loss distribution above the upper $\alpha$-quantile where $\alpha\in (0,1)$ and when a covariate information is available. The purpose of this work is first to establish the asymptotic properties of the RCTM in case of extreme losses, i.e when $\alpha\to 0$ is no longer fixed, under general extreme-value conditions on their distribution tail. In particular, no assumption is made on the sign of the associated extreme-value index. Second, the asymptotic normality of a kernel estimator of the RCTM is established, which allows to derive similar results for estimators of related risk measures such as the Regression Conditional Tail Expectation/Variance/Skewness. When the distribution tail is upper bounded, an application to frontier estimation is also proposed. The results are illustrated both on simulated data and on a real dataset in the field of nuclear reactors reliability.
</p>projecteuclid.org/euclid.ejs/1518663657_20180611220607Mon, 11 Jun 2018 22:06 EDTRegularity properties and simulations of Gaussian random fields on the sphere cross timehttps://projecteuclid.org/euclid.ejs/1518663658<strong>Jorge Clarke De la Cerda</strong>, <strong>Alfredo Alegría</strong>, <strong>Emilio Porcu</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 399--426.</p><p><strong>Abstract:</strong><br/>
We study the regularity properties of Gaussian fields defined over spheres cross time. In particular, we consider two alternative spectral decompositions for a Gaussian field on $\mathbb{S}^{d}\times \mathbb{R}$. For each decomposition, we establish regularity properties through Sobolev and interpolation spaces. We then propose a simulation method and study its level of accuracy in the $L^{2}$ sense. The method turns to be both fast and efficient.
</p>projecteuclid.org/euclid.ejs/1518663658_20180611220607Mon, 11 Jun 2018 22:06 EDTLarge and moderate deviations for kernel–type estimators of the mean density of Boolean modelshttps://projecteuclid.org/euclid.ejs/1518750030<strong>Federico Camerlenghi</strong>, <strong>Elena Villa</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 427--460.</p><p><strong>Abstract:</strong><br/>
The mean density of a random closed set with integer Hausdorff dimension is a crucial notion in stochastic geometry, in fact it is a fundamental tool in a large variety of applied problems, such as image analysis, medicine, computer vision, etc. Hence the estimation of the mean density is a problem of interest both from a theoretical and computational standpoint. Nowadays different kinds of estimators are available in the literature, in particular here we focus on a kernel–type estimator, which may be considered as a generalization of the traditional kernel density estimator of random variables to the case of random closed sets. The aim of the present paper is to provide asymptotic properties of such an estimator in the context of Boolean models, which are a broad class of random closed sets. More precisely we are able to prove large and moderate deviation principles, which allow us to derive the strong consistency of the estimator of the mean density as well as asymptotic confidence intervals. Finally we underline the connection of our theoretical findings with classical literature concerning density estimation of random variables.
</p>projecteuclid.org/euclid.ejs/1518750030_20180611220607Mon, 11 Jun 2018 22:06 EDTDimension reduction and estimation in the secondary analysis of case-control studieshttps://projecteuclid.org/euclid.ejs/1528769120<strong>Liang Liang</strong>, <strong>Raymond Carroll</strong>, <strong>Yanyuan Ma</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 1782--1821.</p><p><strong>Abstract:</strong><br/>
Studying the relationship between covariates based on retrospective data is the main purpose of secondary analysis, an area of increasing interest. We examine the secondary analysis problem when multiple covariates are available, while only a regression mean model is specified. Despite the completely parametric modeling of the regression mean function, the case-control nature of the data requires special treatment and semiparametric efficient estimation generates various nonparametric estimation problems with multivariate covariates. We devise a dimension reduction approach that fits with the specified primary and secondary models in the original problem setting, and use reweighting to adjust for the case-control nature of the data, even when the disease rate in the source population is unknown. The resulting estimator is both locally efficient and robust against the misspecification of the regression error distribution, which can be heteroscedastic as well as non-Gaussian. We demonstrate the advantage of our method over several existing methods, both analytically and numerically.
</p>projecteuclid.org/euclid.ejs/1528769120_20180611220607Mon, 11 Jun 2018 22:06 EDTSolution of linear ill-posed problems by model selection and aggregationhttps://projecteuclid.org/euclid.ejs/1528769121<strong>Felix Abramovich</strong>, <strong>Daniela De Canditiis</strong>, <strong>Marianna Pensky</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 1822--1841.</p><p><strong>Abstract:</strong><br/>
We consider a general statistical linear inverse problem, where the solution is represented via a known (possibly overcomplete) dictionary that allows its sparse representation. We propose two different approaches. A model selection estimator selects a single model by minimizing the penalized empirical risk over all possible models. By contrast with direct problems, the penalty depends on the model itself rather than on its size only as for complexity penalties. A Q-aggregate estimator averages over the entire collection of estimators with properly chosen weights. Under mild conditions on the dictionary, we establish oracle inequalities both with high probability and in expectation for the two estimators. Moreover, for the latter estimator these inequalities are sharp. The proposed procedures are implemented numerically and their performance is assessed by a simulation study.
</p>projecteuclid.org/euclid.ejs/1528769121_20180611220607Mon, 11 Jun 2018 22:06 EDTCommunity detection by $L_{0}$-penalized graph Laplacianhttps://projecteuclid.org/euclid.ejs/1528769122<strong>Chong Chen</strong>, <strong>Ruibin Xi</strong>, <strong>Nan Lin</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 1842--1866.</p><p><strong>Abstract:</strong><br/>
Community detection in network analysis aims at partitioning nodes into disjoint communities. Real networks often contain outlier nodes that do not belong to any communities and often do not have a known number of communities. However, most current algorithms assume that the number of communities is known and even fewer algorithm can handle networks with outliers. In this paper, we propose detecting communities by maximizing a novel model free tightness criterion. We show that this tightness criterion is closely related with the $L_{0}$-penalized graph Laplacian and develop an efficient algorithm to extract communities based on the criterion. Unlike many other community detection methods, this method does not assume the number of communities is known and can properly detect communities in networks with outliers. Under the degree corrected stochastic block model, we show that even for networks with outliers, maximizing the tightness criterion can extract communities with small misclassification rates when the number of communities grows to infinity as the network size grows. Simulation and real data analysis also show that the proposed method performs significantly better than existing methods.
</p>projecteuclid.org/euclid.ejs/1528769122_20180611220607Mon, 11 Jun 2018 22:06 EDTStochastic heavy ballhttps://projecteuclid.org/euclid.ejs/1519030878<strong>Sébastien Gadat</strong>, <strong>Fabien Panloup</strong>, <strong>Sofiane Saadane</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 461--529.</p><p><strong>Abstract:</strong><br/>
This paper deals with a natural stochastic optimization procedure derived from the so-called Heavy-ball method differential equation, which was introduced by Polyak in the 1960s with his seminal contribution [Pol64]. The Heavy-ball method is a second-order dynamics that was investigated to minimize convex functions $f$. The family of second-order methods recently received a large amount of attention, until the famous contribution of Nesterov [Nes83], leading to the explosion of large-scale optimization problems. This work provides an in-depth description of the stochastic heavy-ball method, which is an adaptation of the deterministic one when only unbiased evalutions of the gradient are available and used throughout the iterations of the algorithm. We first describe some almost sure convergence results in the case of general non-convex coercive functions $f$. We then examine the situation of convex and strongly convex potentials and derive some non-asymptotic results about the stochastic heavy-ball method. We end our study with limit theorems on several rescaled algorithms.
</p>projecteuclid.org/euclid.ejs/1519030878_20180612220622Tue, 12 Jun 2018 22:06 EDTConsistent algorithms for multiclass classification with an abstain optionhttps://projecteuclid.org/euclid.ejs/1519030879<strong>Harish G. Ramaswamy</strong>, <strong>Ambuj Tewari</strong>, <strong>Shivani Agarwal</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 530--554.</p><p><strong>Abstract:</strong><br/>
We consider the problem of $n$-class classification ($n\geq2$), where the classifier can choose to abstain from making predictions at a given cost, say, a factor $\alpha$ of the cost of misclassification. Our goal is to design consistent algorithms for such $n$-class classification problems with a ‘reject option’; while such algorithms are known for the binary ($n=2$) case, little has been understood for the general multiclass case. We show that the well known Crammer-Singer surrogate and the one-vs-all hinge loss, albeit with a different predictor than the standard argmax, yield consistent algorithms for this problem when $\alpha=\frac{1}{2}$. More interestingly, we design a new convex surrogate, which we call the binary encoded predictions surrogate, that is also consistent for this problem when $\alpha=\frac{1}{2}$ and operates on a much lower dimensional space ($\log(n)$ as opposed to $n$). We also construct modified versions of all these three surrogates to be consistent for any given $\alpha\in[0,\frac{1}{2}]$.
</p>projecteuclid.org/euclid.ejs/1519030879_20180612220622Tue, 12 Jun 2018 22:06 EDTQuantum non demolition measurements: Parameter estimation for mixtures of multinomialshttps://projecteuclid.org/euclid.ejs/1519376423<strong>Tristan Benoist</strong>, <strong>Fabrice Gamboa</strong>, <strong>Clément Pellegrini</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 555--571.</p><p><strong>Abstract:</strong><br/>
In Quantum Non Demolition measurements, the sequence of observations is distributed as a mixture of multinomial random variables. Parameters of the dynamics are naturally encoded into this family of distributions. We show the local asymptotic mixed normality of the underlying statistical model and the consistency of the maximum likelihood estimator. Furthermore, we prove the asymptotic optimality of this estimator as it saturates the usual Cramér Rao bound.
</p>projecteuclid.org/euclid.ejs/1519376423_20180612220622Tue, 12 Jun 2018 22:06 EDTFlexible linear mixed models with improper priors for longitudinal and survival datahttps://projecteuclid.org/euclid.ejs/1519700495<strong>F. J. Rubio</strong>, <strong>M. F. J. Steel</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 572--598.</p><p><strong>Abstract:</strong><br/>
We propose a Bayesian approach using improper priors for hierarchical linear mixed models with flexible random effects and residual error distributions. The error distribution is modelled using scale mixtures of normals, which can capture tails heavier than those of the normal distribution. This generalisation is useful to produce models that are robust to the presence of outliers. The case of asymmetric residual errors is also studied. We present general results for the propriety of the posterior that also cover cases with censored observations, allowing for the use of these models in the contexts of popular longitudinal and survival analyses. We consider the use of copulas with flexible marginals for modelling the dependence between the random effects, but our results cover the use of any random effects distribution. Thus, our paper provides a formal justification for Bayesian inference in a very wide class of models (covering virtually all of the literature) under attractive prior structures that limit the amount of required user elicitation.
</p>projecteuclid.org/euclid.ejs/1519700495_20180612220622Tue, 12 Jun 2018 22:06 EDTRobust boosting with truncated loss functionshttps://projecteuclid.org/euclid.ejs/1519700496<strong>Zhu Wang</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 599--650.</p><p><strong>Abstract:</strong><br/>
Boosting is a powerful machine learning tool with attractive theoretical properties. In recent years, boosting algorithms have been extended to many statistical estimation problems. For data contaminated with outliers, however, development of boosting algorithms is very limited. In this paper, innovative robust boosting algorithms utilizing the majorization-minimization (MM) principle are developed for binary and multi-category classification problems. Based on truncated loss functions, the robust boosting algorithms share a unified framework for linear and nonlinear effects models. The proposed methods can reduce the heavy influence from a small number of outliers which could otherwise distort the results. In addition, adaptive boosting for the truncated loss functions are developed to construct more sparse predictive models. We present convergence guarantees for smooth surrogate loss functions with both iteration-varying and constant step-sizes. We conducted empirical studies using data from simulations, a pediatric database developed for the US Healthcare Cost and Utilization Project, and breast cancer gene expression data. Compared with non-robust boosting, robust boosting improves classification accuracy and variable selection.
</p>projecteuclid.org/euclid.ejs/1519700496_20180612220622Tue, 12 Jun 2018 22:06 EDTMinimax lower bounds for function estimation on graphshttps://projecteuclid.org/euclid.ejs/1519700497<strong>Alisa Kirichenko</strong>, <strong>Harry van Zanten</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 651--666.</p><p><strong>Abstract:</strong><br/>
We study minimax lower bounds for function estimation problems on large graph when the target function is smoothly varying over the graph. We derive minimax rates in the context of regression and classification problems on graphs that satisfy an asymptotic shape assumption and with a smoothness condition on the target function, both formulated in terms of the graph Laplacian.
</p>projecteuclid.org/euclid.ejs/1519700497_20180612220622Tue, 12 Jun 2018 22:06 EDTVariable screening for high dimensional time serieshttps://projecteuclid.org/euclid.ejs/1519700498<strong>Kashif Yousuf</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 667--702.</p><p><strong>Abstract:</strong><br/>
Variable selection is a widely studied problem in high dimensional statistics, primarily since estimating the precise relationship between the covariates and the response is of great importance in many scientific disciplines. However, most of theory and methods developed towards this goal for the linear model invoke the assumption of iid sub-Gaussian covariates and errors. This paper analyzes the theoretical properties of Sure Independence Screening (SIS) (Fan and Lv [20]) for high dimensional linear models with dependent and/or heavy tailed covariates and errors. We also introduce a generalized least squares screening (GLSS) procedure which utilizes the serial correlation present in the data. By utilizing this serial correlation when estimating our marginal effects, GLSS is shown to outperform SIS in many cases. For both procedures we prove sure screening properties, which depend on the moment conditions, and the strength of dependence in the error and covariate processes, amongst other factors. Additionally, combining these screening procedures with the adaptive Lasso is analyzed. Dependence is quantified by functional dependence measures (Wu [49]), and the results rely on the use of Nagaev-type and exponential inequalities for dependent random variables. We also conduct simulations to demonstrate the finite sample performance of these procedures, and include a real data application of forecasting the US inflation rate.
</p>projecteuclid.org/euclid.ejs/1519700498_20180612220622Tue, 12 Jun 2018 22:06 EDTEfficient semiparametric estimation and model selection for multidimensional mixtureshttps://projecteuclid.org/euclid.ejs/1519700499<strong>Elisabeth Gassiat</strong>, <strong>Judith Rousseau</strong>, <strong>Elodie Vernet</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 703--740.</p><p><strong>Abstract:</strong><br/>
In this paper, we consider nonparametric multidimensional finite mixture models and we are interested in the semiparametric estimation of the population weights. Here, the i.i.d. observations are assumed to have at least three components which are independent given the population. We approximate the semiparametric model by projecting the conditional distributions on step functions associated to some partition. Our first main result is that if we refine the partition slowly enough, the associated sequence of maximum likelihood estimators of the weights is asymptotically efficient, and the posterior distribution of the weights, when using a Bayesian procedure, satisfies a semiparametric Bernstein-von Mises theorem. We then propose a cross-validation like method to select the partition in a finite horizon. Our second main result is that the proposed procedure satisfies an oracle inequality. Numerical experiments on simulated data illustrate our theoretical results.
</p>projecteuclid.org/euclid.ejs/1519700499_20180612220622Tue, 12 Jun 2018 22:06 EDTNew FDR bounds for discrete and heterogeneous testshttps://projecteuclid.org/euclid.ejs/1528855551<strong>Sebastian Döhler</strong>, <strong>Guillermo Durand</strong>, <strong>Etienne Roquain</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 1867--1900.</p><p><strong>Abstract:</strong><br/>
To find interesting items in genome-wide association studies or next generation sequencing data, a crucial point is to design powerful false discovery rate (FDR) controlling procedures that suitably combine discrete tests (typically binomial or Fisher tests). In particular, recent research has been striving for appropriate modifications of the classical Benjamini-Hochberg (BH) step-up procedure that accommodate discreteness and heterogeneity of the data. However, despite an important number of attempts, these procedures did not come with theoretical guarantees. In this paper, we provide new FDR bounds that allow us to fill this gap. More specifically, these bounds make it possible to construct BH-type procedures that incorporate the discrete and heterogeneous structure of the data and provably control the FDR for any fixed number of null hypotheses (under independence). Markedly, our FDR controlling methodology also allows to incorporate the quantity of signal in the data (corresponding therefore to a so-called $\pi_{0}$-adaptive procedure) and to recover some prominent results of the literature. The power advantage of the new methods is demonstrated in a numerical experiment and for some appropriate real data sets.
</p>projecteuclid.org/euclid.ejs/1528855551_20180612220622Tue, 12 Jun 2018 22:06 EDTImproved bounds for Square-Root Lasso and Square-Root Slopehttps://projecteuclid.org/euclid.ejs/1519722051<strong>Alexis Derumigny</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 741--766.</p><p><strong>Abstract:</strong><br/>
Extending the results of Bellec, Lecué and Tsybakov [1] to the setting of sparse high-dimensional linear regression with unknown variance, we show that two estimators, the Square-Root Lasso and the Square-Root Slope can achieve the optimal minimax prediction rate, which is $(s/n)\log\left (p/s\right )$, up to some constant, under some mild conditions on the design matrix. Here, $n$ is the sample size, $p$ is the dimension and $s$ is the sparsity parameter. We also prove optimality for the estimation error in the $l_{q}$-norm, with $q\in[1,2]$ for the Square-Root Lasso, and in the $l_{2}$ and sorted $l_{1}$ norms for the Square-Root Slope. Both estimators are adaptive to the unknown variance of the noise. The Square-Root Slope is also adaptive to the sparsity $s$ of the true parameter. Next, we prove that any estimator depending on $s$ which attains the minimax rate admits an adaptive to $s$ version still attaining the same rate. We apply this result to the Square-root Lasso. Moreover, for both estimators, we obtain valid rates for a wide range of confidence levels, and improved concentration properties as in [1] where the case of known variance is treated. Our results are non-asymptotic.
</p>projecteuclid.org/euclid.ejs/1519722051_20180613220156Wed, 13 Jun 2018 22:01 EDTHypothesis testing sure independence screening for nonparametric regressionhttps://projecteuclid.org/euclid.ejs/1520046228<strong>Adriano Zanin Zambom</strong>, <strong>Michael G. Akritas</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 767--792.</p><p><strong>Abstract:</strong><br/>
In this paper we develop a sure independence screening method based on hypothesis testing (HT-SIS) in a general nonparametric regression model. The ranking utility is based on a powerful test statistic for the hypothesis of predictive significance of each available covariate. The sure screening property of HT-SIS is established, demonstrating that all active predictors will be retained with high probability as the sample size increases. The threshold parameter is chosen in a theoretically justified manner based on the desired false positive selection rate. Simulation results suggest that the proposed method performs competitively against procedures found in the literature of screening for several models, and outperforms them in some scenarios. A real dataset of microarray gene expressions is analyzed.
</p>projecteuclid.org/euclid.ejs/1520046228_20180613220156Wed, 13 Jun 2018 22:01 EDTImproved classification rates under refined margin conditionshttps://projecteuclid.org/euclid.ejs/1520046229<strong>Ingrid Blaschzyk</strong>, <strong>Ingo Steinwart</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 793--823.</p><p><strong>Abstract:</strong><br/>
In this paper we present a simple partitioning based technique to refine the statistical analysis of classification algorithms. The core of our idea is to divide the input space into two parts such that the first part contains a suitable vicinity around the decision boundary, while the second part is sufficiently far away from the decision boundary. Using a set of margin conditions we are then able to control the classification error on both parts separately. By balancing out these two error terms we obtain a refined error analysis in a final step. We apply this general idea to the histogram rule and show that even for this simple method we obtain, under certain assumptions, better rates than the ones known for support vector machines, for certain plug-in classifiers, and for a recently analyzed tree based adaptive-partitioning ansatz. Moreover, we show that a margin condition which sets the critical noise in relation to the decision boundary makes it possible to improve the optimal rates proven for distributions without this margin condition.
</p>projecteuclid.org/euclid.ejs/1520046229_20180613220156Wed, 13 Jun 2018 22:01 EDTEfficient estimation in the partially linear quantile regression model for longitudinal datahttps://projecteuclid.org/euclid.ejs/1520046230<strong>Seonjin Kim</strong>, <strong>Hyunkeun Ryan Cho</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 824--850.</p><p><strong>Abstract:</strong><br/>
The focus of this study is efficient estimation in a quantile regression model with partially linear coefficients for longitudinal data, where repeated measurements within each subject are likely to be correlated. We propose a weighted quantile regression approach for time-invariant and time-varying coefficient estimation. The proposed approach can employ two types of weights obtained from an empirical likelihood method to account for the within-subject correlation: the global weight using all observations and the local weight using observations in the neighborhood of the time point of interest. We investigate the influence of choice of weights on asymptotic estimation efficiency and find theoretical results that are counter intuitive; it is essential to use the global weight for both time-invariant and time-varying coefficient estimation. This benefits from the within-subject correlation and prevents an adverse effect due to the weight discordance. For statistical inference, a random perturbation approach is utilized and evaluated through simulation studies. The proposed approach is also illustrated through a Multi-Center AIDS Cohort study.
</p>projecteuclid.org/euclid.ejs/1520046230_20180613220156Wed, 13 Jun 2018 22:01 EDTNormalizing constants of log-concave densitieshttps://projecteuclid.org/euclid.ejs/1520240451<strong>Nicolas Brosse</strong>, <strong>Alain Durmus</strong>, <strong>Éric Moulines</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 851--889.</p><p><strong>Abstract:</strong><br/>
We derive explicit bounds for the computation of normalizing constants $Z$ for log-concave densities $\pi =\mathrm{e}^{-U}/Z$ w.r.t. the Lebesgue measure on $\mathbb{R}^{d}$. Our approach relies on a Gaussian annealing combined with recent and precise bounds on the Unadjusted Langevin Algorithm [15]. Polynomial bounds in the dimension $d$ are obtained with an exponent that depends on the assumptions made on $U$. The algorithm also provides a theoretically grounded choice of the annealing sequence of variances. A numerical experiment supports our findings. Results of independent interest on the mean squared error of the empirical average of locally Lipschitz functions are established.
</p>projecteuclid.org/euclid.ejs/1520240451_20180613220156Wed, 13 Jun 2018 22:01 EDTEstimation of the asymptotic variance of univariate and multivariate random fields and statistical inferencehttps://projecteuclid.org/euclid.ejs/1520326826<strong>Annabel Prause</strong>, <strong>Ansgar Steland</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 890--940.</p><p><strong>Abstract:</strong><br/>
Correlated random fields are a common way to model dependence structures in high-dimensional data, especially for data collected in imaging. One important parameter characterizing the degree of dependence is the asymptotic variance which adds up all autocovariances in the temporal and spatial domain. Especially, it arises in the standardization of test statistics based on partial sums of random fields and thus the construction of tests requires its estimation. In this paper we propose consistent estimators for this parameter for strictly stationary $\varphi $-mixing random fields with arbitrary dimension of the domain and taking values in a Euclidean space of arbitrary dimension, thus allowing for multivariate random fields. We establish consistency, provide central limit theorems and show that distributional approximations of related test statistics based on sample autocovariances of random fields can be obtained by the subsampling approach.
As in applications the spatial-temporal correlations are often quite local, such that a large number of autocovariances vanish or are negligible, we also investigate a thresholding approach where sample autocovariances of small magnitude are omitted. Extensive simulation studies show that the proposed estimators work well in practice and, when used to standardize image test statistics, can provide highly accurate image testing procedures. Having in mind automatized applications on a big data scale as arising in data science problems, these examinations also cover the proposed data-adaptive procedures to select method parameters.
</p>projecteuclid.org/euclid.ejs/1520326826_20180613220156Wed, 13 Jun 2018 22:01 EDTLeast tail-trimmed absolute deviation estimation for autoregressions with infinite/finite variancehttps://projecteuclid.org/euclid.ejs/1520413266<strong>Rongning Wu</strong>, <strong>Yunwei Cui</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 941--959.</p><p><strong>Abstract:</strong><br/>
We propose least tail-trimmed absolute deviation estimation for autoregressive processes with infinite/finite variance. We explore the large sample properties of the resulting estimator and establish its asymptotic normality. Moreover, we study convergence rates of the estimator under different moment settings and show that it attains a super-$\sqrt{n}$ convergence rate when the innovation variance is infinite. Simulation studies are carried out to examine the finite-sample performance of the proposed method and that of relevant statistical inferences. A real example is also presented.
</p>projecteuclid.org/euclid.ejs/1520413266_20180613220156Wed, 13 Jun 2018 22:01 EDTSupervised dimensionality reduction via distance correlation maximizationhttps://projecteuclid.org/euclid.ejs/1520586206<strong>Praneeth Vepakomma</strong>, <strong>Chetan Tonde</strong>, <strong>Ahmed Elgammal</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 960--984.</p><p><strong>Abstract:</strong><br/>
In our work, we propose a novel formulation for supervised dimensionality reduction based on a nonlinear dependency criterion called Statistical Distance Correlation, (Székely et al., 2007). We propose an objective which is free of distributional assumptions on regression variables and regression model assumptions. Our proposed formulation is based on learning a low-dimensional feature representation $\mathbf{z}$, which maximizes the squared sum of Distance Correlations between low-dimensional features $\mathbf{z}$ and response $y$, and also between features $\mathbf{z}$ and covariates $\mathbf{x}$. We propose a novel algorithm to optimize our proposed objective using the Generalized Minimization Maximization method of (Parizi et al., 2015). We show superior empirical results on multiple datasets proving the effectiveness of our proposed approach over several relevant state-of-the-art supervised dimensionality reduction methods.
</p>projecteuclid.org/euclid.ejs/1520586206_20180613220156Wed, 13 Jun 2018 22:01 EDTRidge regression for the functional concurrent modelhttps://projecteuclid.org/euclid.ejs/1521079461<strong>Tito Manrique</strong>, <strong>Christophe Crambes</strong>, <strong>Nadine Hilgert</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 985--1018.</p><p><strong>Abstract:</strong><br/>
The aim of this paper is to propose estimators of the unknown functional coefficients in the Functional Concurrent Model (FCM). We extend the Ridge Regression method developed in the classical linear case to the functional data framework. Two distinct penalized estimators are obtained: one with a constant regularization parameter and the other with a functional one. We prove the probability convergence of these estimators with rate. Then we study the practical choice of both regularization parameters. Additionally, we present some simulations that show the accuracy of these estimators despite a very low signal-to-noise ratio.
</p>projecteuclid.org/euclid.ejs/1521079461_20180613220156Wed, 13 Jun 2018 22:01 EDTHigh dimensional efficiency with applications to change point testshttps://projecteuclid.org/euclid.ejs/1528941678<strong>John A.D. Aston</strong>, <strong>Claudia Kirch</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 1901--1947.</p><p><strong>Abstract:</strong><br/>
This paper rigourously introduces the asymptotic concept of high dimensional efficiency which quantifies the detection power of different statistics in high dimensional multivariate settings. It allows for comparisons of different high dimensional methods with different null asymptotics and even different asymptotic behavior such as extremal-type asymptotics. The concept will be used to understand the power behavior of different test statistics as the performance will greatly depend on the assumptions made, such as sparseness or denseness of the signal. The effect of misspecification of the covariance on the power of the tests is also investigated, because in many high dimensional situations estimation of the full dependency (covariance) between the multivariate observations in the panel is often either computationally or even theoretically infeasible. The theoretic quantification by the theory is accompanied by simulation results which confirm the theoretic (asymptotic) findings for surprisingly small samples. The development of this concept was motivated by, but is by no means limited to, high-dimensional change point tests. It is shown that the concept of high dimensional efficiency is indeed suitable to describe small sample power.
</p>projecteuclid.org/euclid.ejs/1528941678_20180613220156Wed, 13 Jun 2018 22:01 EDTFeasible invertibility conditions and maximum likelihood estimation for observation-driven modelshttps://projecteuclid.org/euclid.ejs/1521079462<strong>Francisco Blasques</strong>, <strong>Paolo Gorgi</strong>, <strong>Siem Jan Koopman</strong>, <strong>Olivier Wintenberger</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 1019--1052.</p><p><strong>Abstract:</strong><br/>
Invertibility conditions for observation-driven time series models often fail to be guaranteed in empirical applications. As a result, the asymptotic theory of maximum likelihood and quasi-maximum likelihood estimators may be compromised. We derive considerably weaker conditions that can be used in practice to ensure the consistency of the maximum likelihood estimator for a wide class of observation-driven time series models. Our consistency results hold for both correctly specified and misspecified models. We also obtain an asymptotic test and confidence bounds for the unfeasible “true” invertibility region of the parameter space. The practical relevance of the theory is highlighted in a set of empirical examples. For instance, we derive the consistency of the maximum likelihood estimator of the Beta-$t$-GARCH model under weaker conditions than those considered in previous literature.
</p>projecteuclid.org/euclid.ejs/1521079462_20180618040214Mon, 18 Jun 2018 04:02 EDTExact post-selection inference for the generalized lasso pathhttps://projecteuclid.org/euclid.ejs/1521252212<strong>Sangwon Hyun</strong>, <strong>Max G’Sell</strong>, <strong>Ryan J. Tibshirani</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 1053--1097.</p><p><strong>Abstract:</strong><br/>
We study tools for inference conditioned on model selection events that are defined by the generalized lasso regularization path. The generalized lasso estimate is given by the solution of a penalized least squares regression problem, where the penalty is the $\ell_{1}$ norm of a matrix $D$ times the coefficient vector. The generalized lasso path collects these estimates as the penalty parameter $\lambda$ varies (from $\infty$ down to 0). Leveraging a (sequential) characterization of this path from Tibshirani and Taylor [37], and recent advances in post-selection inference from Lee at al. [22], Tibshirani et al. [38], we develop exact hypothesis tests and confidence intervals for linear contrasts of the underlying mean vector, conditioned on any model selection event along the generalized lasso path (assuming Gaussian errors in the observations).
Our construction of inference tools holds for any penalty matrix $D$. By inspecting specific choices of $D$, we obtain post-selection tests and confidence intervals for specific cases of generalized lasso estimates, such as the fused lasso, trend filtering, and the graph fused lasso. In the fused lasso case, the underlying coordinates of the mean are assigned a linear ordering, and our framework allows us to test selectively chosen breakpoints or changepoints in these mean coordinates. This is an interesting and well-studied problem with broad applications; our framework applied to the trend filtering and graph fused lasso cases serves several applications as well. Aside from the development of selective inference tools, we describe several practical aspects of our methods such as (valid, i.e., fully-accounted-for) post-processing of generalized lasso estimates before performing inference in order to improve power, and problem-specific visualization aids that may be given to the data analyst for he/she to choose linear contrasts to be tested. Many examples, from both simulated and real data sources, are presented to examine the empirical properties of our inference methods.
</p>projecteuclid.org/euclid.ejs/1521252212_20180618040214Mon, 18 Jun 2018 04:02 EDTInference for heavy tailed stationary time series based on sliding blockshttps://projecteuclid.org/euclid.ejs/1522116040<strong>Axel Bücher</strong>, <strong>Johan Segers</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 1098--1125.</p><p><strong>Abstract:</strong><br/>
The block maxima method in extreme value theory consists of fitting an extreme value distribution to a sample of block maxima extracted from a time series. Traditionally, the maxima are taken over disjoint blocks of observations. Alternatively, the blocks can be chosen to slide through the observation period, yielding a larger number of overlapping blocks. Inference based on sliding blocks is found to be more efficient than inference based on disjoint blocks. The asymptotic variance of the maximum likelihood estimator of the Fréchet shape parameter is reduced by more than 18%. Interestingly, the amount of the efficiency gain is the same whatever the serial dependence of the underlying time series: as for disjoint blocks, the asymptotic distribution depends on the serial dependence only through the sequence of scaling constants. The findings are illustrated by simulation experiments and are applied to the estimation of high return levels of the daily log-returns of the Standard & Poor’s 500 stock market index.
</p>projecteuclid.org/euclid.ejs/1522116040_20180618040214Mon, 18 Jun 2018 04:02 EDTA strong converse bound for multiple hypothesis testing, with applications to high-dimensional estimationhttps://projecteuclid.org/euclid.ejs/1522116041<strong>Ramji Venkataramanan</strong>, <strong>Oliver Johnson</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 1126--1149.</p><p><strong>Abstract:</strong><br/>
In statistical inference problems, we wish to obtain lower bounds on the minimax risk, that is to bound the performance of any possible estimator. A standard technique to do this involves the use of Fano’s inequality. However, recent work in an information-theoretic setting has shown that an argument based on binary hypothesis testing gives tighter converse results (error lower bounds) than Fano for channel coding problems. We adapt this technique to the statistical setting, and argue that Fano’s inequality can always be replaced by this approach to obtain tighter lower bounds that can be easily computed and are asymptotically sharp. We illustrate our technique in three applications: density estimation, active learning of a binary classifier, and compressed sensing, obtaining tighter risk lower bounds in each case.
</p>projecteuclid.org/euclid.ejs/1522116041_20180618040214Mon, 18 Jun 2018 04:02 EDTSupervised multiway factorizationhttps://projecteuclid.org/euclid.ejs/1522116042<strong>Eric F. Lock</strong>, <strong>Gen Li</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 1150--1180.</p><p><strong>Abstract:</strong><br/>
We describe a probabilistic PARAFAC/CANDECOMP (CP) factorization for multiway (i.e., tensor) data that incorporates auxiliary covariates, SupCP . SupCP generalizes the supervised singular value decomposition (SupSVD) for vector-valued observations, to allow for observations that have the form of a matrix or higher-order array. Such data are increasingly encountered in biomedical research and other fields. We use a novel likelihood-based latent variable representation of the CP factorization, in which the latent variables are informed by additional covariates. We give conditions for identifiability, and develop an EM algorithm for simultaneous estimation of all model parameters. SupCP can be used for dimension reduction, capturing latent structures that are more accurate and interpretable due to covariate supervision. Moreover, SupCP specifies a full probability distribution for a multiway data observation with given covariate values, which can be used for predictive modeling. We conduct comprehensive simulations to evaluate the SupCP algorithm. We apply it to a facial image database with facial descriptors (e.g., smiling / not smiling) as covariates, and to a study of amino acid fluorescence. Software is available at https://github.com/lockEF/SupCP.
</p>projecteuclid.org/euclid.ejs/1522116042_20180618040214Mon, 18 Jun 2018 04:02 EDTAn MM algorithm for estimation of a two component semiparametric density mixture with a known componenthttps://projecteuclid.org/euclid.ejs/1522224150<strong>Zhou Shen</strong>, <strong>Michael Levine</strong>, <strong>Zuofeng Shang</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 1181--1209.</p><p><strong>Abstract:</strong><br/>
We consider a semiparametric mixture of two univariate density functions where one of them is known while the weight and the other function are unknown. We do not assume any additional structure on the unknown density function. For this mixture model, we derive a new sufficient identifiability condition and pinpoint a specific class of distributions describing the unknown component for which this condition is mostly satisfied. We also suggest a novel approach to estimation of this model that is based on an idea of applying a maximum smoothed likelihood to what would otherwise have been an ill-posed problem. We introduce an iterative MM (Majorization-Minimization) algorithm that estimates all of the model parameters. We establish that the algorithm possesses a descent property with respect to a log-likelihood objective functional and prove that the algorithm, indeed, converges. Finally, we also illustrate the performance of our algorithm in a simulation study and apply it to a real dataset.
</p>projecteuclid.org/euclid.ejs/1522224150_20180618040214Mon, 18 Jun 2018 04:02 EDTConvex and non-convex regularization methods for spatial point processes intensity estimationhttps://projecteuclid.org/euclid.ejs/1522288952<strong>Achmad Choiruddin</strong>, <strong>Jean-François Coeurjolly</strong>, <strong>Frédérique Letué</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 1210--1255.</p><p><strong>Abstract:</strong><br/>
This paper deals with feature selection procedures for spatial point processes intensity estimation. We consider regularized versions of estimating equations based on Campbell theorem. In particular, we consider two classical functions: the Poisson likelihood and the logistic regression likelihood. We provide general conditions on the spatial point processes and on penalty functions which ensure oracle property, consistency, and asymptotic normality under the increasing domain setting. We discuss the numerical implementation and assess finite sample properties in simulation studies. Finally, an application to tropical forestry datasets illustrates the use of the proposed method.
</p>projecteuclid.org/euclid.ejs/1522288952_20180618040214Mon, 18 Jun 2018 04:02 EDTFast adaptive estimation of log-additive exponential models in Kullback-Leibler divergencehttps://projecteuclid.org/euclid.ejs/1522828871<strong>Cristina Butucea</strong>, <strong>Jean-François Delmas</strong>, <strong>Anne Dutfoy</strong>, <strong>Richard Fischer</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 1256--1298.</p><p><strong>Abstract:</strong><br/>
We study the problem of nonparametric estimation of probability density functions (pdf) with a product form on the domain $\triangle =\{(x_{1},\ldots ,x_{d})\in{\mathbb{R}} ^{d},0\leq x_{1}\leq \dots\leq x_{d}\leq 1\}$. Such pdf’s appear in the random truncation model as the joint pdf of the observations. They are also obtained as maximum entropy distributions of order statistics with given marginals. We propose an estimation method based on the approximation of the logarithm of the density by a carefully chosen family of basis functions. We show that the method achieves a fast convergence rate in probability with respect to the Kullback-Leibler divergence for pdf’s whose logarithm belong to a Sobolev function class with known regularity. In the case when the regularity is unknown, we propose an estimation procedure using convex aggregation of the log-densities to obtain adaptability. The performance of this method is illustrated in a simulation study.
</p>projecteuclid.org/euclid.ejs/1522828871_20180618040214Mon, 18 Jun 2018 04:02 EDTConditional kernel density estimation for some incomplete data modelshttps://projecteuclid.org/euclid.ejs/1524881058<strong>Ting Yan</strong>, <strong>Liangqiang Qu</strong>, <strong>Zhaohai Li</strong>, <strong>Ao Yuan</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 1299--1329.</p><p><strong>Abstract:</strong><br/>
A class of density estimators based on observed incomplete data are proposed. The method is to use a conditional kernel, defined as the expectation of a given kernel for the complete data conditioning on the observed data, to construct the density estimator. We study such kernel density estimators for several commonly used incomplete data models and establish their basic asymptotic properties. Some characteristics different from the classical kernel estimators are discovered. For instance, the asymptotic results of the proposed estimator do not depend on the choice of the kernel $k(\cdot )$. Simulation study is conducted to evaluate the performance of the estimator and compared with some exising methods.
</p>projecteuclid.org/euclid.ejs/1524881058_20180618040214Mon, 18 Jun 2018 04:02 EDTBayesian nonparametric estimation of survival functions with multiple-samples informationhttps://projecteuclid.org/euclid.ejs/1525334453<strong>Alan Riva Palacio</strong>, <strong>Fabrizio Leisen</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 1330--1357.</p><p><strong>Abstract:</strong><br/>
In many real problems, dependence structures more general than exchangeability are required. For instance, in some settings partial exchangeability is a more reasonable assumption. For this reason, vectors of dependent Bayesian nonparametric priors have recently gained popularity. They provide flexible models which are tractable from a computational and theoretical point of view. In this paper, we focus on their use for estimating multivariate survival functions. Our model extends the work of Epifani and Lijoi (2010) to an arbitrary dimension and allows to model the dependence among survival times of different groups of observations. Theoretical results about the posterior behaviour of the underlying dependent vector of completely random measures are provided. The performance of the model is tested on a simulated dataset arising from a distributional Clayton copula.
</p>projecteuclid.org/euclid.ejs/1525334453_20180618040214Mon, 18 Jun 2018 04:02 EDTBayesian inference for spectral projectors of the covariance matrixhttps://projecteuclid.org/euclid.ejs/1529308884<strong>Igor Silin</strong>, <strong>Vladimir Spokoiny</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 1948--1987.</p><p><strong>Abstract:</strong><br/>
Let $X_{1},\ldots ,X_{n}$ be an i.i.d. sample in $\mathbb{R}^{p}$ with zero mean and the covariance matrix ${\boldsymbol{\varSigma }^{*}}$. The classical PCA approach recovers the projector $\boldsymbol{P}^{*}_{\mathcal{J}}$ onto the principal eigenspace of ${\boldsymbol{\varSigma }^{*}}$ by its empirical counterpart $\widehat{\boldsymbol{P}}_{\mathcal{J}}$. Recent paper [24] investigated the asymptotic distribution of the Frobenius distance between the projectors $\|\widehat{\boldsymbol{P}}_{\mathcal{J}}-\boldsymbol{P}^{*}_{\mathcal{J}}\|_{2}$, while [27] offered a bootstrap procedure to measure uncertainty in recovering this subspace $\boldsymbol{P}^{*}_{\mathcal{J}}$ even in a finite sample setup. The present paper considers this problem from a Bayesian perspective and suggests to use the credible sets of the pseudo-posterior distribution on the space of covariance matrices induced by the conjugated Inverse Wishart prior as sharp confidence sets. This yields a numerically efficient procedure. Moreover, we theoretically justify this method and derive finite sample bounds on the corresponding coverage probability. Contrary to [24, 27], the obtained results are valid for non-Gaussian data: the main assumption that we impose is the concentration of the sample covariance $\widehat{\boldsymbol{\varSigma }}$ in a vicinity of ${\boldsymbol{\varSigma }^{*}}$. Numerical simulations illustrate good performance of the proposed procedure even on non-Gaussian data in a rather challenging regime.
</p>projecteuclid.org/euclid.ejs/1529308884_20180618040214Mon, 18 Jun 2018 04:02 EDTSelection by partitioning the solution pathshttps://projecteuclid.org/euclid.ejs/1529308885<strong>Yang Liu</strong>, <strong>Peng Wang</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 1988--2017.</p><p><strong>Abstract:</strong><br/>
The performance of penalized likelihood approaches depends profoundly on the selection of the tuning parameter; however, there is no commonly agreed-upon criterion for choosing the tuning parameter. Moreover, penalized likelihood estimation based on a single value of the tuning parameter suffers from several drawbacks. This article introduces a novel approach for feature selection based on the entire solution paths rather than the choice of a single tuning parameter, which significantly improves the accuracy of the selection. Moreover, the approach allows for feature selection using ridge or other strictly convex penalties. The key idea is to classify variables as relevant or irrelevant at each tuning parameter and then to select all of the variables which have been classified as relevant at least once. We establish the theoretical properties of the method, which requires significantly weaker conditions than existing methods in the literature. We also illustrate the advantages of the proposed approach with simulation studies and a data example.
</p>projecteuclid.org/euclid.ejs/1529308885_20180618040214Mon, 18 Jun 2018 04:02 EDTCommon price and volatility jumps in noisy high-frequency datahttps://projecteuclid.org/euclid.ejs/1529308886<strong>Markus Bibinger</strong>, <strong>Lars Winkelmann</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 2018--2073.</p><p><strong>Abstract:</strong><br/>
We introduce a statistical test for simultaneous jumps in the price of a financial asset and its volatility process. The proposed test is based on high-frequency data and is robust to market microstructure frictions. For the test, local estimators of volatility jumps at price jump arrival times are designed using a nonparametric spectral estimator of the spot volatility process. A simulation study and an empirical example with NASDAQ order book data demonstrate the practicability of the proposed methods and highlight the important role played by price volatility co-jumps.
</p>projecteuclid.org/euclid.ejs/1529308886_20180618040214Mon, 18 Jun 2018 04:02 EDTChange detection via affine and quadratic detectorshttps://projecteuclid.org/euclid.ejs/1514970025<strong>Yang Cao</strong>, <strong>Arkadi Nemirovski</strong>, <strong>Yao Xie</strong>, <strong>Vincent Guigues</strong>, <strong>Anatoli Juditsky</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 1--57.</p><p><strong>Abstract:</strong><br/>
The goal of the paper is to develop a specific application of the convex optimization based hypothesis testing techniques developed in A. Juditsky, A. Nemirovski, “Hypothesis testing via affine detectors,” Electronic Journal of Statistics 10 :2204–2242, 2016. Namely, we consider the Change Detection problem as follows: observing one by one noisy observations of outputs of a discrete-time linear dynamical system, we intend to decide, in a sequential fashion, on the null hypothesis that the input to the system is a nuisance, vs. the alternative that the input is a “nontrivial signal,” with both the nuisances and the nontrivial signals modeled as inputs belonging to finite unions of some given convex sets. Assuming the observation noises are zero mean sub-Gaussian, we develop “computation-friendly” sequential decision rules and demonstrate that in our context these rules are provably near-optimal.
</p>projecteuclid.org/euclid.ejs/1514970025_20180621040108Thu, 21 Jun 2018 04:01 EDTConfidence intervals for the means of the selected populationshttps://projecteuclid.org/euclid.ejs/1515142842<strong>Claudio Fuentes</strong>, <strong>George Casella</strong>, <strong>Martin T. Wells</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 58--79.</p><p><strong>Abstract:</strong><br/>
Consider an experiment in which $p$ independent populations $\pi_{i}$ with corresponding unknown means $\theta_{i}$ are available, and suppose that for every $1\leq i\leq p$, we can obtain a sample $X_{i1},\ldots,X_{in}$ from $\pi_{i}$. In this context, researchers are sometimes interested in selecting the populations that yield the largest sample means as a result of the experiment, and then estimate the corresponding population means $\theta_{i}$. In this paper, we present a frequentist approach to the problem and discuss how to construct simultaneous confidence intervals for the means of the $k$ selected populations, assuming that the populations $\pi_{i}$ are independent and normally distributed with a common variance $\sigma^{2}$. The method, based on the minimization of the coverage probability, obtains confidence intervals that attain the nominal coverage probability for any $p$ and $k$, taking into account the selection procedure.
</p>projecteuclid.org/euclid.ejs/1515142842_20180621040108Thu, 21 Jun 2018 04:01 EDTUniformly valid confidence sets based on the Lassohttps://projecteuclid.org/euclid.ejs/1526284830<strong>Karl Ewald</strong>, <strong>Ulrike Schneider</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 1358--1387.</p><p><strong>Abstract:</strong><br/>
In a linear regression model of fixed dimension $p\leq n$, we construct confidence regions for the unknown parameter vector based on the Lasso estimator that uniformly and exactly hold the prescribed in finite samples as well as in an asymptotic setup. We thereby quantify estimation uncertainty as well as the “post-model selection error” of this estimator. More concretely, in finite samples with Gaussian errors and asymptotically in the case where the Lasso estimator is tuned to perform conservative model selection, we derive exact formulas for computing the minimal coverage probability over the entire parameter space for a large class of shapes for the confidence sets, thus enabling the construction of valid confidence regions based on the Lasso estimator in these settings. The choice of shape for the confidence sets and comparison with the confidence ellipse based on the least-squares estimator is also discussed. Moreover, in the case where the Lasso estimator is tuned to enable consistent model selection, we give a simple confidence region with minimal coverage probability converging to one. Finally, we also treat the case of unknown error variance and present some ideas for extensions.
</p>projecteuclid.org/euclid.ejs/1526284830_20180621040108Thu, 21 Jun 2018 04:01 EDTA two stage $k$-monotone B-spline regression estimator: Uniform Lipschitz property and optimal convergence ratehttps://projecteuclid.org/euclid.ejs/1526544023<strong>Teresa M. Lebair</strong>, <strong>Jinglai Shen</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 1388--1428.</p><p><strong>Abstract:</strong><br/>
This paper considers $k$-monotone estimation and the related asymptotic performance analysis over a suitable Hölder class for general $k$. A novel two stage $k$-monotone B-spline estimator is proposed: in the first stage, an unconstrained estimator with optimal asymptotic performance is considered; in the second stage, a $k$-monotone B-spline estimator is constructed (roughly) by projecting the unconstrained estimator onto a cone of $k$-monotone splines. To study the asymptotic performance of the second stage estimator under the sup-norm and other risks, a critical uniform Lipschitz property for the $k$-monotone B-spline estimator is established under the $\ell_{\infty }$-norm. This property uniformly bounds the Lipschitz constants associated with the mapping from a (weighted) first stage input vector to the B-spline coefficients of the second stage $k$-monotone estimator, independent of the sample size and the number of knots. This result is then exploited to analyze the second stage estimator performance and develop convergence rates under the sup-norm, pointwise, and $L_{p}$-norm (with $p\in [1,\infty )$) risks. By employing recent results in $k$-monotone estimation minimax lower bound theory, we show that these convergence rates are optimal.
</p>projecteuclid.org/euclid.ejs/1526544023_20180621040108Thu, 21 Jun 2018 04:01 EDTHigh-dimensional robust precision matrix estimation: Cellwise corruption under $\epsilon $-contaminationhttps://projecteuclid.org/euclid.ejs/1526630484<strong>Po-Ling Loh</strong>, <strong>Xin Lu Tan</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 1429--1467.</p><p><strong>Abstract:</strong><br/>
We analyze the statistical consistency of robust estimators for precision matrices in high dimensions. We focus on a contamination mechanism acting cellwise on the data matrix. The estimators we analyze are formed by plugging appropriately chosen robust covariance matrix estimators into the graphical Lasso and CLIME. Such estimators were recently proposed in the robust statistics literature, but only analyzed mathematically from the point of view of the breakdown point. This paper provides complementary high-dimensional error bounds for the precision matrix estimators that reveal the interplay between the dimensionality of the problem and the degree of contamination permitted in the observed distribution. We also show that although the graphical Lasso and CLIME estimators perform equally well from the point of view of statistical consistency, the breakdown property of the graphical Lasso is superior to that of CLIME. We discuss implications of our work for problems involving graphical model estimation when the uncontaminated data follow a multivariate normal distribution, and the goal is to estimate the support of the population-level precision matrix. Our error bounds do not make any assumptions about the the contaminating distribution and allow for a nonvanishing fraction of cellwise contamination.
</p>projecteuclid.org/euclid.ejs/1526630484_20180621040108Thu, 21 Jun 2018 04:01 EDTDimension reduction-based significance testing in nonparametric regressionhttps://projecteuclid.org/euclid.ejs/1526695233<strong>Xuehu Zhu</strong>, <strong>Lixing Zhu</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 1468--1506.</p><p><strong>Abstract:</strong><br/>
A dimension reduction-based adaptive-to-model test is proposed for significance of a subset of covariates in the context of a nonparametric regression model. Unlike existing locally smoothing significance tests, the new test behaves like a locally smoothing test as if the number of covariates was just that under the null hypothesis and it can detect local alternative hypotheses distinct from the null hypothesis at the rate that is only related to the number of covariates under the null hypothesis. Thus, the curse of dimensionality is largely alleviated when nonparametric estimation is inevitably required. In the cases where there are many insignificant covariates, the improvement of the new test is very significant over existing locally smoothing tests on the significance level maintenance and power enhancement. Simulation studies and a real data analysis are conducted to examine the finite sample performance of the proposed test.
</p>projecteuclid.org/euclid.ejs/1526695233_20180621040108Thu, 21 Jun 2018 04:01 EDTSlice inverse regression with score functionshttps://projecteuclid.org/euclid.ejs/1526889626<strong>Dmitry Babichev</strong>, <strong>Francis Bach</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 1507--1543.</p><p><strong>Abstract:</strong><br/>
We consider non-linear regression problems where we assume that the response depends non-linearly on a linear projection of the covariates. We propose score function extensions to sliced inverse regression problems, both for the first- order and second-order score functions. We show that they provably improve estimation in the population case over the non-sliced versions and we study finite sample estimators and their consistency given the exact score functions. We also propose to learn the score function as well, in two steps, i.e., first learning the score function and then learning the effective dimension reduction space, or directly, by solving a convex optimization problem regularized by the nuclear norm. We illustrate our results on a series of experiments.
</p>projecteuclid.org/euclid.ejs/1526889626_20180621040108Thu, 21 Jun 2018 04:01 EDTAn extended empirical saddlepoint approximation for intractable likelihoodshttps://projecteuclid.org/euclid.ejs/1527300140<strong>Matteo Fasiolo</strong>, <strong>Simon N. Wood</strong>, <strong>Florian Hartig</strong>, <strong>Mark V. Bravington</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 1544--1578.</p><p><strong>Abstract:</strong><br/>
The challenges posed by complex stochastic models used in computational ecology, biology and genetics have stimulated the development of approximate approaches to statistical inference. Here we focus on Synthetic Likelihood (SL), a procedure that reduces the observed and simulated data to a set of summary statistics, and quantifies the discrepancy between them through a synthetic likelihood function. SL requires little tuning, but it relies on the approximate normality of the summary statistics. We relax this assumption by proposing a novel, more flexible, density estimator: the Extended Empirical Saddlepoint approximation. In addition to proving the consistency of SL, under either the new or the Gaussian density estimator, we illustrate the method using three examples. One of these is a complex individual-based forest model for which SL offers one of the few practical possibilities for statistical inference. The examples show that the new density estimator is able to capture large departures from normality, while being scalable to high dimensions, and this in turn leads to more accurate parameter estimates, relative to the Gaussian alternative. The new density estimator is implemented by the esaddle R package, which is freely available on the Comprehensive R Archive Network (CRAN).
</p>projecteuclid.org/euclid.ejs/1527300140_20180621040108Thu, 21 Jun 2018 04:01 EDTModified sequential change point procedures based on estimating functionshttps://projecteuclid.org/euclid.ejs/1527300141<strong>Claudia Kirch</strong>, <strong>Silke Weber</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 1579--1613.</p><p><strong>Abstract:</strong><br/>
A large class of sequential change point tests are based on estimating functions where estimation is computationally efficient as (possibly numeric) optimization is restricted to an initial estimation. This includes examples as diverse as mean changes, linear or non-linear autoregressive and binary models. While the standard cumulative-sum-detector (CUSUM) has recently been considered in this general setup, we consider several modifications that have faster detection rates in particular if changes do occur late in the monitoring period. More presicely, we use three different types of detector statistics based on partial sums of a monitoring function, namely the modified moving-sum-statistic (mMOSUM), Page’s cumulative-sum-statistic (Page-CUSUM) and the standard moving-sum-statistic (MOSUM). The statistics only differ in the number of observations included in the partial sum. The mMOSUM uses a bandwidth parameter which multiplicatively scales the lower bound of the moving sum. The MOSUM uses a constant bandwidth parameter, while Page-CUSUM chooses the maximum over all possible lower bounds for the partial sums. So far, the first two schemes have only been studied in a linear model, the MOSUM only for a mean change.
We develop the asymptotics under the null hypothesis and alternatives under mild regularity conditions for each test statistic, which include the existing theory but also many new examples. In a simulation study we compare all four types of test procedures in terms of their size, power and run length. Additionally we illustrate their behavior by applications to exchange rate data as well as the Boston homicide data.
</p>projecteuclid.org/euclid.ejs/1527300141_20180621040108Thu, 21 Jun 2018 04:01 EDTOn penalized estimation for dynamical systems with small noisehttps://projecteuclid.org/euclid.ejs/1527300142<strong>Alessandro De Gregorio</strong>, <strong>Stefano Maria Iacus</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 1614--1630.</p><p><strong>Abstract:</strong><br/>
We consider a dynamical system with small noise for which the drift is parametrized by a finite dimensional parameter. For this model, we consider minimum distance estimation from continuous time observations under $l^{p}$-penalty imposed on the parameters in the spirit of the Lasso approach, with the aim of simultaneous estimation and model selection. We study the consistency and the asymptotic distribution of these Lasso-type estimators for different values of $p$. For $p=1,$ we also consider the adaptive version of the Lasso estimator and establish its oracle properties.
</p>projecteuclid.org/euclid.ejs/1527300142_20180621040108Thu, 21 Jun 2018 04:01 EDTBayesian pairwise estimation under dependent informative samplinghttps://projecteuclid.org/euclid.ejs/1527300143<strong>Matthew R. Williams</strong>, <strong>Terrance D. Savitsky</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 1631--1661.</p><p><strong>Abstract:</strong><br/>
An informative sampling design leads to the selection of units whose inclusion probabilities are correlated with the response variable of interest. Inference under the population model performed on the resulting observed sample, without adjustment, will be biased for the population generative model. One approach that produces asymptotically unbiased inference employs marginal inclusion probabilities to form sampling weights used to exponentiate each likelihood contribution of a pseudo likelihood used to form a pseudo posterior distribution. Conditions for posterior consistency restrict applicable sampling designs to those under which pairwise inclusion dependencies asymptotically limit to $0$. There are many sampling designs excluded by this restriction; for example, a multi-stage design that samples individuals within households. Viewing each household as a population, the dependence among individuals does not attenuate. We propose a more targeted approach in this paper for inference focused on pairs of individuals or sampled units; for example, the substance use of one spouse in a shared household, conditioned on the substance use of the other spouse. We formulate the pseudo likelihood with weights based on pairwise or second order probabilities and demonstrate consistency, removing the requirement for asymptotic independence and replacing it with restrictions on higher order selection probabilities. Our approach provides a nearly automated estimation procedure applicable to any model specified by the data analyst. We demonstrate our method on the National Survey on Drug Use and Health.
</p>projecteuclid.org/euclid.ejs/1527300143_20180621040108Thu, 21 Jun 2018 04:01 EDTHeritability estimation in case-control studieshttps://projecteuclid.org/euclid.ejs/1527559245<strong>Anna Bonnet</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 1662--1716.</p><p><strong>Abstract:</strong><br/>
In the field of genetics, the concept of heritability refers to the proportion of variations of a biological trait or disease that can be explained by genetic factors. Quantifying the heritability of a disease is a fundamental challenge in human genetics, especially when the causes are plural and not clearly identified. Although the literature regarding heritability estimation for binary traits is less rich than for quantitative traits, several methods have been proposed to estimate the heritability of complex diseases. However, to the best of our knowledge, the existing methods are not supported by theoretical grounds. Moreover, most of the methodologies do not take into account a major specificity of the data coming from medical studies, which is the oversampling of the number of patients compared to controls. We propose in this paper to investigate the theoretical properties of the Phenotype Correlation Genotype Correlation (PCGC) regression developed by Golan, Lander and Rosset (2014), which is one of the major techniques used in statistical genetics and which is very efficient in practice, despite the oversampling of patients. Our main result is the proof of the consistency of this estimator, under several assumptions that we will state and discuss. We also provide a numerical study to compare two approximations leading to two heritability estimators.
</p>projecteuclid.org/euclid.ejs/1527559245_20180621040108Thu, 21 Jun 2018 04:01 EDTA deconvolution path for mixtureshttps://projecteuclid.org/euclid.ejs/1527559246<strong>Oscar-Hernan Madrid-Padilla</strong>, <strong>Nicholas G. Polson</strong>, <strong>James Scott</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 1717--1751.</p><p><strong>Abstract:</strong><br/>
We propose a class of estimators for deconvolution in mixture models based on a simple two-step “bin-and-smooth” procedure applied to histogram counts. The method is both statistically and computationally efficient: by exploiting recent advances in convex optimization, we are able to provide a full deconvolution path that shows the estimate for the mi-xing distribution across a range of plausible degrees of smoothness, at far less cost than a full Bayesian analysis. This enables practitioners to conduct a sensitivity analysis with minimal effort. This is especially important for applied data analysis, given the ill-posed nature of the deconvolution problem. Our results establish the favorable theoretical properties of our estimator and show that it offers state-of-the-art performance when compared to benchmark methods across a range of scenarios.
</p>projecteuclid.org/euclid.ejs/1527559246_20180621040108Thu, 21 Jun 2018 04:01 EDTHigh-dimensional inference for personalized treatment decisionhttps://projecteuclid.org/euclid.ejs/1529568040<strong>X. Jessie Jeng</strong>, <strong>Wenbin Lu</strong>, <strong>Huimin Peng</strong>. <p><strong>Source: </strong>Electronic Journal of Statistics, Volume 12, Number 1, 2074--2089.</p><p><strong>Abstract:</strong><br/>
Recent development in statistical methodology for personalized treatment decision has utilized high-dimensional regression to take into account a large number of patients’ covariates and described personalized treatment decision through interactions between treatment and covariates. While a subset of interaction terms can be obtained by existing variable selection methods to indicate relevant covariates for making treatment decision, there often lacks statistical interpretation of the results. This paper proposes an asymptotically unbiased estimator based on Lasso solution for the interaction coefficients. We derive the limiting distribution of the estimator when baseline function of the regression model is unknown and possibly misspecified. Confidence intervals and p-values are derived to infer the effects of the patients’ covariates in making treatment decision. We confirm the accuracy of the proposed method and its robustness against misspecified function in simulation and apply the method to STAR∗D study for major depression disorder.
</p>projecteuclid.org/euclid.ejs/1529568040_20180621040108Thu, 21 Jun 2018 04:01 EDT