Registered users receive a variety of benefits including the ability to customize email alerts, create favorite journals list, and save searches.
Please note that a Project Euclid web account does not automatically grant access to full-text content. An institutional or society member subscription is required to view non-Open Access content.
Contact firstname.lastname@example.org with any questions.
This paper proposes a uniqueness Shapley measure to compare the extent to which different variables are able to identify a subject. Revealing the value of a variable on subject t shrinks the set of possible subjects that t could be. The extent of the shrinkage depends on which other variables have also been revealed. We use Shapley value to combine all of the reductions in log cardinality due to revealing a variable after some subset of the other variables has been revealed. This uniqueness Shapley measure can be aggregated over subjects where it becomes a weighted sum of conditional entropies. Aggregation over subsets of subjects can address questions like how identifying is age for people of a given zip code. Such aggregates have a corresponding expression in terms of cross entropies. We use uniqueness Shapley to investigate the differential effects of revealing variables from the North Carolina voter registration rolls and in identifying anomalous solar flares. An enormous speedup (approaching 2000 fold in one example) is obtained by using the all dimension trees of Moore and Lee (1998) to store the cardinalities we need.
The logistic and probit link functions are the most common choices for regression models with a binary response. However, these choices are not robust to the presence of outliers/unexpected observations. The robit link function, which is equal to the inverse CDF of the Student’s t-distribution, provides a robust alternative to the probit and logistic link functions. A multivariate normal prior for the regression coefficients is the standard choice for Bayesian inference in robit regression models. The resulting posterior density is intractable and a Data Augmentation (DA) Markov chain is used to generate approximate samples from the desired posterior distribution. Establishing geometric ergodicity for this DA Markov chain is important as it provides theoretical guarantees for asymptotic validity of MCMC standard errors for desired posterior expectations/quantiles. Previous work  established geometric ergodicity of this robit DA Markov chain assuming (i) the sample size n dominates the number of predictors p, and (ii) an additional constraint which requires the sample size to be bounded above by a fixed constant which depends on the design matrix X. In particular, modern high-dimensional settings where are not considered. In this work, we show that the robit DA Markov chain is trace-class (i.e., the eigenvalues of the corresponding Markov operator are summable) for arbitrary choices of the sample size n, the number of predictors p, the design matrix X, and the prior mean and variance parameters. The trace-class property implies geometric ergodicity. Moreover, this property allows us to conclude that the sandwich robit chain (obtained by inserting an inexpensive extra step in between the two steps of the DA chain) is strictly better than the robit DA chain in an appropriate sense, and enables the use of recent methods to estimate the spectral gap of trace class DA Markov chains.
A method for the detection of changes in the expectation in univariate sequences is provided. Moving sum processes (MOSUM) are studied. These rely on the selection of a tuning bandwidth. Here, a framework to overcome bandwidth selection is presented – the bandwidth adjusts gradually. For that, MOSUM are made dependent on both time and the bandwidth: the domain becomes a triangle. On the triangle, paths are constructed which systematically lead to change points. An algorithm is provided that estimates change points by subsequent consideration of paths. Strong consistency for the number and location of change points is shown. Simulation studies corroborate estimation precision and reveal competitiveness with state of the art change point detection methods. A companion -package mscp is made available on CRAN.
In many statistical problems the hypotheses are naturally divided into groups, and the investigators are interested to perform group-level inference, possibly along with inference on individual hypotheses. We consider the goal of discovering groups containing u or more signals with group-level false discovery rate (FDR) control. This goal can be addressed by multiple testing of partial conjunction hypotheses with a parameter u, which reduce to global null hypotheses for . We consider the case where the partial conjunction p-values are combinations of within-group p-values, and obtain sufficient conditions on (1) the dependencies among the p-values within and across the groups, (2) the combining method for obtaining partial conjunction p-values, and (3) the multiple testing procedure, for obtaining FDR control on partial conjunction discoveries. We consider separately the dependencies encountered in the meta-analysis setting, where multiple features are tested in several independent studies, and the p-values within each study may be dependent. Based on the results for this setting, we generalize the procedure of Benjamini, Heller, and Yekutieli (2009) for assessing replicability of signals across studies, and extend their theoretical results regarding FDR control with respect to replicability claims.
In this work we consider the problem of estimating function-on-scalar regression models when the functions are observed over multi-dimensional or manifold domains and with potentially multivariate output. We establish the minimax rates of convergence and present an estimator based on reproducing kernel Hilbert spaces that achieves the minimax rate. To better interpret the derived rates, we extend well-known links between RKHS and Sobolev spaces to the case where the domain is a compact Riemannian manifold. This is accomplished using an interesting connection to Weyl’s Law from partial differential equations. We conclude with a numerical study and an application to 3D facial imaging.
In this paper, we consider the inverse problem of estimating the product of two densities, given a d-dimensional n-sample of i.i.d. observations drawn from each distribution. We propose a general method of estimation encompassing both projection estimators with model selection device and kernel estimators with bandwidth selection strategies. The procedures do not consist in making the product of each density estimator, but in plugging an overfitted estimator of one of the two densities, in an estimator based on the second sample. Our findings are a first step toward a better understanding of the good performances of overfitting in regression Nadaraya-Watson estimator.
Tie-breaker experimental designs are hybrids of Randomized Controlled Trials (RCTs) and Regression Discontinuity Designs (RDDs) in which subjects with moderate scores are placed in an RCT while subjects with extreme scores are deterministically assigned to the treatment or control group. In settings where it is unfair or uneconomical to deny the treatment to the more deserving recipients, the tie-breaker design (TBD) trades off the practical advantages of the RDD with the statistical advantages of the RCT. The practical costs of the randomization in TBDs can be hard to quantify in generality, while the statistical benefits conferred by randomization in TBDs have only been studied under linear and quadratic models. In this paper, we discuss and quantify the statistical benefits of TBDs without using parametric modelling assumptions. If the goal is estimation of the average treatment effect or the treatment effect at more than one score value, the statistical benefits of using a TBD over an RDD are apparent. If the goal is nonparametric estimation of the mean treatment effect at merely one score value, we prove that about 2.8 times more subjects are needed for an RDD in order to achieve the same asymptotic mean squared error. We further demonstrate using both theoretical results and simulations from the Angrist and Lavy (1999) classroom size dataset, that larger experimental radii choices for the TBD lead to greater statistical efficiency.
Let be independent observations of size p, each of them belonging to one of c distinct classes. We assume that observations within the class a are characterized by their distribution where here are some non-negative definite matrices. This paper studies the asymptotic behavior of the symmetric matrix when p and n grow to infinity with . Particularly, we prove that, if the class covariance matrices are sufficiently close in a certain sense, the matrix behaves like a low-rank perturbation of a Wigner matrix, presenting possibly some isolated eigenvalues outside the bulk of the semi-circular law. We carry out a careful analysis of some of the isolated eigenvalues of and their associated eigenvectors and illustrate how these results can help understand spectral clustering methods that use as a kernel matrix.
In recent years, generative adversarial networks (GANs) have demonstrated impressive experimental results while there are only a few works that foster statistical learning theory for GANs. In this work, we propose an infinite dimensional theoretical framework for generative adversarial learning. We assume that the probability density functions of the underlying measure are uniformly bounded, k-times α-Hölder differentiable () and uniformly bounded away from zero. Under these assumptions, we show that the Rosenblatt transformation induces an optimal generator, which is realizable in the hypothesis space of -generators. With a consistent definition of the hypothesis space of discriminators, we further show that the Jensen-Shannon divergence between the distribution induced by the generator from the adversarial learning procedure and the data generating distribution converges to zero. Under certain regularity assumptions on the density of the data generating process, we also provide rates of convergence based on chaining and concentration.
In this paper, we aim to test the overall significance of regression coefficients in high-dimensional single-index models. We first reformulate the hypothesis testing problem under elliptical distributions for predictors. Applying distribution-based transformation, we introduce a high-dimensional score-type test statistic. Notably, no moment condition for the error term is required. Our introduced procedures are thus robust with respect to outliers in response. Moreover our procedure is free of variance estimation of the error term. We establish the test statistic’s asymptotic normality under null hypothesis. Power analysis is also investigated. To further improve computational efficiency and enhance empirical powers, we also introduce a two-stage test procedure under ultrahigh-dimensional settings based on random data splitting. To eliminate the additional randomness induced by data splitting, we further develop a powerful ensemble algorithm based on multiple data splitting. We show that the ensemble algorithm can control the type I error rate at a given significance level. Extension to partial significance testing problem is also investigated. Lastly, numerical studies and real data analysis are conducted to compare with existing approaches and to illustrate the robustness and validity of our proposed test procedures.
A bipartite experiment consists of one set of units being assigned treatments and another set of units for which we measure outcomes. The two sets of units are connected by a bipartite graph, governing how the treated units can affect the outcome units. In this paper, we consider estimation of the average total treatment effect in the bipartite experimental framework under a linear exposure-response model. We introduce the Exposure Reweighted Linear (ERL) estimator, and show that the estimator is unbiased, consistent and asymptotically normal, provided that the bipartite graph is sufficiently sparse. To facilitate inference, we introduce an unbiased and consistent estimator of the variance of the ERL point estimator. Finally, we introduce a cluster-based design, Exposure-Design, that uses heuristics to increase the precision of the ERL estimator by realizing a desirable exposure distribution.
Envelope methodology is succinctly pitched as a class of procedures for increasing efficiency in multivariate analyses without altering traditional objectives [5, first sentence of page 1]. This description comes with the additional caveat that efficiency gains obtained by envelope methodology are mitigated by model selection volatility to an unknown degree. Recent strides to account for model selection volatility have been made on two fronts: 1) development of a weighted envelope estimator to account for this variability directly in the context of the multivariate linear regression model; 2) development of model selection criteria that facilitate consistent dimension selection for more general settings. We unify these two directions and provide weighted envelope estimators that directly account for the variability associated with model selection and are appropriate for general multivariate estimation settings. Our weighted estimation technique provides practitioners with robust and useful variance reduction in finite samples. Theoretical and empirical justification is given for our estimators and validity of a nonparametric bootstrap procedure for estimating their asymptotic variance are established. Simulation studies and a real data analysis support our claims and demonstrate the advantage of our weighted envelope estimator when model selection variability is present.
Predictive classification considered in this paper concerns the problem of identifying subgroups based on a continuous biomarker through estimation of an unknown cutpoint and assessing whether these subgroups differ in treatment effect relative to some clinical outcome. The problem is considered under a generalized linear model framework for clinical outcomes and formulated as testing the significance of the interaction between the treatment and the subgroup indicator. When the main effect of the subgroup indicator does not exist, the cutpoint is non-identifiable under the null. Existing procedures are not adaptive to the identifiability issue, and do not work well when the main effect is small. In this work, we propose profile score-type and Wald-type test statistics, and further m-out-of-n bootstrap techniques to obtain their critical values. The proposed procedures do not rely on the knowledge about the model identifiability, and we establish their asymptotic size validity and study the power under local alternatives in both cases. Further, we show that the standard bootstrap is inconsistent for the non-identifiable case. Simulation results corroborate our theory, and the proposed method is applied to a dataset from a clinical trial on advanced colorectal cancer.