Registered users receive a variety of benefits including the ability to customize email alerts, create favorite journals list, and save searches.
Please note that a Project Euclid web account does not automatically grant access to full-text content. An institutional or society member subscription is required to view non-Open Access content.
Contact firstname.lastname@example.org with any questions.
Social and economic data commonly have a nested structure (for example, households nested within neighborhoods). Recently techniques and computer programs have become available for dealing with such data, permitting the formulation of explicit multilevel models with hypotheses about effects occurring at each level and across levels. If data users are planning to analyze survey data using multilevel models rather than concentrating on means, totals, and proportions, this needs to be accounted for in the survey design. The implications for determining sample sizes (for example, the number of neighborhoods in the sample and the number of households sampled within each neighborhood) are explored.
The paper considers the statistical work of the physicist Harold Jeffreys. In 1933-4 Jeffreys had a controversy with R.A. Fisher, the leading statistician of the time. Prior to the encounter, Jeffreys had worked on probability as the basis for scientific inference and had used methods from the theory of errors in astronomy and seismology. He had also started to rework the theory of errors on the basis of his theory of probability. After the encounter Jeffreys produced a full-scale Bayesian treatment of statistics in the form of his Theory of Probability.
This paper considers parametric statistical decision problems conducted within a Bayesian nonparametric context. Our work was motivated by the realisation that typical parametric model selection procedures are essentially incoherent. We argue that one solution to this problem is to use a flexible enough model in the first place, a model that will not be checked no matter what data arrive. Ideally, one would use a nonparametric model to describe all the uncertainty about the density function generating the data. However, parametric models are the preferred choice for many statisticians, despite the incoherence involved in model checking, incoherence that is quite often ignored for pragmatic reasons. In this paper we show how coherent parametric inference can be carried out via decision theory and Bayesian nonparametrics. None of the ingredients discussed here are new, but our main point only becomes evident when one sees all priors-even parametric ones-as measures on sets of densities as opposed to measures on finite-dimensional parameter spaces.
This is an expository paper. Here we propose a decision-theoretic framework for addressing aspects of the confidentiality of information problems in publicly released data. Our basic premise is that the problem needs to be conceptualized by looking at the actions of three agents: a data collector, a legitimate data user, and an intruder. Here we aim to prescribe the actions of the first agent who desires to provide useful information to the second agent, but must protect against possible misuse by the third. The first agent is under the constraint that the released data has to be public to all; this in some societies may not be the case.
A novel aspect of our paper is that all utilities-fundamental to decision making-are in terms of Shannon's information entropy. Thus what gets released is a distribution whose entropy maximizes the expected utility of the first agent. This means that the distribution that gets released will be different from that which generates the collected data. The discrepancy between the two distributions can be assessed via the Kullback-Leibler cross-entropy function. Our proposed strategy therefore boils down to the notion that it is the information content of the data, not the actual data, that gets masked. Current practice of ''statistical disclosure limitation'' masks the observed data via transformations or cell suppression. These transformations are guided by balancing what are known as ''disclosure risks'' and ''data utility''. The entropy indexed utility functions we propose are isomorphic to the above two entities. Thus our approach provides a formal link to that which is currently practiced in statistical disclosure limitation.
Consider using a likelihood ratio to measure the strength of statistical evidence for one hypothesis over another. Recent work has shown that when the model is correctly specified, the likelihood ratio is seldom misleading. But when the model is not, misleading evidence may be observed quite frequently. Here we consider how to choose a working regression model so that the statistical evidence is correctly represented as often as it would be under the true model. We argue that the criteria for choosing a working model should be how often it correctly represents the statistical evidence about the object of interest (regression coefficient in the true model). We see that misleading evidence about the object of interest is more likely to be observed when the working model is chosen according to other criteria (e.g., parsimony or predictive accuracy).
This paper illustrates the versatility of biplot methodology when analysing multivariate data from diverse disciplines. The modern approach of Gower & Hand (1996) whereby biplots are regarded as multivariate analogues of ordinary scatter plots is utilised for extending biplot methodology introducing several novel applications. Focus is on biplot applications where the merits of principal component biplots and canonical variate analysis biplots are illustrated with data sets from higher education, the manufacturing industry, the mining industry, agriculture, finance and archaeology. It is shown how to equip biplots with quality regions, classification regions and acceptance regions; how α-bags superimposed on biplots provide a quantification of the multidimensional overlap of classes as well as enable biplots to be used with large data sets; how to use biplots for exploring multi-dimensional reality and in sophisticated classification procedures.
Remote sensing can be a valuable tool for agricultural statistics when area frames or multiple frames are used. At the design level, remote sensing typically helps in the definition of sampling units and the stratification, but can also be exploited to optimise the sample allocation and size of sampling units. At the estimator level, classified satellite images are generally used as auxiliary variables in a regression estimator or for estimators based on confusion matrixes. The most often used satellite images are LANDSAT-TM and SPOT-XS. In general, classified or photo-interpreted images should not be directly used to estimate crop areas because the proportion of pixels classified into the specific crop is often strongly biased. Vegetation indexes computed from satellite images can give in some cases a good indication of the potential crop yield.
Databases with a lot of data very often mean little information. It is because of the collinearity of variables which consist of the data of the database. This collinearity is in fact a kind of redundancy of the database. In the study a new indicator is given. With this indicator, which contains the eigenvalues of the variables' correlation matrix, it is possible to quantify the percentage of collinearity: from 0% (all the eigenvalues are equal to 1) to 100% (all the eigenvalues, except the first, are equal to 0).