Registered users receive a variety of benefits including the ability to customize email alerts, create favorite journals list, and save searches.
Please note that a Project Euclid web account does not automatically grant access to full-text content. An institutional or society member subscription is required to view non-Open Access content.
Contact firstname.lastname@example.org with any questions.
Gridded estimated rainfall intensity values at very high spatial and temporal resolution levels are needed as main inputs for weather prediction models to obtain accurate precipitation forecasts, and to verify the performance of precipitation forecast models. These gridded rainfall fields are also the main driver for hydrological models that forecast flash floods, and they are essential for disaster prediction associated with heavy rain. Rainfall information can be obtained from rain gages that provide relatively accurate estimates of the actual rainfall values at point-referenced locations, but they do not characterize well enough the spatial and temporal structure of the rainfall fields. Doppler radar data offer better spatial and temporal coverage, but Doppler radar measures effective radar reflectivity (Ze) rather than rainfall rate (R). Thus, rainfall estimates from radar data suffer from various uncertainties due to their measuring principle and the conversion from Ze to R. We introduce a framework to combine radar reflectivity and gage data, by writing the different sources of rainfall information in terms of an underlying unobservable spatial temporal process with the true rainfall values. We use spatial logistic regression to model the probability of rain for both sources of data in terms of the latent true rainfall process. We characterize the different sources of bias and error in the gage and radar data and we estimate the true rainfall intensity with its posterior predictive distribution, conditioning on the observed data. Our model allows for nonstationary and asymmetry in the spatio-temporal dependency structure of the rainfall process, and allows the temporal evolution of the rainfall process to depend on the motions of rain fields, and the spatial correlation to depend on geographic features. We apply our methods to estimate rainfall intensity every 10 minutes, in a subdomain over South Korea with a spatial resolution of 1 km by 1 km.
Short-range forecasts of precipitation fields are needed in a wealth of agricultural, hydrological, ecological and other applications. Forecasts from numerical weather prediction models are often biased and do not provide uncertainty information. Here we present a postprocessing technique for such numerical forecasts that produces correlated probabilistic forecasts of precipitation accumulation at multiple sites simultaneously.
The statistical model is a spatial version of a two-stage model that represents the distribution of precipitation by a mixture of a point mass at zero and a Gamma density for the continuous distribution of precipitation accumulation. Spatial correlation is captured by assuming that two Gaussian processes drive precipitation occurrence and precipitation amount, respectively. The first process is latent and drives precipitation occurrence via a threshold. The second process explains the spatial correlation in precipitation accumulation. It is related to precipitation via a site-specific transformation function, so as to retain the marginal right-skewed distribution of precipitation while modeling spatial dependence. Both processes take into account the information contained in the numerical weather forecast and are modeled as stationary isotropic spatial processes with an exponential correlation function.
The two-stage spatial model was applied to 48-hour-ahead forecasts of daily precipitation accumulation over the Pacific Northwest in 2004. The predictive distributions from the two-stage spatial model were calibrated and sharp, and outperformed reference forecasts for spatially composite and areally averaged quantities.
Self-organizing maps (SOMs) are a technique that has been used with high-dimensional data vectors to develop an archetypal set of states (nodes) that span, in some sense, the high-dimensional space. Noteworthy applications include weather states as described by weather variables over a region and speech patterns as characterized by frequencies in time. The SOM approach is essentially a neural network model that implements a nonlinear projection from a high-dimensional input space to a low-dimensional array of neurons. In the process, it also becomes a clustering technique, assigning to any vector in the high-dimensional data space the node (neuron) to which it is closest (using, say, Euclidean distance) in the data space. The number of nodes is thus equal to the number of clusters. However, the primary use for the SOM is as a representation technique, that is, finding a set of nodes which representatively span the high-dimensional space. These nodes are typically displayed using maps to enable visualization of the continuum of the data space. The technique does not appear to have been discussed in the statistics literature so it is our intent here to bring it to the attention of the community. The technique is implemented algorithmically through a training set of vectors. However, through the introduction of stochasticity in the form of a space–time process model, we seek to illuminate and interpret its performance in the context of application to daily data collection. That is, the observed daily state vectors are viewed as a time series of multivariate process realizations which we try to understand under the dimension reduction achieved by the SOM procedure.
The application we focus on here is to synoptic climatology where the goal is to develop an array of atmospheric states to capture a collection of distinct circulation patterns. In particular, we have daily weather data observed in the form of 11 variables measured for each of 77 grid cells yielding an 847×1 vector for each day. We have such daily vectors for a period of 31 years (11,315 days). Twelve SOM nodes have been obtained by the meteorologists to represent the space of these data vectors. Again, we try to enhance our understanding of dynamic SOM node behavior arising from this dataset.
Nonlinear regression is a useful statistical tool, relating observed data and a nonlinear function of unknown parameters. When the parameter-dependent nonlinear function is computationally intensive, a straightforward regression analysis by maximum likelihood is not feasible. The method presented in this paper proposes to construct a faster running surrogate for such a computationally intensive nonlinear function, and to use it in a related nonlinear statistical model that accounts for the uncertainty associated with this surrogate. A pivotal quantity in the Earth’s climate system is the climate sensitivity: the change in global temperature due to doubling of atmospheric CO2 concentrations. This, along with other climate parameters, are estimated by applying the statistical method developed in this paper, where the computationally intensive nonlinear function is the MIT 2D climate model.
Atmospheric Carbon Monoxide (CO) provides a window on the chemistry of the atmosphere since it is one of few chemical constituents that can be remotely sensed, and it can be used to determine budgets of other greenhouse gases such as ozone and OH radicals. Remote sensing platforms in geostationary Earth orbit will soon provide regional observations of CO at several vertical layers with high spatial and temporal resolution. However, cloudy locations cannot be observed and estimates of the complete CO concentration fields have to be estimated based on the cloud-free observations. The current state-of-the-art solution of this interpolation problem is to combine cloud-free observations with prior information, computed by a deterministic physical model, which might introduce uncertainties that do not derive from data. While sharing features with the physical model, this paper suggests a Bayesian hierarchical model to estimate the complete CO concentration fields. The paper also provides a direct comparison to state-of-the-art methods. To our knowledge, such a model and comparison have not been considered before.
This paper presents an approach to estimating the health effects of an environmental hazard. The approach is general in nature, but is applied here to the case of air pollution. It uses a computer model involving ambient pollution and temperature input to simulate the exposures experienced by individuals in an urban area, while incorporating the mechanisms that determine exposures. The output from the model comprises a set of daily exposures for a sample of individuals from the population of interest. These daily exposures are approximated by parametric distributions so that the predictive exposure distribution of a randomly selected individual can be generated. These distributions are then incorporated into a hierarchical Bayesian framework (with inference using Markov chain Monte Carlo simulation) in order to examine the relationship between short-term changes in exposures and health outcomes, while making allowance for long-term trends, seasonality, the effect of potential confounders and the possibility of ecological bias.
The paper applies this approach to particulate pollution (PM10) and respiratory mortality counts for seniors in greater London (≥65 years) during 1997. Within this substantive epidemiological study, the effects on health of ambient concentrations and (estimated) personal exposures are compared. The proposed model incorporates within day (or between individual) variability in personal exposures, which is compared to the more traditional approach of assuming a single pollution level applies to the entire population for each day. Effects were estimated using single lags and distributed lag models, with the highest relative risk, RR=1.02 (1.01–1.04), being associated with a lag of two days ambient concentrations of PM10. Individual exposures to PM10 for this group (seniors) were lower than the measured ambient concentrations with the corresponding risk, RR=1.05 (1.01–1.09), being higher than would be suggested by the traditional approach using ambient concentrations.
With the widespread availability of satellite-based instruments, many geophysical processes are measured on a global scale and they often show strong nonstationarity in the covariance structure. In this paper we present a flexible class of parametric covariance models that can capture the nonstationarity in global data, especially strong dependency of covariance structure on latitudes. We apply the Discrete Fourier Transform to data on regular grids, which enables us to calculate the exact likelihood for large data sets. Our covariance model is applied to global total column ozone level data on a given day. We discuss how our covariance model compares with some existing models.
Fisher-consistent loss functions play a fundamental role in the construction of successful binary margin-based classifiers. In this paper we establish the Fisher-consistency condition for multicategory classification problems. Our approach uses the margin vector concept which can be regarded as a multicategory generalization of the binary margin. We characterize a wide class of smooth convex loss functions that are Fisher-consistent for multicategory classification. We then consider using the margin-vector-based loss functions to derive multicategory boosting algorithms. In particular, we derive two new multicategory boosting algorithms by using the exponential and logistic regression losses.
Defining the energy function as the negative logarithm of the density, we explore the energy landscape of a distribution via the tree of sublevel sets of its energy. This tree represents the hierarchy among the connected components of the sublevel sets. We propose ways to annotate the tree so that it provides information on both topological and statistical aspects of the distribution, such as the local energy minima (local modes), their local domains and volumes, and the barriers between them. We develop a computational method to estimate the tree and reconstruct the energy landscape from Monte Carlo samples simulated at a wide energy range of a distribution. This method can be applied to any arbitrary distribution on a space with defined connectedness. We test the method on multimodal distributions and posterior distributions to show that our estimated trees are accurate compared to theoretical values. When used to perform Bayesian inference of DNA sequence segmentation, this approach reveals much more information than the standard approach based on marginal posterior distributions.
In large scale multiple testing, the use of an empirical null distribution rather than the theoretical null distribution can be critical for correct inference. This paper proposes a “mode matching” method for fitting an empirical null when the theoretical null belongs to any exponential family. Based on the central matching method for z-scores, mode matching estimates the null density by fitting an appropriate exponential family to the histogram of the test statistics by Poisson regression in a region surrounding the mode. The empirical null estimate is then used to estimate local and tail false discovery rate (FDR) for inference. Delta-method covariance formulas and approximate asymptotic bias formulas are provided, as well as simulation studies of the effect of the tuning parameters of the procedure on the bias-variance trade-off. The standard FDR estimates are found to be biased down at the far tails. Correlation between test statistics is taken into account in the covariance estimates, providing a generalization of Efron’s “wing function” for exponential families. Applications with χ2 statistics are shown in a family-based genome-wide association study from the Framingham Heart Study and an anatomical brain imaging study of dyslexia in children.
We propose a new prior distribution for classical (nonhierarchical) logistic regression models, constructed by first scaling all nonbinary variables to have mean 0 and standard deviation 0.5, and then placing independent Student-t prior distributions on the coefficients. As a default choice, we recommend the Cauchy distribution with center 0 and scale 2.5, which in the simplest setting is a longer-tailed version of the distribution attained by assuming one-half additional success and one-half additional failure in a logistic regression. Cross-validation on a corpus of datasets shows the Cauchy class of prior distributions to outperform existing implementations of Gaussian and Laplace priors.
We recommend this prior distribution as a default choice for routine applied use. It has the advantage of always giving answers, even when there is complete separation in logistic regression (a common problem, even when the sample size is large and the number of predictors is small), and also automatically applying more shrinkage to higher-order interactions. This can be useful in routine data analysis as well as in automated procedures such as chained equations for missing-data imputation.
We implement a procedure to fit generalized linear models in R with the Student-t prior distribution by incorporating an approximate EM algorithm into the usual iteratively weighted least squares. We illustrate with several applications, including a series of logistic regressions predicting voting preferences, a small bioassay experiment, and an imputation model for a public health data set.
A virologic marker, the number of HIV RNA copies or viral load, is currently used to evaluate antiretroviral (ARV) therapies in AIDS clinical trials. This marker can be used to assess the ARV potency of therapies, but is easily affected by drug exposures, drug resistance and other factors during the long-term treatment evaluation process. HIV dynamic studies have significantly contributed to the understanding of HIV pathogenesis and ARV treatment strategies. However, the models of these studies are used to quantify short-term HIV dynamics (< 1 month), and are not applicable to describe long-term virological response to ARV treatment due to the difficulty of establishing a relationship of antiviral response with multiple treatment factors such as drug exposure and drug susceptibility during long-term treatment. Long-term therapy with ARV agents in HIV-infected patients often results in failure to suppress the viral load. Pharmacokinetics (PK), drug resistance and imperfect adherence to prescribed antiviral drugs are important factors explaining the resurgence of virus. To better understand the factors responsible for the virological failure, this paper develops the mechanism-based nonlinear differential equation models for characterizing long-term viral dynamics with ARV therapy. The models directly incorporate drug concentration, adherence and drug susceptibility into a function of treatment efficacy and, hence, fully integrate virologic, PK, drug adherence and resistance from an AIDS clinical trial into the analysis. A Bayesian nonlinear mixed-effects modeling approach in conjunction with the rescaled version of dynamic differential equations is investigated to estimate dynamic parameters and make inference. In addition, the correlations of baseline factors with estimated dynamic parameters are explored and some biologically meaningful correlation results are presented. Further, the estimated dynamic parameters in patients with virologic success were compared to those in patients with virologic failure and significantly important findings were summarized. These results suggest that viral dynamic parameters may play an important role in understanding HIV pathogenesis, designing new treatment strategies for long-term care of AIDS patients.
In vaccine studies for infectious diseases such as human immunodeficiency virus (HIV), the frequency and type of contacts between study participants and infectious sources are among the most informative risk factors, but are often not adequately adjusted for in standard analyses. Such adjustment can improve the assessment of vaccine efficacy as well as the assessment of risk factors. It can be attained by modeling transmission per contact with infectious sources. However, information about contacts that rely on self-reporting by study participants are subject to nontrivial measurement error in many studies. We develop a Bayesian hierarchical model fitted using Markov chain Monte Carlo (MCMC) sampling to estimate the vaccine efficacy controlled for exposure to infection, while adjusting for measurement error in contact-related factors. Our method is used to re-analyze two recent HIV vaccine studies, and the results are compared with the published primary analyses that used standard methods. The proposed method could also be used for other vaccines where contact information is collected, such as human papilloma virus vaccines.
Understanding the seizure initiation process and its propagation pattern(s) is a critical task in epilepsy research. Characteristics of the pre-seizure electroencephalograms (EEGs) such as oscillating powers and high-frequency activities are believed to be indicative of the seizure onset and spread patterns. In this article, we analyze epileptic EEG time series using nonparametric spectral estimation methods to extract information on seizure-specific power and characteristic frequency [or frequency band(s)]. Because the EEGs may become nonstationary before seizure events, we develop methods for both stationary and local stationary processes. Based on penalized Whittle likelihood, we propose a direct generalized maximum likelihood (GML) and generalized approximate cross-validation (GACV) methods to estimate smoothing parameters in both smoothing spline spectrum estimation of a stationary process and smoothing spline ANOVA time-varying spectrum estimation of a locally stationary process. We also propose permutation methods to test if a locally stationary process is stationary. Extensive simulations indicate that the proposed direct methods, especially the direct GML, are stable and perform better than other existing methods. We apply the proposed methods to the intracranial electroencephalograms (IEEGs) of an epileptic patient to gain insights into the seizure generation process.
A voting bloc is defined to be a group of voters who have similar voting preferences. The cleavage of the Irish electorate into voting blocs is of interest. Irish elections employ a “single transferable vote” electoral system; under this system voters rank some or all of the electoral candidates in order of preference. These rank votes provide a rich source of preference information from which inferences about the composition of the electorate may be drawn. Additionally, the influence of social factors or covariates on the electorate composition is of interest.
A mixture of experts model is a mixture model in which the model parameters are functions of covariates. A mixture of experts model for rank data is developed to provide a model-based method to cluster Irish voters into voting blocs, to examine the influence of social factors on this clustering and to examine the characteristic preferences of the voting blocs. The Benter model for rank data is employed as the family of component densities within the mixture of experts model; generalized linear model theory is employed to model the influence of covariates on the mixing proportions. Model fitting is achieved via a hybrid of the EM and MM algorithms. An example of the methodology is illustrated by examining an Irish presidential election. The existence of voting blocs in the electorate is established and it is determined that age and government satisfaction levels are important factors in influencing voting in this election.
Consider a multinomial regression model where the response, which indicates a unit’s membership in one of several possible unordered classes, is associated with a set of predictor variables. Such models typically involve a matrix of regression coefficients, with the (j, k) element of this matrix modulating the effect of the kth predictor on the propensity of the unit to belong to the jth class. Thus, a supposition that only a subset of the available predictors are associated with the response corresponds to some of the columns of the coefficient matrix being zero. Under the Bayesian paradigm, the subset of predictors which are associated with the response can be treated as an unknown parameter, leading to typical Bayesian model selection and model averaging procedures. As an alternative, we investigate model selection and averaging, whereby a subset of individual elements of the coefficient matrix are zero. That is, the subset of predictors associated with the propensity to belong to a class varies with the class. We refer to this as class-specific predictor selection. We argue that such a scheme can be attractive on both conceptual and computational grounds.
A dynamic decision-making system that includes a mass of indistinguishable agents could manifest impressive heterogeneity. This kind of nonhomogeneity is postulated to result from macroscopic behavioral tactics employed by almost all involved agents. A State-Space Based (SSB) mass event-history model is developed here to explore the potential existence of such macroscopic behaviors. By imposing an unobserved internal state-space variable into the system, each individual’s event-history is made into a composition of a common state duration and an individual specific time to action. With the common state modeling of the macroscopic behavior, parametric statistical inferences are derived under the current-status data structure and conditional independence assumptions. Identifiability and computation related problems are also addressed. From the dynamic perspectives of system-wise heterogeneity, this SSB mass event-history model is shown to be very distinct from a random effect model via the Principle Component Analysis (PCA) in a numerical experiment. Real data showing the mass invasion by two species of parasitic nematode into two species of host larvae are also analyzed. The analysis results not only are found coherent in the context of the biology of the nematode as a parasite, but also include new quantitative interpretations.
The paper focuses on the adaptation of local polynomial filters at the end of the sample period. We show that for real time estimation of signals (i.e., exactly at the boundary of the time support) we cannot rely on the automatic adaptation of the local polynomial smoothers, since the direct real time filter turns out to be strongly localized, and thereby yields extremely volatile estimates. As an alternative, we evaluate a general family of asymmetric filters that minimizes the mean square revision error subject to polynomial reproduction constraints; in the case of the Henderson filter it nests the well-known Musgrave’s surrogate filters. The class of filters depends on unknown features of the series such as the slope and the curvature of the underlying signal, which can be estimated from the data. Several empirical examples illustrate the effectiveness of our proposal.
In this paper we propose and discuss variance reduction techniques for the estimation of quantiles of the output of a complex model with random input parameters. These techniques are based on the use of a reduced model, such as a metamodel or a response surface. The reduced model can be used as a control variate; or a rejection method can be implemented to sample the realizations of the input parameters in prescribed relevant strata; or the reduced model can be used to determine a good biased distribution of the input parameters for the implementation of an importance sampling strategy. The different strategies are analyzed and the asymptotic variances are computed, which shows the benefit of an adaptive controlled stratification method. This method is finally applied to a real example (computation of the peak cladding temperature during a large-break loss of coolant accident in a nuclear reactor).