<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <title>The Annals of Applied Statistics Articles (Project Euclid)</title>
    <link>http://projecteuclid.org/euclid.aoas</link>
    <description>The latest articles from The Annals of Applied Statistics on Project Euclid, a site for mathematics and statistics resources.</description>
    <language>en-us</language>
    <copyright>Copyright 2010 Cornell University Library</copyright>
    <webMaster>Euclid-L@cornell.edu (Project Euclid Team)</webMaster>
    <pubDate>Thu, 05 Aug 2010 15:41 EDT</pubDate>
    <lastBuildDate>Mon, 21 Mar 2011 09:49 EDT</lastBuildDate>
    <image>
      <url>http://projecteuclid.org/collection/euclid/images/logo_linking_100.gif</url>
      <title>Project Euclid</title>
      <link>http://projecteuclid.org/</link>
    </image>
    <item>
      <title>Introduction to papers on the modeling and analysis of network data—II</title>
      <link>http://projecteuclid.org/euclid.aoas/1280842129</link>
      <description>&lt;strong&gt;Stephen E. Fienberg&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 4, Number 2, 533--534.&lt;/p&gt;</description>
      <guid isPermaLink="false">projecteuclid.org/euclid.aoas/1280842129_Thu, 05 Aug 2010 15:41 EDT</guid>
      <pubDate>Thu, 05 Aug 2010 15:41 EDT</pubDate>
    </item>
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
  <item><title>People born in the Middle East but residing in the Netherlands: Invariant population size estimates and the role of active and passive covariates</title><link>http://projecteuclid.org/euclid.aoas/1346418564</link><description>&lt;strong&gt;Peter G. M. van der Heijden&lt;/strong&gt;, &lt;strong&gt;Joe Whittaker&lt;/strong&gt;, &lt;strong&gt;Maarten Cruyff&lt;/strong&gt;, &lt;strong&gt;Bart Bakker&lt;/strong&gt;, &lt;strong&gt;Rik van der Vliet&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 3, 831--852.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Including covariates in loglinear models of population registers improves population size estimates for two reasons. First, it is possible to take heterogeneity of inclusion probabilities over the levels of a covariate into account; and second, it allows subdivision of the estimated population by the levels of the covariates, giving insight into characteristics of individuals that are not included in any of the registers. The issue of whether or not marginalizing the full table of registers by covariates over one or more covariates leaves the estimated population size estimate invariant is intimately related to collapsibility of contingency tables [ Biometrika 70 (1983) 567–578]. We show that, with information from two registers, population size invariance is equivalent to the simultaneous collapsibility of each margin consisting of one register and the covariates. We give a short path characterization of the loglinear model which describes when marginalizing over a covariate leads to different population size estimates. Covariates that are collapsible are called passive, to distinguish them from covariates that are not collapsible and are termed active. We make the case that it can be useful to include passive covariates within the estimation model, because they allow a finer description of the population in terms of these covariates. As an example we discuss the estimation of the population size of people born in the Middle East but residing in the Netherlands.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1346418564_Fri, 31 Aug 2012 09:10 EDT</guid><pubDate>Fri, 31 Aug 2012 09:10 EDT</pubDate></item><item><title>Matching markers and unlabeled configurations in protein gels</title><link>http://projecteuclid.org/euclid.aoas/1346418565</link><description>&lt;strong&gt;Kanti V. Mardia&lt;/strong&gt;, &lt;strong&gt;Emma M. Petty&lt;/strong&gt;, &lt;strong&gt;Charles C. Taylor&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 3, 853--869.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Unlabeled shape analysis is a rapidly emerging and challenging area of statistics. This has been driven by various novel applications in bioinformatics. We consider here the situation where two configurations are matched under various constraints, namely, the configurations have a subset of manually located “markers” with high probability of matching each other while a larger subset consists of unlabeled points. We consider a plausible model and give an implementation using the EM algorithm. The work is motivated by a real experiment of gels for renal cancer and our approach allows for the possibility of missing and misallocated markers. The methodology is successfully used to automatically locate and remove a grossly misallocated marker within the given data set.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1346418565_Fri, 31 Aug 2012 09:10 EDT</guid><pubDate>Fri, 31 Aug 2012 09:10 EDT</pubDate></item><item><title>Functional dynamic factor models with application to yield curve forecasting</title><link>http://projecteuclid.org/euclid.aoas/1346418566</link><description>&lt;strong&gt;Spencer Hays&lt;/strong&gt;, &lt;strong&gt;Haipeng Shen&lt;/strong&gt;, &lt;strong&gt;Jianhua Z. Huang&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 3, 870--894.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Accurate forecasting of zero coupon bond yields for a continuum of maturities is paramount to bond portfolio management and derivative security pricing. Yet a universal model for yield curve forecasting has been elusive, and prior attempts often resulted in a trade-off between goodness of fit and consistency with economic theory. To address this, herein we propose a novel formulation which connects the dynamic factor model (DFM) framework with concepts from functional data analysis: a DFM with functional factor loading curves . This results in a model capable of forecasting functional time series. Further, in the yield curve context we show that the model retains economic interpretation. Model estimation is achieved through an expectation-maximization algorithm, where the time series parameters and factor loading curves are simultaneously estimated in a single step. Efficient computing is implemented and a data-driven smoothing parameter is nicely incorporated. We show that our model performs very well on forecasting actual yield data compared with existing approaches, especially in regard to profit-based assessment for an innovative trading exercise. We further illustrate the viability of our model to applications outside of yield forecasting.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1346418566_Fri, 31 Aug 2012 09:10 EDT</guid><pubDate>Fri, 31 Aug 2012 09:10 EDT</pubDate></item><item><title>Bootstrapping data arrays of arbitrary order</title><link>http://projecteuclid.org/euclid.aoas/1346418567</link><description>&lt;strong&gt;Art B. Owen&lt;/strong&gt;, &lt;strong&gt;Dean Eckles&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 3, 895--927.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
In this paper we study a bootstrap strategy for estimating the variance of a mean taken over large multifactor crossed random effects data sets. We apply bootstrap reweighting independently to the levels of each factor, giving each observation the product of independently sampled factor weights. No exact bootstrap exists for this problem [McCullagh (2000) Bernoulli 6 285–301]. We show that the proposed bootstrap is mildly conservative, meaning biased toward overestimating the variance, under sufficient conditions that allow very unbalanced and heteroscedastic inputs. Earlier results for a resampling bootstrap only apply to two factors and use multinomial weights that are poorly suited to online computation. The proposed reweighting approach can be implemented in parallel and online settings. The results for this method apply to any number of factors. The method is illustrated using a $3$ factor data set of comment lengths from Facebook.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1346418567_Fri, 31 Aug 2012 09:10 EDT</guid><pubDate>Fri, 31 Aug 2012 09:10 EDT</pubDate></item><item><title>A model for sequential evolution of ligands by exponential enrichment (SELEX) data</title><link>http://projecteuclid.org/euclid.aoas/1346418568</link><description>&lt;strong&gt;Juli Atherton&lt;/strong&gt;, &lt;strong&gt;Nathan Boley&lt;/strong&gt;, &lt;strong&gt;Ben Brown&lt;/strong&gt;, &lt;strong&gt;Nobuo Ogawa&lt;/strong&gt;, &lt;strong&gt;Stuart M. Davidson&lt;/strong&gt;, &lt;strong&gt;Michael B. Eisen&lt;/strong&gt;, &lt;strong&gt;Mark D. Biggin&lt;/strong&gt;, &lt;strong&gt;Peter Bickel&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 3, 928--949.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
A Systematic Evolution of Ligands by EXponential enrichment (SELEX) experiment begins in round one with a random pool of oligonucleotides in equilibrium solution with a target. Over a few rounds, oligonucleotides having a high affinity for the target are selected. Data from a high throughput SELEX experiment consists of lists of thousands of oligonucleotides sampled after each round. Thus far, SELEX experiments have been very good at suggesting the highest affinity oligonucleotide, but modeling lower affinity recognition site variants has been difficult. Furthermore, an alignment step has always been used prior to analyzing SELEX data.
 
 
We present a novel model, based on a biochemical parametrization of SELEX, which allows us to use data from all rounds to estimate the affinities of the oligonucleotides. Most notably, our model also aligns the oligonucleotides. We use our model to analyze a SELEX experiment containing double stranded DNA oligonucleotides and the transcription factor Bicoid as the target. Our SELEX model outperformed other published methods for predicting putative binding sites for Bicoid as indicated by the results of an in-vivo ChIP-chip experiment.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1346418568_Fri, 31 Aug 2012 09:10 EDT</guid><pubDate>Fri, 31 Aug 2012 09:10 EDT</pubDate></item><item><title>Correlation analysis of enzymatic reaction of a single protein molecule</title><link>http://projecteuclid.org/euclid.aoas/1346418569</link><description>&lt;strong&gt;Chao Du&lt;/strong&gt;, &lt;strong&gt;S. C. Kou&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 3, 950--976.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
New advances in nano sciences open the door for scientists to study biological processes on a microscopic molecule-by-molecule basis. Recent single-molecule biophysical experiments on enzyme systems, in particular, reveal that enzyme molecules behave fundamentally differently from what classical model predicts. A stochastic network model was previously proposed to explain the experimental discovery. This paper conducts detailed theoretical and data analyses of the stochastic network model, focusing on the correlation structure of the successive reaction times of a single enzyme molecule. We investigate the correlation of experimental fluorescence intensity and the correlation of enzymatic reaction times, and examine the role of substrate concentration in enzymatic reactions. Our study shows that the stochastic network model is capable of explaining the experimental data in depth.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1346418569_Fri, 31 Aug 2012 09:10 EDT</guid><pubDate>Fri, 31 Aug 2012 09:10 EDT</pubDate></item><item><title>Classification in postural style</title><link>http://projecteuclid.org/euclid.aoas/1346418570</link><description>&lt;strong&gt;Antoine Chambaz&lt;/strong&gt;, &lt;strong&gt;Christophe Denis&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 3, 977--993.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
This article contributes to the search for a notion of postural style , focusing on the issue of classifying subjects in terms of how they maintain posture. Longer term, the hope is to make it possible to determine on a case by case basis which sensorial information is prevalent in postural control, and to improve/adapt protocols for functional rehabilitation among those who show deficits in maintaining posture, typically seniors. Here, we specifically tackle the statistical problem of classifying subjects sampled from a two-class population. Each subject (enrolled in a cohort of 54 participants) undergoes four experimental protocols which are designed to evaluate potential deficits in maintaining posture. These protocols result in four complex trajectories, from which we can extract four small-dimensional summary measures. Because undergoing several protocols can be unpleasant, and sometimes painful, we try to limit the number of protocols needed for the classification. Therefore, we first rank the protocols by decreasing order of relevance, then we derive four plug-in classifiers which involve the best (i.e., more informative), the two best, the three best and all four protocols. This two-step procedure relies on the cutting-edge methodologies of targeted maximum likelihood learning (a methodology for robust and efficient inference) and super-learning (a machine learning procedure for aggregating various estimation procedures into a single better estimation procedure). A simulation study is carried out. The performances of the procedure applied to the real data set (and evaluated by the leave-one-out rule) go as high as an 87% rate of correct classification (47 out of 54 subjects correctly classified), using only the best protocol.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1346418570_Fri, 31 Aug 2012 09:10 EDT</guid><pubDate>Fri, 31 Aug 2012 09:10 EDT</pubDate></item><item><title>Fibre-generated point processes and fields of orientations</title><link>http://projecteuclid.org/euclid.aoas/1346418571</link><description>&lt;strong&gt;Bryony J. Hill&lt;/strong&gt;, &lt;strong&gt;Wilfrid S. Kendall&lt;/strong&gt;, &lt;strong&gt;Elke Thönnes&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 3, 994--1020.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
This paper introduces a new approach to analyzing spatial point data clustered along or around a system of curves or “fibres.” Such data arise in catalogues of galaxy locations, recorded locations of earthquakes, aerial images of minefields and pore patterns on fingerprints. Finding the underlying curvilinear structure of these point-pattern data sets may not only facilitate a better understanding of how they arise but also aid reconstruction of missing data. We base the space of fibres on the set of integral lines of an orientation field. Using an empirical Bayes approach, we estimate the field of orientations from anisotropic features of the data. We then sample from the posterior distribution of fibres, exploring models with different numbers of clusters, fitting fibres to the clusters as we proceed. The Bayesian approach permits inference on various properties of the clusters and associated fibres, and the results perform well on a number of very different curvilinear structures.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1346418571_Fri, 31 Aug 2012 09:10 EDT</guid><pubDate>Fri, 31 Aug 2012 09:10 EDT</pubDate></item><item><title>A two-way regularization method for MEG source reconstruction</title><link>http://projecteuclid.org/euclid.aoas/1346418572</link><description>&lt;strong&gt;Tian Siva Tian&lt;/strong&gt;, &lt;strong&gt;Jianhua Z. Huang&lt;/strong&gt;, &lt;strong&gt;Haipeng Shen&lt;/strong&gt;, &lt;strong&gt;Zhimin Li&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 3, 1021--1046.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
The MEG inverse problem refers to the reconstruction of the neural activity of the brain from magnetoencephalography (MEG) measurements. We propose a two-way regularization (TWR) method to solve the MEG inverse problem under the assumptions that only a small number of locations in space are responsible for the measured signals (focality), and each source time course is smooth in time (smoothness). The focality and smoothness of the reconstructed signals are ensured respectively by imposing a sparsity-inducing penalty and a roughness penalty in the data fitting criterion. A two-stage algorithm is developed for fast computation, where a raw estimate of the source time course is obtained in the first stage and then refined in the second stage by the two-way regularization. The proposed method is shown to be effective on both synthetic and real-world examples.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1346418572_Fri, 31 Aug 2012 09:10 EDT</guid><pubDate>Fri, 31 Aug 2012 09:10 EDT</pubDate></item><item><title>Detecting mutations in mixed sample sequencing data using empirical Bayes</title><link>http://projecteuclid.org/euclid.aoas/1346418573</link><description>&lt;strong&gt;Omkar Muralidharan&lt;/strong&gt;, &lt;strong&gt;Georges Natsoulis&lt;/strong&gt;, &lt;strong&gt;John Bell&lt;/strong&gt;, &lt;strong&gt;Hanlee Ji&lt;/strong&gt;, &lt;strong&gt;Nancy R. Zhang&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 3, 1047--1067.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
We develop statistically based methods to detect single nucleotide DNA mutations in next generation sequencing data. Sequencing generates counts of the number of times each base was observed at hundreds of thousands to billions of genome positions in each sample. Using these counts to detect mutations is challenging because mutations may have very low prevalence and sequencing error rates vary dramatically by genome position. The discreteness of sequencing data also creates a difficult multiple testing problem: current false discovery rate methods are designed for continuous data, and work poorly, if at all, on discrete data.
 
 
We show that a simple randomization technique lets us use continuous false discovery rate methods on discrete data. Our approach is a useful way to estimate false discovery rates for any collection of discrete test statistics, and is hence not limited to sequencing data. We then use an empirical Bayes model to capture different sources of variation in sequencing error rates. The resulting method outperforms existing detection approaches on example data sets.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1346418573_Fri, 31 Aug 2012 09:10 EDT</guid><pubDate>Fri, 31 Aug 2012 09:10 EDT</pubDate></item><item><title>Inference and characterization of multi-attribute networks with application to computational biology</title><link>http://projecteuclid.org/euclid.aoas/1346418574</link><description>&lt;strong&gt;Natallia Katenka&lt;/strong&gt;, &lt;strong&gt;Eric D. Kolaczyk&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 3, 1068--1094.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Our work is motivated by and illustrated with application of association networks in computational biology, specifically in the context of gene/protein regulatory networks. Association networks represent systems of interacting elements, where a link between two different elements indicates a sufficient level of similarity between element attributes. While in reality relational ties between elements can be expected to be based on similarity across multiple attributes, the vast majority of work to date on association networks involves ties defined with respect to only a single attribute. We propose an approach for the inference of multi-attribute association networks from measurements on continuous attribute variables, using canonical correlation and a hypothesis-testing strategy. Within this context, we then study the impact of partial information on multi-attribute network inference and characterization, when only a subset of attributes is available. We consider in detail the case of two attributes, wherein we examine through a combination of analytical and numerical techniques the implications of the choice and number of node attributes on the ability to detect network links and, more generally, to estimate higher-level network summary statistics, such as node degree, clustering coefficients and measures of centrality. Illustration and applications throughout the paper are developed using gene and protein expression measurements on human cancer cell lines from the NCI-60 database.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1346418574_Fri, 31 Aug 2012 09:10 EDT</guid><pubDate>Fri, 31 Aug 2012 09:10 EDT</pubDate></item><item><title>Tree-guided group lasso for multi-response regression with structured sparsity, with an application to eQTL mapping</title><link>http://projecteuclid.org/euclid.aoas/1346418575</link><description>&lt;strong&gt;Seyoung Kim&lt;/strong&gt;, &lt;strong&gt;Eric P. Xing&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 3, 1095--1117.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
We consider the problem of estimating a sparse multi-response regression function, with an application to expression quantitative trait locus (eQTL) mapping, where the goal is to discover genetic variations that influence gene-expression levels. In particular, we investigate a shrinkage technique capable of capturing a given hierarchical structure over the responses, such as a hierarchical clustering tree with leaf nodes for responses and internal nodes for clusters of related responses at multiple granularity, and we seek to leverage this structure to recover covariates relevant to each hierarchically-defined cluster of responses. We propose a tree-guided group lasso, or tree lasso , for estimating such structured sparsity under multi-response regression by employing a novel penalty function constructed from the tree. We describe a systematic weighting scheme for the overlapping groups in the tree-penalty such that each regression coefficient is penalized in a balanced manner despite the inhomogeneous multiplicity of group memberships of the regression coefficients due to overlaps among groups. For efficient optimization, we employ a smoothing proximal gradient method that was originally developed for a general class of structured-sparsity-inducing penalties. Using simulated and yeast data sets, we demonstrate that our method shows a superior performance in terms of both prediction errors and recovery of true sparsity patterns, compared to other methods for learning a multivariate-response regression.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1346418575_Fri, 31 Aug 2012 09:10 EDT</guid><pubDate>Fri, 31 Aug 2012 09:10 EDT</pubDate></item><item><title>The importance of distinct modeling strategies for gene and gene-specific treatment effects in hierarchical models for microarray data</title><link>http://projecteuclid.org/euclid.aoas/1346418576</link><description>&lt;strong&gt;Steven P. Lund&lt;/strong&gt;, &lt;strong&gt;Dan Nettleton&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 3, 1118--1133.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
When analyzing microarray data, hierarchical models are often used to share information across genes when estimating means and variances or identifying differential expression. Many methods utilize some form of the two-level hierarchical model structure suggested by Kendziorski et al. [ Stat. Med. (2003) 22 3899–3914] in which the first level describes the distribution of latent mean expression levels among genes and among differentially expressed treatments within a gene. The second level describes the conditional distribution, given a latent mean, of repeated observations for a single gene and treatment. Many of these models, including those used in Kendziorski’s et al. [ Stat. Med. (2003) 22 3899–3914] EBarrays package, assume that expression level changes due to treatment effects have the same distribution as expression level changes from gene to gene. We present empirical evidence that this assumption is often inadequate and propose three-level hierarchical models as extensions to the two-level log-normal based EBarrays models to address this inadequacy. We demonstrate that use of our three-level models dramatically changes analysis results for a variety of microarray data sets and verify the validity and improved performance of our suggested method in a series of simulation studies. We also illustrate the importance of accounting for the uncertainty of gene-specific error variance estimates when using hierarchical models to identify differentially expressed genes.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1346418576_Fri, 31 Aug 2012 09:10 EDT</guid><pubDate>Fri, 31 Aug 2012 09:10 EDT</pubDate></item><item><title>Gene-centric gene–gene interaction: A model-based kernel machine method</title><link>http://projecteuclid.org/euclid.aoas/1346418577</link><description>&lt;strong&gt;Shaoyu Li&lt;/strong&gt;, &lt;strong&gt;Yuehua Cui&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 3, 1134--1161.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Much of the natural variation for a complex trait can be explained by variation in DNA sequence levels. As part of sequence variation, gene–gene interaction has been ubiquitously observed in nature, where its role in shaping the development of an organism has been broadly recognized. The identification of interactions between genetic factors has been progressively pursued via statistical or machine learning approaches. A large body of currently adopted methods, either parametrically or nonparametrically, predominantly focus on pairwise single marker interaction analysis. As genes are the functional units in living organisms, analysis by focusing on a gene as a system could potentially yield more biologically meaningful results. In this work, we conceptually propose a gene-centric framework for genome-wide gene–gene interaction detection. We treat each gene as a testing unit and derive a model-based kernel machine method for two-dimensional genome-wide scanning of gene–gene interactions. In addition to the biological advantage, our method is statistically appealing because it reduces the number of hypotheses tested in a genome-wide scan. Extensive simulation studies are conducted to evaluate the performance of the method. The utility of the method is further demonstrated with applications to two real data sets. Our method provides a conceptual framework for the identification of gene–gene interactions which could shed novel light on the etiology of complex diseases.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1346418577_Fri, 31 Aug 2012 09:10 EDT</guid><pubDate>Fri, 31 Aug 2012 09:10 EDT</pubDate></item><item><title>Tree models for difference and change detection in a complex environment</title><link>http://projecteuclid.org/euclid.aoas/1346418578</link><description>&lt;strong&gt;Yong Wang&lt;/strong&gt;, &lt;strong&gt;Ilze Ziedins&lt;/strong&gt;, &lt;strong&gt;Mark Holmes&lt;/strong&gt;, &lt;strong&gt;Neil Challands&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 3, 1162--1184.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
A new family of tree models is proposed, which we call “differential trees.” A differential tree model is constructed from multiple data sets and aims to detect distributional differences between them. The new methodology differs from the existing difference and change detection techniques in its nonparametric nature, model construction from multiple data sets, and applicability to high-dimensional data. Through a detailed study of an arson case in New Zealand, where an individual is known to have been laying vegetation fires within a certain time period, we illustrate how these models can help detect changes in the frequencies of event occurrences and uncover unusual clusters of events in a complex environment.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1346418578_Fri, 31 Aug 2012 09:10 EDT</guid><pubDate>Fri, 31 Aug 2012 09:10 EDT</pubDate></item><item><title>Semiparametric regression in testicular germ cell data</title><link>http://projecteuclid.org/euclid.aoas/1346418579</link><description>&lt;strong&gt;Anastasia Voulgaraki&lt;/strong&gt;, &lt;strong&gt;Benjamin Kedem&lt;/strong&gt;, &lt;strong&gt;Barry I. Graubard&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 3, 1185--1208.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
It is possible to approach regression analysis with random covariates from a semiparametric perspective where information is combined from multiple multivariate sources. The approach assumes a semiparametric density ratio model where multivariate distributions are “regressed” on a reference distribution. A kernel density estimator can be constructed from many data sources in conjunction with the semiparametric model. The estimator is shown to be more efficient than the traditional single-sample kernel density estimator, and its optimal bandwidth is discussed in some detail. Each multivariate distribution and the corresponding conditional expectation (regression) of interest are estimated from the combined data using all sources. Graphical and quantitative diagnostic tools are suggested to assess model validity. The method is applied in quantifying the effect of height and age on weight of germ cell testicular cancer patients. Comparisons are made with multiple regression, generalized additive models (GAM) and nonparametric kernel regression.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1346418579_Fri, 31 Aug 2012 09:10 EDT</guid><pubDate>Fri, 31 Aug 2012 09:10 EDT</pubDate></item><item><title>Network inference and biological dynamics</title><link>http://projecteuclid.org/euclid.aoas/1346418580</link><description>&lt;strong&gt;Chris. J. Oates&lt;/strong&gt;, &lt;strong&gt;Sach Mukherjee&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 3, 1209--1235.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Network inference approaches are now widely used in biological applications to probe regulatory relationships between molecular components such as genes or proteins. Many methods have been proposed for this setting, but the connections and differences between their statistical formulations have received less attention. In this paper, we show how a broad class of statistical network inference methods, including a number of existing approaches, can be described in terms of variable selection for the linear model. This reveals some subtle but important differences between the methods, including the treatment of time intervals in discretely observed data. In developing a general formulation, we also explore the relationship between single-cell stochastic dynamics and network inference on averages over cells. This clarifies the link between biochemical networks as they operate at the cellular level and network inference as carried out on data that are averages over populations of cells. We present empirical results, comparing thirty-two network inference methods that are instances of the general formulation we describe, using two published dynamical models. Our investigation sheds light on the applicability and limitations of network inference and provides guidance for practitioners and suggestions for experimental design.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1346418580_Fri, 31 Aug 2012 09:10 EDT</guid><pubDate>Fri, 31 Aug 2012 09:10 EDT</pubDate></item><item><title>Semiparametric zero-inflated modeling in multi-ethnic study of atherosclerosis (MESA)</title><link>http://projecteuclid.org/euclid.aoas/1346418581</link><description>&lt;strong&gt;Hai Liu&lt;/strong&gt;, &lt;strong&gt;Shuangge Ma&lt;/strong&gt;, &lt;strong&gt;Richard Kronmal&lt;/strong&gt;, &lt;strong&gt;Kung-Sik Chan&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 3, 1236--1255.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
We analyze the Agatston score of coronary artery calcium (CAC) from the Multi-Ethnic Study of Atherosclerosis (MESA) using the semiparametric zero-inflated modeling approach, where the observed CAC scores from this cohort consist of high frequency of zeroes and continuously distributed positive values. Both partially constrained and unconstrained models are considered to investigate the underlying biological processes of CAC development from zero to positive, and from small amount to large amount. Different from existing studies, a model selection procedure based on likelihood cross-validation is adopted to identify the optimal model, which is justified by comparative Monte Carlo studies. A shrinkaged version of cubic regression spline is used for model estimation and variable selection simultaneously. When applying the proposed methods to the MESA data analysis, we show that the two biological mechanisms influencing the initiation of CAC and the magnitude of CAC when it is positive are better characterized by an unconstrained zero-inflated normal model. Our results are significantly different from those in published studies, and may provide further insights into the biological mechanisms underlying CAC development in humans. This highly flexible statistical framework can be applied to zero-inflated data analyses in other areas.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1346418581_Fri, 31 Aug 2012 09:10 EDT</guid><pubDate>Fri, 31 Aug 2012 09:10 EDT</pubDate></item><item><title>Order selection in nonlinear time series models with application to the study of cell memory</title><link>http://projecteuclid.org/euclid.aoas/1346418582</link><description>&lt;strong&gt;Ying Hung&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 3, 1256--1279.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Cell adhesion experiments are biomechanical experiments studying the binding of a cell to another cell at the level of single molecules. Such a study plays an important role in tumor metastasis in cancer study. Motivated by analyzing a repeated cell adhesion experiment, a new class of nonlinear time series models with an order selection procedure is developed in this paper. Due to the nonlinearity, there are two types of overfitting. Therefore, a double penalized approach is introduced for order selection. To implement this approach, a global optimization algorithm using mixed integer programming is discussed. The procedure is shown to be asymptotically consistent in estimating both the order and parameters of the proposed model. Simulations show that the new order selection approach outperforms standard methods. The finite-sample performance of the estimator is also examined via a simulation study. The application of the proposed methodology to a T-cell experiment provides a better understanding of the kinetics and mechanics of cell adhesion, including quantifying the memory effect on a repeated unbinding force experiment and identifying the order of the memory.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1346418582_Fri, 31 Aug 2012 09:10 EDT</guid><pubDate>Fri, 31 Aug 2012 09:10 EDT</pubDate></item><item><title>Statistical methods for tissue array images—algorithmic scoring and co-training</title><link>http://projecteuclid.org/euclid.aoas/1346418583</link><description>&lt;strong&gt;Donghui Yan&lt;/strong&gt;, &lt;strong&gt;Pei Wang&lt;/strong&gt;, &lt;strong&gt;Michael Linden&lt;/strong&gt;, &lt;strong&gt;Beatrice Knudsen&lt;/strong&gt;, &lt;strong&gt;Timothy Randolph&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 3, 1280--1305.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Recent advances in tissue microarray technology have allowed immunohistochemistry to become a powerful medium-to-high throughput analysis tool, particularly for the validation of diagnostic and prognostic biomarkers. However, as study size grows, the manual evaluation of these assays becomes a prohibitive limitation; it vastly reduces throughput and greatly increases variability and expense. We propose an algorithm—Tissue Array Co-Occurrence Matrix Analysis (TACOMA)—for quantifying cellular phenotypes based on textural regularity summarized by local inter-pixel relationships. The algorithm can be easily trained for any staining pattern, is absent of sensitive tuning parameters and has the ability to report salient pixels in an image that contribute to its score. Pathologists’ input via informative training patches is an important aspect of the algorithm that allows the training for any specific marker or cell type. With co-training, the error rate of TACOMA can be reduced substantially for a very small training sample (e.g., with size $30$). We give theoretical insights into the success of co-training via thinning of the feature set in a high-dimensional setting when there is “sufficient” redundancy among the features. TACOMA is flexible, transparent and provides a scoring process that can be evaluated with clarity and confidence. In a study based on an estrogen receptor (ER) marker, we show that TACOMA is comparable to, or outperforms, pathologists’ performance in terms of accuracy and repeatability.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1346418583_Fri, 31 Aug 2012 09:10 EDT</guid><pubDate>Fri, 31 Aug 2012 09:10 EDT</pubDate></item><item><title>The screening and ranking algorithm to detect DNA copy number variations</title><link>http://projecteuclid.org/euclid.aoas/1346418584</link><description>&lt;strong&gt;Yue S. Niu&lt;/strong&gt;, &lt;strong&gt;Heping Zhang&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 3, 1306--1326.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
DNA Copy number variation (CNV) has recently gained considerable interest as a source of genetic variation that likely influences phenotypic differences. Many statistical and computational methods have been proposed and applied to detect CNVs based on data that generated by genome analysis platforms. However, most algorithms are computationally intensive with complexity at least $O(n^{2})$, where $n$ is the number of probes in the experiments. Moreover, the theoretical properties of those existing methods are not well understood. A faster and better characterized algorithm is desirable for the ultra high throughput data. In this study, we propose the Screening and Ranking algorithm (SaRa) which can detect CNVs fast and accurately with complexity down to $O(n)$. In addition, we characterize theoretical properties and present numerical analysis for our algorithm.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1346418584_Fri, 31 Aug 2012 09:10 EDT</guid><pubDate>Fri, 31 Aug 2012 09:10 EDT</pubDate></item><item><title>Integrative Model-based clustering of microarray methylation and expression data</title><link>http://projecteuclid.org/euclid.aoas/1346418585</link><description>&lt;strong&gt;Matthias Kormaksson&lt;/strong&gt;, &lt;strong&gt;James G. Booth&lt;/strong&gt;, &lt;strong&gt;Maria E. Figueroa&lt;/strong&gt;, &lt;strong&gt;Ari Melnick&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 3, 1327--1347.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
In many fields, researchers are interested in large and complex biological processes. Two important examples are gene expression and DNA methylation in genetics. One key problem is to identify aberrant patterns of these processes and discover biologically distinct groups. In this article we develop a model-based method for clustering such data. The basis of our method involves the construction of a likelihood for any given partition of the subjects. We introduce cluster specific latent indicators that, along with some standard assumptions, impose a specific mixture distribution on each cluster. Estimation is carried out using the EM algorithm. The methods extend naturally to multiple data types of a similar nature, which leads to an integrated analysis over multiple data platforms, resulting in higher discriminating power.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1346418585_Fri, 31 Aug 2012 09:10 EDT</guid><pubDate>Fri, 31 Aug 2012 09:10 EDT</pubDate></item><item><title>Section on the special year for mathematics of planet earth (MPE 2013)</title><link>http://projecteuclid.org/euclid.aoas/1356629042</link><description>&lt;strong&gt;Tilmann Gneiting&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 4, 1349--1351.&lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1356629042_Thu, 27 Dec 2012 12:24 EST</guid><pubDate>Thu, 27 Dec 2012 12:24 EST</pubDate></item><item><title>Inference for population dynamics in the Neolithic period</title><link>http://projecteuclid.org/euclid.aoas/1356629043</link><description>&lt;strong&gt;Andrew W. Baggaley&lt;/strong&gt;, &lt;strong&gt;Richard J. Boys&lt;/strong&gt;, &lt;strong&gt;Andrew Golightly&lt;/strong&gt;, &lt;strong&gt;Graeme R. Sarson&lt;/strong&gt;, &lt;strong&gt;Anvar Shukurov&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 4, 1352--1376.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
We consider parameter estimation for the spread of the Neolithic incipient farming across Europe using radiocarbon dates. We model the arrival time of farming at radiocarbon-dated, early Neolithic sites by a numerical solution to an advancing wavefront. We allow for (technical) uncertainty in the radiocarbon data, lack-of-fit of the deterministic model and use a Gaussian process to smooth spatial deviations from the model. Inference for the parameters in the wavefront model is complicated by the computational cost required to produce a single numerical solution. We therefore employ Gaussian process emulators for the arrival time of the advancing wavefront at each radiocarbon-dated site. We validate our model using predictive simulations.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1356629043_Thu, 27 Dec 2012 12:24 EST</guid><pubDate>Thu, 27 Dec 2012 12:24 EST</pubDate></item><item><title>Finding a consensus on credible features among several paleoclimate reconstructions</title><link>http://projecteuclid.org/euclid.aoas/1356629044</link><description>&lt;strong&gt;Panu Erästö&lt;/strong&gt;, &lt;strong&gt;Lasse Holmström&lt;/strong&gt;, &lt;strong&gt;Atte Korhola&lt;/strong&gt;, &lt;strong&gt;Jan Weckström&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 4, 1377--1405.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
We propose a method to merge several paleoclimate time series into one that exhibits a consensus on the features of the individual times series. The paleoclimate time series can be noisy, nonuniformly sampled and the dates at which the paleoclimate is reconstructed can have errors. Bayesian inference is used to model the various sources of uncertainty and smoothing of the posterior distribution of the consensus is used to capture its credible features in different time scales. The technique is demonstrated by analyzing a collection of six Holocene temperature reconstructions from Finnish Lapland based on various biological proxies. Although the paper focuses on paleoclimate time series, the proposed method can be applied in other contexts where one seeks to infer features that are jointly supported by an ensemble of irregularly sampled noisy time series.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1356629044_Thu, 27 Dec 2012 12:24 EST</guid><pubDate>Thu, 27 Dec 2012 12:24 EST</pubDate></item><item><title>Approximating the conditional density given large observed values via a multivariate extremes framework, with application to environmental data</title><link>http://projecteuclid.org/euclid.aoas/1356629045</link><description>&lt;strong&gt;Daniel Cooley&lt;/strong&gt;, &lt;strong&gt;Richard A. Davis&lt;/strong&gt;, &lt;strong&gt;Philippe Naveau&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 4, 1406--1429.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Phenomena such as air pollution levels are of greatest interest when observations are large, but standard prediction methods are not specifically designed for large observations. We propose a method, rooted in extreme value theory, which approximates the conditional distribution of an unobserved component of a random vector given large observed values. Specifically, for $\mathbf{Z}=(Z_{1},\ldots,Z_{d})^{T}$ and $\mathbf{Z}_{-d}=(Z_{1},\ldots,Z_{d-1})^{T}$, the method approximates the conditional distribution of $[Z_{d}|\mathbf{Z}_{-d}=\mathbf{z}_{-d}]$ when $\|\mathbf{z}_{-d}\|&amp;gt;r_{*}$. The approach is based on the assumption that $\mathbf{Z}$ is a multivariate regularly varying random vector of dimension $d$. The conditional distribution approximation relies on knowledge of the angular measure of $\mathbf{Z}$, which provides explicit structure for dependence in the distribution’s tail. As the method produces a predictive distribution rather than just a point predictor, one can answer any question posed about the quantity being predicted, and, in particular, one can assess how well the extreme behavior is represented.
 
 
Using a fitted model for the angular measure, we apply our method to nitrogen dioxide measurements in metropolitan Washington DC. We obtain a predictive distribution for the air pollutant at a location given the air pollutant’s measurements at four nearby locations and given that the norm of the vector of the observed measurements is large.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1356629045_Thu, 27 Dec 2012 12:24 EST</guid><pubDate>Thu, 27 Dec 2012 12:24 EST</pubDate></item><item><title>A hierarchical max-stable spatial model for extreme precipitation</title><link>http://projecteuclid.org/euclid.aoas/1356629046</link><description>&lt;strong&gt;Brian J. Reich&lt;/strong&gt;, &lt;strong&gt;Benjamin A. Shaby&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 4, 1430--1451.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Extreme environmental phenomena such as major precipitation events manifestly exhibit spatial dependence. Max-stable processes are a class of asymptotically-justified models that are capable of representing spatial dependence among extreme values. While these models satisfy modeling requirements, they are limited in their utility because their corresponding joint likelihoods are unknown for more than a trivial number of spatial locations, preventing, in particular, Bayesian analyses. In this paper, we propose a new random effects model to account for spatial dependence. We show that our specification of the random effect distribution leads to a max-stable process that has the popular Gaussian extreme value process (GEVP) as a limiting case. The proposed model is used to analyze the yearly maximum precipitation from a regional climate model.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1356629046_Thu, 27 Dec 2012 12:24 EST</guid><pubDate>Thu, 27 Dec 2012 12:24 EST</pubDate></item><item><title>A dynamic nonstationary spatio-temporal model for short term prediction of precipitation</title><link>http://projecteuclid.org/euclid.aoas/1356629047</link><description>&lt;strong&gt;Fabio Sigrist&lt;/strong&gt;, &lt;strong&gt;Hans R. Künsch&lt;/strong&gt;, &lt;strong&gt;Werner A. Stahel&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 4, 1452--1477.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Precipitation is a complex physical process that varies in space and time. Predictions and interpolations at unobserved times and/or locations help to solve important problems in many areas. In this paper, we present a hierarchical Bayesian model for spatio-temporal data and apply it to obtain short term predictions of rainfall. The model incorporates physical knowledge about the underlying processes that determine rainfall, such as advection, diffusion and convection. It is based on a temporal autoregressive convolution with spatially colored and temporally white innovations. By linking the advection parameter of the convolution kernel to an external wind vector, the model is temporally nonstationary. Further, it allows for nonseparable and anisotropic covariance structures. With the help of the Voronoi tessellation, we construct a natural parametrization, that is, space as well as time resolution consistent, for data lying on irregular grid points. In the application, the statistical model combines forecasts of three other meteorological variables obtained from a numerical weather prediction model with past precipitation observations. The model is then used to predict three-hourly precipitation over 24 hours. It performs better than a separable, stationary and isotropic version, and it performs comparably to a deterministic numerical weather prediction model for precipitation and has the advantage that it quantifies prediction uncertainty.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1356629047_Thu, 27 Dec 2012 12:24 EST</guid><pubDate>Thu, 27 Dec 2012 12:24 EST</pubDate></item><item><title>Spatial analysis of wave direction data using wrapped Gaussian processes</title><link>http://projecteuclid.org/euclid.aoas/1356629048</link><description>&lt;strong&gt;Giovanna Jona-Lasinio&lt;/strong&gt;, &lt;strong&gt;Alan Gelfand&lt;/strong&gt;, &lt;strong&gt;Mattia Jona-Lasinio&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 4, 1478--1498.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Directional data arise in various contexts such as oceanography (wave directions) and meteorology (wind directions), as well as with measurements on a periodic scale (weekdays, hours, etc.). Our contribution is to introduce a model-based approach to handle periodic data in the case of measurements taken at spatial locations, anticipating structured dependence between these measurements. We formulate a wrapped Gaussian spatial process model for this setting, induced from a customary linear Gaussian process.
 
 
We build a hierarchical model to handle this situation and show that the fitting of such a model is possible using standard Markov chain Monte Carlo methods. Our approach enables spatial interpolation (and can accommodate measurement error). We illustrate with a set of wave direction data from the Adriatic coast of Italy, generated through a complex computer model.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1356629048_Thu, 27 Dec 2012 12:24 EST</guid><pubDate>Thu, 27 Dec 2012 12:24 EST</pubDate></item><item><title>A toolbox for fitting complex spatial point process models using integrated nested Laplace approximation (INLA)</title><link>http://projecteuclid.org/euclid.aoas/1356629049</link><description>&lt;strong&gt;Janine B. Illian&lt;/strong&gt;, &lt;strong&gt;Sigrunn H. Sørbye&lt;/strong&gt;, &lt;strong&gt;Håvard Rue&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 4, 1499--1530.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
This paper develops methodology that provides a toolbox for routinely fitting complex models to realistic spatial point pattern data. We consider models that are based on log-Gaussian Cox processes and include local interaction in these by considering constructed covariates. This enables us to use integrated nested Laplace approximation and to considerably speed up the inferential task. In addition, methods for model comparison and model assessment facilitate the modelling process. The performance of the approach is assessed in a simulation study. To demonstrate the versatility of the approach, models are fitted to two rather different examples, a large rainforest data set with covariates and a point pattern with multiple marks.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1356629049_Thu, 27 Dec 2012 12:24 EST</guid><pubDate>Thu, 27 Dec 2012 12:24 EST</pubDate></item><item><title>Phenotypic evolution studied by layered stochastic differential equations</title><link>http://projecteuclid.org/euclid.aoas/1356629050</link><description>&lt;strong&gt;Trond Reitan&lt;/strong&gt;, &lt;strong&gt;Tore Schweder&lt;/strong&gt;, &lt;strong&gt;Jorijntje Henderiks&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 4, 1531--1551.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Time series of cell size evolution in unicellular marine algae (division Haptophyta; Coccolithus lineage), covering 57 million years, are studied by a system of linear stochastic differential equations of hierarchical structure. The data consists of size measurements of fossilized calcite platelets (coccoliths) that cover the living cell, found in deep-sea sediment cores from six sites in the world oceans and dated to irregular points in time. To accommodate biological theory of populations tracking their fitness optima, and to allow potentially interpretable correlations in time and space, the model framework allows for an upper layer of partially observed site-specific population means, a layer of site-specific theoretical fitness optima and a bottom layer representing environmental and ecological processes. While the modeled process has many components, it is Gaussian and analytically tractable. A total of 710 model specifications within this framework are compared and inference is drawn with respect to model structure, evolutionary speed and the effect of global temperature.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1356629050_Thu, 27 Dec 2012 12:24 EST</guid><pubDate>Thu, 27 Dec 2012 12:24 EST</pubDate></item><item><title>Gap bootstrap methods for massive data sets with an application to transportation engineering</title><link>http://projecteuclid.org/euclid.aoas/1356629051</link><description>&lt;strong&gt;S. N. Lahiri&lt;/strong&gt;, &lt;strong&gt;C. Spiegelman&lt;/strong&gt;, &lt;strong&gt;J. Appiah&lt;/strong&gt;, &lt;strong&gt;L. Rilett&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 4, 1552--1587.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
In this paper we describe two bootstrap methods for massive data sets. Naive applications of common resampling methodology are often impractical for massive data sets due to computational burden and due to complex patterns of inhomogeneity. In contrast, the proposed methods exploit certain structural properties of a large class of massive data sets to break up the original problem into a set of simpler subproblems, solve each subproblem separately where the data exhibit approximate uniformity and where computational complexity can be reduced to a manageable level, and then combine the results through certain analytical considerations. The validity of the proposed methods is proved and their finite sample properties are studied through a moderately large simulation study. The methodology is illustrated with a real data example from Transportation Engineering, which motivated the development of the proposed methods.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1356629051_Thu, 27 Dec 2012 12:24 EST</guid><pubDate>Thu, 27 Dec 2012 12:24 EST</pubDate></item><item><title>Dynamical functional prediction and classification, with application to traffic flow prediction</title><link>http://projecteuclid.org/euclid.aoas/1356629052</link><description>&lt;strong&gt;Jeng-Min Chiou&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 4, 1588--1614.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Motivated by the need for accurate traffic flow prediction in transportation management, we propose a functional data method to analyze traffic flow patterns and predict future traffic flow. In this study we approach the problem by sampling traffic flow trajectories from a mixture of stochastic processes. The proposed functional mixture prediction approach combines functional prediction with probabilistic functional classification to take distinct traffic flow patterns into account. The probabilistic classification procedure, which incorporates functional clustering and discrimination, hinges on subspace projection. The proposed methods not only assist in predicting traffic flow trajectories, but also identify distinct patterns in daily traffic flow of typical temporal trends and variabilities. The proposed methodology is widely applicable in analysis and prediction of longitudinally recorded functional data.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1356629052_Thu, 27 Dec 2012 12:24 EST</guid><pubDate>Thu, 27 Dec 2012 12:24 EST</pubDate></item><item><title>Dating medieval English charters</title><link>http://projecteuclid.org/euclid.aoas/1356629053</link><description>&lt;strong&gt;Gelila Tilahun&lt;/strong&gt;, &lt;strong&gt;Andrey Feuerverger&lt;/strong&gt;, &lt;strong&gt;Michael Gervers&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 4, 1615--1640.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Deeds, or charters, dealing with property rights, provide a continuous documentation which can be used by historians to study the evolution of social, economic and political changes. This study is concerned with charters (written in Latin) dating from the tenth through early fourteenth centuries in England. Of these, at least one million were left undated, largely due to administrative changes introduced by William the Conqueror in 1066. Correctly dating such charters is of vital importance in the study of English medieval history. This paper is concerned with computer-automated statistical methods for dating such document collections, with the goal of reducing the considerable efforts required to date them manually and of improving the accuracy of assigned dates. Proposed methods are based on such data as the variation over time of word and phrase usage, and on measures of distance between documents. The extensive (and dated) Documents of Early England Data Set (DEEDS) maintained at the University of Toronto was used for this purpose.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1356629053_Thu, 27 Dec 2012 12:24 EST</guid><pubDate>Thu, 27 Dec 2012 12:24 EST</pubDate></item><item><title>Assessing transient carryover effects in recurrent event processes, with application to chronic health conditions</title><link>http://projecteuclid.org/euclid.aoas/1356629054</link><description>&lt;strong&gt;Candemir Çiğşar&lt;/strong&gt;, &lt;strong&gt;Jerald F. Lawless&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 4, 1641--1663.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
In some settings involving recurrent events, the occurrence of one event may produce a temporary increase in the event intensity; we refer to this phenomenon as a transient carryover effect. This paper provides models and tests for carryover effect. Motivation for our work comes from events associated with chronic health conditions, and we consider two studies involving asthma attacks in children in some detail. We consider how carryover effects can be modeled and assessed, and note some difficulties in the context of heterogeneous groups of individuals. We give a simple intuitive test for no carryover effect and examine its properties. In addition, we demonstrate the need for detailed modeling in trying to deconstruct the dynamics of recurrent events.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1356629054_Thu, 27 Dec 2012 12:24 EST</guid><pubDate>Thu, 27 Dec 2012 12:24 EST</pubDate></item><item><title>Multiple hypothesis testing adjusted for latent variables, with an application to the AGEMAP gene expression data</title><link>http://projecteuclid.org/euclid.aoas/1356629055</link><description>&lt;strong&gt;Yunting Sun&lt;/strong&gt;, &lt;strong&gt;Nancy R. Zhang&lt;/strong&gt;, &lt;strong&gt;Art B. Owen&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 4, 1664--1688.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
In high throughput settings we inspect a great many candidate variables (e.g., genes) searching for associations with a primary variable (e.g., a phenotype). High throughput hypothesis testing can be made difficult by the presence of systemic effects and other latent variables. It is well known that those variables alter the level of tests and induce correlations between tests. They also change the relative ordering of significance levels among hypotheses. Poor rankings lead to wasteful and ineffective follow-up studies. The problem becomes acute for latent variables that are correlated with the primary variable. We propose a two-stage analysis to counter the effects of latent variables on the ranking of hypotheses. Our method, called LEAPP, statistically isolates the latent variables from the primary one. In simulations, it gives better ordering of hypotheses than competing methods such as SVA and EIGENSTRAT. For an illustration, we turn to data from the AGEMAP study relating gene expression to age for 16 tissues in the mouse. LEAPP generates rankings with greater consistency across tissues than the rankings attained by the other methods.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1356629055_Thu, 27 Dec 2012 12:24 EST</guid><pubDate>Thu, 27 Dec 2012 12:24 EST</pubDate></item><item><title>Truth and memory: Linking instantaneous and retrospective self-reported cigarette consumption</title><link>http://projecteuclid.org/euclid.aoas/1356629056</link><description>&lt;strong&gt;Hao Wang&lt;/strong&gt;, &lt;strong&gt;Saul Shiffman&lt;/strong&gt;, &lt;strong&gt;Sandra D. Griffith&lt;/strong&gt;, &lt;strong&gt;Daniel F. Heitjan&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 4, 1689--1706.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Studies of smoking behavior commonly use the time-line follow-back ( TLFB ) method, or periodic retrospective recall, to gather data on daily cigarette consumption. TLFB is considered adequate for identifying periods of abstinence and lapse but not for measurement of daily cigarette consumption, thanks to substantial recall and digit preference biases. With the development of the hand-held electronic diary (ED), it has become possible to collect cigarette consumption data using ecological momentary assessment ( EMA ), or the instantaneous recording of each cigarette as it is smoked. EMA data, because they do not rely on retrospective recall, are thought to more accurately measure cigarette consumption. In this article we present an analysis of consumption data collected simultaneously by both methods from 236 active smokers in the pre-quit phase of a smoking cessation study. We define a statistical model that describes the genesis of the TLFB records as a two-stage process of mis-remembering and rounding, including fixed and random effects at each stage. We use Bayesian methods to estimate the model, and we evaluate its adequacy by studying histograms of imputed values of the latent remembered cigarette count. Our analysis suggests that both mis-remembering and heaping contribute substantially to the distortion of self-reported cigarette counts. Higher nicotine dependence, white ethnicity and male sex are associated with greater remembered smoking given the EMA count. The model is potentially useful in other applications where it is desirable to understand the process by which subjects remember and report true observations.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1356629056_Thu, 27 Dec 2012 12:24 EST</guid><pubDate>Thu, 27 Dec 2012 12:24 EST</pubDate></item><item><title>Toxicity profiling of engineered nanomaterials via multivariate dose-response surface modeling</title><link>http://projecteuclid.org/euclid.aoas/1356629057</link><description>&lt;strong&gt;Trina Patel&lt;/strong&gt;, &lt;strong&gt;Donatello Telesca&lt;/strong&gt;, &lt;strong&gt;Saji George&lt;/strong&gt;, &lt;strong&gt;André E. Nel&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 4, 1707--1729.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
New generation in vitro high-throughput screening (HTS) assays for the assessment of engineered nanomaterials provide an opportunity to learn how these particles interact at the cellular level, particularly in relation to injury pathways. These types of assays are often characterized by small sample sizes, high measurement error and high dimensionality, as multiple cytotoxicity outcomes are measured across an array of doses and durations of exposure. In this paper we propose a probability model for the toxicity profiling of engineered nanomaterials. A hierarchical structure is used to account for the multivariate nature of the data by modeling dependence between outcomes and thereby combining information across cytotoxicity pathways. In this framework we are able to provide a flexible surface-response model that provides inference and generalizations of various classical risk assessment parameters. We discuss applications of this model to data on eight nanoparticles evaluated in relation to four cytotoxicity parameters.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1356629057_Thu, 27 Dec 2012 12:24 EST</guid><pubDate>Thu, 27 Dec 2012 12:24 EST</pubDate></item><item><title>Optimal obstacle placement with disambiguations</title><link>http://projecteuclid.org/euclid.aoas/1356629058</link><description>&lt;strong&gt;Vural Aksakalli&lt;/strong&gt;, &lt;strong&gt;Elvan Ceyhan&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 4, 1730--1774.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
We introduce the optimal obstacle placement with disambiguations problem wherein the goal is to place true obstacles in an environment cluttered with false obstacles so as to maximize the total traversal length of a navigating agent (NAVA). Prior to the traversal, the NAVA is given location information and probabilistic estimates of each disk-shaped hindrance (hereinafter referred to as disk) being a true obstacle. The NAVA can disambiguate a disk’s status only when situated on its boundary. There exists an obstacle placing agent (OPA) that locates obstacles prior to the NAVA’s traversal. The goal of the OPA is to place true obstacles in between the clutter in such a way that the NAVA’s traversal length is maximized in a game-theoretic sense. We assume the OPA knows the clutter spatial distribution type, but not the exact locations of clutter disks. We analyze the traversal length using repeated measures analysis of variance for various obstacle number, obstacle placing scheme and clutter spatial distribution type combinations in order to identify the optimal combination. Our results indicate that as the clutter becomes more regular (clustered), the NAVA’s traversal length gets longer (shorter). On the other hand, the traversal length tends to follow a concave-down trend as the number of obstacles increases. We also provide a case study on a real-world maritime minefield data set.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1356629058_Thu, 27 Dec 2012 12:24 EST</guid><pubDate>Thu, 27 Dec 2012 12:24 EST</pubDate></item><item><title>A likelihood-based scoring method for peptide identification using mass spectrometry</title><link>http://projecteuclid.org/euclid.aoas/1356629059</link><description>&lt;strong&gt;Qunhua Li&lt;/strong&gt;, &lt;strong&gt;Jimmy K. Eng&lt;/strong&gt;, &lt;strong&gt;Matthew Stephens&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 4, 1775--1794.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Mass spectrometry provides a high-throughput approach to identify proteins in biological samples. A key step in the analysis of mass spectrometry data is to identify the peptide sequence that, most probably, gave rise to each observed spectrum. This is often tackled using a database search: each observed spectrum is compared against a large number of theoretical “expected” spectra predicted from candidate peptide sequences in a database, and the best match is identified using some heuristic scoring criterion. Here we provide a more principled, likelihood-based, scoring criterion for this problem. Specifically, we introduce a probabilistic model that allows one to assess, for each theoretical spectrum, the probability that it would produce the observed spectrum. This probabilistic model takes account of peak locations and intensities, in both observed and theoretical spectra, which enables incorporation of detailed knowledge of chemical plausibility in peptide identification. Besides placing peptide scoring on a sounder theoretical footing, the likelihood-based score also has important practical benefits: it provides natural measures for assessing the uncertainty of each identification, and in comparisons on benchmark data it produced more accurate peptide identifications than other methods, including SEQUEST. Although we focus here on peptide identification, our scoring rule could easily be integrated into any downstream analyses that require peptide-spectrum match scores.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1356629059_Thu, 27 Dec 2012 12:24 EST</guid><pubDate>Thu, 27 Dec 2012 12:24 EST</pubDate></item><item><title>Latent demographic profile estimation in hard-to-reach groups</title><link>http://projecteuclid.org/euclid.aoas/1356629060</link><description>&lt;strong&gt;Tyler H. McCormick&lt;/strong&gt;, &lt;strong&gt;Tian Zheng&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 4, 1795--1813.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
The sampling frame in most social science surveys excludes members of certain groups, known as hard-to-reach groups . These groups, or subpopulations, may be difficult to access (the homeless, e.g.), camouflaged by stigma (individuals with HIV/AIDS), or both (commercial sex workers). Even basic demographic information about these groups is typically unknown, especially in many developing nations. We present statistical models which leverage social network structure to estimate demographic characteristics of these subpopulations using Aggregated relational data (ARD), or questions of the form “How many X’s do you know?” Unlike other network-based techniques for reaching these groups, ARD require no special sampling strategy and are easily incorporated into standard surveys. ARD also do not require respondents to reveal their own group membership. We propose a Bayesian hierarchical model for estimating the demographic characteristics of hard-to-reach groups, or latent demographic profiles , using ARD. We propose two estimation techniques. First, we propose a Markov-chain Monte Carlo algorithm for existing data or cases where the full posterior distribution is of interest. For cases when new data can be collected, we propose guidelines and, based on these guidelines, propose a simple estimate motivated by a missing data approach. Using data from McCarty et al. [ Human Organization 60 (2001) 28–39], we estimate the age and gender profiles of six hard-to-reach groups, such as individuals who have HIV, women who were raped, and homeless persons. We also evaluate our simple estimates using simulation studies.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1356629060_Thu, 27 Dec 2012 12:24 EST</guid><pubDate>Thu, 27 Dec 2012 12:24 EST</pubDate></item><item><title>Addressing missing data mechanism uncertainty using multiple-model multiple imputation: Application to a longitudinal clinical trial</title><link>http://projecteuclid.org/euclid.aoas/1356629061</link><description>&lt;strong&gt;Juned Siddique&lt;/strong&gt;, &lt;strong&gt;Ofer Harel&lt;/strong&gt;, &lt;strong&gt;Catherine M. Crespi&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 4, 1814--1837.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
We present a framework for generating multiple imputations for continuous data when the missing data mechanism is unknown. Imputations are generated from more than one imputation model in order to incorporate uncertainty regarding the missing data mechanism. Parameter estimates based on the different imputation models are combined using rules for nested multiple imputation. Through the use of simulation, we investigate the impact of missing data mechanism uncertainty on post-imputation inferences and show that incorporating this uncertainty can increase the coverage of parameter estimates. We apply our method to a longitudinal clinical trial of low-income women with depression where nonignorably missing data were a concern. We show that different assumptions regarding the missing data mechanism can have a substantial impact on inferences. Our method provides a simple approach for formalizing subjective notions regarding nonresponse so that they can be easily stated, communicated and compared.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1356629061_Thu, 27 Dec 2012 12:24 EST</guid><pubDate>Thu, 27 Dec 2012 12:24 EST</pubDate></item><item><title>Composite Gaussian process models for emulating expensive functions</title><link>http://projecteuclid.org/euclid.aoas/1356629062</link><description>&lt;strong&gt;Shan Ba&lt;/strong&gt;, &lt;strong&gt;V. Roshan Joseph&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 4, 1838--1860.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
A new type of nonstationary Gaussian process model is developed for approximating computationally expensive functions. The new model is a composite of two Gaussian processes, where the first one captures the smooth global trend and the second one models local details. The new predictor also incorporates a flexible variance model, which makes it more capable of approximating surfaces with varying volatility. Compared to the commonly used stationary Gaussian process model, the new predictor is numerically more stable and can more accurately approximate complex surfaces when the experimental design is sparse. In addition, the new model can also improve the prediction intervals by quantifying the change of local variability associated with the response. Advantages of the new predictor are demonstrated using several examples.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1356629062_Thu, 27 Dec 2012 12:24 EST</guid><pubDate>Thu, 27 Dec 2012 12:24 EST</pubDate></item><item><title>A semiparametric regression model for paired longitudinal outcomes with application in childhood blood pressure development</title><link>http://projecteuclid.org/euclid.aoas/1356629063</link><description>&lt;strong&gt;Hai Liu&lt;/strong&gt;, &lt;strong&gt;Wanzhu Tu&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 4, 1861--1882.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
This research examines the simultaneous influences of height and weight on longitudinally measured systolic and diastolic blood pressure in children. Previous studies have shown that both height and weight are positively associated with blood pressure. In children, however, the concurrent increases of height and weight have made it all but impossible to discern the effect of height from that of weight. To better understand these influences, we propose to examine the joint effect of height and weight on blood pressure. Bivariate thin plate spline surfaces are used to accommodate the potentially nonlinear effects as well as the interaction between height and weight. Moreover, we consider a joint model for paired blood pressure measures, that is, systolic and diastolic blood pressure, to account for the underlying correlation between the two measures within the same individual. The bivariate spline surfaces are allowed to vary across different groups of interest. We have developed related model fitting and inference procedures. The proposed method is used to analyze data from a real clinical investigation.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1356629063_Thu, 27 Dec 2012 12:24 EST</guid><pubDate>Thu, 27 Dec 2012 12:24 EST</pubDate></item><item><title>Probabilistic prediction of neurological disorders with a statistical assessment of neuroimaging data modalities</title><link>http://projecteuclid.org/euclid.aoas/1356629064</link><description>&lt;strong&gt;M. Filippone&lt;/strong&gt;, &lt;strong&gt;A. F. Marquand&lt;/strong&gt;, &lt;strong&gt;C. R. V. Blain&lt;/strong&gt;, &lt;strong&gt;S. C. R. Williams&lt;/strong&gt;, &lt;strong&gt;J. Mourão-Miranda&lt;/strong&gt;, &lt;strong&gt;M. Girolami&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 4, 1883--1905.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
For many neurological disorders, prediction of disease state is an important clinical aim. Neuroimaging provides detailed information about brain structure and function from which such predictions may be statistically derived. A multinomial logit model with Gaussian process priors is proposed to: (i) predict disease state based on whole-brain neuroimaging data and (ii) analyze the relative informativeness of different image modalities and brain regions. Advanced Markov chain Monte Carlo methods are employed to perform posterior inference over the model. This paper reports a statistical assessment of multiple neuroimaging modalities applied to the discrimination of three Parkinsonian neurological disorders from one another and healthy controls, showing promising predictive performance of disease states when compared to nonprobabilistic classifiers based on multiple modalities. The statistical analysis also quantifies the relative importance of different neuroimaging measures and brain regions in discriminating between these diseases and suggests that for prediction there is little benefit in acquiring multiple neuroimaging sequences. Finally, the predictive capability of different brain regions is found to be in accordance with the regional pathology of the diseases as reported in the clinical literature.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1356629064_Thu, 27 Dec 2012 12:24 EST</guid><pubDate>Thu, 27 Dec 2012 12:24 EST</pubDate></item><item><title>Evaluating stationarity via change-point alternatives with applications to fMRI data</title><link>http://projecteuclid.org/euclid.aoas/1356629065</link><description>&lt;strong&gt;John A. D. Aston&lt;/strong&gt;, &lt;strong&gt;Claudia Kirch&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 4, 1906--1948.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Functional magnetic resonance imaging (fMRI) is now a well-established technique for studying the brain. However, in many situations, such as when data are acquired in a resting state, it is difficult to know whether the data are truly stationary or if level shifts have occurred. To this end, change-point detection in sequences of functional data is examined where the functional observations are dependent and where the distributions of change-points from multiple subjects are required. Of particular interest is the case where the change-point is an epidemic change—a change occurs and then the observations return to baseline at a later time. The case where the covariance can be decomposed as a tensor product is considered with particular attention to the power analysis for detection. This is of interest in the application to fMRI, where the estimation of a full covariance structure for the three-dimensional image is not computationally feasible. Using the developed methods, a large study of resting state fMRI data is conducted to determine whether the subjects undertaking the resting scan have nonstationarities present in their time courses. It is found that a sizeable proportion of the subjects studied are not stationary. The change-point distribution for those subjects is empirically determined, as well as its theoretical properties examined.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1356629065_Thu, 27 Dec 2012 12:24 EST</guid><pubDate>Thu, 27 Dec 2012 12:24 EST</pubDate></item><item><title>The ranking lasso and its application to sport tournaments</title><link>http://projecteuclid.org/euclid.aoas/1356629066</link><description>&lt;strong&gt;Guido Masarotto&lt;/strong&gt;, &lt;strong&gt;Cristiano Varin&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 4, 1949--1970.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Ranking a vector of alternatives on the basis of a series of paired comparisons is a relevant topic in many instances. A popular example is ranking contestants in sport tournaments. To this purpose, paired comparison models such as the Bradley–Terry model are often used. This paper suggests fitting paired comparison models with a lasso-type procedure that forces contestants with similar abilities to be classified into the same group. Benefits of the proposed method are easier interpretation of rankings and a significant improvement of the quality of predictions with respect to the standard maximum likelihood fitting. Numerical aspects of the proposed method are discussed in detail. The methodology is illustrated through ranking of the teams of the National Football League 2010–2011 and the American College Hockey Men’s Division I 2009–2010.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1356629066_Thu, 27 Dec 2012 12:24 EST</guid><pubDate>Thu, 27 Dec 2012 12:24 EST</pubDate></item><item><title>Bayesian inference and the parametric bootstrap</title><link>http://projecteuclid.org/euclid.aoas/1356629067</link><description>&lt;strong&gt;Bradley Efron&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 6, Number 4, 1971--1997.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
The parametric bootstrap can be used for the efficient computation of Bayes posterior distributions. Importance sampling formulas take on an easy form relating to the deviance in exponential families and are particularly simple starting from Jeffreys invariant prior. Because of the i.i.d. nature of bootstrap sampling, familiar formulas describe the computational accuracy of the Bayes estimates. Besides computational methods, the theory provides a connection between Bayesian and frequentist analysis. Efficient algorithms for the frequentist accuracy of Bayesian inferences are developed and demonstrated in a model selection example.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1356629067_Thu, 27 Dec 2012 12:24 EST</guid><pubDate>Thu, 27 Dec 2012 12:24 EST</pubDate></item><item><title>Variance function estimation in quantitative mass spectrometry with application to iTRAQ labeling</title><link>http://projecteuclid.org/euclid.aoas/1365527188</link><description>&lt;strong&gt;Micha Mandel&lt;/strong&gt;, &lt;strong&gt;Manor Askenazi&lt;/strong&gt;, &lt;strong&gt;Yi Zhang&lt;/strong&gt;, &lt;strong&gt;Jarrod A. Marto&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 7, Number 1, 1--24.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
This paper describes and compares two methods for estimating the variance function associated with iTRAQ (isobaric tag for relative and absolute quantitation) isotopic labeling in quantitative mass spectrometry based proteomics. Measurements generated by the mass spectrometer are proportional to the concentration of peptides present in the biological sample. However, the iTRAQ reporter signals are subject to errors that depend on the peptide amounts. The variance function of the errors is therefore an essential parameter for evaluating the results, but estimating it is complicated, as the number of nuisance parameters increases with sample size while the number of replicates for each peptide remains small. Two experiments that were conducted with the sole goal of estimating the variance function and its stability over time are analyzed, and the resulting estimated variance function is used to analyze an experiment targeting aberrant signaling cascades in cells harboring distinct oncogenic mutations. Methods for constructing conservative $p$-values and confidence intervals are discussed.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1365527188_Tue, 09 Apr 2013 13:07 EDT</guid><pubDate>Tue, 09 Apr 2013 13:07 EDT</pubDate></item><item><title>Stronger instruments via integer programming in an observational study of late preterm birth outcomes</title><link>http://projecteuclid.org/euclid.aoas/1365527189</link><description>&lt;strong&gt;José R. Zubizarreta&lt;/strong&gt;, &lt;strong&gt;Dylan S. Small&lt;/strong&gt;, &lt;strong&gt;Neera K. Goyal&lt;/strong&gt;, &lt;strong&gt;Scott Lorch&lt;/strong&gt;, &lt;strong&gt;Paul R. Rosenbaum&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 7, Number 1, 25--50.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
In an optimal nonbipartite match, a single population is divided into matched pairs to minimize a total distance within matched pairs. Nonbipartite matching has been used to strengthen instrumental variables in observational studies of treatment effects, essentially by forming pairs that are similar in terms of covariates but very different in the strength of encouragement to accept the treatment. Optimal nonbipartite matching is typically done using network optimization techniques that can be quick, running in polynomial time, but these techniques limit the tools available for matching. Instead, we use integer programming techniques, thereby obtaining a wealth of new tools not previously available for nonbipartite matching, including fine and near-fine balance for several nominal variables, forced near balance on means and optimal subsetting. We illustrate the methods in our on-going study of outcomes of late-preterm births in California, that is, births of 34 to 36 weeks of gestation. Would lengthening the time in the hospital for such births reduce the frequency of rapid readmissions? A straightforward comparison of babies who stay for a shorter or longer time would be severely biased, because the principal reason for a long stay is some serious health problem. We need an instrument, something inconsequential and haphazard that encourages a shorter or a longer stay in the hospital. It turns out that babies born at certain times of day tend to stay overnight once with a shorter length of stay, whereas babies born at other times of day tend to stay overnight twice with a longer length of stay, and there is nothing particularly special about a baby who is born at 11:00 pm. Therefore, we use hour-of-birth as an instrument for a longer hospital stay. Using integer programming, we form 80,600 pairs of two babies who are similar in terms of observed covariates but very different in anticipated lengths of stay based on their hours of birth. We ask whether encouragement to stay an extra day reduces readmissions within two days of discharge. A sensitivity analysis addresses the possibility that the instrument is not valid as an instrument, that is, not random but rather biased by an unmeasured covariate associated with the hour of birth. Bias can give the impression of a treatment effect when there is no effect, but it can also mask an actual effect, leaving the impression of no effect, and both possibilities are examined in analyses for effects and for near equivalence.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1365527189_Tue, 09 Apr 2013 13:07 EDT</guid><pubDate>Tue, 09 Apr 2013 13:07 EDT</pubDate></item><item><title>Variable selection and sensitivity analysis using dynamic trees, with an application to computer code performance tuning</title><link>http://projecteuclid.org/euclid.aoas/1365527190</link><description>&lt;strong&gt;Robert B. Gramacy&lt;/strong&gt;, &lt;strong&gt;Matt Taddy&lt;/strong&gt;, &lt;strong&gt;Stefan M. Wild&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 7, Number 1, 51--80.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
We investigate an application in the automatic tuning of computer codes, an area of research that has come to prominence alongside the recent rise of distributed scientific processing and heterogeneity in high-performance computing environments. Here, the response function is nonlinear and noisy and may not be smooth or stationary. Clearly needed are variable selection, decomposition of influence, and analysis of main and secondary effects for both real-valued and binary inputs and outputs. Our contribution is a novel set of tools for variable selection and sensitivity analysis based on the recently proposed dynamic tree model. We argue that this approach is uniquely well suited to the demands of our motivating example. In illustrations on benchmark data sets, we show that the new techniques are faster and offer richer feature sets than do similar approaches in the static tree and computer experiment literature. We apply the methods in code-tuning optimization, examination of a cold-cache effect, and detection of transformation errors.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1365527190_Tue, 09 Apr 2013 13:07 EDT</guid><pubDate>Tue, 09 Apr 2013 13:07 EDT</pubDate></item><item><title>Geostatistical modeling in the presence of interaction between the measuring instruments, with an application to the estimation of spatial market potentials</title><link>http://projecteuclid.org/euclid.aoas/1365527191</link><description>&lt;strong&gt;Francesco Finazzi&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 7, Number 1, 81--101.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
This paper addresses the problem of recovering the spatial market potential of a retail product from spatially distributed sales data. In order to tackle the problem in a general way, the concept of spatial potential is introduced. The potential is concurrently measured at different spatial locations and the measurements are analyzed in order to recover the spatial potential. The measuring instruments used to collect the data interact with each other, that is, the measurement at a given spatial location is affected by the concurrent measurements at other locations. An approach based on a novel geostatistical model is developed. In particular, the model is able to handle both the measuring instrument interaction and the missing data. A model estimation procedure based on the expectation–maximization algorithm is provided as well as standard inferential tools. The model is applied to the estimation of the spatial market potential of a newspaper for the city of Bergamo, Italy. The estimated spatial market potential is eventually analyzed in order to identify the areas with the highest potential, to identify the areas where it is profitable to open additional newsstands and to evaluate the newspaper total market volume of the city.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1365527191_Tue, 09 Apr 2013 13:07 EDT</guid><pubDate>Tue, 09 Apr 2013 13:07 EDT</pubDate></item><item><title>Varying coefficient model for modeling diffusion tensors along white matter tracts</title><link>http://projecteuclid.org/euclid.aoas/1365527192</link><description>&lt;strong&gt;Ying Yuan&lt;/strong&gt;, &lt;strong&gt;Hongtu Zhu&lt;/strong&gt;, &lt;strong&gt;Martin Styner&lt;/strong&gt;, &lt;strong&gt;John H. Gilmore&lt;/strong&gt;, &lt;strong&gt;J. S. Marron&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 7, Number 1, 102--125.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Diffusion tensor imaging provides important information on tissue structure and orientation of fiber tracts in brain white matter in vivo. It results in diffusion tensors, which are $3\times3$ symmetric positive definite (SPD) matrices, along fiber bundles. This paper develops a functional data analysis framework to model diffusion tensors along fiber tracts as functional data in a Riemannian manifold with a set of covariates of interest, such as age and gender. We propose a statistical model with varying coefficient functions to characterize the dynamic association between functional SPD matrix-valued responses and covariates. We calculate weighted least squares estimators of the varying coefficient functions for the log-Euclidean metric in the space of SPD matrices. We also develop a global test statistic to test specific hypotheses about these coefficient functions and construct their simultaneous confidence bands. Simulated data are further used to examine the finite sample performance of the estimated varying coefficient functions. We apply our model to study potential gender differences and find a statistically significant aspect of the development of diffusion tensors along the right internal capsule tract in a clinical study of neurodevelopment.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1365527192_Tue, 09 Apr 2013 13:07 EDT</guid><pubDate>Tue, 09 Apr 2013 13:07 EDT</pubDate></item><item><title>Bayesian analysis of dynamic item response models in educational testing</title><link>http://projecteuclid.org/euclid.aoas/1365527193</link><description>&lt;strong&gt;Xiaojing Wang&lt;/strong&gt;, &lt;strong&gt;James O. Berger&lt;/strong&gt;, &lt;strong&gt;Donald S. Burdick&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 7, Number 1, 126--153.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Item response theory (IRT) models have been widely used in educational measurement testing. When there are repeated observations available for individuals through time, a dynamic structure for the latent trait of ability needs to be incorporated into the model, to accommodate changes in ability. Other complications that often arise in such settings include a violation of the common assumption that test results are conditionally independent, given ability and item difficulty, and that test item difficulties may be partially specified, but subject to uncertainty. Focusing on time series dichotomous response data, a new class of state space models, called Dynamic Item Response (DIR) models, is proposed. The models can be applied either retrospectively to the full data or on-line, in cases where real-time prediction is needed. The models are studied through simulated examples and applied to a large collection of reading test data obtained from MetaMetrics, Inc.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1365527193_Tue, 09 Apr 2013 13:07 EDT</guid><pubDate>Tue, 09 Apr 2013 13:07 EDT</pubDate></item><item><title>Modeling temporal gradients in regionally aggregated California asthma hospitalization data</title><link>http://projecteuclid.org/euclid.aoas/1365527194</link><description>&lt;strong&gt;Harrison Quick&lt;/strong&gt;, &lt;strong&gt;Sudipto Banerjee&lt;/strong&gt;, &lt;strong&gt;Bradley P. Carlin&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 7, Number 1, 154--176.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Advances in Geographical Information Systems (GIS) have led to the enormous recent burgeoning of spatial-temporal databases and associated statistical modeling. Here we depart from the rather rich literature in space–time modeling by considering the setting where space is discrete (e.g., aggregated data over regions), but time is continuous. Our major objective in this application is to carry out inference on gradients of a temporal process in our data set of monthly county level asthma hospitalization rates in the state of California, while at the same time accounting for spatial similarities of the temporal process across neighboring counties. Use of continuous time models here allows inference at a finer resolution than at which the data are sampled. Rather than use parametric forms to model time, we opt for a more flexible stochastic process embedded within a dynamic Markov random field framework. Through the matrix-valued covariance function we can ensure that the temporal process realizations are mean square differentiable, and may thus carry out inference on temporal gradients in a posterior predictive fashion. We use this approach to evaluate temporal gradients where we are concerned with temporal changes in the residual and fitted rate curves after accounting for seasonality, spatiotemporal ozone levels and several spatially-resolved important sociodemographic covariates.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1365527194_Tue, 09 Apr 2013 13:07 EDT</guid><pubDate>Tue, 09 Apr 2013 13:07 EDT</pubDate></item><item><title>Clustering for multivariate continuous and discrete longitudinal data</title><link>http://projecteuclid.org/euclid.aoas/1365527195</link><description>&lt;strong&gt;Arnošt Komárek&lt;/strong&gt;, &lt;strong&gt;Lenka Komárková&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 7, Number 1, 177--200.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Multiple outcomes, both continuous and discrete, are routinely gathered on subjects in longitudinal studies and during routine clinical follow-up in general. To motivate our work, we consider a longitudinal study on patients with primary biliary cirrhosis (PBC) with a continuous bilirubin level, a discrete platelet count and a dichotomous indication of blood vessel malformations as examples of such longitudinal outcomes. An apparent requirement is to use all the outcome values to classify the subjects into groups (e.g., groups of subjects with a similar prognosis in a clinical setting). In recent years, numerous approaches have been suggested for classification based on longitudinal (or otherwise correlated) outcomes, targeting not only traditional areas like biostatistics, but also rapidly evolving bioinformatics and many others. However, most available approaches consider only continuous outcomes as a basis for classification, or if noncontinuous outcomes are considered, then not in combination with other outcomes of a different nature. Here, we propose a statistical method for clustering (classification) of subjects into a prespecified number of groups with a priori unknown characteristics on the basis of repeated measurements of several longitudinal outcomes of a different nature. This method relies on a multivariate extension of the classical generalized linear mixed model where a mixture distribution is additionally assumed for random effects. We base the inference on a Bayesian specification of the model and simulation-based Markov chain Monte Carlo methodology. To apply the method in practice, we have prepared ready-to-use software for use in R (http://www.R-project.org). We also discuss evaluation of uncertainty in the classification and also discuss usage of a recently proposed methodology for model comparison—the selection of a number of clusters in our case—based on the penalized posterior deviance proposed by Plummer [ Biostatistics 9 (2008) 523–539].
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1365527195_Tue, 09 Apr 2013 13:07 EDT</guid><pubDate>Tue, 09 Apr 2013 13:07 EDT</pubDate></item><item><title>Local tests for identifying anisotropic diffusion areas in human brain with DTI</title><link>http://projecteuclid.org/euclid.aoas/1365527196</link><description>&lt;strong&gt;Tao Yu&lt;/strong&gt;, &lt;strong&gt;Chunming Zhang&lt;/strong&gt;, &lt;strong&gt;Andrew L. Alexander&lt;/strong&gt;, &lt;strong&gt;Richard J. Davidson&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 7, Number 1, 201--225.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Diffusion tensor imaging (DTI) plays a key role in analyzing the physical structures of biological tissues, particularly in reconstructing fiber tracts of the human brain in vivo. On the one hand, eigenvalues of diffusion tensors (DTs) estimated from diffusion weighted imaging (DWI) data usually contain systematic bias, which subsequently biases the diffusivity measurements popularly adopted in fiber tracking algorithms. On the other hand, correctly accounting for the spatial information is important in the construction of these diffusivity measurements since the fiber tracts are typically spatially structured. This paper aims to establish test-based approaches to identify anisotropic water diffusion areas in the human brain. These areas in turn indicate the areas passed by fiber tracts. Our proposed test statistic not only takes into account the bias components in eigenvalue estimates, but also incorporates the spatial information of neighboring voxels. Under mild regularity conditions, we demonstrate that the proposed test statistic asymptotically follows a $\chi^{2}$ distribution under the null hypothesis. Simulation and real DTI data examples are provided to illustrate the efficacy of our proposed methods.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1365527196_Tue, 09 Apr 2013 13:07 EDT</guid><pubDate>Tue, 09 Apr 2013 13:07 EDT</pubDate></item><item><title>Sparse least trimmed squares regression for analyzing high-dimensional large data sets</title><link>http://projecteuclid.org/euclid.aoas/1365527197</link><description>&lt;strong&gt;Andreas Alfons&lt;/strong&gt;, &lt;strong&gt;Christophe Croux&lt;/strong&gt;, &lt;strong&gt;Sarah Gelper&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 7, Number 1, 226--248.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Sparse model estimation is a topic of high importance in modern data analysis due to the increasing availability of data sets with a large number of variables. Another common problem in applied statistics is the presence of outliers in the data. This paper combines robust regression and sparse model estimation. A robust and sparse estimator is introduced by adding an $L_{1}$ penalty on the coefficient estimates to the well-known least trimmed squares (LTS) estimator. The breakdown point of this sparse LTS estimator is derived, and a fast algorithm for its computation is proposed. In addition, the sparse LTS is applied to protein and gene expression data of the NCI-60 cancer cell panel. Both a simulation study and the real data application show that the sparse LTS has better prediction performance than its competitors in the presence of leverage points.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1365527197_Tue, 09 Apr 2013 13:07 EDT</guid><pubDate>Tue, 09 Apr 2013 13:07 EDT</pubDate></item><item><title>Robust partial likelihood approach for detecting imprinting and maternal effects using case-control families</title><link>http://projecteuclid.org/euclid.aoas/1365527198</link><description>&lt;strong&gt;Jingyuan Yang&lt;/strong&gt;, &lt;strong&gt;Shili Lin&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 7, Number 1, 249--268.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Genomic imprinting and maternal effects are two epigenetic factors that have been increasingly explored for their roles in the etiology of complex diseases. This is part of a concerted effort to find the “missing heritability.” Accordingly, statistical methods have been proposed to detect imprinting and maternal effects simultaneously based on either a case-parent triads design or a case-mother/control-mother pairs design. However, existing methods are full-likelihood based and have to make strong assumptions concerning mating type probabilities (nuisance parameters) to avoid overparametrization. In this paper we propose to augment the two popular study designs by combining them and including control-parent triads, so that our sample may contain a mixture of case-parent/control-parent triads and case-mother/control-mother pairs. By matching the case families with control families of the same structure and stratifying according to the familial genotypes, we are able to derive a partial likelihood that is free of the nuisance parameters. This renders unnecessary any unrealistic assumptions and leads to a robust procedure without sacrificing power. Our simulation study demonstrates that our partial likelihood method has correct type I error rate, little bias and reasonable power under a variety of settings.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1365527198_Tue, 09 Apr 2013 13:07 EDT</guid><pubDate>Tue, 09 Apr 2013 13:07 EDT</pubDate></item><item><title>Sparse integrative clustering of multiple omics data sets</title><link>http://projecteuclid.org/euclid.aoas/1365527199</link><description>&lt;strong&gt;Ronglai Shen&lt;/strong&gt;, &lt;strong&gt;Sijian Wang&lt;/strong&gt;, &lt;strong&gt;Qianxing Mo&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 7, Number 1, 269--294.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
High resolution microarrays and second-generation sequencing platforms are powerful tools to investigate genome-wide alterations in DNA copy number, methylation and gene expression associated with a disease. An integrated genomic profiling approach measures multiple omics data types simultaneously in the same set of biological samples. Such approach renders an integrated data resolution that would not be available with any single data type. In this study, we use penalized latent variable regression methods for joint modeling of multiple omics data types to identify common latent variables that can be used to cluster patient samples into biologically and clinically relevant disease subtypes. We consider lasso [ J. Roy. Statist. Soc. Ser. B 58 (1996) 267–288], elastic net [ J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2005) 301–320] and fused lasso [ J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2005) 91–108] methods to induce sparsity in the coefficient vectors, revealing important genomic features that have significant contributions to the latent variables. An iterative ridge regression is used to compute the sparse coefficient vectors. In model selection, a uniform design [ Monographs on Statistics and Applied Probability (1994) Chapman &amp;amp; Hall] is used to seek “experimental” points that scattered uniformly across the search domain for efficient sampling of tuning parameter combinations. We compared our method to sparse singular value decomposition (SVD) and penalized Gaussian mixture model (GMM) using both real and simulated data sets. The proposed method is applied to integrate genomic, epigenomic and transcriptomic data for subtype analysis in breast and lung cancer data sets.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1365527199_Tue, 09 Apr 2013 13:07 EDT</guid><pubDate>Tue, 09 Apr 2013 13:07 EDT</pubDate></item><item><title>Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s critique</title><link>http://projecteuclid.org/euclid.aoas/1365527200</link><description>&lt;strong&gt;Winston Lin&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 7, Number 1, 295--318.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Freedman [ Adv. in Appl. Math. 40 (2008) 180–193; Ann. Appl. Stat. 2 (2008) 176–196] critiqued ordinary least squares regression adjustment of estimated treatment effects in randomized experiments, using Neyman’s model for randomization inference. Contrary to conventional wisdom, he argued that adjustment can lead to worsened asymptotic precision, invalid measures of precision, and small-sample bias. This paper shows that in sufficiently large samples, those problems are either minor or easily fixed. OLS adjustment cannot hurt asymptotic precision when a full set of treatment–covariate interactions is included. Asymptotically valid confidence intervals can be constructed with the Huber–White sandwich standard error estimator. Checks on the asymptotic approximations are illustrated with data from Angrist, Lang, and Oreopoulos’s [ Am. Econ. J.: Appl. Econ. 1:1 (2009) 136–163] evaluation of strategies to improve college students’ achievement. The strongest reasons to support Freedman’s preference for unadjusted estimates are transparency and the dangers of specification search.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1365527200_Tue, 09 Apr 2013 13:07 EDT</guid><pubDate>Tue, 09 Apr 2013 13:07 EDT</pubDate></item><item><title>Robust VIF regression with application to variable selection in large data sets</title><link>http://projecteuclid.org/euclid.aoas/1365527201</link><description>&lt;strong&gt;Debbie J. Dupuis&lt;/strong&gt;, &lt;strong&gt;Maria-Pia Victoria-Feser&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 7, Number 1, 319--341.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
The sophisticated and automated means of data collection used by an increasing number of institutions and companies leads to extremely large data sets. Subset selection in regression is essential when a huge number of covariates can potentially explain a response variable of interest. The recent statistical literature has seen an emergence of new selection methods that provide some type of compromise between implementation (computational speed) and statistical optimality (e.g., prediction error minimization). Global methods such as Mallows’ $C_{p}$ have been supplanted by sequential methods such as stepwise regression. More recently, streamwise regression, faster than the former, has emerged. A recently proposed streamwise regression approach based on the variance inflation factor (VIF) is promising, but its least-squares based implementation makes it susceptible to the outliers inevitable in such large data sets. This lack of robustness can lead to poor and suboptimal feature selection. In our case, we seek to predict an individual’s educational attainment using economic and demographic variables. We show how classical VIF performs this task poorly and a robust procedure is necessary for policy makers. This article proposes a robust VIF regression, based on fast robust estimators, that inherits all the good properties of classical VIF in the absence of outliers, but also continues to perform well in their presence where the classical approach fails.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1365527201_Tue, 09 Apr 2013 13:07 EDT</guid><pubDate>Tue, 09 Apr 2013 13:07 EDT</pubDate></item><item><title>Incorporating external information in analyses of clinical trials with binary outcomes</title><link>http://projecteuclid.org/euclid.aoas/1365527202</link><description>&lt;strong&gt;Minge Xie&lt;/strong&gt;, &lt;strong&gt;Regina Y. Liu&lt;/strong&gt;, &lt;strong&gt;C. V. Damaraju&lt;/strong&gt;, &lt;strong&gt;William H. Olson&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 7, Number 1, 342--368.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
External information, such as prior information or expert opinions, can play an important role in the design, analysis and interpretation of clinical trials. However, little attention has been devoted thus far to incorporating external information in clinical trials with binary outcomes, perhaps due to the perception that binary outcomes can be treated as normally-distributed outcomes by using normal approximations. In this paper we show that these two types of clinical trials could behave differently, and that special care is needed for the analysis of clinical trials with binary outcomes. In particular, we first examine a simple but commonly used univariate Bayesian approach and observe a technical flaw. We then study the full Bayesian approach using different beta priors and a new frequentist approach based on the notion of confidence distribution (CD). These approaches are illustrated and compared using data from clinical studies and simulations. The full Bayesian approach is theoretically sound, but surprisingly, under skewed prior distributions, the estimate derived from the marginal posterior distribution may not fall between those from the marginal prior and the likelihood of clinical trial data. This counterintuitive phenomenon, which we call the “discrepant posterior phenomenon,” does not occur in the CD approach. The CD approach is also computationally simpler and can be applied directly to any prior distribution, symmetric or skewed.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1365527202_Tue, 09 Apr 2013 13:07 EDT</guid><pubDate>Tue, 09 Apr 2013 13:07 EDT</pubDate></item><item><title>Efficient computation with a linear mixed model on large-scale data sets with applications to genetic studies</title><link>http://projecteuclid.org/euclid.aoas/1365527203</link><description>&lt;strong&gt;Matti Pirinen&lt;/strong&gt;, &lt;strong&gt;Peter Donnelly&lt;/strong&gt;, &lt;strong&gt;Chris C. A. Spencer&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 7, Number 1, 369--390.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Motivated by genome-wide association studies, we consider a standard linear model with one additional random effect in situations where many predictors have been collected on the same subjects and each predictor is analyzed separately. Three novel contributions are (1) a transformation between the linear and log-odds scales which is accurate for the important genetic case of small effect sizes; (2) a likelihood-maximization algorithm that is an order of magnitude faster than the previously published approaches; and (3) efficient methods for computing marginal likelihoods which allow Bayesian model comparison. The methodology has been successfully applied to a large-scale association study of multiple sclerosis including over 20,000 individuals and 500,000 genetic variants.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1365527203_Tue, 09 Apr 2013 13:07 EDT</guid><pubDate>Tue, 09 Apr 2013 13:07 EDT</pubDate></item><item><title>Bootstrap inference for network construction with an application to a breast cancer microarray study</title><link>http://projecteuclid.org/euclid.aoas/1365527204</link><description>&lt;strong&gt;Shuang Li&lt;/strong&gt;, &lt;strong&gt;Li Hsu&lt;/strong&gt;, &lt;strong&gt;Jie Peng&lt;/strong&gt;, &lt;strong&gt;Pei Wang&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 7, Number 1, 391--417.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Gaussian Graphical Models (GGMs) have been used to construct genetic regulatory networks where regularization techniques are widely used since the network inference usually falls into a high–dimension–low–sample–size scenario. Yet, finding the right amount of regularization can be challenging, especially in an unsupervised setting where traditional methods such as BIC or cross-validation often do not work well. In this paper, we propose a new method—Bootstrap Inference for Network COnstruction (BINCO)—to infer networks by directly controlling the false discovery rates (FDRs) of the selected edges. This method fits a mixture model for the distribution of edge selection frequencies to estimate the FDRs, where the selection frequencies are calculated via model aggregation. This method is applicable to a wide range of applications beyond network construction. When we applied our proposed method to building a gene regulatory network with microarray expression breast cancer data, we were able to identify high-confidence edges and well-connected hub genes that could potentially play important roles in understanding the underlying biological processes of breast cancer.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1365527204_Tue, 09 Apr 2013 13:07 EDT</guid><pubDate>Tue, 09 Apr 2013 13:07 EDT</pubDate></item><item><title>Variable selection for sparse Dirichlet-multinomial regression with an application to microbiome data analysis</title><link>http://projecteuclid.org/euclid.aoas/1365527205</link><description>&lt;strong&gt;Jun Chen&lt;/strong&gt;, &lt;strong&gt;Hongzhe Li&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 7, Number 1, 418--442.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
With the development of next generation sequencing technology, researchers have now been able to study the microbiome composition using direct sequencing, whose output are bacterial taxa counts for each microbiome sample. One goal of microbiome study is to associate the microbiome composition with environmental covariates. We propose to model the taxa counts using a Dirichlet-multinomial (DM) regression model in order to account for overdispersion of observed counts. The DM regression model can be used for testing the association between taxa composition and covariates using the likelihood ratio test. However, when the number of covariates is large, multiple testing can lead to loss of power. To address the high dimensionality of the problem, we develop a penalized likelihood approach to estimate the regression parameters and to select the variables by imposing a sparse group $\ell_{1}$ penalty to encourage both group-level and within-group sparsity. Such a variable selection procedure can lead to selection of the relevant covariates and their associated bacterial taxa. An efficient block-coordinate descent algorithm is developed to solve the optimization problem. We present extensive simulations to demonstrate that the sparse DM regression can result in better identification of the microbiome-associated covariates than models that ignore overdispersion or only consider the proportions. We demonstrate the power of our method in an analysis of a data set evaluating the effects of nutrient intake on human gut microbiome composition. Our results have clearly shown that the nutrient intake is strongly associated with the human gut microbiome.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1365527205_Tue, 09 Apr 2013 13:07 EDT</guid><pubDate>Tue, 09 Apr 2013 13:07 EDT</pubDate></item><item><title>Estimating treatment effect heterogeneity in randomized program evaluation</title><link>http://projecteuclid.org/euclid.aoas/1365527206</link><description>&lt;strong&gt;Kosuke Imai&lt;/strong&gt;, &lt;strong&gt;Marc Ratkovic&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 7, Number 1, 443--470.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
When evaluating the efficacy of social programs and medical treatments using randomized experiments, the estimated overall average causal effect alone is often of limited value and the researchers must investigate when the treatments do and do not work. Indeed, the estimation of treatment effect heterogeneity plays an essential role in (1) selecting the most effective treatment from a large number of available treatments, (2) ascertaining subpopulations for which a treatment is effective or harmful, (3) designing individualized optimal treatment regimes, (4) testing for the existence or lack of heterogeneous treatment effects, and (5) generalizing causal effect estimates obtained from an experimental sample to a target population. In this paper, we formulate the estimation of heterogeneous treatment effects as a variable selection problem. We propose a method that adapts the Support Vector Machine classifier by placing separate sparsity constraints over the pre-treatment parameters and causal heterogeneity parameters of interest. The proposed method is motivated by and applied to two well-known randomized evaluation studies in the social sciences. Our method selects the most effective voter mobilization strategies from a large number of alternative strategies, and it also identifies the characteristics of workers who greatly benefit from (or are negatively affected by) a job training program. In our simulation studies, we find that the proposed method often outperforms some commonly used alternatives.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1365527206_Tue, 09 Apr 2013 13:07 EDT</guid><pubDate>Tue, 09 Apr 2013 13:07 EDT</pubDate></item><item><title>Multiple testing of local maxima for detection of peaks in ChIP-Seq data</title><link>http://projecteuclid.org/euclid.aoas/1365527207</link><description>&lt;strong&gt;Armin Schwartzman&lt;/strong&gt;, &lt;strong&gt;Andrew Jaffe&lt;/strong&gt;, &lt;strong&gt;Yulia Gavrilov&lt;/strong&gt;, &lt;strong&gt;Clifford A. Meyer&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 7, Number 1, 471--494.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
A topological multiple testing approach to peak detection is proposed for the problem of detecting transcription factor binding sites in ChIP-Seq data. After kernel smoothing of the tag counts over the genome, the presence of a peak is tested at each observed local maximum, followed by multiple testing correction at the desired false discovery rate level. Valid $p$-values for candidate peaks are computed via Monte Carlo simulations of smoothed Poisson sequences, whose background Poisson rates are obtained via linear regression from a Control sample at two different scales. The proposed method identifies nearby binding sites that other methods do not.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1365527207_Tue, 09 Apr 2013 13:07 EDT</guid><pubDate>Tue, 09 Apr 2013 13:07 EDT</pubDate></item><item><title>Regression trees for longitudinal and multiresponse data</title><link>http://projecteuclid.org/euclid.aoas/1365527208</link><description>&lt;strong&gt;Wei-Yin Loh&lt;/strong&gt;, &lt;strong&gt;Wei Zheng&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 7, Number 1, 495--522.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Previous algorithms for constructing regression tree models for longitudinal and multiresponse data have mostly followed the CART approach. Consequently, they inherit the same selection biases and computational difficulties as CART. We propose an alternative, based on the GUIDE approach, that treats each longitudinal data series as a curve and uses chi-squared tests of the residual curve patterns to select a variable to split each node of the tree. Besides being unbiased, the method is applicable to data with fixed and random time points and with missing values in the response or predictor variables. Simulation results comparing its mean squared prediction error with that of MVPART are given, as well as examples comparing it with standard linear mixed effects and generalized estimating equation models. Conditions for asymptotic consistency of regression tree function estimates are also given.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1365527208_Tue, 09 Apr 2013 13:07 EDT</guid><pubDate>Tue, 09 Apr 2013 13:07 EDT</pubDate></item><item><title>Joint and individual variation explained (JIVE) for integrated analysis of multiple data types</title><link>http://projecteuclid.org/euclid.aoas/1365527209</link><description>&lt;strong&gt;Eric F. Lock&lt;/strong&gt;, &lt;strong&gt;Katherine A. Hoadley&lt;/strong&gt;, &lt;strong&gt;J. S. Marron&lt;/strong&gt;, &lt;strong&gt;Andrew B. Nobel&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 7, Number 1, 523--542.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Research in several fields now requires the analysis of data sets in which multiple high-dimensional types of data are available for a common set of objects. In particular, The Cancer Genome Atlas (TCGA) includes data from several diverse genomic technologies on the same cancerous tumor samples. In this paper we introduce Joint and Individual Variation Explained (JIVE), a general decomposition of variation for the integrated analysis of such data sets. The decomposition consists of three terms: a low-rank approximation capturing joint variation across data types, low-rank approximations for structured variation individual to each data type, and residual noise. JIVE quantifies the amount of joint variation between data types, reduces the dimensionality of the data and provides new directions for the visual exploration of joint and individual structures. The proposed method represents an extension of Principal Component Analysis and has clear advantages over popular two-block methods such as Canonical Correlation Analysis and Partial Least Squares. A JIVE analysis of gene expression and miRNA data on Glioblastoma Multiforme tumor samples reveals gene–miRNA associations and provides better characterization of tumor types.
 
 
Data and software are available at https://genome.unc.edu/jive/.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1365527209_Tue, 09 Apr 2013 13:07 EDT</guid><pubDate>Tue, 09 Apr 2013 13:07 EDT</pubDate></item><item><title>Bayesian semiparametric analysis for two-phase studies of gene-environment interaction</title><link>http://projecteuclid.org/euclid.aoas/1365527210</link><description>&lt;strong&gt;Jaeil Ahn&lt;/strong&gt;, &lt;strong&gt;Bhramar Mukherjee&lt;/strong&gt;, &lt;strong&gt;Stephen B. Gruber&lt;/strong&gt;, &lt;strong&gt;Malay Ghosh&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 7, Number 1, 543--569.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
The two-phase sampling design is a cost-efficient way of collecting expensive covariate information on a judiciously selected subsample. It is natural to apply such a strategy for collecting genetic data in a subsample enriched for exposure to environmental factors for gene-environment interaction ($G\times E$) analysis. In this paper, we consider two-phase studies of $G\times E$ interaction where phase I data are available on exposure, covariates and disease status. Stratified sampling is done to prioritize individuals for genotyping at phase II conditional on disease and exposure. We consider a Bayesian analysis based on the joint retrospective likelihood of phases I and II data. We address several important statistical issues: (i) we consider a model with multiple genes, environmental factors and their pairwise interactions. We employ a Bayesian variable selection algorithm to reduce the dimensionality of this potentially high-dimensional model; (ii) we use the assumption of gene–gene and gene-environment independence to trade off between bias and efficiency for estimating the interaction parameters through use of hierarchical priors reflecting this assumption; (iii) we posit a flexible model for the joint distribution of the phase I categorical variables using the nonparametric Bayes construction of Dunson and Xing [ J. Amer. Statist. Assoc. 104 (2009) 1042–1051]. We carry out a small-scale simulation study to compare the proposed Bayesian method with weighted likelihood and pseudo-likelihood methods that are standard choices for analyzing two-phase data. The motivating example originates from an ongoing case-control study of colorectal cancer, where the goal is to explore the interaction between the use of statins (a drug used for lowering lipid levels) and 294 genetic markers in the lipid metabolism/cholesterol synthesis pathway. The subsample of cases and controls on which these genetic markers were measured is enriched in terms of statin users. The example and simulation results illustrate that the proposed Bayesian approach has a number of advantages for characterizing joint effects of genotype and exposure over existing alternatives and makes efficient use of all available data in both phases.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1365527210_Tue, 09 Apr 2013 13:07 EDT</guid><pubDate>Tue, 09 Apr 2013 13:07 EDT</pubDate></item><item><title>Canonical correlation analysis between time series and static outcomes, with application to the spectral analysis of heart rate variability</title><link>http://projecteuclid.org/euclid.aoas/1365527211</link><description>&lt;strong&gt;Robert T. Krafty&lt;/strong&gt;, &lt;strong&gt;Martica Hall&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 7, Number 1, 570--587.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Although many studies collect biomedical time series signals from multiple subjects, there is a dearth of models and methods for assessing the association between frequency domain properties of time series and other study outcomes. This article introduces the random Cramér representation as a joint model for collections of time series and static outcomes where power spectra are random functions that are correlated with the outcomes. A canonical correlation analysis between cepstral coefficients and static outcomes is developed to provide a flexible yet interpretable measure of association. Estimates of the canonical correlations and weight functions are obtained from a canonical correlation analysis between the static outcomes and maximum Whittle likelihood estimates of truncated cepstral coefficients. The proposed methodology is used to analyze the association between the spectrum of heart rate variability and measures of sleep duration and fragmentation in a study of older adults who serve as the primary caregiver for their ill spouse.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1365527211_Tue, 09 Apr 2013 13:07 EDT</guid><pubDate>Tue, 09 Apr 2013 13:07 EDT</pubDate></item><item><title>Daily minimum and maximum temperature simulation over complex terrain</title><link>http://projecteuclid.org/euclid.aoas/1365527212</link><description>&lt;strong&gt;William Kleiber&lt;/strong&gt;, &lt;strong&gt;Richard W. Katz&lt;/strong&gt;, &lt;strong&gt;Balaji Rajagopalan&lt;/strong&gt;&lt;p&gt;&lt;strong&gt;Source: &lt;/strong&gt;Ann. Appl. Stat., Volume 7, Number 1, 588--612.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br/&gt; 
 
Spatiotemporal simulation of minimum and maximum temperature is a fundamental requirement for climate impact studies and hydrological or agricultural models. Particularly over regions with variable orography, these simulations are difficult to produce due to terrain driven nonstationarity. We develop a bivariate stochastic model for the spatiotemporal field of minimum and maximum temperature. The proposed framework splits the bivariate field into two components of “local climate” and “weather.” The local climate component is a linear model with spatially varying process coefficients capturing the annual cycle and yielding local climate estimates at all locations, not only those within the observation network. The weather component spatially correlates the bivariate simulations, whose matrix-valued covariance function we estimate using a nonparametric kernel smoother that retains nonnegative definiteness and allows for substantial nonstationarity across the simulation domain. The statistical model is augmented with a spatially varying nugget effect to allow for locally varying small scale variability. Our model is applied to a daily temperature data set covering the complex terrain of Colorado, USA, and successfully accommodates substantial temporally varying nonstationarity in both the direct-covariance and cross-covariance functions.
 
 &lt;/p&gt;</description><guid isPermaLink="false">projecteuclid.org/euclid.aoas/1365527212_Tue, 09 Apr 2013 13:07 EDT</guid><pubDate>Tue, 09 Apr 2013 13:07 EDT</pubDate></item></channel>
</rss>
