Registered users receive a variety of benefits including the ability to customize email alerts, create favorite journals list, and save searches.
Please note that a Project Euclid web account does not automatically grant access to full-text content. An institutional or society member subscription is required to view non-Open Access content.
Contact email@example.com with any questions.
Molecular biology has posed a number of fascinating and sometimes daunting computational problems, which came naturally expressed in its native language of character strings. Through the years, some such problems have found elegant and even useful solutions in response to the needs that originally motivated them. What is perhaps even more remarkable, several of the ideas inspired by computational molecular biology have found application in remote and diverse domains, so that it may be argued that molecular biology did more for computing than the latter did for it. As a modest tribute, this paper reviews a small sample of these cases drawing from the personal exposure of the author.
Single Nucleotide Polymorphisms (SNPs) are common among human populations. SNPs that are proximally located within a small human chromosome region are generally strongly correlated that a subset of SNPs, termed tag SNPs, can provide enough information to infer neigh- boring SNPs. Such correlations are generally known as linkage disequilibrium (LD) and are measured either pair-wise, such as $r^2$, or multi-to-one (multi-marker). For any given set of SNPs, a variety of algorithms have been proposed to identify a subset of tag SNPs by which the remaining SNPs can be inferred. This paper focuses on finding that number of tag SNPs from which remaining SNPs can be inferred through multi-allelic LD or pair-wise LD with a pre-defined $r^2$ threshold. We call this the optimal tag SNP selection problem. Although this problem is theoretically NP-hard, it can be formulated as an integer programming (IP) problem under a certain constraint, and the opti- mal solution can be efficiently found by our newly developed IPMarker program. In addition, the flexibility of the computational framework allows us to formulate and solve the problem of finding common tag SNPs for multiple populations that have different LD patterns. Various datasets, in- cluding ENCODE and the Major Histocompatiability Complex (MHC) region, were used to evaluate the performance of IPMarker. We also extended IPMarker to the whole genome HapMap Phase I data. Results showed that IPMarker significantly reduces the number of tag SNPs required when compared to the most widely used program, Haploview, although a significant longer running time is required. Thus, overall, genotyping a selected set of tag SNPs is the most cost-effective way to conduct large-scale genome-wide association studies.
In low- and medium-budget association studies, a limited number of tag SNPs are selected out of a large set of available SNPs previously typed in an initial cohort. These tag SNPs are then typed in a larger set of control and affected individuals. Current association studies pick the set of tag SNPs based on the correlation criterion. Here we show that association studies that use tag SNPs selected according to their imputation accuracy are more powerful than those relying on tag SNPs selected by the correlation criterion. The advantage is particularly striking when the set of tag SNPs is sparse; thus, picking tag SNPs to maximize the imputation accuracy will increase the effectiveness of future association studies without additional cost.
Patterns of locomotor activity of a freely moving organism can help characterize its behavioral phenotypes. To infer behavior from such activity in Drosophila melanogaster, we use a real-time image acquisition system to track the movement of multiple flies in three dimensions. When dealing with fly movement trajectories, we must take into account that similar movement patterns can be expressed in different orientations and speeds. In this paper, we present methods to transform the three-dimensional fly movement trajectories into a space that is translation, rotation, scale and timescale invariant. We then propose an approach motivated by sequence alignment to detect similar movement patterns from fly trajectories in order to infer specific behaviors. We demonstrate the accuracy of the methods and highlight their usefulness in studies aimed at characterizing behavioral phenotypes.