## The Annals of Statistics

- Ann. Statist.
- Volume 22, Number 3 (1994), 1603-1629.

### A Poisson Approximation for Sequence Comparisons with Insertions and Deletions

#### Abstract

We construct a statistical test for a sequence alignment problem which enables us to decide whether two given sequences are related. Such a test can be used in DNA and protein sequence comparisons. It is based on a comparison of two long sequences of i.i.d. letters taken from a finite alphabet. The test statistic typically employed is the length of the longest matching region between the two sequences in which a certain number of insertions and deletions but no mismatches are allowed. We give a distributional result which enables one to compute $P$-values, and hence to decide whether or not the two sequences are related. Its proof utilizes the Chen-Stein method for Poisson approximation. The test is based on a greedy algorithm that searches for the longest matching region. We show that this algorithm finds the longest matching region with probability approaching 1 as the lengths of the two sequences go to infinity.

#### Article information

**Source**

Ann. Statist., Volume 22, Number 3 (1994), 1603-1629.

**Dates**

First available in Project Euclid: 11 April 2007

**Permanent link to this document**

https://projecteuclid.org/euclid.aos/1176325645

**Digital Object Identifier**

doi:10.1214/aos/1176325645

**Mathematical Reviews number (MathSciNet)**

MR1311992

**Zentralblatt MATH identifier**

0817.62013

**JSTOR**

links.jstor.org

**Subjects**

Primary: 62F05: Asymptotic properties of tests

Secondary: 92D20: Protein sequences, DNA sequences

**Keywords**

Chen-Stein method sequence matching Poisson approximation DNA sequences greedy algorithm

#### Citation

Neuhauser, Claudia. A Poisson Approximation for Sequence Comparisons with Insertions and Deletions. Ann. Statist. 22 (1994), no. 3, 1603--1629. doi:10.1214/aos/1176325645. https://projecteuclid.org/euclid.aos/1176325645