Abstract
Studies of inhomogeneities in long DNA sequences can be insightful to the organization of the human genome (or any genome). Questions about the spacings of a marker array and general issues of sequence heterogeneity in our studies of DNA and protein sequences led us to statistical considerations of $r$-scan lengths, the distances between marker $i$ and marker $i+r$, $i=1,2,3,\ldots\,$. It is interesting to characterize the $r$-scan lengths harboring clusters or indicating regions of over-dispersion of the markers along the sequence. Applications are reviewed for certain words in the Haemophilus genome and the Cyanobacter genome.
Information
Digital Object Identifier: 10.1214/lnms/1196285397