next up previous
Next: Using compression models to Up: Using Markov Models and



karplus@cse.ucsc.edu

UCSC-CRL-94-24

Abstract:

This paper presents a technique for using simple Markov models and hidden Markov models (HMMs) to search for interesting sequences in a database of DNA sequences. The models are used to create a cost map for each sequence in the database. These cost maps can be searched rapidly for subsequences that have significantly lower costs than a null model. Milosavljevic's algorithmic significance test is used to determine when a subsequence is significantly found. The sequences reported are trimmed to maximize the signal-to-noise ratio (cost savings /tex2html_wrap_inline13 ).

Methods are given for automatically constructing simple Markov models and hidden Markov models from small training sets.

The techniques are illustrated by searching a database of E. coli genomic DNA, EcoSeq6, for clusters of Repetitive Extragenic Palindromic sequences (REPs). Of the known REPs, 91% are found with simple Markov models starting with a single REP cluster as a seed, and 95% are found by a hidden Markov model built from the results of the simple Markov model search. There are no false positives from the simple Markov models, and the few extra sequences found by the HMMs may be genuinely related sequences.


Using Markov Models and Hidden Markov Models to Find Repetitive Extragenic Palindromic Sequences in Escherichia coli

Kevin Karplus

26 July 1994

This paper describes a technique for using models suitable for data compression as a tool for finding interesting sequences in a database. The technique does not require that the structure or consensus sequence for the target sequences be known in advance--only that an example sequence be provided.

The paper presents an efficient algorithm for the search, methods for constructing the models automatically, and results of finding clusters of Repetitive Extragenic Palindromic sequences (REPs) in the E. coli genome database, EcoSeq6 [12]. It compares the set of REPs found with previous lists [7, 2, 13]. The new search techniques do as well as the best of previous techniques (self-BLAST), finding about 95% of the known REPs. Furthermore, the hidden Markov model produced for the search provides a good, human-readable description of the structure of the REP family, clearly distinguishing the two main REP sequences and the REPv variant.

Although the technique could be used for sequences over any alphabet, the software has only been implemented for DNA or RNA sequences so far.




next up previous
Next: Using compression models to Up: Using Markov Models and

Rey Rivera
Thu Aug 22 14:04:06 PDT 1996