next up previous
Next: Hard-to-find REPs Up: Looking for REPs in Previous: Using hidden Markov models

Looking for repeated elements without a seed

 

Since REPs are so common (REP clusters make up about 0.6% of the E. coli genome), it should be possible to find them without using a seed--just from looking at the database itself and trying to find repeated patterns.

An attempt was made to concentrate the EcoSeq6 database, by building a simple order-8 Markov model (with zero-offset and complement blurring both set to 1 and no neighbor blurring) from the entire database, then using the model to search the database for sequences that compressed significantly better than average. (See the EcoSeq6c line in Table 3  gif.)

The resulting model found about 45% of the REP sequences, but only about 5.6% of the bases found were in a REP cluster. Growing the set of sequences by repeated expansion increased the number of REP clusters found to about 79% (EcoSeq6c-gxn in Table 3  gif), but still only about 7.4% of the bases were in REP clusters. The problem is that there are some much larger repeated sequences, particularly the numerous IS sequences, and the repeated expansion process is looking simultaneously for REPs and IS sequences.

If each of the sequences in the ``concentrated'' file EcoSeq6c is individually used as a seed that is grown by repeated expansion, we get many different sets of sequences. Most of the sets of sequences are clearly identifiable (REPs, IS2, IS5, ... ). If we look just at the sets in which one or more REPs are found, we find very similar coverage of the REPs (21-27 REPs missed), no matter which seed is used (see Table 7  gif). Although REP99 and REP106 were originally chosen as seeds for Table 3  gif because they were the biggest known REP clusters, it does not seem necessary to start with them--almost any REP cluster found by concentrating the database works about as well.

  

table3

Table 7: Search results starting with each sequence from the ``concentrated'' file EcoSeq6c as a seed for repeated expansion. Only those searches that found at least on REP are reported here. The results are sorted by the number of known REPs that they missed.


next up previous
Next: Hard-to-find REPs Up: Looking for REPs in Previous: Using hidden Markov models

Rey Rivera
Thu Aug 22 14:04:06 PDT 1996