Since REPs are so common (REP clusters make up about 0.6% of the E. coli genome), it should be possible to find them without using a seed--just from looking at the database itself and trying to find repeated patterns.
An attempt was made to concentrate the EcoSeq6 database, by
building a simple order-8 Markov model (with zero-offset and
complement blurring both set to 1 and no neighbor blurring) from
the entire database, then using the model to search the database for
sequences that compressed significantly better than average. (See the
EcoSeq6c line in Table 3
.)
The resulting model found about 45% of the REP sequences, but only
about 5.6% of the bases found were in a REP cluster. Growing the set
of sequences by repeated expansion increased the number of REP
clusters found to about 79% (EcoSeq6c-gxn in
Table 3
), but still only about 7.4% of the
bases were in REP clusters. The problem is that there are some much
larger repeated sequences, particularly the numerous IS sequences, and
the repeated expansion process is looking simultaneously for REPs and
IS sequences.
If each of the sequences in the ``concentrated'' file EcoSeq6c is
individually used as a seed that is grown by repeated expansion, we
get many different sets of sequences.
Most of the sets of sequences are clearly
identifiable (REPs, IS2, IS5, ... ). If we look just at the sets in
which one or more REPs are found, we find very similar coverage of
the REPs (21-27 REPs missed), no matter which seed is used (see
Table 7
). Although REP99 and REP106 were
originally chosen as seeds for Table 3
because
they were the biggest known REP clusters, it does
not seem necessary to start with them--almost any REP cluster found
by concentrating the database works about as well.
Table 7: Search results starting with each sequence from the
``concentrated'' file EcoSeq6c as a seed for repeated expansion. Only
those searches that found at least on REP are reported here.
The results are sorted by the number of known REPs that they missed.