One of the two largest REP clusters (REP99 or REP106) was chosen as a seed sequence and an order-8 simple Markov model was constructed from it. Since we are looking for the sequences on either strand of the DNA and a large part of what we are looking for is palindromic, the complement blurring parameter was set to 1.
The zero offset and neighbor blurring parameters were chosen arbitrarily, but near values that had given optimal adaptive compression for sets of similar sequences. The zero-offset was set to 0.1, and the neighbor blurring parameters were 0.05 (for A<->G and C<->T) and 0.01 for the other substitutions.
The entire EcoSeq6 database was searched for sequences whose cost using the models was significantly better than 1.99 bits/base. This expand step was repeated using the result as a seed for a new model until the number of sequences found no longer increased (5 expansions at 36 seconds each on a SparcStation 10). The final results were called REP99-gxn and REP106-gxn. These both did an excellent job of finding REPs, with 90-91% of the previously identified REP sequences found [13]. All of the sequences overlapped with a known REP sequence, and only 18-20 REP clusters were missed, all of them short clusters with a single REP in them.
A smaller seed was also tried--a single copy of the REPv consensus.
Growing by repeated expansion from this seed found essentially the
same set of REPs.
The statistics for all three searches are listed in the first three
lines of Table 3
on page
.
To check the sensitivity to the model parameters, the repeated
expansion from REP99 was done with three more sets of parameters,
chosen to give optimal adaptive coding of the REP99-gxn sequences for
order-6, order-8, and order-10 Markov models (lines REP99-gx6,
REP99-gx8, and REP99-gx10 in Table 3
on
page
).
The ``extra'' sequences found by the order-8 model are just regions of
unknown bases in EcoSeq6.
Since the program converts these unknowns arbitrarily to sequences of
As, once a piece of one of them gets into the seed, the entire region of
unknowns will compress very well.
The order-10 model does poorly, probably because any base substitution
will disrupt the compression for 11 bases, not just 7 or 9 as with the
smaller models.
The neighbor blurring can compensate somewhat for single errors in the
11-word, but not for two errors.
An experiment was also done using a significance threshold that varied
with the square-root of sequence length, rather than being fixed, as
described in Section 1.2
.
The standard deviation of the encoding cost per base was computed for
the entire EcoSeq6 database, and the threshold set arbitrarily at five
standard deviations plus three bits (the extra three bits was to
prevent too many very short sequences from being falsely reported).
Line REP99-gxa in Table 3
reports the results of
repeatedly growing the set of sequences with this criterion.
Over 30 iterations were required before the process converged, as the
standard deviation was recomputed for each iteration, and the
sequences found varied slightly as the threshold changed.
The huge number of false negatives were almost all from the regions of
unknown characters, and weeding these out produced a fairly clean set
of REPs (REP99-gxawa).
Using this set as a seed for the standard repeated expansion search
(using a threshold of 27.483 bits) produced REP99-gxawa-gxn, which
finds one more REP than REP99-gxn.
Of the three ``false +'' sequences, one is also found by most of the
hidden Markov models, and is probably a genuinely related sequence not
on Rudd's list (3660932 3660963 uspAeco+846).
Table 3:
Summary of using various models to search for REP clusters in EcoSeq6.
The results of the search were compared with a list of 244 known
sequences grouped into 128 clusters [13].
The table reports the number of sequences (or clusters) on the list
that do not overlap with any sequence found (false -), and the
number of sequences found that do not overlap with any on the list
(false +).