next up previous
Next: Using hidden Markov models Up: Looking for REPs using Previous: Looking for REPs using

Using simple Markov models

 

One of the two largest REP clusters (REP99 or REP106) was chosen as a seed sequence and an order-8 simple Markov model was constructed from it. Since we are looking for the sequences on either strand of the DNA and a large part of what we are looking for is palindromic, the complement blurring parameter was set to 1.

The zero offset and neighbor blurring parameters were chosen arbitrarily, but near values that had given optimal adaptive compression for sets of similar sequences. The zero-offset was set to 0.1, and the neighbor blurring parameters were 0.05 (for A<->G and C<->T) and 0.01 for the other substitutions.

The entire EcoSeq6 database was searched for sequences whose cost using the models was significantly better than 1.99 bits/base. This expand step was repeated using the result as a seed for a new model until the number of sequences found no longer increased (5 expansions at 36 seconds each on a SparcStation 10). The final results were called REP99-gxn and REP106-gxn. These both did an excellent job of finding REPs, with 90-91% of the previously identified REP sequences found [13]. All of the sequences overlapped with a known REP sequence, and only 18-20 REP clusters were missed, all of them short clusters with a single REP in them.

A smaller seed was also tried--a single copy of the REPv consensus. Growing by repeated expansion from this seed found essentially the same set of REPs. The statistics for all three searches are listed in the first three lines of Table 3  gif on page gif.

To check the sensitivity to the model parameters, the repeated expansion from REP99 was done with three more sets of parameters, chosen to give optimal adaptive coding of the REP99-gxn sequences for order-6, order-8, and order-10 Markov models (lines REP99-gx6, REP99-gx8, and REP99-gx10 in Table 3  gif on page gif). The ``extra'' sequences found by the order-8 model are just regions of unknown bases in EcoSeq6. Since the program converts these unknowns arbitrarily to sequences of As, once a piece of one of them gets into the seed, the entire region of unknowns will compress very well. The order-10 model does poorly, probably because any base substitution will disrupt the compression for 11 bases, not just 7 or 9 as with the smaller models. The neighbor blurring can compensate somewhat for single errors in the 11-word, but not for two errors.

An experiment was also done using a significance threshold that varied with the square-root of sequence length, rather than being fixed, as described in Section 1.2  gif. The standard deviation of the encoding cost per base was computed for the entire EcoSeq6 database, and the threshold set arbitrarily at five standard deviations plus three bits (the extra three bits was to prevent too many very short sequences from being falsely reported). Line REP99-gxa in Table 3  gif reports the results of repeatedly growing the set of sequences with this criterion. Over 30 iterations were required before the process converged, as the standard deviation was recomputed for each iteration, and the sequences found varied slightly as the threshold changed. The huge number of false negatives were almost all from the regions of unknown characters, and weeding these out produced a fairly clean set of REPs (REP99-gxawa). Using this set as a seed for the standard repeated expansion search (using a threshold of 27.483 bits) produced REP99-gxawa-gxn, which finds one more REP than REP99-gxn. Of the three ``false +'' sequences, one is also found by most of the hidden Markov models, and is probably a genuinely related sequence not on Rudd's list (3660932 3660963 uspAeco+846).

   table3

 

 


Table 3: Summary of using various models to search for REP clusters in EcoSeq6. The results of the search were compared with a list of 244 known sequences grouped into 128 clusters [13]. The table reports the number of sequences (or clusters) on the list that do not overlap with any sequence found (false -), and the number of sequences found that do not overlap with any on the list (false +).


next up previous
Next: Using hidden Markov models Up: Looking for REPs using Previous: Looking for REPs using

Rey Rivera
Thu Aug 22 14:04:06 PDT 1996