next up previous
Next: Markov Models as compression Up: Using compression models to Previous: When is a subsequence

Scanning algorithm for finding the best subsequences

 

Given the definition of significantly found sequences in Section 1.2  gif, it is easy to scan a cost map to find significant subsequences. We simply start at the left end of the map and add up the savings in each position ( tex2html_wrap_inline733 ), keeping track of the greatest cumulative savings encountered. When the cumulative savings becomes negative, we have just scanned an uninteresting sequence, and so we reset the savings and greatest savings to zero, and continue the scan with the left endpoint in the current position.

A significant sequence is present when the greatest savings seen is greater than T, but the scan is continued until either we reach the end of the cost map or the savings per position drops to less than half the savings per position at the point of greatest savings. The significant sequence runs from the start of the scan to the location of the greatest savings.

This method finds sequences that are significant, but the endpoints may not be optimally chosen. If we just picked the maximal segment (as in [4]), we would have extraneous characters on the ends of the segments. If the model for the interesting sequences provides estimates for uninteresting positions that are as good as the null models, then the savings for a junk character is zero or slightly positive (violating the requirements of a scoring system used with a maximal-segment method). In fact, early versions of the program reported the maximal segments and picked up long strings of junk.

Maximizing the savings for a sequence is too greedy, reporting uninteresting sequences. One way to correct the problem would be to trim the significant sequences to maximize the savings per position. This approach, however, is too conservative, throwing away parts of the sequence that remain interesting.

Instead, the program tries to maximize the signal-to-noise ratio (SNR). The noise (the standard deviation of the savings for an uninteresting sequence) grows with for junk sequences of length n, so maximizing

displaymath741

should give us the best signal-to-noise ratio.

We can move the endpoints inward one position at a time, keeping track of the position that gives the maximum SNR. To make sure we do not lose any significant sequences, we stop the trimming before the remaining savings would be less than tex2html_wrap_inline743.

After finding a sequence, we can restart the scan at the right endpoint of the reported sequence. Restarting here will scan some bases repeatedly, which may seem like a violation of the assumption used for determining the significance threshold, but the only way to find a sequence containing one of these rescanned positions is for the position to be part of a sequence that saves at least twice the threshold. This savings is great enough that it would be significant even if all tex2html_wrap_inline745 subsequences had been examined.


next up previous
Next: Markov Models as compression Up: Using compression models to Previous: When is a subsequence

Rey Rivera
Thu Aug 22 14:04:06 PDT 1996