Given the definition of significantly found sequences in
Section 1.2
, it is easy to scan a cost map to
find significant subsequences. We simply start at the left end of the
map and add up the savings in each position
(
),
keeping track of the greatest
cumulative savings encountered.
When the cumulative savings becomes negative, we have just scanned an
uninteresting sequence, and so we reset the savings and greatest
savings to zero, and continue the scan with the left endpoint in the
current position.
A significant sequence is present when the greatest savings seen is greater than T, but the scan is continued until either we reach the end of the cost map or the savings per position drops to less than half the savings per position at the point of greatest savings. The significant sequence runs from the start of the scan to the location of the greatest savings.
This method finds sequences that are significant, but the endpoints may not be optimally chosen. If we just picked the maximal segment (as in [4]), we would have extraneous characters on the ends of the segments. If the model for the interesting sequences provides estimates for uninteresting positions that are as good as the null models, then the savings for a junk character is zero or slightly positive (violating the requirements of a scoring system used with a maximal-segment method). In fact, early versions of the program reported the maximal segments and picked up long strings of junk.
Maximizing the savings for a sequence is too greedy, reporting uninteresting sequences. One way to correct the problem would be to trim the significant sequences to maximize the savings per position. This approach, however, is too conservative, throwing away parts of the sequence that remain interesting.
Instead, the program tries to maximize the signal-to-noise ratio (SNR). The noise (the standard deviation of the savings for an uninteresting sequence) grows with for junk sequences of length n, so maximizing
![]()
should give us the best signal-to-noise ratio.
We can move the endpoints inward one position at a time, keeping track of
the position that gives the maximum SNR.
To make sure we do not lose any significant sequences, we stop the
trimming before the remaining savings would be less than
.
After finding a sequence, we can restart the scan at the right
endpoint of the reported sequence.
Restarting here will scan some bases repeatedly, which may seem like a
violation of the assumption used for determining the significance
threshold, but the only way to find a sequence containing one of these
rescanned positions is for the position to be part of a sequence that
saves at least twice the threshold.
This savings is great enough that it would be significant even if all
subsequences had been examined.