next up previous
Next: Scanning algorithm for finding Up: Using compression models to Previous: Efficient search using a

When is a subsequence significant?

 

A sequence is considered interesting when the model gives it a significantly lower cost than we would expect. Formally, we look for sequences where the cost of the sequence using the model (cost_m ) is significantly less than the cost with a null model (cost_0 ).

There are many ways to determine when the cost is significantly lower. The current program uses Milosavljevic's Algorithmic Significance Method [8], which has a simple threshold test to determine if a sequence s is significant:

displaymath695

where the threshold T is computed from N, the number of sequences checked for significance:

displaymath701

If we want a probability of 0.01 or less of getting a sequence returned due to chance matching, then the threshold is computed as tex2html_wrap_inline703. In the searching technique described in Section 1.3  gif, one sequence is checked for each position in the database, so N is the length of the database d. For EcoSeq6, with 1,875,932 bases, finding a sequence requires that its cost under the model be 27.483 bits less than the null model cost.

This simple threshold technique is easy to implement, but the database is not really drawn randomly from the distribution implied by the null model. So even if we can refute the null model with high confidence, we have not necessarily found a good match to our model. For example, a hidden Markov model may assign cost slightly less than 2 bits/base to G and C in the junk loop that matches the uninteresting parts of the database. If the null model assigns 2 bits/base, then almost any sufficiently large GC-rich region would have significant savings.

On the other hand, a short REP cluster with only 24 bases would need to be encodable in only 48-27.479 = 20.521 bits (0.855 bits/base) in order to be found. This stringent condition is difficult to meet with the simple models described in Section 2  gif. Luckily, the regions around the REP clusters are fairly similar, and so the model can usually find a significant, slightly larger region around a REP.

In general, using this threshold yields too many long sequences and not enough short ones. The current program has some methods for suppressing the bogus long sequences, but does not find short sequences unless they compress very well. Since the standard deviation of the cost of a sequence of length n grows with tex2html_wrap_inline713, the threshold should probably have the form tex2html_wrap_inline715, rather than being a simple constant, but it is unclear how to set a and b to get a given significance level for a database.

Suppressing the incorrectly found long sequences is done by adding the extra condition that the model must save at least 0.1 bits/base and by artificially replacing the 2 bits/base of the null model by a smaller estimate of the true entropy of the database (say, 1.99 bits/base). A better technique would be to use a better null model than the very crude 2 bit/base model--perhaps a second- or third-order Markov model. Note that although using a better null model or estimating the entropy at less than 2 bits/base helps to eliminate the incorrectly found long sequences, it becomes more difficult to find the correct short ones.


next up previous
Next: Scanning algorithm for finding Up: Using compression models to Previous: Efficient search using a

Rey Rivera
Thu Aug 22 14:04:06 PDT 1996