The problem of finding interesting sequences can be broken into several subproblems: defining what sequences are interesting, devising an algorithm to find them efficiently, and determining whether the sequences found are statistically significant or just chance variations.
The approach used here is to define the interesting sequences by using a model m that assigns probabilities
to sequences (
). The
probabilities are assigned so that all the probabilities for sequences of a given length sum to 1--the length of the sequence is
not predicted by the model.
A sequence is interesting if the probability assigned by the model is
significantly higher than the probability of the sequence using a
null model.
Normally we use the negative log probability of the sequence as a
cost measure, since the probabilities become
extremely small for longer sequences.
Using base-2 logarithms gives us the encoding cost in bits of
sequence s using model m:
![]()
Section 2
will talk about two ways that models can be
constructed automatically, but first let's discuss how they are used
for searching efficiently.