The simplest algorithm for finding sequences using a model would be to
enumerate all possible sequences and compute the cost of each.
Unfortunately, this crude algorithm is too expensive.
In a database of d characters, up to
different
sequences may be found.
If computing the cost of a sequence takes time proportional to the
length of the sequence, then the entire search algorithm would be
order
.
Furthermore, if several overlapping sequences have a low cost, we
would like the algorithm to pick out the best of them--deferring for
the moment exactly what we mean by ``best''.
Therefore, we need a search technique whose execution time is linear
in d and which returns only disjoint sequences.
First, let's assume that the model used to compute the probabilities
provides not just an overall cost for a sequence, but assigns a cost
to each position of the sequence.
Second, let's assume that the cost of any given sequence is the sum of
the costs for each of the characters of the sequence.
Section 2.4
discusses the ways in which the
models actually used violate these assumptions and why the violations
are not serious in this context.
With the above assumptions, we can compute the costs for each character of the database once and save the costs in an array, called a cost map. From the cost map we can compute the cost for any subsequence of the database by taking the sum of costs in the array starting at the beginning of the subsequence and stopping at the end. Hence, given a cost map, finding interesting sequences becomes independent of the model used to create the cost map. This separation of the model from the search technique is one of the main advantages of the techniques described in this paper.
Our problem then is to find non-overlapping subsequences of a cost map whose cost is significantly lower than what we would expect from random strings of characters. Furthermore, we would like to do this in time proportional to the number of positions in the cost map.