next up previous
Next: Complement blurring Up: Simple Markov models Previous: Zero-offset

Neighbor blurring

When doing searches, we are usually interested not only in sequences identical to the seed, but in sequences that have only a few characters different from the seed. Unfortunately, even a single mutation can change a k-long context into one that has zero counts of all four words beginning with that context. To compensate somewhat for this limitation of high-order models, we can blur the word counts by adding a weighted sum of all neighboring word counts:

displaymath793

where N(w) is the set of all words that differ from w in exactly one position, and n is the blurring weight for neighbor blurring.

The program uses a slightly more sophisticated model, with three blurring parameters tex2html_wrap_inline801, tex2html_wrap_inline803, and tex2html_wrap_inline805 . The tex2html_wrap_inline801 weight is used for words whose difference is (AG) or (CT). The tex2html_wrap_inline803 weight is used for words whose difference is (AC) or (GT), and tex2html_wrap_inline805 is used for (AT) or (GC).

The neighbor weights are best chosen by optimizing the adaptive compression of a related set of sequences, using a simple optimization technique, such as gradient descent, on the four parameters z, , , and . The optimal value of z is much smaller when neighbor blurring is used, since the neighbor blurring eliminates many of the zero counts, and provides a more accurate estimate of the probabilities than the simple zero-offset.

One problem with neighbor blurring is that we do not treat the predicted position of the word differently from the part that establishes the context. When the counts get large in a given context, blurring them may produce a worse estimate of the probabilities than the raw counts. Since we use the blurring only for models with extremely small counts, where blurring in the predicted position generally produces better predictions, this problem has been ignored.


next up previous
Next: Complement blurring Up: Simple Markov models Previous: Zero-offset

Rey Rivera
Thu Aug 22 14:04:06 PDT 1996