next up previous
Next: Zero-offset Up: Markov Models as compression Previous: Markov Models as compression

Simple Markov models

 

A simple Markov model of order k estimates the probabilities for letters in a given position based only on the characters in the preceding k positions. The model is trained by giving it a seed (a set of sequences that are interesting) and counting the number of times each word of length k+1 occurs in the seed. The words that start with the same k characters constitute a context, and their counts can be converted to estimates of the probability of the characters in the final position of the word.

The strengths of a simple Markov model are that it allows very fast searching (the cost of each position can be computed in constant time), it has fairly small memory demands (tex2html_wrap_inline755 words for an alphabet of four letters), and the model can be built quickly from a fairly small seed.

The weaknesses are that the models are not human-readable and are not directly useful for aligning sequences. Furthermore, any extraneous junk that is in the set of seed sequences is also searched for--not just the interesting part of the seed.

Some previous work with Markov models has concentrated on using them to predict the frequencies of short words [10, 1, 14]. These studies have generally found that fairly low-order models (k=2 to k=4) trained on the entire database work best for that application. In contrast, for searching we use fairly high-order Markov models (k=6 to k=10) and small seeds (as low as 30 characters). Even with a large seed of 12,000 characters (essentially all the REPs), the average value of a count for the words of an order-8 model is only 0.046. With such small seeds and large models, almost all contexts have zero counts for all four characters, making the way zero counts are handled particularly important.




next up previous
Next: Zero-offset Up: Markov Models as compression Previous: Markov Models as compression

Rey Rivera
Thu Aug 22 14:04:06 PDT 1996