A simple Markov model of order k estimates the probabilities for letters in a given position based only on the characters in the preceding k positions. The model is trained by giving it a seed (a set of sequences that are interesting) and counting the number of times each word of length k+1 occurs in the seed. The words that start with the same k characters constitute a context, and their counts can be converted to estimates of the probability of the characters in the final position of the word.
The strengths of a simple Markov model are that it allows very fast
searching (the cost of each position can be computed in constant
time), it has fairly small memory demands
(
words for an
alphabet of four letters), and the model can be built quickly from a
fairly small seed.
The weaknesses are that the models are not human-readable and are not directly useful for aligning sequences. Furthermore, any extraneous junk that is in the set of seed sequences is also searched for--not just the interesting part of the seed.
Some previous work with Markov models has concentrated on using them to predict the frequencies of short words [10, 1, 14]. These studies have generally found that fairly low-order models (k=2 to k=4) trained on the entire database work best for that application. In contrast, for searching we use fairly high-order Markov models (k=6 to k=10) and small seeds (as low as 30 characters). Even with a large seed of 12,000 characters (essentially all the REPs), the average value of a count for the words of an order-8 model is only 0.046. With such small seeds and large models, almost all contexts have zero counts for all four characters, making the way zero counts are handled particularly important.