next up previous
Next: Direct construction of HMM Up: Markov Models as compression Previous: Violations of assumption needed

Building a hidden Markov model from a simple Markov model

 

Although hand-crafted hidden Markov models may be useful for searches where a consensus sequence is already known, I am primarily interested in HMMs that are generated automatically from a set of seed sequences.

We need a way to control the size of the HMM, and ensure that the most useful, shared information in the seed sequences is reflected in the HMM. We already have a mechanism for pooling information from a set of seed sequences: the counts of simple Markov models. The construction process used for the results in Section 3.1  gif starts by building an order-k simple Markov model for a set of seed sequences, constructing an HMM from the counts of k+1-words in a simple Markov model of the seed, then training the model on the sequences, using the cross-training technique described in Section 2.7  gif.

The construction technique creates states and maps words to them. First, a junk-loop state is created, and all words that aren't explicitly mapped to another state will map to the junk state. Then words are examined in decreasing order of frequency, creating a state for each one that isn't already mapped. The state will be used to match the character in the middle position of the word, so all words that differ only in that position will map to the same state. The count for each character in a state is initialized to the count for the word that has that character in the middle position.

There is a natural shift relationship between two words of the same length. This relationship is exploited to create the edges of the HMM, and to ensure some connection between the various states.

definition158

Note that the middle character position for the right-shift of w corresponds to the position one to the right of the middle in w.

We will provide an edge from state s to state t if some right-shift of one of the words mapping to s maps to t or if some left-shift of one of the words mapping to t maps to s. The count for the edge is the minimum of the counts for the corresponding words.

To ensure some connectivity in the Markov model, we don't just map the high-frequency words to states, but map their most probable predecessors and successors as well. When a new state is created for a word w, we examine the four right-shifts of the word and choose the one with the largest count . If this is above some threshold, then we create a state for and repeat until either the maximum count drops below the threshold or all four right-shifts of the word already map to states. We do the same with the four left-shifts of w.

For example, if two seeds aaaaaatggggggg and aaaaaacgggggg were used with an order-3 Markov model, the words that map to states would be aAa, aAt, aAc, aCg and aTg, tGg, cGg, and gGg. The HMM will have only one state for aCg and aTg (perhaps best named aYg, to indicate what letters it has low costs for), but the right-shifted words (tGg and cGg) map to different states, even though they match the same character (G). Similarly the left-shifted words aAt and aAc map to different states, though both match A.

As a slight added complexity, we also create the states corresponding to the dyadic complements of the words, and tie the complementary states together with a variable-weight tie. The tie has no effect on the model when we are determining the probability of a sequence, but when the probability estimates for a state are updated, a weighted multiple of the counts for the complementary characters in the tied state is added to the counts. This trick helps the model learn palindromic sequences and sequences that can occur on either strand.


next up previous
Next: Direct construction of HMM Up: Markov Models as compression Previous: Violations of assumption needed

Rey Rivera
Thu Aug 22 14:04:06 PDT 1996