The basic mechanism for training an HMM to a set of sequences is to repeatedly run the following tuning algorithm. First, run the Viterbi algorithm to determine the best path through the model for each sequence, and count the number of times each edge is used and the number of times each character occurs in each state. More sophisticated training algorithms, such as the Baum-Welch method [11], could be used to get expected counts, but this simple method works fairly well.
After all the sequences in the training set have been counted, the counts can be converted to probabilities for each set of characters in a state and for each set of edges out of a state. As with the counts in the simple Markov model, some care has to be taken with zero counts. Since the counts in the HMM tend to be much larger than in the simple Markov models, the handling does not have to be as careful, and the program just adds the old probability estimate plus a small fixed zero-offset to all the counts.
After the probabilities have been recomputed and converted to costs, the whole tuning step is repeated, either for a fixed number of iterations or until the change in cost per character is less than a user-specified threshold. Generally four iterations are enough to get convergence to better than 0.01 bits/base.
After tuning an HMM, some edges or states may never have been used. A simple pruning step after tuning removes these unused states and edges from the HMM.
Useless states and edges occur fairly commonly on the automatically
constructed HMMs because parallel sequences are often constructed.
For example, in Section 2.5
the states for tGg and cGg
matched exactly the same characters.
In training, one of the two states will get a higher count (hence
lower cost) and the other will become unused.
The smaller models in Table 3
on
page
(hmm125, small, and cross) were
constructed from simple Markov models by the methods of
Section 2.5
and tuned on the seed sequences (REP99-gxn).
Unused edges and states were removed, and the models were made
slightly more general by flattening the
probabilities--re-estimating the probabilities from the counts using
a larger zero-offset than used in the tuning. (These models were
flattened with a zero-offset of 0.1.)