One great advantage of HMMs is that they can be estimated from sequences, without having to align the sequences first. The sequences used to estimate or train the model are called the training sequences, and any reserved sequences used to evaluate the model are called the test sequences. The model estimation is done with the forward-backward algorithm, also known as the Baum-Welch algorithm, which is described in [Rabiner, 1989]. It is an iterative algorithm that maximizes the likelihood of the training sequences. We have modified the algorithm to implement maximum a posteriori (MAP) estimation.
In the basic algorithm the expected number of times a certain
transition or letter emission is used by the training sequences
is calculated [Rabiner, 1989].
For a letter emission probability
this is called n(x|q). Then the reestimated parameter is
where the sum is over all the characters x' in the alphabet, such as the 20 amino acids. The reestimation formula for the transition probabilities are similar. To begin with all the n(x|q) values are found from an arbitrary initial model. Next, a new set of parameters is found using equation (4), and the similar formula for the transition probabilities. Then, using the new model, the reestimation procedure is repeated until the change in the parameters is insignificant.