next up previous
Next: Problems with local maxima Up: Estimation of the model Previous: Estimation of the model

Prior distribution and regularization

When estimating a model from data, there is always the possibility that the model will over-fit the data -- it models the training sequences very well, but will not fit other sequences from the same family. This is particularly likely if there are few training sequences. With only one training sequence, a perfect model would have a match state for each residue in which that residue would have unity probability and all other residues zero probability. Such a model would give zero probability to all other sequences than the training sequence! For larger sets of training data, similar problems are still present but not as extreme.

To avoid this problem a regularizer can be used. Regularization is a method to avoid over-fitting the data, and in Bayesian statistics it is tightly connected with the so-called prior distribution. The prior distribution is a distribution over the model parameters; for the HMM it is a probability distribution over probability distributions. The prior contains our prior beliefs about the parameters of the model. In our work we use Dirichlet distributions for the prior [Berger, 1985, Santner & Duffy, 1989]. For a discrete probability distribution tex2html_wrap_inline1654 a Dirichlet distribution is described by M parameters tex2html_wrap_inline1658 . The mean of the Dirichlet distribution is tex2html_wrap_inline1660 , where tex2html_wrap_inline1662 , and the variance is inversely proportional to tex2html_wrap_inline1664 . If tex2html_wrap_inline1664 is large, it is highly probable that tex2html_wrap_inline1668 . For each probability distribution in the HMM, a Dirichlet distribution is used as a prior, and we call the corresponding tex2html_wrap_inline1670 values tex2html_wrap_inline1672 for the distributions over characters, and tex2html_wrap_inline1674 for the transition probabilities. The reestimation formula corresponding to (4) is

equation2969

The set of all the tex2html_wrap_inline1670 s is called the regularizer. We call this the maximum a posteriori (MAP) estimate, although the correct MAP formula has tex2html_wrap_inline1678 instead of tex2html_wrap_inline1680 . Equation (5) is really a least squares estimate (see [Krogh et al., 1994a]), but one can also view it as a MAP estimate with redefined tex2html_wrap_inline1670 s.

Even without the theoretical justification this formula is appealing. For each parameter in the model a number ( tex2html_wrap_inline1670 ) is added to the corresponding n before the new parameter is found. If n is small compared to tex2html_wrap_inline1670 , as when there is little training data, the regularizer essentially determines the parameter, and

equation2977

(This is the average of the Dirichlet distribution.) The size of the sum tex2html_wrap_inline1692 determines the strength of the regularization, or the strength of the prior beliefs. If this sum is small, say 1, just a few sequences will be enough to `take over' the model. On the other hand, if the sum is large, say 1000, then on the order of 1000 training sequences will be needed to significantly make the model differ from the prior beliefs.

This type of regularization is convenient when modeling biological sequences because we have prior knowledge from conventional alignment methods. For example, in both pairwise and multiple alignments the penalty for starting a deletion is usually larger than for continuing a deletion. This `prior belief' that match to delete transitions are less probable than delete to delete transitions can easily be built into an HMM by setting tex2html_wrap_inline1694 .

The SAM system can also use the more complicated Dirichlet mixture priors for regularization. These priors include several different distributions for some number of different types of columns, such as hydrophobic and hydrophilic positions. For more information on these distributions, please refer to our previous work [Brown et al., 1993].

By normalizing the regularizer as in (6), a valid model is obtained. Since this model represents prior beliefs, it is natural to use it as the initial model for the estimation process. Usually noise is added to this initial model for reasons discussed next.


next up previous
Next: Problems with local maxima Up: Estimation of the model Previous: Estimation of the model

Rey Rivera
Thu Aug 29 15:28:54 PDT 1996