When estimating a model from data, there is always the possibility that the model will over-fit the data -- it models the training sequences very well, but will not fit other sequences from the same family. This is particularly likely if there are few training sequences. With only one training sequence, a perfect model would have a match state for each residue in which that residue would have unity probability and all other residues zero probability. Such a model would give zero probability to all other sequences than the training sequence! For larger sets of training data, similar problems are still present but not as extreme.
To avoid this problem a regularizer can be used.
Regularization is a method to avoid over-fitting the data,
and in Bayesian statistics it is tightly connected with the so-called
prior distribution.
The prior distribution is a distribution over the model parameters;
for the HMM it is a probability distribution over probability
distributions. The prior contains our prior beliefs about the parameters
of the model.
In our work we use Dirichlet distributions for the prior
[Berger, 1985, Santner & Duffy, 1989].
For a discrete probability distribution
a Dirichlet
distribution is described by M parameters
.
The mean of the Dirichlet distribution is
,
where
, and the variance is inversely
proportional to
. If
is large,
it is highly probable that
.
For each probability distribution in the HMM, a Dirichlet distribution is
used as a prior, and we call the corresponding
values
for the distributions over characters, and
for the
transition probabilities. The reestimation formula corresponding to
(4)
is
The set of all the
s is called the regularizer.
We call this the maximum a posteriori (MAP) estimate, although
the correct MAP formula has
instead of
.
Equation (5) is really a least squares estimate (see
[Krogh et al., 1994a]), but one can also view it as a MAP estimate with
redefined
s.
Even without the theoretical justification this formula is
appealing. For each parameter in the model a number (
) is added
to the corresponding n before the new parameter is found. If n is
small compared to
, as when there is little
training data, the regularizer essentially determines the parameter,
and
(This is the average of the Dirichlet distribution.)
The size of the sum
determines the strength of
the regularization, or the strength of the prior beliefs.
If this sum is small, say 1, just a few
sequences will be enough to `take over' the model. On the other hand, if
the sum is large, say 1000, then on the order of 1000
training sequences will be needed to significantly make the model
differ from the prior beliefs.
This type of regularization is convenient when modeling biological
sequences because we have prior knowledge from conventional alignment
methods. For example, in both pairwise and multiple alignments the
penalty for starting a deletion is usually larger than for continuing
a deletion. This `prior belief' that match to delete transitions are
less probable than delete to delete transitions can easily be built
into an HMM by setting
.
The SAM system can also use the more complicated Dirichlet mixture priors for regularization. These priors include several different distributions for some number of different types of columns, such as hydrophobic and hydrophilic positions. For more information on these distributions, please refer to our previous work [Brown et al., 1993].
By normalizing the regularizer as in (6), a valid model is obtained. Since this model represents prior beliefs, it is natural to use it as the initial model for the estimation process. Usually noise is added to this initial model for reasons discussed next.