This section briefly describes the structure and basic theory of the type of hidden Markov model used in this work. For a general introduction to hidden Markov models, refer to [Rabiner, 1989]. For additional details on our hidden Markov models, refer to [Krogh et al., 1994a].
The basic model topology is shown in Figure 1. Each position, or module, in the model has three states. A state shown as a rectangular box is a match state that models the distribution of letters in the corresponding column of an alignment. A state shown by a diamond-shaped box models insertions of random letters between two alignment positions, and a state shown by a circle models a deletion, corresponding to a gap in an alignment. States of neighboring positions are connected, as shown by lines. For each of these lines there is an associated `transition probability', which is the probability of going from one state to the other.
This model topology was chosen so as to feature insertions and deletions similar to biological sequence alignment techniques based on affine gap penalties. Other model topologies can certainly be considered, see for example [Baldi et al., 1994, Krogh et al., 1994b], but for reasons of efficiency the software described here is limited to the topology shown in Figure 1. Gibbs sampling is an alternate approach that does not allow arbitrary gaps within aligned blocks [Lawrence et al., 1993].
Alignment of a sequence to a model means that each letter in the sequence is associated with a match or insert state in the model. Two 5-character sequences, A and B, are shown in a 4-state model in Figure 2, along with the corresponding alignment between the sequences.
One can specify such an alignment by giving the corresponding sequence
of states with the restriction that the transition lines in the figure
must be followed. For example, to match a letter to the first match
state (
) and the next letter to the third match state (
) can
only be done by using the intermediate delete state (
), so that
part of the alignment can be written as
. In HMM
terminology such an alignment of a sequence to a model is called a
path through the model. A sequence can be aligned to a model in many
different ways, just as in sequence-to-sequence alignments, or
sequence-to-profile alignments. Each alignment of a given sequence to
the model can be scored by using the probability parameters of the
model.
In the following, a sequence of L letters is denoted
. The path specifying the particular alignment is denoted
, where
is the kth state in the path. To
simplify notation, as compared to [Krogh et al., 1994a], only the match
and insert states are included in the path; for two states not
directly connected, delete states are filled in as needed. Because the
first state in a path is always the begin state
, and the last
one is always the end state
, these two trivial states are
also not included in the state sequence. The number of match
positions in the model is called M, and is referred to as the length
of the model. The probability of an alignment of a sequence to a model
is the product of all the probabilities met in the path, written as
where
is the probability associated with the
transition between states
and
. The probability of
matching letter x to state q is called
. If two
states in the path are not directly connected by a transition, the
transition probability has to be calculated by filling in the
appropriate delete states. For example, the probability of a transition
from the match state at position 1 (
) to the insert state at
position 4 (
), passing through delete states
,
, and
, is
By taking the negative logarithm of equation (1) it can be turned into a more familiar type of alignment `cost':
Now the numbers can be interpreted as `penalties'. A term like
is a positive penalty for initiating a
deletion after position i, and
is the
penalty for continuing a deletion at position i. Some of the other
terms have similar interpretations.