Imagine that the same domain appears in several sequences. Then to build an HMM of that domain and using it for searching or aligning requires a few extensions. We augment the HMM by insertion states on both ends that have no preference for which letters are inserted (i.e., they have uniform character distributions). The cost of aligning a sequence to such a model does not strongly depend on the position of the pattern in the sequence, and thus it will pick up the piece of the sequence that fits the model the best. A new parameter then has to be set, namely the the probability for making an insertion in these flanking insertion modules. If this probability is low, deletions in the main part of the model are encouraged and insertions discouraged, whereas a high insertion probability encourages the opposite behavior, because it becomes `expensive' to use the flanking insertion states. The setting of this parameter is most important when estimating a model like this, because there is a danger of finding a model in which for instance only the delete states are used if it is too `cheap' to make insertion in the flanking modules. These flanking modules are called a free insertion modules (FIMs), because the the self-transition probability in the insert state is set to unity. Other values can of course be used.
FIMs are implemented by only using the delete state and the insert state in a standard module. All the transitions from the previous module go into the delete state, which is ensured by setting the other probabilities to zero. From the delete state there is a transition to the insert state with the probability set to one. In the insert state all characters have the same probability and the probability of insert to insert is set to one. From the delete and insert states there are transitions to the next module which have the same probabilities (delete to match and insert to match have the same probability, and so on). Note that the probabilities do not sum up to one properly in such a module. A model with three FIMs is shown in Figure 4.
These special modules can be used to learn, align, or discriminate domains that occur once per sequence. Typically, FIMs are used at the beginning and the end of a model to allow an arbitrary number of insertions at either end. To train a model to find several (different) domains, one can also add FIMs at different positions in the model. Note, however, that only domains always occouring in the same order and the same number of times can be modelled this way. To model domains that come in different order or in a variable number, a model with backward transitions is required, which is something that we might add to SAM at a later stage.
There are two different ways to model subsequences. The first method is to cut out the subsequences from their host molecules, build a model from these, and then add a FIM in both ends afterwards. The second method is to use the full sequences and train with the FIMs. The second method is generally easier, but also slower in terms of CPU time, especially if the sequences are much longer than the subsequences. In [Krogh et al., 1994a] the first method was used to model the protein kinase catalytic domain and the EF hand domain, and in this paper the second method will be demonstrated on the SH2 domain.
In [Lawrence & Reilly, 1990] and [Lawrence et al., 1993], two related methods of automatically finding common patterns in a set of sequences are described. Those papers deal only with gap-less alignments, i.e., patterns without insertions or deletions. The method described here can be viewed as an extension to alignments with gaps.