By summing the probabilities of all the different alignments of a sequence to a model, one can calculate the total probability of the sequence given that model,
where the sum is over all possible alignments (paths)
,
and the probability in the sum is given by (1).
This probability can be calculated efficiently without having to
explicitly consider all the possible alignments by the
forward algorithm [Rabiner, 1989]. The negative logarithm
of this probability is called the negative log-likelihood score,
Any sequence can be compared to a model by calculating this NLL score. For sequences of equal length the NLL scores measures how `far' they are from the model, and it can be used to select sequences that are from the same family. However, the NLL score has a strong dependence on sequence length and model length, see Figure 3. One means of overcoming this length bias is using Z-scores, or the number of standard deviations each NLL is away from the average NLL of sequences of the same length, but which are not part of the family being modeled, or do not contain the motif being modeled.
When searching a database like SWISS-PROT [Bairoch & Boeckmann, 1994] with an HMM, the smooth average and the Z-scores are calculated as follows. For a fixed sequence length we assume that the NLL scores are distributed as a normal distribution with some outliers representing the sequences in the modeled family. The smooth average should be the average of the normal distribution, and it is found by iteratively removing outliers:
This procedure often produces excellent results on a large database like SWISS-PROT, but there is no guarantee that it works. It is easy to detect when it is not working, because the sequences in the family, such as the training sequences, have low Z-scores. In this case, the training sequences and other obvious outliers can be removed by hand, and the above process repeated. This method always yields good results.
In some sequences there are unknown residues that are indicated by special characters. A completely unknown residue is represented by the letter X in proteins and by N in DNA and RNA. For proteins also the letters B, meaning amino acid N or D, and Z (Q or E) are taken into account. For DNA and RNA the letters R for purine and Y pyrimidine are recognized. All other letters that are not part of the sequence alphabet or equal to one of these wild card characters are taken to be unknown, i.e., changed to X or N depending on the sequence type. The probability of a wild card character in a state of the HMM is set equal to the maximum probability of all the letters it represents. It has the unfortunate side effect that sequences with many unknowns automatically receive a large probability, and these sequences have to be inspected separately. Another solution would be to set the probability equal to the average probability of the letters the wild card represents, but then the opposite problem might occur. This can also be chosen as an option in SAM.