next up previous
Next: Model surgery Up: Algorithm Previous: Multiple sequence alignment

Searching

By summing the probabilities of all the different alignments of a sequence to a model, one can calculate the total probability of the sequence given that model,

equation3003

where the sum is over all possible alignments (paths) tex2html_wrap_inline1602 , and the probability in the sum is given by (1). This probability can be calculated efficiently without having to explicitly consider all the possible alignments by the forward algorithm [Rabiner, 1989]. The negative logarithm of this probability is called the negative log-likelihood score,

equation3010


 

figure3015
Figure 3: NLL scores vs. sequence length for a search of SWISS-PROT using a model of the SH2 domain described in section 5.3. The tex2html_wrap_inline3753 show the SH2 containing sequences (the training set), and dots show the about 30.000 remaining sequences. All sequences containing the letter 'X' are excluded.  

Any sequence can be compared to a model by calculating this NLL score. For sequences of equal length the NLL scores measures how `far' they are from the model, and it can be used to select sequences that are from the same family. However, the NLL score has a strong dependence on sequence length and model length, see Figure 3. One means of overcoming this length bias is using Z-scores, or the number of standard deviations each NLL is away from the average NLL of sequences of the same length, but which are not part of the family being modeled, or do not contain the motif being modeled.

When searching a database like SWISS-PROT [Bairoch & Boeckmann, 1994] with an HMM, the smooth average and the Z-scores are calculated as follows. For a fixed sequence length we assume that the NLL scores are distributed as a normal distribution with some outliers representing the sequences in the modeled family. The smooth average should be the average of the normal distribution, and it is found by iteratively removing outliers:

  1. For k larger than or equal to the minimum sequence length, find the minimum length tex2html_wrap_inline3757 such that the length interval tex2html_wrap_inline3759 contains at least K sequences. We usually use K=1000. For all such intervals calculate the average sequence length tex2html_wrap_inline3765 and the average NLL score.
  2. For any integer length l the smoothened average NLL score is found by linear interpolation of the average NLLs corresponding to the closest tex2html_wrap_inline3769 and tex2html_wrap_inline3771 . For l either smaller than the smallest tex2html_wrap_inline3765 or larger than the largest tex2html_wrap_inline3765 , linear extrapolation from the two closest points is used.
  3. For each interval the standard deviation of the NLL scores from the averages is found. For any integer length the smoothened standard deviation is found by linear interpolation or extrapolation as above.
  4. All sequences with an NLL score more than a certain number of standard deviations (usually 4) from the smooth average are considered outliers and excluded in the next iteration.
  5. If the excluded sequences are identical to the ones excluded in the previous iteration, the process is stopped. Otherwise it is repeated unless the maximum number of iterations have been performed.

This procedure often produces excellent results on a large database like SWISS-PROT, but there is no guarantee that it works. It is easy to detect when it is not working, because the sequences in the family, such as the training sequences, have low Z-scores. In this case, the training sequences and other obvious outliers can be removed by hand, and the above process repeated. This method always yields good results.

In some sequences there are unknown residues that are indicated by special characters. A completely unknown residue is represented by the letter X in proteins and by N in DNA and RNA. For proteins also the letters B, meaning amino acid N or D, and Z (Q or E) are taken into account. For DNA and RNA the letters R for purine and Y pyrimidine are recognized. All other letters that are not part of the sequence alphabet or equal to one of these wild card characters are taken to be unknown, i.e., changed to X or N depending on the sequence type. The probability of a wild card character in a state of the HMM is set equal to the maximum probability of all the letters it represents. It has the unfortunate side effect that sequences with many unknowns automatically receive a large probability, and these sequences have to be inspected separately. Another solution would be to set the probability equal to the average probability of the letters the wild card represents, but then the opposite problem might occur. This can also be chosen as an option in SAM.


next up previous
Next: Model surgery Up: Algorithm Previous: Multiple sequence alignment

Rey Rivera
Thu Aug 29 15:28:54 PDT 1996