next up previous
Next: 5 Results Up: Scoring Hidden Markov Models Previous: 3 System and Methods

Subsections

4 Algorithm

4.1 Hidden Markov Models

Since full treatment of SAM's HMMs is available elsewhere [Krogh et al., 1994,Hughey & Krogh, 1996], discussion will be limited to a brief overview of the method.

Each position, or module, in a linear HMM (Figure 1) has three states. A state shown as a rectangular box is a match state that models the distribution of letters in the corresponding column of an alignment. Diamond-shaped states model insertions of letters between two alignment positions, and circular states model a deletion, corresponding to a gap in an alignment. States of neighboring positions are connected, as shown by lines. For each of these lines there is an associated transition probability, which is the probability of going from one state to the other. This model topology features insertions and deletions similar to biological sequence alignment techniques based on affine gap penalties [Gotoh, 1982].


 
Figure: A linear hidden Markov model.  
\begin{figure}
\centerline{
\psfig {figure=model.drawing2.ps,width=\columnwidth}
}\end{figure}

Building an HMM is a process that iteratively estimates the parameters from a set of unaligned training sequences. SAM uses a maximum a posteriori (MAP) approach, where a prior probability distribution, or regularizer, over all possible parameters is assumed. These priors encapsulate our beliefs on what a model should be like. The MAP estimation process tries to maximize the posterior probability of the model given the training sequences. While this process is not guaranteed to find the best model in a global sense, various heuristics, including noise injection and model surgery, are used to (hopefully) approach a global, rather than merely local, maximum.

Once an HMM has been trained (or during training, if the sequences are not trimmed to the region of interest), Free Insertion Modules (FIMs) are added to the ends of the model. The purpose of the FIMs is to allow the HMM to model a region within a larger sequence, rather than the larger sequence in its entirety. Each FIM is a table of residue insertion probabilities with a no-cost, or unity probability, loop. In essence, the FIMs allow the model core to match the most likely part of the sequence and treat the rest as randomly generated residues with no length distribution. The FIMs enable the HMM to position itself within a sequence and to find the best match.

Once an HMM has been built for a family of proteins, negative log-likelihood (NLL) scores, $-\log(P(s\vert m))$, based on the probability that the sequence s was generated by the model m are returned by SAM. This raw NLL-score is exceedingly dependent on the lengths of s and m, and cannot be used directly to classify the sequence.

Using the original Z-score method, SAM overcame this problem by appropriately normalizing the NLL-score. First, all proteins in a standard database, such as SwissProt [Bairoch & Boeckmann, 1994], are scored against the model, and smoothed average curves of NLL-score versus length and standard deviation versus sequence length are calculated. Using this data, the difference between the NLL-score of a sequence and the mean NLL-score of sequences of similar length in standard deviations is calculated. This value is the Z-score for the sequence. A Z-score cutoff is then chosen by inspecting the histogram of Z-scores for sequences in the database and choosing an appropriate value, ideally one that cleanly separates the distributions of family members and non-members. Sequences with Z-scores above this cutoff are deemed to be in the family modeled by the HMM.

The major shortcomings of this procedure include the difficulty of scoring a single sequence without scoring an entire database, the subjective choice of Z-score cutoff, and the presumption of a normal distribution of scores.

We present an analysis of methods that solve these problems and in the process eliminate the need for Z-scoring altogether.

4.2 Scoring

Sequence alignments under probabilistic models can be scored using the log-odds ratio, an information-theoretic means of scoring an alignment [Altschul, 1991] that has been used by HMMer since its inception [Eddy et al., 1995]. The log-odds score asserts whether or not the probability that the sequence s was generated by the model m is larger than the probability that the sequence was generated by the null model $\phi$:

\begin{displaymath}
\mbox{score}(s)=\log_z\frac{P_m(s)}{P_{\phi}(s)}, \end{displaymath}

where the logarithm can be to any base z, most often to base 2, in which case the score is reported in bits. For historical reasons, SAM uses natural logarithms.

Once the model m has been trained, a key question is, of course, what should be used for the null model $\phi$ in the calculation of these log-odds, or NLL-NULL, scores.

One obvious choice is to have the null model match the FIMs appended to the model. The FIMs model parts of a sequence not modeled by the core of the HMM, and this is just the effect desired in a null model. Making this choice, the probability that the entire sequence was randomly generated is simply the product of the sequence residue probabilities in the FIM.

One possible problem with the simple looping null model is that, unlike the trained model, it has no distribution over sequence lengths. We have evaluated an alternative complex null model that is an HMM with transition probabilities identical to those of the trained model, but without position-specific character emission probabilities. Because the character emission probabilities are the same in all states (and identical to the FIM probabilities), with the complex null model, $P_{\phi_c}(s)$ can be decomposed into two parts for efficient calculation: the length-independent loop of the simple null model and the length distribution of the complex null model: $P_{\phi_c}(s) =
P_{\phi_c}(\vert s\vert)\cdot P_{\phi}(s)$.

Having decided on the relation between FIMs and the null model -- looked at in the other direction, FIMs are just null models attached the sides (or middle, in the case of a 2-part domain) of the model -- the distribution to place in the null model must be decided.

4.3 Null model selection

First, the null model can be fixed for all scoring sequences and all models. For example, the FIM and null model tables can be set to the flat distribution or the background distribution of amino acids over all proteins. This second one is commonly used in sequence analysis. HMMer uses these background frequencies as its default null model. Both of these distributions are invariant over all protein families, and thus have an advantage in their simplicity.

Second, the null model can be fixed for all scoring sequences, but different for each model. The goal here is to correct for compositional bias within the model. For example, if a certain family is alanine-rich, the core of its model may produce a high score for any alanine-rich protein.

Third, the null model can be different for every scoring sequence, but the same for each model. For example, the tables can contain the amino acid distribution of the sequence currently being scored. This is a pessimistic null model, as it demands not that the HMM model the sequence better than fixed background frequencies, but that it model the sequence significantly better than frequencies exactly matching the sequence's composition. This will both correct for compositional bias and ensure that a sequence scores well only if the extra effort of creating an HMM for the family outperforms simpler representations.

Fourth, the null model can depend on both the scored sequence and the model. We have not investigated this possibility.

The experimental work has shown that the second category, null models that depend on the trained model, are usually best. This work examines three null models in this category.

This third method, because it has a strong foundation, can be calculated directly from a model (rather than from information saved during model training), and performs similarly or better than the other choices, is our favorite. As a result of this work, SAM has been modified to use the geometric mean by default and, during the training procedure, to update the insert and FIM tables to be the geometric mean of the match states after each training iteration.

Once a null model has been selected, the task remains to settle how to distinguish sequences that contain the model from sequences that do not.

4.4 Significance

When the log-odds score is greater than zero, the HMM m fits sequence s better than the null model $\phi$ fits sequence s. The question remains, however, whether or not this difference is significant. To address this issue, we make use of Milosavljevic's algorithmic significance test, which asserts that the probability of getting a score larger than d is less than or equal to z-d (where z is the logarithm base), assuming that the null model is a reasonably accurate description of the space the sequences are drawn from [Milosavljevic & Jurka, 1993]. When a database is searched, and the search includes N individual placements of the model, for a given d, the equation becomes:

\begin{displaymath}
P(\exists_i,\mbox{score}_i\geq d) \leq \sum_{i=1}^{N}
P(\mbox{score}_i\geq d) \leq Nz^{-d}.\end{displaymath}

Thus, to assure a certain level of significance $\sigma$ (typically 0.01 to 10.0), a score such that $\sigma \leq N z^{-d}$, or $d \geq -\log_z(\sigma) + \log_z(N)$ will certainly indicate significance, though this significance level is pessimistic as it assumes an independence of placements that does not exist in HMMs. The meaning of $\sigma$ is the same as BLAST's E parameter, the expected number of false positives [Altschul et al., 1994].

Consider searching a database of length L residues in S sequences with an HMM. What value should be assigned to N? If the HMM is considered to be matching each sequence in its entirety, then N should just be the number of sequences S. On the other hand, if a motif is being searched for (with, for example, free insertion modules added to the ends of the HMM), then an experiment is effectively performed at each starting point within the database, so N is the total length L of the database. Because HMMs can have insertions and deletions, however, perhaps L2, representing the number of starting points and ending points tested, should be considered. In practice, HMMs tend to be rigid enough that once the starting point is picked, the ending point is not a free variable.

Taking the middle ground, letting N be the number of starting points within the database, one further adjustment can be made, also based on the rigidity of HMMs. Because the HMM has a certain preferential length (adjusted by SAM's surgery heuristics), sequences shorter than the HMM really only contribute one placement to N: does the model fit the sequence starting at the beginning and ending at the end. Similarly, a longer sequence will contribute a number of placements to N approximately equal to the difference between its length and the HMM's. So, for this work, we set

\begin{displaymath}
N=\sum_{i=1}^S
\max(\mbox{length}_{s_i}-\mbox{length}_m,1). \end{displaymath}

With semi-local scoring, which allows a complete sequence to match only part of the model, the number of placements is the adjusted sequence length plus the model length. With full local scoring, which allows a subregion of the sequence to match a subregion of the model [Smith & Waterman, 1981], the number of placements is the sum of the length of the model and the length of the sequence. These corrections are critical when the scores of a single sequence against different models are compared.

This value of N continues to be pessimistic because, as mentioned, it assumes independence among the N placements. This is a problem at two levels. First, the sequences in most databases are not independent, though various non-redundant databases strive to correct this problem. Second, within a sequence, if the HMM does not produce a significant score for a sequence s starting at the first residue, it almost certainly will not starting at the second either, though it may very well match the model somewhere further along the sequence. Furthermore, SAM's default scoring method will only report one motif in a sequence even when more than one exist. The independence of placements can be experimentally evaluated for any given model, or estimated as some number of residues or fraction of the overall model length. Rather than delve into these subjective choices, we choose instead to use the pessimistic N. Log-odds scoring has always been used by HMMer, which suggests use of $\log(L)$ for discrimination ($\sigma = 1$) [Eddy, 1995].


next up previous
Next: 5 Results Up: Scoring Hidden Markov Models Previous: 3 System and Methods
SAM
sam-info@cse.ucsc.edu
UCSC Computational Biology Group