Next: 6 Results for separate Up: Evaluating Regularizers for Estimating Previous: 4 Experimental method

5 Results for training and testing on full database

This section reports the average encoding costs obtained for different sample sizes and different regularizers. For each regularizer and sample size, the excess entropy is reported, that is, the difference between the average encoding cost per column using the regularizer and the average encoding cost per column using the best theoretically possible optimizer. For the entire database, the best possible encoding costs are reported in Table 1.

The last row of each of Tables 2 through 8 reports the excess entropy if the full column is given to the regularizer, rather than a sample s. Since the entropy for the column is minimized if the values exactly match the observed counts , this measures how much the regularizer distorts the data. Because some of the columns have few counts, it is not the same as letting , but offers a more realistic idea of what can be expected with large sample sizes.

5.1 Zero-offset

The simplest method for ensuring that no probability is estimated as zero is to add a small, fixed, positive zero-offset to each count to generate the posterior counts:

For large sample sizes, the zero-offset has little effect on the probability estimation, and as .

For |s|=0, the estimated distribution will be flat (), which is generally a poor approximation to the amino acid distribution in a context about which nothing is known yet.

It is fairly traditional to use z=1 when nothing is known about the distributions being approximated, but this value is much too large for highly conserved regions like the BLOCKS database---the optimal value is much closer to 0.05. Table 2 presents the excess entropy for three optimized regularizers, as well as the popular ``add-one'' regularizer.

Table: Excess entropy for the zero-offset regularizers. The popular ``add-one'' regularizer is clearly a poor choice for the BLOCKS database.

5.2 Pseudocounts

Pseudocount methods are a slight variant on the zero-offset, intended to produce more reasonable distributions when |s|=0. Instead of adding a constant zero-offset, a different positive constant is added for each amino acid:

These zero-offsets are referred to as pseudocounts, since they are used in a way equivalent to having counted amino acids.

Again, as the pseudocounts have diminishing influence on the probability estimate and . For |s|=0, we can get , by setting , for any positive constant a. This setting of the pseudocounts has been referred to as background pseudocounts [15] or the Bayesian prediction method [17] (for the Bayesian interpretation of pseudocounts, see [13]). For the BLOCKS database and |s|>0, the optimal value of a is near 1.0.

For non-empty samples, the pseudocounts that minimize the encoding cost of Section 2 are not necessarily multiples of (see [13] for details).

The pseudocounts were optimized for |s|=0 through |s|=3, both separately, and minimizing the average entropy for all four sample sizes combined. The excess entropy for different pseudocounts is given in Table 3. The pseudocount regularizers do much better than the zero-offset regularizers for |s|=0 and |s|=1, but already by |s|=5, the difference is only 0.025 bits per column.

Table 3: Excess entropy for pseudocount regularizers.

5.3 Gribskov average score

The Gribskov profile [7] method or average-score method [17] computes the weighted average of scores from a score matrix M. There are several standard scoring matrices in use, most notably the Dayhoff matrices [6] and the BLOSUM matrices [9], which were originally created for aligning one sequence with another (|s|=1).

The scores are best interpreted as the logarithm of the ratio of the probability of the amino acid in the context to the background probability [1]:

The averaging of the score matrices is intended to create a new score. With the interpretation of scores given above, and assuming natural logarithms are used, the posterior counts are

We can avoid recording the extra parameters by redefining the score matrix slightly. If we let , then

The BLOSUM substitution matrices provide a score matrix

for matching amino acid i and amino acid j, where is the probability of i and j appearing as an ordered pair in any column of a correct alignment. Let's take natural logarithms in creating the score matrix (to match the exponential in the computation of ). If we use j to name the sample consisting of a single amino acid j, then

This is the optimal value for , and so the Gribskov method is optimal for |s|=1 (with a properly chosen score matrix).

Although the Gribskov method is optimal for |s|=1, it does not perform well at the extremes. For |s|=0, it predicts a completely flat distribution (just as zero-offset methods do). As , the Gribskov method does not approach a maximum-likelihood estimate for .

We can get much better performance for |s|>1 by optimizing the score matrix, but the Gribskov method does not generalize to other values of |s| as well the substitution matrix method described in Section 5.4.

Four score matrices were tested: the BLOSUM62 matrix (appropriately modified to represent for the BLOSUM62 data), the log-odds matrix for the test data, a matrix optimized for |s|=2, and one optimized for |s|=0,1,2,3. The excess entropies are presented in Table 4.

Table 4: Excess entropy for Gribskov average-score regularizers.

5.4 Substitution matrices

A substitution matrix computes the posterior counts as a linear combination of the counts:

This method is similar to the Gribskov average-score method of Section 5.3, with one major difference---the matrix M is not a logarithmic score matrix.

Note that for |s|=0, all the sample counts are zero, and so the posterior counts are also zero. This violates the constraints on posterior counts, and so some other method of deriving posterior counts is needed for |s|=0. For experiments reported in this paper, all-zero count vectors are replaced by all-one count vectors ( and ). This is equivalent to adding an infinitesimal zero-offset to the count vectors before multiplying by the substitution matrix M.

Substitution matrices, like score matrices, are designed for use in sequence-sequence alignment, where the sample consists of exactly one amino acid (|s|=1). Let be the distribution we expect in a column in which amino acid j has been seen (the relatedness odds ratio which has been widely studied; see, for example [12]). Then we can get optimal behavior for |s|=1 by setting , for arbitrary positive constants .

Furthermore, if we set , then , and we get optimal estimation for |s|=0 () as well as |s|=1. Neither the log-odds matrix () nor this frequency matrix approach works well for larger sample sizes.

For large values of |s|, the substitution matrix does not guarantee that the estimated distribution approaches the true distribution, unless the count vector s happens to be an eigenvector of the matrix.

Four substitution matrices were tested: the frequency matrix from which the BLOSUM62 scoring matrix was derived, a frequency matrix computed from the weighted BLOCKS database, a substitution matrix optimized for |s|=2, and one optimized for |s|=0, 1, 2, 3. Table 5 presents the excess entropies.

The pure frequency matrix is optimal for |s|=0 and |s|=1, but degrades badly for larger samples, and is worse than pseudocounts for |s|=3. The BLOSUM62 matrix does not do well for any sample size greater than zero, probably because of the difference in weighting schemes used for building the matrix and for testing.

Optimizing the substitution matrix can preserve its superiority over pseudocounts up to |s|=4, but as the sample size increases, the pseudocounts approach the optimum regularizer, while substitution matrices get farther from the optimum.

Table 5: Excess entropy for substitution matrix regularizers.

5.5 Substitution matrices plus extra terms

In an attempt to avoid the rather ad hoc approach for handling |s|=0 with substitution methods, I tried a method which combines substitution matrices and pseudocount methods:

If one thinks of a substitution matrix as a mutation model, then the pseudocounts represent a mutation or substitution that does not depend on what is currently in the sequence. For doing single alignments, where there is exactly one that is non-zero, one could obtain the same effect by adding the pseudocounts to each column of the substitution matrix, but for other sample sizes, the separation of residue-specific and background substitutions turns out to be quite useful.

If is set to for a very small positive number a, then the method is essentially identical to the pure substitution matrix method. If M is set to be the identity matrix, then the method is identical to the pure pseudocount method. In practice, the optimal matrix is closer to the identity matrix than the simple substitution matrix is, but still has significant off-diagonal elements.

As with substitution matrices, substitution matrices plus pseudocounts do not converge to the optimal distribution as . They do a little better than pure substitution matrices, since the matrix is closer to being an identity matrix.

Adding pseudocounts makes substitution matrices work better for |s|=0, but they still do not converge to the maximum-likelihood estimate as . This problem can be solved by adding one more term to the posterior counts, proportional to the counts and growing faster than the vector Ms does. One easy way to accomplish this is to add the counts scaled by their sum:

This substitution-matrix method is a slight generalization of the data-dependent pseudocount method [17]. The data-dependent method sets

for arbitrary parameters B and and a substitution matrix A. Scaling this by |s| and absorbing the constants and exponentiation into the matrix gives us

which is identical to the method here, if the pseudocounts are all zero. However, the construction of the matrices in [17] is rather ad hoc, and probably not optimized for the task.

Claverie proposed a similar method [5]---his method is equivalent to setting and scaling the by , instead of |s|. It might be interesting to try other scaling functions, besides |s| or ; any positive function such that as would give the correct convergence to the maximum-likelihood estimate. Lacking any theoretical justification for choosing the scaling function, I took the simplest one: |s|.

Adding pseudocounts, scaled counts, or both to the substitution matrices improves their performance significantly. Table 6 presents the excess entropies for these regularizers. The full method, using scaled counts and pseudocounts as well as the substitution matrix, has the best results of any of the methods mentioned so far.

: Excess entropy for substitution matrix regularizers with pseudocounts and pseudocounts plus scaled counts.

5.6 Dirichlet mixtures

The Dirichlet mixture method introduced in [4] has similarities to the pseudocount methods, but is somewhat more complex. They have been used quite successfully by several researchers [4,17,11]. The results here show that Dirichlet mixtures are quantitatively superior to all the other regularizers examined, and that there is not much room for improvement to better regularizers.

One way to view the posterior counts of Dirichlet mixtures is as a linear combination of pseudocount regularizers, where the weights on the combination vary from one sample to another, but the underlying regularizers are fixed. Each pseudocount regularizer is referred to as a component of the mixture. The weights for the components are the product of two numbers---a prior weight called the mixture coefficient and a weight that is proportional to the likelihood of the sample given the component.

Each pseudocount regularizer defines a Dirichlet density function ( through ) on the possible distributions of amino acids, with characterized by the pseudocounts . We need to introduce some notation---the Gamma and Beta functions. The Gamma function is the continuous generalization of the integer factorial function and the Beta function is a generalization of the binomial coefficients:

With this notation, we can define

where should be interpreted as the component-wise sum of the two vectors. The derivation of this formula using Bayesian statistics can be found in [13].

Because each of the pseudocount regularizers converges to the correct estimate as , the Dirichlet mixture will also have the correct behavior in the limit. For |s|=0, the Beta functions cancel, and we have

which can easily be made to fit the background distribution.

Dirichlet mixtures are clearly the luxury choice among regularizers. The need for computing Gamma functions in order to evaluate the regularizer makes them much more expensive to use than any of the other regularizers reviewed here. However, the excess entropy results in Tables 7 and 8 show that the mixtures perform better than any other regularizer test, and may well be worth the extra computational cost in creating a profile or hidden Markov model.

The regularizers in the table (except for the 9-component one) were created by fitting a single component to the data, or by adding components to a previously created mixture, optimizing after each addition for |s|=1,2. The components were selected using the greedy strategy described in [13]. The 1-component mixture is just a set of pseudocounts, and so performs almost identically to the pseudocounts optimized for |s|=1 or |s|=2.

The 9-component mixture was optimized by Kimmen Sjölander to produce a good Bayesian prior for the count vectors from the BLOCKS database with all sequences given equal weight [4]. Sjölander's 9-component mixture is the best we have for |s|=5, but it does fairly poorly for |s|=0,1,2.

The overall best regularizer is the 21-component Dirichlet mixture, which gets within 0.027 bits of the best possible regularizer for sample sizes up to 5, and probably never takes more than 0.09 bits more than the optimum regularizer.

: Excess entropy for small Dirichlet mixtures regularizers optimized for |s|=1,2. The mixtures were built by adding new components to a previous mixture, except for for the nine-component mixture, which was provided by Kimmen Sjölander.

: Excess entropy for larger Dirichlet mixtures regularizers optimized for |s|=1,2. The mixtures were built by adding new components to a previous mixture, with the history of the additions shown in the name. The 21-component mixture 1+2+3+4+5+6 is the best overall regularizer for the BLOCKS database.

Next: 6 Results for separate Up: Evaluating Regularizers for Estimating Previous: 4 Experimental method

karplus@cse.ucsc.edu