Next: 2 Quantitative measure for Up: Evaluating Regularizers for Estimating Previous: Evaluating Regularizers for Estimating

1 Why estimate amino acid distributions?

Most search and comparison algorithms for proteins need to estimate the probabilities of the twenty amino acids in a given context. This probability is often expressed indirectly as a score for each of the amino acids, with positive scores for expected amino acids and negative scores for unexpected ones.

As Altschul pointed out [1], any alignment-scoring system is really making an assertion about the probability of the test sequences given the reference sequence. The score for an alignment is the sum of the scores for individual matched positions, plus the costs for insertions and deletions. For each match position, there are twenty scores---one for each of the possible amino acids in the test sequence. Each match score can be interpreted as the logarithm of the ratio of two estimated probabilities: the probability of the test amino acid given the amino acid in the reference sequence and the probability of the test amino acid in the background distribution. If we define as the estimated probability of amino acid i in position x and as the estimated background probability in any position, then the score for i in column t is for some arbitrary logarithmic base b [1].

Any method for estimating the probabilities and defines a match-scoring system. Rather than looking at the final scoring system, this paper will concentrate on methods that can be used for estimating the probabilities themselves.

In more sophisticated models than single sequence alignments, such as multiple alignments, profiles [7], and hidden Markov models [14,3], we may have more than one reference sequence in our training set. Each position in such a model defines a context for which we need to estimate the probabilities of the twenty amino acids. In this paper, s refers to a sample of amino acids from a column and to the number of times that amino acid i appears in that sample. Our problem, then, is to compute the estimated probabilities for the context from which sample s was taken, given only the twenty numbers .

For alignment and search problems, we usually add scores from many positions, and so fairly small improvements in computing individual match scores can add up to significant overall differences. For example, the small differences between the PAM and BLOSUM scoring matrices have been shown to make a significant difference in the quality of search results [9].

The differences between regularizers are often fairly small; this paper attempts to quantify these small differences for several regularizers. Section 2 explains the measure used to quantify the tests; Section 3 explains the notion of posterior counts; Section 4 describes the data used for training and testing; and Section 5 presents the different methods and quantitative comparisons of them.

Next: 2 Quantitative measure for Up: Evaluating Regularizers for Estimating Previous: Evaluating Regularizers for Estimating

karplus@cse.ucsc.edu