Most search and comparison algorithms for proteins need to estimate the probabilities of the twenty amino acids in a given context. This probability is often expressed indirectly as a score for each of the amino acids, with positive scores for expected amino acids and negative scores for unexpected ones.
As Altschul pointed out [1], any alignment-scoring
system is really making an assertion about the probability of the test
sequences given the reference sequence.
The score for an alignment is the sum of the scores for individual
matched positions, plus the costs for insertions and deletions.
For each match position, there are twenty scores---one for each of the
possible amino acids in the test sequence.
Each match score can be interpreted as the logarithm of the ratio of
two estimated probabilities: the probability of the test amino acid
given the amino acid in the reference sequence and the probability of
the test amino acid in the background distribution.
If we define
as the estimated probability of amino acid
i in position x and
as the estimated background
probability in any position, then the score for i in column t
is
for some arbitrary logarithmic base
b [1].
Any method for estimating the probabilities
and
defines a match-scoring system.
Rather than looking at the final scoring system, this paper
will concentrate on methods that can be used for estimating the
probabilities themselves.
In more sophisticated models than single sequence alignments, such as
multiple alignments, profiles [7], and hidden Markov
models [14,3], we may have more than one reference
sequence in our training set.
Each position in such a model defines a context for which we need to
estimate the probabilities of the twenty amino acids.
In this paper, s refers to a sample of amino acids from a
column and
to the number of times that amino acid i appears
in that sample.
Our problem, then, is to compute the estimated probabilities
for the context from which sample s was taken, given
only the twenty numbers
.
For alignment and search problems, we usually add scores from many positions, and so fairly small improvements in computing individual match scores can add up to significant overall differences. For example, the small differences between the PAM and BLOSUM scoring matrices have been shown to make a significant difference in the quality of search results [9].
The differences between regularizers are often fairly small; this paper attempts to quantify these small differences for several regularizers. Section 2 explains the measure used to quantify the tests; Section 3 explains the notion of posterior counts; Section 4 describes the data used for training and testing; and Section 5 presents the different methods and quantitative comparisons of them.