Next: 4 Experimental method Up: Evaluating Regularizers for Estimating Previous: 2 Quantitative measure for

3 Posterior Counts

Section 2 introduced the maximum-likelihood method for estimating probabilities from counts, . Maximum likelihood is asymptotically optimal as , but performs very badly for small sample sizes, since the encoding cost for any amino acid not seen in the sample is . To avoid this infinitely high cost, we will constrain regularizers to provide non-zero estimates for all probabilities: .

Regularizers can be viewed as making an adjustment to the sample counts, s, to produce posterior counts, , from which we estimate the probability:

To get legal estimated probabilities and avoid infinite costs, the primary constraint is .

Note: there will be several different formulas given for computing , corresponding to different regularizers. The symbols will be used for defining the different methods. The notations and refer to the background probabilities and their estimates (that is, the probabilities given a sample of size zero).

Once we have decided that the goal is to minimize the average encoding cost of the columns, and chosen a method to try, we can try to optimize the parameters of the method, using Newton's method or gradient descent to find parameter values at which all the first derivatives of the encoding cost are zero, and all the second derivatives are positive.

We can compute the derivatives of the encoding cost with respect to any parameter fairly easily from and its derivatives. For many of the methods, the second derivative of is 0 for all the parameters, further simplifying the optimization. For a more complete discussion of optimization, see [13].

karplus@cse.ucsc.edu