next up previous
Next: Hidden Markov Models Up: Simple Markov models Previous: Neighbor blurring

Complement blurring

In many cases, we want to look for sequences on either strand of the DNA. We can achieve this by complement blurring of the word counts:

displaymath835

where w' is the dyadic complement of w and c is the complement blurring weight.

Generally, c is set to either 0 or 1, depending on the application. Choosing c by optimizing adaptive compression does not seem to work well, as gradient descent methods do not converge when optimizing with c, z, and the neighbor blurring parameters simultaneously.

  
table3

Table 1: Optimal blurring parameters for adaptive compression of the 109 sequences of REP99-gxn (12077 bases) with the complement parameter fixed at c=1.

Table 1  gif gives a table of the parameter setting for the best adaptive compression of the 109 sequences (12077 bases) of REP99-gxn (with c=1). The number of counts for an order-k model is 12077-109k, since the first k bases of a sequence do not generate counts.

Note that the order-2 and order-3 models use the zero-offset in preference to neighbor blurring, but as the order gets larger (and the counts per context smaller), the neighbor blurring becomes much more important. If we look at the ratio of z to the expected count , we see it increasing from .0163 for order-2 models to 2.97 for order-10, even though the value of z itself is decreasing rapidly.

If we assume that all contexts contain a single count of 1 and three counts of zero, we can approximate the single-point mutation frequencies for each of the three types of substitution as

displaymath49

We can improve this estimate somewhat by scaling down the z value in the formula by the expected count for non-zero contexts. We can estimate the number of non-zero contexts as roughly

displaymath50,

where x is the total number of counts made. The expected count for a non-zero context is thus

displaymath51

The estimated mutation frequencies using this method are given in Table 2  gif--since the assumptions of the method are more reasonable for higher-order models, the mutation rates estimates are probably most accurate for the highest order model. The very high predicted rate of substitution for k=5 and k=6 probably results from merging together contexts from the two parts of the REP, which are similar but not identical. Higher-order models can identify the separate parts of the REP more reliably, and so the predicted substitution rates are more likely to be reasonable.

Unfortunately, I have no way to check these predicted substitution rates against other methods for estimating substitutions rates (such as character counts in a multiple alignment), as I do not have a multiple alignment for all the REP sequences.

  

   table3

Table 2: Estimated substitution frequencies for the three types of substitutions (and combined mutation frequency) in REP99-gxn based on the optimal neighbor blurring parameters from Table 1  gif.


next up previous
Next: Hidden Markov Models Up: Simple Markov models Previous: Neighbor blurring

Rey Rivera
Thu Aug 22 14:04:06 PDT 1996