In many cases, we want to look for sequences on either strand of the DNA. We can achieve this by complement blurring of the word counts:
![]()
where w' is the dyadic complement of w and c is the complement blurring weight.
Generally, c is set to either 0 or 1, depending on the application. Choosing c by optimizing adaptive compression does not seem to work well, as gradient descent methods do not converge when optimizing with c, z, and the neighbor blurring parameters simultaneously.
Table 1: Optimal blurring parameters for adaptive compression of the
109 sequences of REP99-gxn (12077 bases) with the complement parameter
fixed at c=1.
Table 1
gives a table of the parameter setting
for the best adaptive compression of the 109 sequences (12077 bases)
of REP99-gxn (with c=1).
The number of counts for an order-k model is 12077-109k, since the
first k bases of a sequence do not generate counts.
Note that the order-2 and order-3 models use the zero-offset in preference to neighbor blurring, but as the order gets larger (and the counts per context smaller), the neighbor blurring becomes much more important. If we look at the ratio of z to the expected count , we see it increasing from .0163 for order-2 models to 2.97 for order-10, even though the value of z itself is decreasing rapidly.
If we assume that all contexts contain a single count of 1 and three
counts of zero, we can approximate the single-point mutation
frequencies for each of the three types of substitution as
We can improve this estimate somewhat by scaling down the z value in the formula by the expected count for non-zero contexts. We can estimate the number of non-zero contexts as roughly
,
where x is the total number of counts made. The expected count for a non-zero context is thus
The estimated mutation frequencies using this method are given in
Table 2
--since the assumptions of the method
are more reasonable for higher-order models, the mutation rates
estimates are probably most accurate for the highest order model.
The very high predicted rate of substitution for k=5 and k=6
probably results from merging together contexts from the two parts of
the REP, which are similar but not identical. Higher-order models can
identify the separate parts of the REP more reliably, and so the
predicted substitution rates are more likely to be reasonable.
Unfortunately, I have no way to check these predicted substitution rates against other methods for estimating substitutions rates (such as character counts in a multiple alignment), as I do not have a multiple alignment for all the REP sequences.
Table 2: Estimated substitution frequencies for the three types of
substitutions (and combined mutation frequency) in REP99-gxn
based on the optimal neighbor blurring parameters from Table 1
.